In my “Friends Don’t Let Friends Use JSON” post, I noted that I preferred Avro to Parquet, because it was easier to write code to use it. I expected some pushback, and got it: Parquet is “much” more performant. So I decided to do some benchmarking.
The Data
Finding large and relevant datasets is a challenge. So, to make these examples realistic, I generated my data.
Chariot’s Data Engineering interview uses a simulated source of “clickstream” events: the sort of events that you might get when recording users’ actions on your e-Commerce website. I adapted this event generator to write JSON files to S3, and ran it for a simulated population of 1,000,000 users, 10,000 products, and 100,000,000 total events (I ignored some events for this post, so the numbers below won’t add up).
I then wrote a Glue script to transform this raw data into Avro and Parquet, and manually downloaded and GZipped the JSON. Then I created Athena tables for each of the datasets.
If you’d like to run this yourself, you’ll find the data generator and relevant CloudFormation templates here.
product_page
This table represents a person viewing the description page for a product, and identifies the user, the product, and the time that the event happened. All events also have a unique identifer.
{ "eventType": "productPage", "eventId": "6519d3be-8799-4a00-a69e-d5681047fd7d", "timestamp": "2023-04-24 19:10:49.234", "userId": "c5362ccc-7355-433d-9322-9b9b564276a5", "productId": "8155" }
Number of events: 59,693,900
File Type | Number of Files | Events per File | File Size | Total Size |
---|---|---|---|---|
Avro | 120 | 500,000 | 57 MB | 6.7 GB |
JSON | 597 | 100,000 | 18 MB | 11.1 GB |
GZipped JSON | 597 | 100,000 | 5 MB | 2.9 GB |
Parquet | 120 | 500,000 | 35 MB | 4.2 GB |
add_to_cart
This table represents customers adding items to their shopping cart. It contains most of the same information as the product page event, along with the number of units added.
{ "eventType": "addToCart", "eventId": "80ca509e-6493-48cc-92a1-4052045d507f", "timestamp": "2023-04-24 19:11:15.392", "userId": "c5362ccc-7355-433d-9322-9b9b564276a5", "productId": "8155", "quantity": 4 }
Number of events: 18,523,255
File Type | Number of Files | Events per File | File Size | Total Size |
---|---|---|---|---|
Avro | 37 | 500,000 | 57 MB | 2.1 GB |
JSON | 186 | 100,000 | 20 MB | 3.6 GB |
GZipped JSON | 186 | 100,000 | 5 MB | 0.9 GB |
Parquet | 37 | 500,000 | 36 MB | 1.3 GB |
checkout_complete
This table represents the last stage in the purchase funnel. It identifies the user, the total number of items purchased, and their total value.
{ "eventType": "checkoutComplete", "eventId": "aa243032-cffd-4fd7-ab9b-994e69567a76", "timestamp": "2023-04-24 19:16:42.581", "userId": "c5362ccc-7355-433d-9322-9b9b564276a5", "itemsInCart": 4, "totalValue": 6.00 }
Number of events: 9,853,549
File Type | Number of Files | Events per File | File Size | Total Size |
---|---|---|---|---|
Avro | 33 | 300,000 | 37 MB | 1.2 GB |
JSON | 99 | 100,000 | 20 MB | 2.0 GB |
GZipped JSON | 99 | 100,000 | 5 MB | 0.5 GB |
Parquet | 33 | 300,000 | 21 MB | 0.7 GB |
Digression: Using Glue to bulk-transform datasets
If you have a large amount of data to transform (for example, 19 GB of JSON to transform into Parquet and Avro), then Glue is the easiest way to do it. At its most basic, you tell Glue to read one file format and write another; if you need to, you can transform the data between those steps. There are, however, some quirks in the process of which you should be aware.
The first is the difference between a Spark DataFrame
and a Glue DynamicFrame
. These objects are both “table-like” data stuctures, and are used when reading and writing data. The DynamicFrame
provides some common operations to transform your data, and also infers the schema for your data. This latter feature is great if you have dirty data, or are combining data from heterogenous sources, because it lets you define the actual schema after the fact with a transform. However, if you know the schema of your data, and know that it won’t deviate from the schema, then it’s easier to apply that schema when reading the data, which only DataFrame
supports. Fortunately, the two objects are interconvertible: you can start by reading into a DataFrame
, then create a DynamicFrame
to perform transforms and write your data in formats that DataFrame
doesn’t support.
A second quirk is that the SerDe (serializer/deserializer, used to read and write files) might not behave as you expect or desire. The Glue Avro SerDe, for example, writes timestamps and decimals as strings by default; you must configure it to write them as Avro logical types. It also writes column names in mixed case, which will cause problems for Athena unless you provide an explicit mapping in the table definition — or do as I do, and lowercase the names in the Glue script.
And lastly, the output of your SerDe must match the input expected by Athena (or whatever tool you’re using). Again looking at the Avro SerDe: it writes timestamps in microseconds, while Athena expects them in milliseconds. If you define the table as holding a timestamp
, then Athena will report dates far in the future. Instead, you must define the column as a bigint
and transform it for use (which is one of the JSON problems that I was trying to avoid in the first place!).
All of which is to say that reading the documentation is important, as is experimenting with your data in a notebook before firing off a Glue job. To keep costs down, I prefer using the aws-glue-libs Docker container, rather than a managed developer notebook.
The Queries
I’ve picked a selection of “analytics” queries that explore different access patterns. For each, I ran the query against the four datasets (Avro, JSON, GZipped JSON, and Parquet), and recorded Athena’s runtime and data scanned statistics. Runtime is in seconds, rounded to the nearest hundredth (and even that, I think, might give a false sense of precision: sometimes the numbers were replicable, sometimes they varied by a second or more between runs).
Query #1: top-10 best-selling products, as measured by cart adds
This is a simple aggregation over one table; it requires reading every row.
select productid, sum(quantity) as units_added from "athena-avro"."add_to_cart" group by productid order by units_added desc limit 10;
File Type | Runtime | Data Scanned |
---|---|---|
Avro | 5.19 | 1.78 GB |
JSON | 2.36 | 3.52 GB |
GZipped JSON | 5.47 | 940 MB |
Parquet | 1.05 | 42 MB |
I expected Parquet to have an edge here, and was not disappointed: it ran much faster than the others, and scanned an order of magnitude less data. This is a result of its columnar data layout: Athena only had to read the two columns involved in the query. With the other formats, Athena had to read the entire file.
And I have to admit: Avro disappointed me with its performance. I was expecting it to be at least as fast as JSON, because internally it stores data in a binary format. But it wasn’t even competitive. This is, possibly, a result of the SerDe that Athena uses to read the data.
Query #2: best-selling products in specific time range
From a business perspective, trend and current data is more important than an overall aggregation. From the perspective of comparing data formats, it adds another dimension to the query plan: only some of the rows will contribute to the result.
The other thing that this query highlights is the different forms of date handling. The Parquet query is pretty much what you’d expect:
select productid, sum(quantity) as units_added from "athena-parquet"."add_to_cart" where "timestamp" between from_iso8601_timestamp('2023-05-03T21:00:00') and from_iso8601_timestamp('2023-05-03T22:00:00') group by productid order by units_added desc limit 10;
The JSON query uses strings for the between
clause. This works because of the way that the timestamps are formatted: lexical order is the same as timestamp order. If the timestamps used some other format, you might have to convert the timestamp
column into an actual Athena timestamp, and compare using the same logic as the Parquet query.
select productid, sum(quantity) as units_added from "athena-json"."add_to_cart" where "timestamp" between '2023-05-03 21:00:00' and '2023-05-03 22:00:00' group by productid order by units_added desc limit 10;
And lastly, the Avro version. Because of the way that the Glue SerDe wrote timestamps, I needed to first convert them to an Athena-compatible value. If you were writing the file yourself, and could chose the Athena-compatible timestamp-millis
logical type, there would be no need for this conversion.
select productid, sum(quantity) as units_added from "athena-avro"."add_to_cart" where from_unixtime("timestamp"/1000000) between from_iso8601_timestamp('2023-05-03T21:00:00') and from_iso8601_timestamp('2023-05-03T22:00:00') group by productid order by units_added desc limit 10;
File Type | Runtime | Data Scanned |
---|---|---|
Avro | 4.81 | 1.78 GB |
JSON | 2.40 | 3.52 GB |
GZipped JSON | 4.06 | 940 MB |
Parquet | 0.85 | 8.16 MB |
Parquet is an even bigger win for this query, with over two orders of magnitude less data scanned. I wondered if perhaps the column metadata identified the minimum and maximum values stored in the file, allowing Athena to skip files that didn’t contain the date range, but the Parquet file format doc doesn’t show any such metadata. As far as I can tell, it’s just because the chunk of each file containing timestamps can be read and scanned very quickly.
Of course, in the real world you would probably partition your files by date, meaning that Athena would only read a subset of the files regardless of format. This is likely to reduce Parquet’s performance advantage.
Query #3: products that are often viewed but not added to a cart
This is a join between two tables, which Athena performs by selecting all rows from both tables and then doing a merge. One thing to note is that Athena, unlike Redshift, can’t use select-expression aliases (such as “views
” and “adds
”) in the select list. As a result, you have to calculate the difference outside the query or by repeating the count
expressions..
select pp.productid, count(distinct pp.eventid) as views, count(distinct atc.eventid) as adds from "athena-avro"."product_page" pp join "athena-avro"."add_to_cart" atc on atc.userid = pp.userid and atc.productid = pp.productid group by pp.productid order by views - adds desc limit 10
File Type | Runtime | Data Scanned |
---|---|---|
Avro | 7.46 | 7.51 GB |
JSON | 5.13 | 14.15 GB |
GZipped JSON | 7.18 | 3.76 GB |
Parquet | 3.86 | 5.39 GB |
As I said above, Athena merges the two tables when performing a join. There’s nothing here that reduces the number of rows in the join, so the total data scanned roughly equals the size of the two tables. And since this query involves almost all of the rows of the two tables (excepting the timestamps and cart quantity), there’s little benefit to be gained from Parquet.
Query #4: customers that abandoned carts
If a customer puts an item in their cart but never completes checkout, we consider that cart abandoned. The easiest way to discover these users is with an outer join between the add_to_cart
and checkout_complete
tables, comparing the timestamps of each. Note that we don’t have to do any conversions on the timestamps: each of the data formats represents those timestamps in a way that a simple comparison will work.
select count(distinct user_id) as users_with_abandoned_carts from ( select atc.userid as user_id, max(atc."timestamp") as max_atc_timestamp, max(cc."timestamp") as max_cc_timestamp from "athena-avro"."add_to_cart" atc left join "athena-avro"."checkout_complete" cc on cc.userid = atc.userid group by atc.userid ) where max_atc_timestamp > max_cc_timestamp or max_cc_timestamp is null
File Type | Runtime | Data Scanned |
---|---|---|
Avro | 8.07 | 2.76 GB |
JSON | 4.07 | 5.48 GB |
GZipped JSON | 10.44 | 1.40 GB |
Parquet | 4.40 | 975 MB |
In comparison to the previous query, Parquet does have an advantage here: it only needs to read the userid
and timestamp
columns from the two tables.
Conclusions
Yes, Parquet performed better than the other options. In some cases, “much” better.
But one thing to consider is that this dataset is almost perfectly targeted to Parquet: it’s a relatively small number of relatively large files. If you created one file per day, or one per hour (reasonable from a busy stream), then you may pay more in file reads than you save from columnar storage. Or perhaps not: I modified the Glue job to generate a larger number of partitions, and while the queries took longer, it wasn’t that much more: the “top 10 cart adds“ query ran in 1.4 seconds versus 1.05, and scanned 58 MB of data rather than 42.
And I was disappointed by Avro’s performance, both in terms of query time and bytes scanned. Having to jump through hoops to compensate for the SerDe didn’t help.
So, if I have the opportunity to aggregate data into large files, then I think Parquet is the right way to go. But if I’m processing relatively small files, writing my own code to create the schema and translate the data, then I still lean toward Avro. Because developer time is “much” more valuable than machine time.
Can we help you?
Ready to transform your business with customized data engineering solutions? Chariot Solutions is your trusted partner. Our consultants specialize in managing software and data complexities, tailoring solutions to your unique needs. Explore our data engineering offerings or reach out today to discuss your project.