Athena Performance Comparison: Avro, JSON, and Parquet

by
Tags: , , , , ,
Category:

In my “Friends Don’t Let Friends Use JSON” post, I noted that I preferred Avro to Parquet, because it was easier to write code to use it. I expected some pushback, and got it: Parquet is “much” more performant. So I decided to do some benchmarking.

The Data

Finding large and relevant datasets is a challenge. So, to make these examples realistic, I generated my data.

Chariot’s Data Engineering interview uses a simulated source of “clickstream” events: the sort of events that you might get when recording users’ actions on your e-Commerce website. I adapted this event generator to write JSON files to S3, and ran it for a simulated population of 1,000,000 users, 10,000 products, and 100,000,000 total events (I ignored some events for this post, so the numbers below won’t add up).

I then wrote a Glue script to transform this raw data into Avro and Parquet, and manually downloaded and GZipped the JSON. Then I created Athena tables for each of the datasets.

If you’d like to run this yourself, you’ll find the data generator and relevant CloudFormation templates here.

product_page

This table represents a person viewing the description page for a product, and identifies the user, the product, and the time that the event happened. All events also have a unique identifer.

{
  "eventType": "productPage",
  "eventId": "6519d3be-8799-4a00-a69e-d5681047fd7d",
  "timestamp": "2023-04-24 19:10:49.234",
  "userId": "c5362ccc-7355-433d-9322-9b9b564276a5",
  "productId": "8155"
}

Number of events: 59,693,900

File Type Number of Files Events per File File Size Total Size
Avro 120 500,000 57 MB 6.7 GB
JSON 597 100,000 18 MB 11.1 GB
GZipped JSON 597 100,000 5 MB 2.9 GB
Parquet 120 500,000 35 MB 4.2 GB

add_to_cart

This table represents customers adding items to their shopping cart. It contains most of the same information as the product page event, along with the number of units added.

{
  "eventType": "addToCart",
  "eventId": "80ca509e-6493-48cc-92a1-4052045d507f",
  "timestamp": "2023-04-24 19:11:15.392",
  "userId": "c5362ccc-7355-433d-9322-9b9b564276a5",
  "productId": "8155",
  "quantity": 4
}

Number of events: 18,523,255

File Type Number of Files Events per File File Size Total Size
Avro 37 500,000 57 MB 2.1 GB
JSON 186 100,000 20 MB 3.6 GB
GZipped JSON 186 100,000 5 MB 0.9 GB
Parquet 37 500,000 36 MB 1.3 GB

checkout_complete

This table represents the last stage in the purchase funnel. It identifies the user, the total number of items purchased, and their total value.

{
  "eventType": "checkoutComplete",
  "eventId": "aa243032-cffd-4fd7-ab9b-994e69567a76",
  "timestamp": "2023-04-24 19:16:42.581",
  "userId": "c5362ccc-7355-433d-9322-9b9b564276a5",
  "itemsInCart": 4,
  "totalValue": 6.00
}

Number of events: 9,853,549

File Type Number of Files Events per File File Size Total Size
Avro 33 300,000 37 MB 1.2 GB
JSON 99 100,000 20 MB 2.0 GB
GZipped JSON 99 100,000 5 MB 0.5 GB
Parquet 33 300,000 21 MB 0.7 GB

Digression: Using Glue to bulk-transform datasets

If you have a large amount of data to transform (for example, 19 GB of JSON to transform into Parquet and Avro), then Glue is the easiest way to do it. At its most basic, you tell Glue to read one file format and write another; if you need to, you can transform the data between those steps. There are, however, some quirks in the process of which you should be aware.

The first is the difference between a Spark DataFrame and a Glue DynamicFrame. These objects are both “table-like” data stuctures, and are used when reading and writing data. The DynamicFrame provides some common operations to transform your data, and also infers the schema for your data. This latter feature is great if you have dirty data, or are combining data from heterogenous sources, because it lets you define the actual schema after the fact with a transform. However, if you know the schema of your data, and know that it won’t deviate from the schema, then it’s easier to apply that schema when reading the data, which only DataFrame supports. Fortunately, the two objects are interconvertible: you can start by reading into a DataFrame, then create a DynamicFrame to perform transforms and write your data in formats that DataFrame doesn’t support.

A second quirk is that the SerDe (serializer/deserializer, used to read and write files) might not behave as you expect or desire. The Glue Avro SerDe, for example, writes timestamps and decimals as strings by default; you must configure it to write them as Avro logical types. It also writes column names in mixed case, which will cause problems for Athena unless you provide an explicit mapping in the table definition — or do as I do, and lowercase the names in the Glue script.

And lastly, the output of your SerDe must match the input expected by Athena (or whatever tool you’re using). Again looking at the Avro SerDe: it writes timestamps in microseconds, while Athena expects them in milliseconds. If you define the table as holding a timestamp, then Athena will report dates far in the future. Instead, you must define the column as a bigint and transform it for use (which is one of the JSON problems that I was trying to avoid in the first place!).

All of which is to say that reading the documentation is important, as is experimenting with your data in a notebook before firing off a Glue job. To keep costs down, I prefer using the aws-glue-libs Docker container, rather than a managed developer notebook.

The Queries

I’ve picked a selection of “analytics” queries that explore different access patterns. For each, I ran the query against the four datasets (Avro, JSON, GZipped JSON, and Parquet), and recorded Athena’s runtime and data scanned statistics. Runtime is in seconds, rounded to the nearest hundredth (and even that, I think, might give a false sense of precision: sometimes the numbers were replicable, sometimes they varied by a second or more between runs).

Query #1: top-10 best-selling products, as measured by cart adds

This is a simple aggregation over one table; it requires reading every row.

select  productid, sum(quantity) as units_added
from    "athena-avro"."add_to_cart"
group   by productid
order   by units_added desc
limit   10;
File Type Runtime Data Scanned
Avro 5.19 1.78 GB
JSON 2.36 3.52 GB
GZipped JSON 5.47 940 MB
Parquet 1.05 42 MB

I expected Parquet to have an edge here, and was not disappointed: it ran much faster than the others, and scanned an order of magnitude less data. This is a result of its columnar data layout: Athena only had to read the two columns involved in the query. With the other formats, Athena had to read the entire file.

And I have to admit: Avro disappointed me with its performance. I was expecting it to be at least as fast as JSON, because internally it stores data in a binary format. But it wasn’t even competitive. This is, possibly, a result of the SerDe that Athena uses to read the data.

Query #2: best-selling products in specific time range

From a business perspective, trend and current data is more important than an overall aggregation. From the perspective of comparing data formats, it adds another dimension to the query plan: only some of the rows will contribute to the result.

The other thing that this query highlights is the different forms of date handling. The Parquet query is pretty much what you’d expect:

select  productid, sum(quantity) as units_added
from    "athena-parquet"."add_to_cart"
where   "timestamp" between from_iso8601_timestamp('2023-05-03T21:00:00') 
                        and from_iso8601_timestamp('2023-05-03T22:00:00')
group   by productid
order   by units_added desc
limit   10;

The JSON query uses strings for the between clause. This works because of the way that the timestamps are formatted: lexical order is the same as timestamp order. If the timestamps used some other format, you might have to convert the timestamp column into an actual Athena timestamp, and compare using the same logic as the Parquet query.

select  productid, sum(quantity) as units_added
from    "athena-json"."add_to_cart"
where   "timestamp" between '2023-05-03 21:00:00' 
                        and '2023-05-03 22:00:00'
group   by productid
order   by units_added desc
limit   10;

And lastly, the Avro version. Because of the way that the Glue SerDe wrote timestamps, I needed to first convert them to an Athena-compatible value. If you were writing the file yourself, and could chose the Athena-compatible timestamp-millis logical type, there would be no need for this conversion.

select  productid, sum(quantity) as units_added
from    "athena-avro"."add_to_cart"
where   from_unixtime("timestamp"/1000000) between from_iso8601_timestamp('2023-05-03T21:00:00') 
                                               and from_iso8601_timestamp('2023-05-03T22:00:00')
group   by productid
order   by units_added desc
limit   10;
File Type Runtime Data Scanned
Avro 4.81 1.78 GB
JSON 2.40 3.52 GB
GZipped JSON 4.06 940 MB
Parquet 0.85 8.16 MB

Parquet is an even bigger win for this query, with over two orders of magnitude less data scanned. I wondered if perhaps the column metadata identified the minimum and maximum values stored in the file, allowing Athena to skip files that didn’t contain the date range, but the Parquet file format doc doesn’t show any such metadata. As far as I can tell, it’s just because the chunk of each file containing timestamps can be read and scanned very quickly.

Of course, in the real world you would probably partition your files by date, meaning that Athena would only read a subset of the files regardless of format. This is likely to reduce Parquet’s performance advantage.

Query #3: products that are often viewed but not added to a cart

This is a join between two tables, which Athena performs by selecting all rows from both tables and then doing a merge. One thing to note is that Athena, unlike Redshift, can’t use select-expression aliases (such as “views” and “adds”) in the select list. As a result, you have to calculate the difference outside the query or by repeating the count expressions..

select  pp.productid, 
        count(distinct pp.eventid) as views, 
        count(distinct atc.eventid) as adds
from    "athena-avro"."product_page" pp
join    "athena-avro"."add_to_cart" atc 
on      atc.userid = pp.userid 
and     atc.productid = pp.productid
group   by pp.productid
order   by views - adds desc
limit   10
File Type Runtime Data Scanned
Avro 7.46 7.51 GB
JSON 5.13 14.15 GB
GZipped JSON 7.18 3.76 GB
Parquet 3.86 5.39 GB

As I said above, Athena merges the two tables when performing a join. There’s nothing here that reduces the number of rows in the join, so the total data scanned roughly equals the size of the two tables. And since this query involves almost all of the rows of the two tables (excepting the timestamps and cart quantity), there’s little benefit to be gained from Parquet.

Query #4: customers that abandoned carts

If a customer puts an item in their cart but never completes checkout, we consider that cart abandoned. The easiest way to discover these users is with an outer join between the add_to_cart and checkout_complete tables, comparing the timestamps of each. Note that we don’t have to do any conversions on the timestamps: each of the data formats represents those timestamps in a way that a simple comparison will work.

select  count(distinct user_id) as users_with_abandoned_carts
from    (
        select  atc.userid as user_id,
                max(atc."timestamp") as max_atc_timestamp,
                max(cc."timestamp") as max_cc_timestamp
        from    "athena-avro"."add_to_cart" atc
        left join "athena-avro"."checkout_complete" cc
        on      cc.userid = atc.userid
        group   by atc.userid
        )
where   max_atc_timestamp > max_cc_timestamp
or      max_cc_timestamp is null
File Type Runtime Data Scanned
Avro 8.07 2.76 GB
JSON 4.07 5.48 GB
GZipped JSON 10.44 1.40 GB
Parquet 4.40 975 MB

In comparison to the previous query, Parquet does have an advantage here: it only needs to read the userid and timestamp columns from the two tables.

Conclusions

Yes, Parquet performed better than the other options. In some cases, “much” better.

But one thing to consider is that this dataset is almost perfectly targeted to Parquet: it’s a relatively small number of relatively large files. If you created one file per day, or one per hour (reasonable from a busy stream), then you may pay more in file reads than you save from columnar storage. Or perhaps not: I modified the Glue job to generate a larger number of partitions, and while the queries took longer, it wasn’t that much more: the “top 10 cart adds“ query ran in 1.4 seconds versus 1.05, and scanned 58 MB of data rather than 42.

And I was disappointed by Avro’s performance, both in terms of query time and bytes scanned. Having to jump through hoops to compensate for the SerDe didn’t help.

So, if I have the opportunity to aggregate data into large files, then I think Parquet is the right way to go. But if I’m processing relatively small files, writing my own code to create the schema and translate the data, then I still lean toward Avro. Because developer time is “much” more valuable than machine time.

 


 

Can we help you?

Ready to transform your business with customized data engineering solutions? Chariot Solutions is your trusted partner. Our consultants specialize in managing software and data complexities, tailoring solutions to your unique needs. Explore our data engineering offerings or reach out today to discuss your project.