Performance Comparison: Athena versus Redshift

I’ve always been a fan of database servers: self-contained entities that manage both storage and compute, and give you knobs to turn to optimize your queries. The flip side is that I have an inherent distrust of services such as Athena, which promise to run queries efficiently on structured data split between many files in a data lake. It just doesn’t seem natural; where are the knobs?

So, since I had data generated for my post on Athena performance with different file types, I decided to use that data in a performance comparison with Redshift.

Athena Performance Comparison: Avro, JSON, and Parquet

In my “Friends Don’t Let Friends Use JSON” post, I noted that I preferred the Avro file format to Parquet, because it was easier to write code to use it. I expected some pushback, and got it: Parquet is “much” more performant. So I decided to do some benchmarking.

Rightsizing Data for Athena

Amazon Athena is a service that lets you run SQL queries against structured data files stored in S3. It takes a “divide and conquer” approach, spinning up parallel query execution engines that each examine only a portion of your data. The performance of these queries, however, depends on how you consolidate and partition your data. In this post I compare query times for a moderately large dataset, looking for the “sweet spot” between number of files and individual file size.