In this week’s TechChat, we welcome Keith Gregory, our Cloud & Data Engineering Practice Lead here at Chariot. Keith is a prolific writer both on the Chariot blog as well as on his own, and is a wealth of knowledge on all things AWS. We touch on Redshift execution plans, how to appropriately size Redshift … Read More
I’ve always been a fan of database servers: self-contained entities that manage both storage and compute, and give you knobs to turn to optimize your queries. The flip side is that I have an inherent distrust of services such as Athena, which promise to run queries efficiently on structured data split between many files in a data lake. It just doesn’t seem natural; where are the knobs?
So, since I had data generated for my post on Athena performance with different file types, I decided to use that data in a performance comparison with Redshift.
In my “Friends Don’t Let Friends Use JSON” post, I noted that I preferred the Avro file format to Parquet, because it was easier to write code to use it. I expected some pushback, and got it: Parquet is “much” more performant. So I decided to do some benchmarking.
Amazon Athena is a service that lets you run SQL queries against structured data files stored in S3. It takes a “divide and conquer” approach, spinning up parallel query execution engines that each examine only a portion of your data. The performance of these queries, however, depends on how you consolidate and partition your data. In this post I compare query times for a moderately large dataset, looking for the “sweet spot” between number of files and individual file size.