Rightsizing Data for Athena

by
Tags: , , ,
Category: ,

Amazon Athena is a service that lets you run SQL queries against structured data files stored in S3. It takes a “divide and conquer” approach, spinning up parallel query execution engines that each examine only a portion of your data. For infrequently-run queries and large datasets, it can be very cost-effective compared to Redshift or a custom Glue job.

However, it’s not magical: in order to do its work it must retrieve the files, parse them, and then apply query predicates. Each of these steps takes time — individually a very small amount of time — but it adds up. As a result, depending on how you manage your data you can see 10x or more performance difference for the same queries.

The Performance Tuning section of the Athena User Guide has several suggestions for storing your data. Some of which seemingly contradict each other. And finding numbers to go with those recommendations can be a challenge. So I decided to generate my own.

The Data

The first problem with any benchmark is picking an appropriate dataset. I’ve chosen to use Chariot’s CloudTrail event log, which is moderately large but may or may not represent your “real world” data. CloudTrail events tend to be bursty in the best of cases (coinciding with actual user actions), and are especially so in our environment, which has many “sandbox” accounts that get infrequent use. The result is a large number of generally small files, which violates one of Amazon’s tuning recommendations. However, I think this same pattern might appear in real-world “clickstream” data, and it highlights the problems with lots of small files.

Some numbers: we have 21 months of data, stored in approximately 1.7 million compressed JSON files that consume approximately 3.1 GB of data on S3; the uncompressed size is approximately 9.6 GB. I keep using the word “approximately” because this is a live repository: new files are constantly added, at a rate of around 100 files per hour. As a result, all of the query counts shown below are also approximate: the same query run at different times will have a slightly different set of data to work with.

I created three datasets for testing:

  • The raw data: 1.7 million files, 3.1 GB of stored data. Individual files range from 321 bytes to 1 MB, with a median size of 821 bytes.
  • Consolidated events by year/month/day: 684 files containing 1.8 GB of compressed JSON. Minimum filesize approximately 64k, maximum 87 MB, and median 800k.
  • All events, as a single 1.8 GB compressed JSON file. This also goes against the recommendations in the Athena tuning guide, but I think it’s important see its effect (I’ve been burned in the past when a multi-gigabyte “backfill” file got dropped into a data lake).

Table Definition and Partitioning

AWS provides an Athena table definition for CloudTrail logs, which was my starting point. This table definition relies on the CloudTrail SerDe (serializer-deserializer, used by Athena to read and write text files), and I discovered that it didn’t want to read my consolidated files (possibly because they are missing a top-level “Records” attribute). So for the consolidated logs I kept the same table structure, but used the Hive JSON SerDe to access the data.

Partitioning is how Athena reduces the amount of data that it reads for a query, by examining only the S3 prefixes that are relevant to that query. For example, a CloudTrail logfile is stored in S3 with a path that looks like this:

AWSLogs/o-x2b2e28quf/123456789012/CloudTrail/us-east-1/2020/10/03/123456789012_CloudTrail_us-east-1_20201003T0000Z_abBucVvli7n8ZW9S.json.gz

The path has several fields embedded in it: account number (123456789012), region (us-east-1), and the date that the file was created (2020/10/03). You can partition on any of these values, and doing so improves the performance of any query that has a predicate on the partition value (eg, WHERE account_id = '123456789012') because Athena need only look at S3 files where that value is part of the key.

There are two ways to partition a table: you can explicitly add partitions (or let a Glue crawler do so for you), or you can use partition projection. In the latter case, your table definition specifies legal values for the fields that you want to partition on, and Athena uses those specifications to generate paths. This can be more efficient, because Athena doesn’t need to read the partition metadata from Glue.

Picking appropriate partitions is a bit of an art. Partitions only help you when the query predicates reference partitioned values; and can actually slow down queries that don’t — sometimes dramatically, as you’ll see below. And, unfortunately, real-world data sometimes forces your hand: CloudTrail partitions a multi-account trail by account ID whether or not you want to query by that field.

For the raw CloudTrail logs, I defined two tables: one unpartitioned, and one partitioned by account/region/date. The former has to scan all 1.6 million files for every query; the latter scans less data for most “reasonable” queries. For the consolidated logs, I partitioned by date, which matches the data. The “single file” dataset, of course, couldn’t be partitioned.

The Queries

These queries are intended to highlight the difference in processing the same basic data given different storage configurations. As such, they’re not terribly complex, using simple predicates that can be satisfied either via partition values or by parsing and extracting data fields.

For each query, I show the number of rows returned, execution time, and amount of data scanned; these three values are copied from the Athena Console. I also give an estimate of the number of files scanned, based on a directory listing.

Query 1: count of all events

This query has to scan the entire dataset; it does not benefit at all from partitioning. In fact, Athena takes far longer to run the query with the partitioned version of the raw logs. Again, remember that the counts are approximate, and depend on the state of the data when the queries were run.

  Count Time Scanned Files
Raw CloudTrail logs 6,819,062 3:15 2.94 GB 1,637,376
Partitioned CloudTrail logs 6,754,019 12:42 2.92 GB 1,637,376
Consolidated by date 6,448,023 6.3 sec 1.75 GB 684
Single file 6,509,762 1:44 1.76 GB 1

Query 2: count of all events for single account

This query identifies events for our primary account. With the fully partitioned data, that means that it only needs to look at a subset of the files. For all of the others, it must look at the useridentity.accountid field in the parsed date.

Note the significant discrepancy between the counts from the partitioned table and those from other queries. This isn’t due to timing, and you’ll see it in the next query as well. I’ll explain it at the end of the post.

  Count Time Scanned Files
Raw CloudTrail logs 522,333 3:14 2.94 GB 1,637,376
Partitioned CloudTrail logs 755,130 20.7 sec 342.96 MB 336,478
Consolidated by date 502,513 6.5 sec 1.75 GB 684
Single file 516,393 1:53 1.76 GB 1

Query 3: count of all events for single account, single month

Using the same account as the previous query, this query adds a predicate on the year and month. This again benefits the partitioned datasets; the others must parse the data and extract a substring from the eventtime field.

  Count Time Scanned Files
Raw CloudTrail logs 25,595 3:12 2.94 GB 1,637,376
Partitioned CloudTrail logs 34,990 3.3 sec 19.22 MB 19,723
Consolidated by date 25,595 1.3 sec 37.25 MB 31
Single file 25,595 1:56 1.76 GB 1

Conclusions

Partitioning definitely helps improve query performance, by reducing the number of files that Athena has to look at. But if you’ll generally get much better improvement if you reduce the number of files by increasing the data that each one contains. At no point did the heavily-partitioned raw dataset beat the lightly-partitioned consolidated dataset, even when it let Athena scan a fraction of the data.

On the other hand, dumping all of the data a single file isn’t a good idea either, because it doesn’t leverage Athena’s ability to process queries in parallel.

For this dataset, consolidating (and partitioning) by date seemed to be the sweet spot. For lower-volume data, consolidating by month might be sufficient.

And I want to leave you with another thought: partitioning might be needed for correct queries. The queries that used a predicate on account ID always reported higher numbers when that predicate used a partition, compared to those that extracted that information from the event data. The reason for this is that a significant number of CloudTrail events don’t store the account ID (at least not in the field I used). This as a reminder that performance is great, but not when it impairs accuracy!

If you’d like to try these queries yourself, I’ve uploaded the table definitions and queries to our GitHub site. You’ll need to run it against your own CloudTrail data, or a publicly accessible dataset like this one.

 


 

Can we help you?

Ready to transform your business with customized data engineering solutions? Chariot Solutions is your trusted partner. Our consultants specialize in managing software and data complexities, tailoring solutions to your unique needs. Explore our data engineering offerings or reach out today to discuss your project.