I’ve always been a fan of database servers: self-contained entities that manage both storage and compute, and give you knobs to turn to optimize your queries. The flip side is that I have an inherent distrust of services such as Athena, which promise to run queries efficiently on structured data split between many files in a data lake. It just doesn’t seem natural; where are the knobs?
Since I had data generated for my post on Athena performance with different file types, I decided to use that data in a performance comparison with Redshift. For the latter, I chose both 4-node and 8-node dc2.large
provisioned clusters, and an 8-RPU serverless workspace; I also measured the performance of Redshift Spectrum on those clusters.
The Tables
To access the data from Redshift Spectrum, I created an external schema (named “ext_parquet
“) that referred back to the “athena-parquet
“ Glue database of the prior post.
Then I manually created tables on the Redshift cluster using the following DDL, and did an insert into ... select from ...
query to populate them from the Spectrum tables.
create table PRODUCT_PAGE ( eventtype text not null, eventid text not null, timestamp timestamp not null, userid text not null, productid text not null ) distkey (userid) sortkey (timestamp); create table ADD_TO_CART ( eventtype text not null, eventid text not null, timestamp timestamp not null, userid text not null, productid text not null, quantity int not null ) distkey (userid) sortkey (timestamp); create table CHECKOUT_COMPLETE ( eventtype text not null, eventid text not null, timestamp timestamp not null, userid text not null, itemsincart int not null, totalvalue decimal(8,2) not null ) distkey (userid) sortkey (timestamp);
Two things to call out from this DDL: first, that I kept the column naming from the Parquet files. While I would normally add underscores (eg, product_id
), that would have meant different queries for the different sources, and I didn’t want to do that much editing.
More important, I set the distribution and sort keys. By distributing on userid
, I should maximize Redshift’s ability to perform the example joins in parallel. By sorting on timestamp
, I should enable Redshift to ignore irrelevant data for time-based queries. “Should” is an important qualifier here: the effect of these two knobs are highly dependent on how much data actually lives in the cluster, and how it’s queried.
Digression: caching, or why you should take my numbers with a grain of salt
Reads from a local disk are slow, reads from S3 are slower. As a result, computer systems aggressively cache data blocks in memory, so that they don’t have to re-read those blocks from disk. And Redshift goes one step further: it caches query results. If you execute the same query twice, the second run typically takes only a few milliseconds, regardless of how long the first run took.
You can disable the query cache, but not the block cache (since it’s part of the underlying operating system). That means that the cached data of one test query might impact the performance of the next. To compensate for this, I adopted the rather extreme approach of rebooting the Redshift provisioned clusters between queries. I also show timing for a second run without reboot, to give a sense of how the block cache might affect similar queries. I didn’t reboot the serverless cluster, because it was spinning up new workers for each query.
I also didn’t reboot between Spectrum queries, in either type of cluster, because most of the work happens in the “Spectrum fleet” rather than the worker nodes. However, I do show two query times, to give a sense of whether that worker fleet does any caching on its own. Ditto for Athena (where I have seen cases where the timings are significantly lower for a second run).
In the real world, you may or may not see the effects of the block cache. If your dataset is much larger than the available RAM on your cluster, and your users are running queries against different subsets of that data, then Redshift will be forced to load its data from disk every time. But if you have control over what queries are run when, then exploiting the cache becomes another knob to turn.
The Queries
The queries are essentially the same as from the prior post, so I won’t go into details on their real-world relevance. You’ll note that the Athena query times differ from that post: I reran them so that I could get multiple timings.
Query #1: top-10 best-selling products, as measured by cart adds
select productid, sum(quantity) as units_added from "ext_parquet"."add_to_cart" group by productid order by units_added desc limit 10;
Platform | First Run | Second Run |
---|---|---|
Athena Parquet | 0.875 | 0.826 |
Spectrum, 4 nodes | 5.105 | 2.658 |
Spectrum, 8 nodes | 3.359 | 3.785 |
Spectrum, Serverless | 2.221 | 1.707 |
Provisioned, 4 nodes | 0.493 | 0.340 |
Provisioned, 8 nodes | 0.307 | 0.279 |
Serverless | 0.336 | 0.115 |
This is the type of query that should scale almost linearly with the number of workers. But while the 8-node cluster is indeed faster than the 4-node cluster, it’s not a linear speedup. There are a couple of possible reasons; one is general query overhead as Redshift parses the query and develops the execution job. A more likely cause is the aggregation step, which happens on the (single-threaded) leader node. I’ll dig into this a little more in a future post.
What’s more striking to me is the difference between Athena and Redshift Spectrum. They’re different execution engines, so I should expect different performance, but it surprised me just how different they were. When I look at the execution plan, it appears that the Spectrum query brings intermediate results into the Redshift engine for the final aggregation.
Query #2: best-selling products in specific time range
select productid, sum(quantity) as units_added from "ext_parquet"."add_to_cart" where "timestamp" between to_timestamp('2023-05-03 21:00:00', 'YYYY-MM-DD HH24:MI:SS') and to_timestamp('2023-05-03 22:00:00', 'YYYY-MM-DD HH24:MI:SS') group by productid order by units_added desc limit 10;
Platform | First Run | Second Run |
---|---|---|
Athena Parquet | 1.017 | 0.826 |
Spectrum, 4 nodes | 4.565 | 2.811 |
Spectrum, 8 nodes | 3.544 | 2.274 |
Spectrum, Serverless | 2.157 | 2.059 |
Provisioned, 4 nodes | 0.723 | 0.292 |
Provisioned, 8 nodes | 0.453 | 0.102 |
Serverless | 0.148 | 0.078 |
The surprise in this case was just how fast serverless Redshift is, relative to either of the provisioned clusters. I have a feeling this was due to caching, but have no data to back that up.
Query #3: products that are often viewed but not added to a cart
In my last post, I made a snide comment about how Athena did not let you use select-list expression aliases in other expressions, unlike other SQL databases. I have to eat those words: Redshift, which is compatible with Postgres 8.x, doesn’t allow you to do that either, nor does it let you use those aliases elsewhere in the query, such as an `order by` clause (for the record, Postgres 12 does). So I had to rewrite this query with the aggregation as a subquery.
select productid, (views - adds) as diff from ( select pp.productid as productid, count(distinct pp.eventid) as views, count(distinct atc.eventid) as adds from "ext_parquet"."product_page" pp join "ext_parquet"."add_to_cart" atc on atc.userid = pp.userid and atc.productid = pp.productid group by pp.productid ) order by 2 desc limit 10;
Platform | First Run | Second Run |
---|---|---|
Athena Parquet | 4.172 | 3.304 |
Spectrum, 4 nodes | 48.513 | 34.852 |
Spectrum, 8 nodes | 33.361 | 23.478 |
Spectrum, Serverless | 57.483 | 44.159 |
Provisioned, 4 nodes | 21.693 | 16.298 |
Provisioned, 8 nodes | 15.064 | 11.704 |
Serverless | 5.973 | 5.947 |
Athena really impressed me here, as did Redshift Serverless. As I noted in my previous post, this query touches almost all of the columns from the involved tables, so there’s little benefit from columnar storage. However, the distribution key configuration should have allowed the join to proceed in parallel on Redshift, versus the redistribute-and-merge required by Athena. Yet Redshift took several times as long.
Query #4: customers that abandoned carts
select count(distinct user_id) as users_with_abandoned_carts from ( select atc.userid as user_id, max(atc."timestamp") as max_atc_timestamp, max(cc."timestamp") as max_cc_timestamp from "ext_parquet"."add_to_cart" atc left join "ext_parquet"."checkout_complete" cc on cc.userid = atc.userid group by atc.userid ) where max_atc_timestamp > max_cc_timestamp or max_cc_timestamp is null;
Platform | First Run | Second Run |
---|---|---|
Athena Parquet | 4.441 | 3.532 |
Spectrum, 4 nodes | 15.504 | 14.070 |
Spectrum, 8 nodes | 12.782 | 9.443 |
Spectrum, Serverless | 21.729 | 19.345 |
Provisioned, 4 nodes | 5.805 | 5.567 |
Provisioned, 8 nodes | 4.469 | 4.206 |
Serverless | 1.828 | 1.712 |
This query, with a single-column join on the distribution key and limited columns accessed, should highlight Redshift’s strengths. And it didn’t disappoint, but Athena still came out on top.
Conclusions
Let’s get the elephant in the room out of the way: this dataset isn’t “big data.” It might have been 10 years ago, and definitely would have been 20 years ago. But today, I can load this dataset into a Postgres server running on my laptop, and it performs respectably … at least for the single-table queries. However, I think that this dataset, limited as it is, serves as useful model of how real “big data” queries will perform. The numbers may scale differently for different datasets, but you should be able to say ’this is an appropriate technology for this job.”
Athena impressed me; it matched or beat Redshift in every test. I think it was able to more effectively exploit parallelism. As you’ll recall from the previous post, the PRODUCT_PAGE
data was stored in 120 Parquet files, ADD_TO_CART
in 37, and CHECKOUT_COMPLETE
data in 33. I don’t know how many workers Athena will assign to a query; it may not be one per file, but I’m willing to bet that it’s more than the 4 or 8 nodes of the Redshift cluster. However, as I noted in the last post, my use of relatively large Parquet files plays to Athena strengths; if you have a lot of small files, loading them into Redshift and running your queries there would be a better choice.
I was initially concerned about Athena’s pricing model: it seemed that $5 per terabyte scanned could add up quickly. But as I thought about it, looking back at some of the projects I’ve worked on, it seemed less an issue. Sure, you may have many terabytes of data in your data lake, but do you query all of that data all of the time? In my experience, the answer is no, especially if you have partitioned the data. Moreover, if you wanted to store a single terabyte in a provisioned Redshift cluster, you’d need at least an 8-node dc2.large
, which would cost you $48/day to run. That’s enough to scan 9 terabytes with Athena.
I was also impressed by Redshift Serverless, although I think it’s an apples-to-oranges comparison with the provisioned clusters. For those, I used the dc2.large
node type, which provides 2 vCPUs and 15 GB of RAM. For serverless, I used the minimum of 8 “Redshift Processing Units” (RPUs). I can’t find any definition of what an RPU represents; it doesn’t match cleanly to any of the existing node types, although I suspect that it’s a variant of an ra3
. It definitely performed better than the 8-node provisioned cluster.
I think Redshift Serverless will also be a win in terms of cost. In the US regions an RPU-hour costs $0.36, or approximately $2100/month for the 8-RPU configuration (storage costs are negligible: under $25 per terabyte-month). That’s 50% more than an 8-node provisioned cluster, but the serverless cluster probably won’t be running for 100% of the time. If there’s no activity outside business hours, you’ll actually pay around $700/month, which is less than the 4-node provisioned cluster.
Beware, however, that the default allocation for a new serverless workgroup is 128 RPUs. Unless you know that you need this, pick a lower value or you’ll be paying $47/hour rather than $3.
Redshift Spectrum was the big disappointment in these tests. I always assumed that it was using Athena behind the scenes. Clearly, it’s not. I still think it’s an invaluable tool for getting data into your cluster (versus using a COPY
command), but I would steer clear of it for actual queries. For the same price of $5/terabyte scanned, Athena will give you much better performance.
The TL;DR is that, knobs or not, Athena with Parquet is a cost-effective, performant choice for big data analytics.
Can we help you?
Ready to transform your business with customized data engineering solutions? Chariot Solutions is your trusted partner. Our consultants specialize in managing software and data complexities, tailoring solutions to your unique needs. Explore our data engineering offerings or reach out today to discuss your project.