Amazon Redshift’s launch in 2012 was one of the “wow!” moments in my experience with AWS. Here was a massively parallel database system that could be rented for 25 cents per node-hour. By comparison, in the early 90s I worked with a similar system that had 64 nodes, a then-astronomical 512 GB of disk, and cost three million dollars. And the financial services company that I worked for at the time thought it was a bargain, because it could run analysis queries that no contemporary Oracle or Sybase system could even attempt.
Here we are in 2021, and AWS has just announced Redshift Serverless, in which you pay for the compute and storage that you use, rather than a fixed monthly cost for a fixed number of nodes with a fixed amount of storage. And for a lot of use-cases, I think that’s a great idea.
Unfortunately, there’s not a lot of information out there. AWS released a blog post that announced the service, and if you’re signed up for re:Invent virtual, the ANT216 session goes into a little more detail, including comparisons with (the existing) Provisioned Redshift. There’s also a new section in the Cluster Management guide, published after I started writing this post.
As I said, I think it’s a great idea, and one that it could be useful to several of Chariot’s clients, so I decided to spend some of the $500 in “new user” credits that AWS provides and kick the tires. I don’t currently have access to a production-scale dataset, so my performance numbers are based on dummy data and should be taken with a grain of salt. Instead, I look at the user experience, from the perspective of a person who’s been working with Redshift for many years.
Decision Support versus OLTP, or “Why Redshift”
Before I get started, let’s set some context. If you’re looking at Redshift with the background of an online transaction processing (OLTP) database developer, it may seem very strange: where are the indexes?
An OLTP application typically “touches” only a few rows for each transaction. For example, if you have an eCommerce application and a customer adds an item to their cart, this translates into a single insert operation: add a row to the
USER_CART table. That row is probably very small: it contains the user’s ID, the ID of a row in the
PRODUCT table, and a quantity. Your database will be set up with foreign-key constraints to ensure that your code can’t insert a row with an invalid user or product ID.
In a decision support database, by comparison, you tend to write queries that summarize entire tables, perhaps with joins to other tables. For example, “select total revenue by month for the last year, stratified by how long the user had been a customer at the time of their first transaction.”
These sorts of queries require a completely different execution engine than that used for OLTP. For one thing, most decision support queries start with a table-scan, something that’s anathema to OLTP developers. Joins are typically merges, rather than the index-based “nested loops” of an OLTP database: a decision support database take two large rowsets, orders them by the join column, and then merges the rows where that column matches.
To make this performant, decision support databases are built on two main abstractions:
- Columnar data storage.
All relational database deal in terms of rows and columns. An OLTP database stores all of the columns for a single row together, while a decision support database stores each column separately, as a distinct list of values. If you need two columns from the same table, you look at the same position in two lists.
This has big benefits for both performance and storage efficiency. Most decision support queries access only a few columns in a table, and benefit by not reading data for other columns. For example, you might want to retrieve a year’s worth of orders based on their timestamp. A typical row-based OLTP database would recognize that an index doesn’t help that query, so would perform a table-scan instead. But each row in that OLTP table might be a few hundred bytes, which translates to reading tens of gigabytes from disk for that scan. By comparison, a columnar storage database keeps keeps all of the timestamps together, at a cost of roughly eight bytes per value. It can scan all of the timestamps in the table by reading a few tens of megabytes from disk.
- Distributed data
Even with columnar data storage, it takes time to scan large tables. Decision support databases attack this problem with a “divide and conquer” approach: the “database” is a cluster of multiple compute/storage nodes, and each table is distributed across the nodes. So if you have a billion row table and an 8 node cluster, each node holds 125,000,000 rows. If you need more performance, you can increase the size of the cluster and performance scales linearly.
Distributed data introduces many new challenges for database physical design. For example, to make joins performant, you need to ensure that both tables are distributed in the same way; if they aren’t, then one (or both!) need to be “shuffled” to other nodes in the cluster.
This may be more than you want to know about decision support databases (and believe me, I trimmed it down!), but I think it’s important to understanding the rest of this post. Redshift uses both techniques, but Redshift Serverless hides some of the details from you, in ways that I think might be disturbing to those already familiar with Redshift.
Difference between Provisioned and Serverless
In a traditional, “provisioned” Redshift cluster, the size of your cluster largely depends on the size of your data. This makes sense: the larger your data, the more nodes that you want in order to execute queries in parallel.
If all of your data is “active,” meaning that any given query might fully scan any given table, you typically use a “dense compute” (DC2) node type, which has a fixed allocation of SSD disk per node. If, instead, your workloads typically only access a fraction of your data (typically the most recent), you might choose a “dense storage” (DS2) node type, which stored data on hard drives and relied on caching to improve performance. In either case, you would divide your total amount of data by the storage size of each node, and then add nodes as needed to improve performance.
The RA3 node type, introduced at the end of 2019, was intended replace the DS2 node type, and uses a combination of attached SSD for “hot” data and S3 for “cold” data. This node type also separated the compute and storage components of a Redshift cluster: each node could manage up to 128 TB of data, so it was more important to pick the number of nodes based on compute requirements than on data size.
Redshift Serverless takes the split between compute and storage one step further: there is no (published) limit to the amount of storage that you can assign to the cluster, and it’s no longer tied to nodes. In fact, “nodes” don’t exist in a Redshift Serverless database. Instead, you configure “Redshift Processing Units,” which can automatically scale based on your query workload (the exact behavior is a little unclear from the documentation; based on my research, you assign a “base” RPU level, but Redshift may use more or less depending on the queries that you run).
This is a big change, and an even bigger one is that you pay for those RPUs only when you’re actively using the database. So if you don’t run any queries overnight, you don’t pay for an idle cluster.
As I said above, I don’t currently have access to a production-scale database for testing. Instead, I used a dataset that we use for the Chariot Data Engineering interview. This data is intended to replicate eCommerce “clickstream” data for different user activities: visiting a product page, adding that product to a cart, changing its quantity, starting the checkout flow, and completing the checkout flow.
To give you a sense of size, there are approximately 14.3 million product page events, 4.5 million add-to-cart events, and 2.3 million checkout-complete events – relatively small data for a data warehouse.
Our interview challenge has the following three queries:
- Top-10 viewed products.
This is a simple aggregation of the
- Abandoned carts.
This finds users that have added items to a cart but not completed checkout. It’s an outer join between the
- Total quantity of products sold, by product.
This is a complex query that involves window functions and multiple common table expressions. It’s modeled on some of the more interesting queries that I wrote for clients.
I spun up a Redshift Serverless endpoint with 32 base RPUs, and a Redshift Provisioned cluster with 4 dc2.large nodes. According to this doc an RPU is 2 virtual CPUs and 16 GB of RAM, which is roughly equivalent to a single dc2.large node, so the Serverless cluster should be twice as powerful as the Provisioned one. But with the small dataset that I used, an 8 node provisioned cluster would quite possibly give worse results. And I think the Serverless cluster agrees with that: while I set a base of 32 RPUs (the smallest you can do), it only used 12.8 for my queries.
I ran each query twice, and used the timing information reported by the
psql client. You’ll note that the first and second runs have dramatically different times. Redshift works by building and then compiling an actual program for each query, which is then shipped to the nodes for execution. This step takes an extremely long time the first time Redshift sees a query, but is aggressively cached for future executions. If you’re using Redshift for “dashboard” queries, you should see performance closer to the “second run” times.
Without further ado, the numbers.
|First Run||Second Run|
|Top-10 Products||5,344 ms||5,017 ms||3.4 ms||3.9 ms|
|Abandoned carts||6,533 ms||4,860 ms||4.1 ms||4.6 ms|
|Quantity of items sold||11,498 ms||20,193 ms||8.9 ms||7.8 ms|
And the conclusion: at least for the data sizes that I used, Redshift Serverless provides approximately the same performance as Redshift Provisioned. The biggest difference was the first-run time for a complex query, which I believe to be all about compilation time.
Things to be aware of
If you plan to try out Redshift Serverless in your environment, here are some things to think about first. See also the Known Issues and Limitations section of the Cluster Management guide.
You can only have one serverless endpoint for each AWS account … and it lasts forever
I don’t know if this will always be the case, or if it’s part of the preview functionality. It seems very shortsighted to me, as I’ve worked on several projects that had multiple Redshift clusters containing different data.
More concerning, if you’re just planning to experiment, is that there’s no way to delete the endpoint once you’ve created it – at least, there’s no way to do so from the Console. While you won’t be charged for the database when it isn’t running, you will be charged for storage. So be sure to drop your tables once you’re done.
And beware: if you’ve enabled logging, system operations will be written to the log even if you’re not actively using the database. The volume isn’t high, but it does add to your CloudWatch costs. You can disable (or enable) these logs from the “Data access” tab in the Console.
Limited CloudWatch metrics
With a Provisioned Redshift cluster, you can see per-node statistics such as CPU utilization and disk space. These are very useful tools to identify problem queries: for example, if the disk consumption is higher on node “Compute-0” relative to other nodes, you have a strong indication that the distribution key of your table or intermediate result has a high percentage of null values.
This information doesn’t exist for Redshift Serverless. Instead, you can see the number of queries that are running or completed, broken down by query type. Interestingly, even though you can get RPU consumption from the Redshift Serverless Console page, that’s not a metric that’s available from CloudWatch.
Diagnostic tables/views may not be available
Provisioned Redshift provides a plethora of system tables and views that give you insight into how your cluster is operating. Over time, many people (myself included) accumulate a collection of useful queries that use these system tables. Unfortunately, many of them are not supported by Redshift Serverless, and attempting to query them returns a “permission denied” error.
In some cases, there are replacements. For example,
SYS_QUERY_HISTORY has much the same content as
STL_QUERY. In other cases, such as
STV_BLOCKLIST, the information may not be relevant to a Serverless environment. But if you’ve written monitoring scripts that are based on these system tables, they’ll need to be rewritten for Serverless.
Redshift Spectrum is not supported … or is it?
As-of this writing, the documented list of features explicitly calls out Redshift Spectrum as unsupported. However, the billing section of that page says that Spectrum queries are priced in RPUs, just like normal queries. Moreover, I was able to create and successfully query an external table.
I’m going to assume that the list of features is inaccurate, or perhaps hasn’t been updated since Spectrum support was added. But if Redshift Spectrum is a key component of your deployment, beware that there may be limitations.
Does Redshift Serverless make sense for you?
In my experience, decision support databases tend to have bursty usage: scheduled jobs to aggregate and analyze new data, with occasional queries from dashboards or ad hoc users. This pattern is perfect for Redshift Serverless: not only do you not pay compute charges when there aren’t active queries, but it will scale as needed to match the size of your bursts. Overall, that should lead to both lower costs and better performance.
Of course, as of this writing Redshift Serverless is still in “preview” mode: it’s not available in all regions, and its capabilities may change. So I wouldn’t jump to using it in production.
But it’s easy to create (or overwrite) a serverless database with a snapshot from a provisioned database, so I’d definitely recommend trying it out in a development environment. Especially given the $500 in “free trial” credits that’s available for new users.