AWS describes availability zones as “isolated locations within each Region,” and for a long time, Amazon’s official take on the proper number of availability zones was that you should “[d]ivide your VPC network range evenly across all available Availability Zones (AZs) in a region.”
But is there a good reason to do that? In this post I’m going to look at why you might choose different numbers of AZs for your deployment.
The Cost of an Availability Zone
While availability zones do not have an inherent charge, they are not free to use.
First, because a multi-AZ deployment increases the complexity of your deployment scripts, and therefore the cost to maintain them. In 99% of cases this is minimal: you define lists of subnets and use the appropriate list in your resource definitions. The core infrastructure scripts, however — the ones that set up your subnets and routing tables — can’t avoid becoming more complex. Fortunately, you don’t often touch those.
Second, because you pay for data transferred between availability zones. This costs only pennies per gigabyte, but it can add up quickly, and may not be obvious that it’s happening. For example, you can easily generate gigabytes of log messages per day; if you send them all to a single fluentd server, you’ll be billed for any that cross an availability zone to get there (and if you have one fluentd server per AZ, see point #1 above).
Lastly, if you follow the standard VPC configuration of public and private subnets, you’ll probably need some way for instances in those private subnets to access the Internet. There are several ways to accomplish this: VPC Endpoints, NAT Gateways, and NAT Instances. I could write an entire blog post about the cost considerations of these options (in addition to the one I already wrote!), but assuming you pick a NAT Gateway you’ll pay $0.045 per hour, per Gateway, for US regions.
That’s not much: around $33 per month. But you need a separate NAT for every availability zone where you have private networks (yes, you do, unless you want to wake up to change route tables if the one AZ that holds your NAT goes down). And when you multiply that by the number of deployment environments your business uses, you’re talking about hundreds of dollars a month, just for NATs.
OK, with that in mind, let’s look at the options.
One: You Only Live Once!
This is not as crazy as it sounds.
For one thing, certain services and/or workloads can’t benefit from multiple AZs. Elastic Map Reduce (EMR) is one: all EMR instances must run in the same subnet, and that’s part of the cluster configuration. To support multiple availability zones, you need to manage multiple cluster configurations, or be able to spin up a new cluster quickly. In many cases, the AZ will be back online before you do that. Moreover, EMR jobs tend not to be time-critical, so if an availability zone is down you can just wait for it to come back up.
CI/CD pipelines are another case where one availability zone may be better than more. The reason is that deployment bundles tend to be quite big, primarily due to the number of dependencies that go into them. If you’re using a local repository server to manage those dependencies, you don’t want to be paying cross-AZ charges for every build. And, like EMR jobs, builds usually can be delayed without serious impact.
Taking this one step further, does your development environments need to be multi-AZ? Most of a developer’s work happens on their laptop; they use the cloud only for test deployments and integration. If that AZ happens to go down then they might not be able to do their testing. But it’s likely that they’re spending that time making sure that production isn’t having problems!
And lastly, what about running prod in a single zone? There are, in fact, times when this makes sense as well. For example, about five years ago I set up a Kafka cluster in a single AZ. At the time, ZooKeeper was a core component of Kafka, and there was a lot of debate over how well it would could handle the lags of inter-AZ communication. Moreover, we expected enough volume that the inter-AZ data transfer changes would be significant. After discussing the impact of the service being unavailable, we decided that the single AZ was the way to go.
If you come to a similar conclusion for your workload(s), remember Andrew Carnegie’s admonition: “put all your eggs in one basket, and then watch that basket!” Know how long an outage you can sustain, and plan for recovery.
Two: High Availability
OK, maybe you’re not willing to put all of your eggs in one availability zone. AWS makes it easy to spread workloads across availability zones to achieve high availability:
High availability requires at least two availability zones. The idea is that only one zone will go down at a time: the proverbial backhoe cutting power and network cables. Since Amazon isolates the data centers for each availability zone, that backhoe won’t take out more than one AZ.
Multiple availability zones won’t protect you against region-wide outages, like the S3 outage of February 2017, or the Kinesis (and dependent services) outage of November 2020. You’ll need a multi-region strategy to compensate for those outages.
And it’s also important to understand that failover isn’t immediate. TCP/IP networking is designed to be resilient in the face of delays, which means that it takes time for you (or your load balancer) to discover that an AZ is down. That time is measured in seconds, but it’s still there. You may lose transactions when it happens.
Nor is failover transparent. When a database server goes down, all of the connections to that server stop responding. Notice that I didn’t say “close.” Instead, your application discovers that the database is down when it sends a request and doesn’t get a response within a timeout. At that point it either tries to reconnect or throws an exception. And when reconnecting, it may try to access the down server, as I describe here.
All of which is a long-winded way of saying that your application shoulders a big part of the responsibility for being highly available. It doesn’t matter how many availability zones you have if you can’t serve traffic.
I’m concerned that I might be sending the wrong message in this section, so I want to wrap up with a clear statement: high availability, implemented using two availability zones, should be considered the minimal deployment for a production system. AZs do go down, and you should be prepared for that. Depending on your business, however, you may want to go above and beyond.
Three: Maintain a Quorum
A few years ago I got into a discussion with a person who vehemently insisted that high availability meant at least three availability zones. It took me a while to realize that he was confusing high availability with maintaining a quorum, so I’ll start this section with some definitions.
- High availability refers to the ability to “fail over” to a second set of servers in case the first become unavailable. “Second” is important here: you can add additional sets of servers, to provide a fail-over for the fail-over, but these are not critical to the definition. Note, for example, that “multi-AZ RDS” only supports a single failover.
- Maintaining a quorum refers to the ability for a distributed storage system to have a majority of its members agree that an update occurred. If you have two nodes in such a system, and one goes down, the other node has no way to know if it is the only living node or if it just can’t communicate with the other node, so it can’t do anything. To maintain a quorum in the face of one AZ going down, you need nodes in at least two other AZs.
There are many AWS services that maintain a quorum internally, S3 and Aurora being two of the best known. For these services, you don’t have any control over which availability zones AWS uses to store your data; as a user of the service it’s immaterial.
On the other hand, if you’re running a self-managed service such as Apache Cassandra or Apache ZooKeeper, you care very much about the availability zones that a node runs in. In the case of ZooKeeper, it’s enough to ensure that each node is in a separate zone. For Cassandra, or other replicated storage services that might have many nodes, you need to ensure that data exists in multiple zones.
In terms of high availability, three zones in a region does not give you significantly more availability than two: two backhoes cutting powerlines is an attack, not an accident. With AWS deployments, it also means that you have to take on more operational responsibility. For example, rather than using the built-in “multi-AZ” synchronous replication that RDS provides, you need to explicitly create multiple replicas (at least one per AZ) and then manually promote a replica if the primary zone goes down (or use Aurora, which does this for you).
None: Serverless For The Win?
It’s possible to create a complete web-application without involving a VPC at all: CloudFront, S3, API Gateway, Lambda, Cognito, and DynamoDB. Or, if you need a relational database, replace DynamoDB with Aurora Serverless and the RDS Data Service API. This is, to be honest, not an option that works for everyone: large numbers of small database transactions will take a noticeable performance hit. But for low-volume applications, it removes the entire question of how many availability zones you need.