Troubleshooting ECS Deployments

At Chariot, we’re fans of deploying on Amazon’s Elastic Container Service using Fargate. Container-based deployments simplify developers’ lives, because they can run the same thing on their desktop as in production. For operations, it removes much of the worry about unpatched vulnerabilities … as long as developers use up-to-date images. And for people familiar with the AWS ecosystem, ECS fits neatly into the slot traditionally occupied by EC2.

However, there are a few stumbling blocks for ECS deployments. In this post I cover some of the problems that I’ve seen, and the techniques that I use to avoid tripping over them.

Things that go wrong

No access to the Internet

Functionally, a container image is like a “pre-baked” EC2 AMI: both contain everything that they need to run, and at least theoretically, should be deployable in a subnet that has no Internet access. However, while theory and practice match in the case of EC2, they don’t for ECS: you need to access several AWS services to launch your containers.

This means that either (1) you give your containers public IP addresses, which is useful for development, but not something you’d want to do in production; (2) you create five VPC endpoints, plus additional endpoints to support the AWS services that you use; or (3) you use a NAT Gateway. The last choice is the easiest, but beware the cost of launching lots of large containers: you pay a per-GB charge to read the images.

If you can’t get to the Internet, your task will fail with a rather wordy message, displayed in the task’s detail page in the Console. In fact, two wordy messages, depending on whether you’re attempting to pull a task from ECR or a public repository:

ResourceInitializationError: unable to pull secrets or registry auth: The task cannot pull
registry auth from Amazon ECR: There is a connection issue between the task and Amazon ECR.
Check your task network configuration. RequestError: send request failed caused by:
Post "https://api.ecr.us-east-1.amazonaws.com/": dial tcp 44.213.79.10:443: i/o timeout

CannotPullContainerError: pull image manifest has been retried 5 time(s): failed to resolve ref
docker.io/library/httpd:2.4: failed to do request: Head "https://registry-1.docker.io/v2/library/httpd/manifests/2.4": 
dial tcp 3.219.239.5:443: i/o timeout

In both cases, the relevant words are “i/o timeout”: ECS tries to open a connection to the repository, but the request never gets to its destination.

You’re unlikely to have a timeout in a production environment, because those are usually set up with NATs. It’s far more likely to occur in a development environment, where you might omit NATs in order to save money. In that case, you’ll need to deploy into a public subnet with public IP address.

x86 versus ARM

Ten years ago the world was simple: wherever you went, there was an Intel-compatible (x86) machine. Your laptop ran x86, your build server ran x86, your deployment server ran x86, and you could run the same image on all of them. Then ARM made its way into the data center, and perhaps more important, onto developers’ desktops in the form of Apple Silicon.

If you deploy an image with the wrong architecture, the task stops and you’ll see the message “Essential container in task exited.” This is the same message that you’ll see if the task has a fatal error, or if it stops normally. To discover the problem you need to go one step further, to CloudWatch Logs:

exec /usr/local/bin/httpd-foreground: exec format error

The relevant words are “exec format error”; the rest of the message indicates the program (in this case the Apache HTTP server) that failed.

Again, this is typically an issue with development, when developers build images and push them directly to a shared repository. Moving builds off the developer workstation and onto a build server, such as CodeBuild or GitHub Actions, will ensure that it never bites you.

Incorrect privileges in Task Execution Role

As of this writing, AWS has 34 services that use some form of an “execution” role. In most of those services, the “execution” role grants permissions to whatever is executing (such as a Lambda function). Not ECS. Instead, the ECS task execution role is the role that ECS uses to launch a task,. The executing task uses the task role.

If you add permissions to the task execution role rather than the task role, then you’ll get an permission error in your application; this is fairly easy to track down and fix. On the other hand, if you add permissions to the task role rather than the task execution role then your task won’t start, and the messaging is rather confusing.

AWS provides the managed policy AmazonECSTaskExecutionRolePolicy, which grants permissions to retrieve images from ECR, create a log stream in CloudWatch Logs, and write log messages to it. This policy doesn’t grant excessive privileges, so it makes sense include it in your task execution role rather than granting the individual privileges via an inline policy. But in many cases, this predefined policy is not enough.

For example, if you populate task environment variables from secrets using ValueFrom (which you should, if you use environment variables for configuration), the execution role requires secretsmanager:GetSecretValue for the secrets you use; if it doesn’t you’ll see a message like this:

ResourceInitializationError: unable to pull secrets or registry auth: execution 
resource retrieval failed: unable to retrieve secret from asm: service call has 
been retried 1 time(s): failed to fetch secret arn:aws:secretsmanager:us-east-1:123456789012:secret:DefaultSecret-Z89PLm 
from secrets manager: AccessDeniedException: User: arn:aws:sts::123456789012:assumed-role/ECS-Test-TaskExecutionRole-us-east-1/2707875552b74681925ba8f676aa24ed 
is not authorized to perform: secretsmanager:GetSecretValue on resource: arn:aws:secretsmanager:us-east-1:123456789012:secret:DefaultSecret-Z89PLm 
because no identity-based policy allows the secretsmanager:GetSecretValue action status code: 400, request id: b5ece521-1add-4438-b50e-9bcb25dc39c5

This message looks a lot like the earlier message “unable to pull secrets or registry auth.” If you stopped reading there, you might go down the dead-end path of checking your network configuration. Instead, you have to read on, to find the AccessDeniedException in the middle of the message, and that the task execution role is not authorized to retrieve the secret value.

Failed health checks

I’m torn when it comes to load balancer health checks. On the one hand, it’s nice to have a clear signal that the target instance is up and running (or not). On the other, simply hitting a URL and checking the status code doesn’t provide a lot of information. And the penalty for failing a health check — terminating the task — gets in the way of diagnosing problems.

Health checks also introduce lag into your service startup: the target group must see at least two successful checks before it considers a target healthy; the default is five. And these health checks are invoked on a schedule, defaulting to 30 seconds between checks.

But the big problem with health checks is how they interact with CloudFormation or other infrastructure-as-code (IaC) tools: if the tool waits until it has a successful deployment (and CloudFormation does), then it may never finish. The deployment will enter a loop where the ECS Service brings the task up, only to have it killed by the load balancer’s Target Group.

Detecting a failed health check is simple: the load balancer target group reports a count of healthy and unhealthy targets. Below, I’ll give some tips for mittgation, in particular preventing a respawn loop.

Misconfigured port mappings

One of the things that can cause a health check to fail is if the target group attempts to contact the service instance on the wrong port. I call this out because there are four places that you need to configure the port number: the load balancer target group, the ECS Service load balancer configuration, the security group attached to the service, and the task definition. It’s easy to miss one of those, especially if you’re copying an IaC script.

However, if you are using IAc, you can avoid misconfiguration by configuring the port number as a parameter rather than hardcoding it in every place needed. And if you reuse templates for different deployments, remember that different frameworks expect different port numbers: a configuration for Apache (port 80) won’t work for NodeJS (port 3000). That’s burned me more than once.

Making life difficult: disappearing tasks

Tasks disappear in two ways. First, by default the Console only shows running and pending tasks. As soon as the task terminates, for whatever reason, it disappears from the list. You have to remember to select the “Any desired status” filter to see stopped tasks. It’s annoying, especially since you can’t set it as a default. You quickly get in the habit of opening the task’s page while it’s in Pending state, so that you can see what happens to it once it stops.

On a longer time scale – tens of minutes – tasks disappear from the task listing entirely. If you have a production task that fails to start in the middle of the night, there will be no sign of it in the morning.

The only way that I know of to preserve this information is to set up a Lambda (like this one) that responds to task state change events. When a task stops running, the Lambda captures the final status and records it in its own logs.

Tips to make diagnosis easier

Test locally first

One of the best features of containers is that you can run the same environment on your desktop as you do in production. What’s surprising is that a lot of developers don’t take advantage of this. Instead, they run the various components in stand-alone mode during development, then build an image and deploy. While that can provide fast development cycles, you won’t know if your image is usable unless you test it.

If you have a complex environment, with databases and separate front-end and back-end services, look into Docker Compose. But beware that the containers in a Compose file are started concurrently. If your application assumes that the database will be fully operational when it’s started, you will need to introduce a delay script, which makes your configuration more complex and less representative of production.

When using IaC, create services with zero tasks and then increase for testing

Infrastructure-as-code tools are a great way to deploy a service on ECS … until you run into an error. That can leave your deployment half-complete, possibly unrecoverable. And waiting for your tool to create a service when its tasks are failing is a waste of time.

The solution is simple: create services with zero tasks. Then, once the infrastructure is fully deployed, increase the task count. Do this manually during development. If you use a CI/CD pipeline to deploy production, increasing task count is a matter of one commit.

Incidentally, a zero-task service is a great way to have a single script that creates all of your infrastructure, including the ECR repository (which has to be populated before you can run a task that uses it).

Enable a circuit breaker for service deployments

At release, ECS Services would retry failed tasks forever, never becoming active. In 2020, AWS announced ECS circuit breakers, which will either cause a deployment to fail or be rolled back to a previous state. If you create your service via the Console, a circuit breaker is enabled by default; if you create it via CloudFormation, you need to add the following lines to the DeploymentConfiguration:

DeploymentCircuitBreaker:
    Enable:                       true
    Rollback:                     false

You’ll note that I’ve disabled rollback; instead, the deployment will simply fail. This is because — at least in development — you might not have a valid state to roll back to. And in production, CloudFormation will automatically roll back the stack to the previous valid version (which might be a zero-task deployment)..

Start tasks outside target group

Failing health checks are one of the most painful cases to debug because you can’t disable them, you don’t get much information about why the health check failed, and the tasks often shut down while you’re doing diagnosis. To make progress, I’ve found that it’s useful to start the task manually, not connected to the target group.

This tells you several things. First, if the task shuts down on its own shortly after startup, you know that the health check isn’t to blame; look for something within the task itself. Second, you can hit the health check endpoint with curl, to see what it actually returns, and update the actual health check as needed. If you’ve deployed into private subnets, you’ll need to spin up an EC2 instance to do this.

When you run a task manually you’ll have the option to attach whatever security group(s) you want. Always pick the group(s) that would be attached to the actual task running in a service; don’t be tempted to use a default group or one that gives you direct access. If you use the same groups as the actual service, you’ll discover when those groups are misconfigured (eg, the task doesn’t allow access from the load balancer).

Turn on access logs at load balancer as well as in container

You should enable load balancer access logs as a matter of course: they’re a way to analyze your traffic by endpoint, including performance. But beware that the load balancer delivers them to S3 every five minutes, so they aren’t a good source of real-time data.

To fill in this gap, you can enable access logging from within the container. Exactly how you do this will depend on the framework that you’re using. Some base images (such as Apache and Nginx) redirect all logs to standard error by default (which can be annoying, as they’ll be interleaved with application logs).

By comparing the two logs, you can determine whether requests that were accepted by the load balancer ever got to the container. You’ll also see the health checks (and learn that a load balancer actually makes multiple health check calls, one from each availability zone where it’s deployed). Of course, if you don’t see health checks you know that something’s not right between the load balancer and your container.

ECS Exec

When running Docker locally, you can start a shell process inside the container with a command like this (2612f3d49b21 is the target container’s ID):

docker exec -it 2612f3d49b21 /bin/bash

Once you’ve connected to the container, you have root access to all the processes and files contained within. And if you need a tool that isn’t already installed in the image, you can install it. It’s a tremendously powerful debugging feature, one that I often use during development and testing. ECS offers something similar: ECS Exec, which uses Session Manager to manage a connection into the container.

To enable ECS Exec, your container must be able to access Session Manager, either via a NAT or dedicated VPC endpoint. You must grant the task permission to access Session Manager via the task role. And you must explicitly enable ECS Exec when you launch the task.

Unfortunately, you can’t enable ECS Exec when you launch from the Console. You must either launch via a service (and enable there), or using the command line or ECS API.

With all these prerequisites met, you establish a connection using the AWS CLI (replacing CLUSTER-NAME and TASK-ID with your deployment’s values):

aws ecs execute-command \
    --cluster CLUSTER-NAME \
    --task TASK-ID \
    --interactive \
    --command "/bin/bash"

There are a few gotchas to this process. The biggest, of course, is that you might not have deployed with ECS Exec enabled; once the task is running, it’s too late to change. You might also find that your image doesn’t have the bash shell; I’ve seen a few images like this, although they all had the older Bourne (sh) shell. And there might be nothing worth looking at: if a container redirects all logs to /dev/stderr, then searching for logfiles is pointless.

Wrapping up

This has been a long post, and there’s more that I could write, but debugging stories are rarely as interesting to the reader as they are to the author. To summarize, I’d say that the most important thing is to verify that your image works before you try to deploy it on ECS. If you do that, then the most likely problems you’ll run into are network related, often misconfigured ports.

Happy troubleshooting!