Keith Gregory

What’s the point of Lambda SnapStart?

Lambda SnapStart is intended to improve the cold start time for a Lambda function. It’s been available for Java workloads since 2022, and was recently released for Python and .Net workloads. It works by running the initialization code of your Lambda function when you release a version, and then storing an image of the Lambda execution environment. Cold starts load this image rather than running the initialization themselves. Given that cold starts happen unpredictably, and may be measured in seconds, this seems like a win-win situation.

The reality, as usual, is more nuanced. SnapStart introduces its own cold start delays, as it loads the image into the runtime. And it increases the time and effort of deployment. In this post I drill down into the nuance, so that you can decide whether it’s a worthwhile choice fo your project.

Troubleshooting ECS Deployments

At Chariot, we’re fans of deploying on Amazon’s Elastic Container Service using Fargate. However, there are a few stumbling blocks for ECS deployments. In this post I cover the common problems that I’ve seen, and some of the techniques that I use to avoid tripping over them.

Websockets feeding Kinesis

We recently explored a project to retrieve data from a third-party service. They didn’t offer any push capabilities such as writing to a Kafka or Kinesis stream, or even a web-hook. But they did offer a WebSocket interface, so we explored whether we could use that as our streaming source. We didn’t go that route, but I was intrigued by the idea enough to make a proof-of-concept.

Lambda Four Ways, a Rosetta Stone for AWS

When I write Lambdas professionally, Python is my preferred language. It offers decent performance, a straightforward syntax, and high developer productivity. I’ve also used Java, both in demonstration apps and actual client work. But while I have some familiarity with other languages supported by the platform, I’ve never used them. So, with some downtime, I decided to implement the same Lambda in four different languages: Python, Java, JavaScript, and Go, to get a better sense of their strengths and weaknesses.

Perils of Partitioning

Partitioning is one of the easiest ways to improve the performance of your data lake, because it reduces the amount of data scanned. But implementing partitions can be surprisingly challenging, as can their effective use. In this post I look at several of the issues that you should consider when partitioning your data.

Transforming Data with Amazon Athena

My prior posts used Lambda to do data transformation. But what if we could use a non-programmatic tool, in keeping with the Extract-Load-Transform mindset of the modern data pipeline. As it turns, we can: Amazon Athena can write data as well as query it. There are, of course, a few stumbles along the way. In this blog post I walk through the process of aggregating CloudTrail data using SQL.

Aggregating Files in your Data Lake – Part 2

When I ran the Lambda from my previous post against Chariot’s CloudTrail repository, it took almost four minutes to process a single day’s worth of data. That seems like a long time, and as a developer I want to optimize everything I write. In this post I look into analyzing the current runtime, and options for improving it.

Aggregating Files in your Data Lake – Part 1

As I’ve written in the past, large numbers of small files make for an inefficient data lake. But sometimes, you can’t avoid small files. Our CloudTrail repository, for example, has 4,601,675 files as-of this morning, 44% of which are under 1,000 bytes long. In this post, I develop a Lambda-based data pipeline to aggregate these files, storing them in a new S3 location partitioned by date. Along the way I call out some of the challenges that face such a pipeline.

Data Engineering is more SRE than SQL

Following my post about the Chariot Data Engineering interview, I received some comments along the lines of “wait, you don’t test their SQL skills?!?” Actually, we do: after loading up the test data into Redshift, the candidate creates three progressively difficult queries. But by then, I’m pretty sure they’ve got the skills we need, because … Read More