aws

Populating a Data Lake with AWS Database Migration Service and Amazon Data Firehose

Data lakes are great for holding large volumes of data, such as clickstream logs. But such data has limited usefulness unless you can combine it with data from your transactional, line-of-business databases. And this is where things get tricky. Simple approaches, such as replicating entire tables, don’t scale. Streaming approaches that include updates and deletes require logic to determine the latest value (or existence!) of any given row. All of which has to be translated into static data files in a data lake.

In this post I look at one approach to solve this problem: AWS Data Migration Service to capture changes from the source database and write them to a Kinesis Data Stream, and Amazon Data Firehose to load those records into Iceberg tables.

Cost Optimizing an ML Feature Store

A client recently started building a new machine learning (ML) architecture with a feature store as one of the key pieces. The feature store was already burning through a lot of money on AWS Elasticache and it wasn’t even scaled up in production yet! The project was in danger of being shelved without serious cost … Read More

Websockets feeding Kinesis

We recently explored a project to retrieve data from a third-party service. They didn’t offer any push capabilities such as writing to a Kafka or Kinesis stream, or even a web-hook. But they did offer a WebSocket interface, so we explored whether we could use that as our streaming source. We didn’t go that route, but I was intrigued by the idea enough to make a proof-of-concept.

Lambda Four Ways, a Rosetta Stone for AWS

When I write Lambdas professionally, Python is my preferred language. It offers decent performance, a straightforward syntax, and high developer productivity. I’ve also used Java, both in demonstration apps and actual client work. But while I have some familiarity with other languages supported by the platform, I’ve never used them. So, with some downtime, I decided to implement the same Lambda in four different languages: Python, Java, JavaScript, and Go, to get a better sense of their strengths and weaknesses.

Perils of Partitioning

Partitioning is one of the easiest ways to improve the performance of your data lake, because it reduces the amount of data scanned. But implementing partitions can be surprisingly challenging, as can their effective use. In this post I look at several of the issues that you should consider when partitioning your data.

Transforming Data with Amazon Athena

My prior posts used Lambda to do data transformation. But what if we could use a non-programmatic tool, in keeping with the Extract-Load-Transform mindset of the modern data pipeline. As it turns, we can: Amazon Athena can write data as well as query it. There are, of course, a few stumbles along the way. In this blog post I walk through the process of aggregating CloudTrail data using SQL.

Aggregating Files in your Data Lake – Part 2

When I ran the Lambda from my previous post against Chariot’s CloudTrail repository, it took almost four minutes to process a single day’s worth of data. That seems like a long time, and as a developer I want to optimize everything I write. In this post I look into analyzing the current runtime, and options for improving it.