data pipeline

Populating a Data Lake with AWS Database Migration Service and Amazon Data Firehose

Data lakes are great for holding large volumes of data, such as clickstream logs. But such data has limited usefulness unless you can combine it with data from your transactional, line-of-business databases. And this is where things get tricky. Simple approaches, such as replicating entire tables, don’t scale. Streaming approaches that include updates and deletes require logic to determine the latest value (or existence!) of any given row. All of which has to be translated into static data files in a data lake.

In this post I look at one approach to solve this problem: AWS Data Migration Service to capture changes from the source database and write them to a Kinesis Data Stream, and Amazon Data Firehose to load those records into Iceberg tables.

Aggregating Files in your Data Lake – Part 3

In this final part of a three-part series, I add another aggregation step to combine a month’s worth of data and write it as Parquet.

Aggregating Files in your Data Lake – Part 2

When I ran the Lambda from my previous post against Chariot’s CloudTrail repository, it took almost four minutes to process a single day’s worth of data. That seems like a long time, and as a developer I want to optimize everything I write. In this post I look into analyzing the current runtime, and options for improving it.

Aggregating Files in your Data Lake – Part 1

As I’ve written in the past, large numbers of small files make for an inefficient data lake. But sometimes, you can’t avoid small files. Our CloudTrail repository, for example, has 4,601,675 files as-of this morning, 44% of which are under 1,000 bytes long. In this post, I develop a Lambda-based data pipeline to aggregate these files, storing them in a new S3 location partitioned by date. Along the way I call out some of the challenges that face such a pipeline.

15 Minutes With: Keith Gregory on Building a Data Pipeline for Better Business Operations

Clickstream data – the behavior data collected from a user’s path through a website or app – is often used for business intelligence reports. It helps many companies answer questions like, ‘which of my products are people adding to their cart?’ or ‘What does our online purchase funnel look like?’ But our AWS Practice Lead, … Read More

Populating a Data Lake with AWS Database Migration Service and Amazon Data Firehose

Aggregating Files in your Data Lake – Part 3

Aggregating Files in your Data Lake – Part 2

Aggregating Files in your Data Lake – Part 1

15 Minutes With: Keith Gregory on Building a Data Pipeline for Better Business Operations

Introducing Team Data

15 Minutes With: Andrew Ganim on Building a Data Pipeline

IoT on AWS – Intro to Kinesis