data engineering

S3 Table Buckets vs Redshift

AWS released S3 Table Buckets at re:Invent 2024, and at release they were pretty much only usable with Elastic Map Reduce. However, over the past year, the S3 Tables team has been making improvements. And while there are still some limitations, S3 Tables with Athena gives a user experience similar to traditional data warehouses such as Redshift.

Which leads to the question: can Athena and S3 Tables be a cost-effective replacement for Redshift? In this post I show how to use S3 Tables, and run some performance comparisons to answer that question.

D2C242: Data Engineering and its Streams, Rivers, and Lakes

Keith Gregory teaches Day Two Cloud about data engineering in a way DevOps folks (and hydrologists) can understand. He explains that the role of a data engineer is to create pipelines to transport data from metaphorical rivers and make it usable for data analysts. Keith walks us through the testing process; the difference between streaming … Read More

Perils of Partitioning

Partitioning is one of the easiest ways to improve the performance of your data lake, because it reduces the amount of data scanned. But implementing partitions can be surprisingly challenging, as can their effective use. In this post I look at several of the issues that you should consider when partitioning your data.

Transforming Data with Amazon Athena

My prior posts used Lambda to do data transformation. But what if we could use a non-programmatic tool, in keeping with the Extract-Load-Transform mindset of the modern data pipeline. As it turns, we can: Amazon Athena can write data as well as query it. There are, of course, a few stumbles along the way. In this blog post I walk through the process of aggregating CloudTrail data using SQL.

Aggregating Files in your Data Lake – Part 2

When I ran the Lambda from my previous post against Chariot’s CloudTrail repository, it took almost four minutes to process a single day’s worth of data. That seems like a long time, and as a developer I want to optimize everything I write. In this post I look into analyzing the current runtime, and options for improving it.

Aggregating Files in your Data Lake – Part 1

As I’ve written in the past, large numbers of small files make for an inefficient data lake. But sometimes, you can’t avoid small files. Our CloudTrail repository, for example, has 4,601,675 files as-of this morning, 44% of which are under 1,000 bytes long. In this post, I develop a Lambda-based data pipeline to aggregate these files, storing them in a new S3 location partitioned by date. Along the way I call out some of the challenges that face such a pipeline.

TechChat Tuesdays #65: Redshift Execution Plans with Keith Gregory

In this week’s TechChat, we welcome Keith Gregory, our Cloud & Data Engineering Practice Lead here at Chariot. Keith is a prolific writer both on the Chariot blog as well as on his own, and is a wealth of knowledge on all things AWS. We touch on Redshift execution plans, how to appropriately size Redshift … Read More