Keith Gregory

Perils of Partitioning

Partitioning is one of the easiest ways to improve the performance of your data lake, because it reduces the amount of data scanned. But implementing partitions can be surprisingly challenging, as can their effective use. In this post I look at several of the issues that you should consider when partitioning your data.

Transforming Data with Amazon Athena

My prior posts used Lambda to do data transformation. But what if we could use a non-programmatic tool, in keeping with the Extract-Load-Transform mindset of the modern data pipeline. As it turns, we can: Amazon Athena can write data as well as query it. There are, of course, a few stumbles along the way. In this blog post I walk through the process of aggregating CloudTrail data using SQL.

Aggregating Files in your Data Lake – Part 2

When I ran the Lambda from my previous post against Chariot’s CloudTrail repository, it took almost four minutes to process a single day’s worth of data. That seems like a long time, and as a developer I want to optimize everything I write. In this post I look into analyzing the current runtime, and options for improving it.

Aggregating Files in your Data Lake – Part 1

As I’ve written in the past, large numbers of small files make for an inefficient data lake. But sometimes, you can’t avoid small files. Our CloudTrail repository, for example, has 4,601,675 files as-of this morning, 44% of which are under 1,000 bytes long. In this post, I develop a Lambda-based data pipeline to aggregate these files, storing them in a new S3 location partitioned by date. Along the way I call out some of the challenges that face such a pipeline.

Data Engineering is more SRE than SQL

Following my post about the Chariot Data Engineering interview, I received some comments along the lines of “wait, you don’t test their SQL skills?!?” Actually, we do: after loading up the test data into Redshift, the candidate creates three progressively difficult queries. But by then, I’m pretty sure they’ve got the skills we need, because … Read More

Developing A Coding Test for Data Engineering

Hiring good candidates is difficult. After nearly 40 years in this business, and interviewing hundreds of candidates, I’m not going to claim that I have the answer. Just some ideas.

Putting Amazon CodeGuru Reviewer To The Test

Amazon CodeGuru Reviewer promises to “detect potential defects that are difficult for developers to find,” using machine learning to identify potential problems. But how does it compare to existing rule-based tools? In this post I turn CodeGuru loose on a seven-year-old library that’s in use by 3,000 people, to see what issues it flags.

Java Virtual Threads: Worth the wait?

With version 21, Java got virtual (lightweight) threads. This feature has received a lot of press, but will it actually help you? In this post I review the theoretical benefits of virtual threads, and then show actual results from a benchmark.

Small Data: a pipeline for low-latency decision support

In my last post, I said that I didn’t think Postgres was a good choice for a decision support database, versus a task-specific DBMS such as Redshift. In this post I’m going to take the opposite stand, and say that there are cases where Postgres is appropriate: namely, low-latency systems that contain a limited amount of data.