Keith Gregory, Author at Chariot Solutions

Aggregating Files in your Data Lake – Part 1

As I’ve written in the past, large numbers of small files make for an inefficient data lake. But sometimes, you can’t avoid small files. Our CloudTrail repository, for example, has 4,601,675 files as-of this morning, 44% of which are under 1,000 bytes long. In this post, I develop a Lambda-based data pipeline to aggregate these files, storing them in a new S3 location partitioned by date. Along the way I call out some of the challenges that face such a pipeline.

Data Engineering is more SRE than SQL

Following my post about the Chariot Data Engineering interview, I received some comments along the lines of “wait, you don’t test their SQL skills?!?” Actually, we do: after loading up the test data into Redshift, the candidate creates three progressively difficult queries. But by then, I’m pretty sure they’ve got the skills we need, because … Read More

Developing A Coding Test for Data Engineering

Hiring good candidates is difficult. After nearly 40 years in this business, and interviewing hundreds of candidates, I’m not going to claim that I have the answer. Just some ideas.

Putting Amazon CodeGuru Reviewer To The Test

Amazon CodeGuru Reviewer promises to “detect potential defects that are difficult for developers to find,” using machine learning to identify potential problems. But how does it compare to existing rule-based tools? In this post I turn CodeGuru loose on a seven-year-old library that’s in use by 3,000 people, to see what issues it flags.

Java Virtual Threads: Worth the wait?

With version 21, Java got virtual (lightweight) threads. This feature has received a lot of press, but will it actually help you? In this post I review the theoretical benefits of virtual threads, and then show actual results from a benchmark.

Small Data: a pipeline for low-latency decision support

In my last post, I said that I didn’t think Postgres was a good choice for a decision support database, versus a task-specific DBMS such as Redshift. In this post I’m going to take the opposite stand, and say that there are cases where Postgres is appropriate: namely, low-latency systems that contain a limited amount of data.

Why Not Just Use Postgres?

My last few posts have focused on Redshift and Athena, two specialized tools for managing and querying Big Data. But there’s a meme that’s been floating around for at least a few years that you should just use Postgres for anything data-related. It may not provide all of the features and capabilities of a dedicated tool, but is one less thing to learn and manage. Should this advice also apply to your data warehouse?

A Deep Dive on Redshift Execution Plans

In this post I walk through several execution plans, explain what Redshift is doing in each, and highlight the parts of plans that indicate problems.

Performance Comparison: Athena versus Redshift

I’ve always been a fan of database servers: self-contained entities that manage both storage and compute, and give you knobs to turn to optimize your queries. The flip side is that I have an inherent distrust of services such as Athena, which promise to run queries efficiently on structured data split between many files in a data lake. It just doesn’t seem natural; where are the knobs?

So, since I had data generated for my post on Athena performance with different file types, I decided to use that data in a performance comparison with Redshift.

Athena Performance Comparison: Avro, JSON, and Parquet

In my “Friends Don’t Let Friends Use JSON” post, I noted that I preferred the Avro file format to Parquet, because it was easier to write code to use it. I expected some pushback, and got it: Parquet is “much” more performant. So I decided to do some benchmarking.