data engineering

Unbalanced Data in Redshift

Decision support databases have a number of quirks that are not obvious to the casual user, particularly someone coming from an OLTP background. In this post I look at how unbalanced distributions can impact your query performance, how you can identify imbalances, and what you can do to fix them.

Avro Three Ways

In my last post I recommended using Avro for file storage in a data lake. It has the benefits of compact storage and a schema in every file that tells you what data it holds. In this post I show three ways to generate Avro files: one in Java, and two in Python.

Friends Don’t Let Friends Use JSON (in their data lakes)

I’ve never been a JSON hater, but I’ve recently run into enough pain with JSON as a data serialization format that my feelings are edging toward dislike. However, JSON is a fact of life in most data pipelines, especially those that receive event-stream data from a third-party supplier. This post reflects on some of the problems that I’ve seen, and solutions that I’ve used

Twenty Years of Big Data

More, cheaper, faster: our own Keith Gregory recounts the changes in big data, data storage, and data engineering over the last two decades.