Keith Gregory

Avro Three Ways

In my last post I recommended using Avro for file storage in a data lake. It has the benefits of compact storage and a schema in every file that tells you what data it holds. In this post I show three ways to generate Avro files: one in Java, and two in Python.

Friends Don’t Let Friends Use JSON (in their data lakes)

I’ve never been a JSON hater, but I’ve recently run into enough pain with JSON as a data serialization format that my feelings are edging toward dislike. However, JSON is a fact of life in most data pipelines, especially those that receive event-stream data from a third-party supplier. This post reflects on some of the problems that I’ve seen, and solutions that I’ve used

Twenty Years of Big Data

More, cheaper, faster: our own Keith Gregory recounts the changes in big data, data storage, and data engineering over the last two decades.

Limiting Cross-stack References in CDK

Several years ago I wrote CloudFormation Tips and Tricks, in which I gave the advice to “use outputs lavishly, exports sparingly.” The reason is that when you export a value from one stack and import it into another you bind those stacks tightly together, and can’t change that exported value. For example, you might create … Read More

Managing Internet Access for AWS Workloads

Two months ago I didn’t give much thought to controlling a program’s access to the Internet. Then Log4Shell happened. This post looks at three ways that you can control what an in-VPC application is allowed to talk to.

Using Cloud Deployments To Mitigate Log4Shell and Similar Vulnerabilities

It’s been a week since CVE-2021-44228, a remote code execution vulnerability in Log4J 2.x, hit the world. Hopefully by now everybody reading this has updated their Java deployments with the latest Log4J libraries. But no doubt there’s another vulnerability, in some popular framework or library, just waiting to make its presence known. This post is about Cloud features that act to minimize the blast radius of such vulnerabilities.

First Look at Amazon Redshift Serverless

Amazon Redshift’s launch in 2012 was one of the “wow!” moments in my experience with AWS. Here was a massively parallel database system that could be rented for 25 cents per node-hour. Here we are in 2021, and AWS has just announced Redshift Serverless, in which you pay for the compute and storage that you use, rather than a fixed monthly cost for a fixed number of nodes with a fixed amount of storage. And for a lot of use cases, I think that’s a great idea. So I spent some time kicking the tires, and this is what I learned.

Serverless Databasing with the Aurora Data API

Earlier this year I wrote that Amazon Aurora Serverless allows you to implement a fully serverless application with a relational database. In this post I show you how to use the Data API to do just that.

Rightsizing Data for Athena

Amazon Athena is a service that lets you run SQL queries against structured data files stored in S3. It takes a “divide and conquer” approach, spinning up parallel query execution engines that each examine only a portion of your data. The performance of these queries, however, depends on how you consolidate and partition your data. In this post I compare query times for a moderately large dataset, looking for the “sweet spot” between number of files and individual file size.

Using CodeArtifact with Poetry

In my last post I discussed how an artifact server is the best way to publish locally-developed Python packages. In this post, I show you how to set up the AWS CodeArtifact service and use it with pip and Poetry.