aws

TechChat Tuesdays #65: Redshift Execution Plans with Keith Gregory

In this week’s TechChat, we welcome Keith Gregory, our Cloud & Data Engineering Practice Lead here at Chariot. Keith is a prolific writer both on the Chariot blog as well as on his own, and is a wealth of knowledge on all things AWS. We touch on Redshift execution plans, how to appropriately size Redshift … Read More

Why Not Just Use Postgres?

My last few posts have focused on Redshift and Athena, two specialized tools for managing and querying Big Data. But there’s a meme that’s been floating around for at least a few years that you should just use Postgres for anything data-related. It may not provide all of the features and capabilities of a dedicated tool, but is one less thing to learn and manage. Should this advice also apply to your data warehouse?

Performance Comparison: Athena versus Redshift

I’ve always been a fan of database servers: self-contained entities that manage both storage and compute, and give you knobs to turn to optimize your queries. The flip side is that I have an inherent distrust of services such as Athena, which promise to run queries efficiently on structured data split between many files in a data lake. It just doesn’t seem natural; where are the knobs?

So, since I had data generated for my post on Athena performance with different file types, I decided to use that data in a performance comparison with Redshift.

Athena Performance Comparison: Avro, JSON, and Parquet

In my “Friends Don’t Let Friends Use JSON” post, I noted that I preferred the Avro file format to Parquet, because it was easier to write code to use it. I expected some pushback, and got it: Parquet is “much” more performant. So I decided to do some benchmarking.

Beyond the Bastion: Connecting to Your Resources in AWS

In a perfect world, there would never be a need to connect to your resources running on AWS. In the real world, it’s sometimes necessary to get your hands dirty and look at what’s happening on the actual machine, especially during development. This post dives into a few ways to connect your workstation to resources running inside a VPC. It started out as a how-to for using bastion hosts, but quickly expanded to look beyond the bastion.

Unbalanced Data in Redshift

Decision support databases have a number of quirks that are not obvious to the casual user, particularly someone coming from an OLTP background. In this post I look at how unbalanced distributions can impact your query performance, how you can identify imbalances, and what you can do to fix them.

AWS CodeBuild and Flyway Database Migrations

Are you running a database with RDS? Would you like to manage it via migrations? This article explains how to use AWS CodeBuild to keep a database schema updated using Flyway, an open-source data migrations tool. Configuration is outlined via CloudFormation snippets. An AWS example repository is provided.

Analyzing Glue Jobs with AWS X-Ray

It’s possible to analyze your Glue jobs using just the logs they produce. Possible. But it’s not a pleasant task: your log messages are buried in messages from the framework, and in the case of a distributed PySpark job they’ll be spread amongst multiple CloudWatch log streams. In this post I look at an alternative: AWS X-Ray, which captures and aggregates “trace segments” that monitor specific sections of your code. With X-Ray, you can easily see where your jobs are spending their time, and compare different runs.

TechChat Tuesdays #57: AWS Re:Invent Announcements with Keith Gregory, AWS Practice Lead

Today we welcome Keith Gregory to the show! Keith is our AWS Practice Lead here at Chariot. We cover some announcements from AWS re:Invent, and do a deep dive into CodeCatalyst, OpenSearch Serverless, Lambda Snapstart, Redshift streaming ingestion from Kafka/Kinesis, and EventBridge Pipes.

Friends Don’t Let Friends Use JSON (in their data lakes)

I’ve never been a JSON hater, but I’ve recently run into enough pain with JSON as a data serialization format that my feelings are edging toward dislike. However, JSON is a fact of life in most data pipelines, especially those that receive event-stream data from a third-party supplier. This post reflects on some of the problems that I’ve seen, and solutions that I’ve used