All Blog Posts

Populating a Data Lake with AWS Database Migration Service and Amazon Data Firehose

Keith Gregory | March 18, 2025 | aws, aws data firehose, aws database migration service, data pipeline

Data lakes are great for holding large volumes of data, such as clickstream logs. But such data has limited usefulness unless you can combine it with data from your transactional, line-of-business databases. And this is where things get tricky. Simple approaches, such as replicating entire tables, don’t scale. Streaming approaches that include updates and deletes require logic to determine the latest value (or existence!) of any given row. All of which has to be translated into static data files in a data lake.

In this post I look at one approach to solve this problem: AWS Data Migration Service to capture changes from the source database and write them to a Kinesis Data Stream, and Amazon Data Firehose to load those records into Iceberg tables.

Populating Iceberg Tables with Amazon Data Firehose

Keith Gregory | January 16, 2025January 17, 2025 | amazon data firehose, aws, data engineering, iceberg, parquet

In this post I look at using an Amazon Data Firehose to populate Iceberg tables, with the automatic optimization features that AWS announced in November 2024.

What’s the point of Lambda SnapStart?

Keith Gregory | December 4, 2024 | AWS lambda

Lambda SnapStart is intended to improve the cold start time for a Lambda function. It’s been available for Java workloads since 2022, and was recently released for Python and .Net workloads. It works by running the initialization code of your Lambda function when you release a version, and then storing an image of the Lambda execution environment. Cold starts load this image rather than running the initialization themselves. Given that cold starts happen unpredictably, and may be measured in seconds, this seems like a win-win situation.

The reality, as usual, is more nuanced. SnapStart introduces its own cold start delays, as it loads the image into the runtime. And it increases the time and effort of deployment. In this post I drill down into the nuance, so that you can decide whether it’s a worthwhile choice fo your project.

Cost Optimizing an ML Feature Store

Will Vuong | June 21, 2024 | aws, elasticache, feature store, json, machine learning, ml, protobufs, protocol buffers, redis

A client recently started building a new machine learning (ML) architecture with a feature store as one of the key pieces. The feature store was already burning through a lot…

Websockets feeding Kinesis

Keith Gregory | June 11, 2024 | aws, kinesis, Websockets

We recently explored a project to retrieve data from a third-party service. They didn’t offer any push capabilities such as writing to a Kafka or Kinesis stream, or even a web-hook. But they did offer a WebSocket interface, so we explored whether we could use that as our streaming source. We didn’t go that route, but I was intrigued by the idea enough to make a proof-of-concept.

Lambda Four Ways, a Rosetta Stone for AWS

Keith Gregory | May 14, 2024 | aws, AWS lambda, golang, java, javascript, python

When I write Lambdas professionally, Python is my preferred language. It offers decent performance, a straightforward syntax, and high developer productivity. I’ve also used Java, both in demonstration apps and actual client work. But while I have some familiarity with other languages supported by the platform, I’ve never used them. So, with some downtime, I decided to implement the same Lambda in four different languages: Python, Java, JavaScript, and Go, to get a better sense of their strengths and weaknesses.

Perils of Partitioning

Keith Gregory | March 22, 2024 | athena, aws, data engineering, performance

Partitioning is one of the easiest ways to improve the performance of your data lake, because it reduces the amount of data scanned. But implementing partitions can be surprisingly challenging, as can their effective use. In this post I look at several of the issues that you should consider when partitioning your data.

Transforming Data with Amazon Athena

Keith Gregory | March 15, 2024 | amazon athena, aws, data engineering

My prior posts used Lambda to do data transformation. But what if we could use a non-programmatic tool, in keeping with the Extract-Load-Transform mindset of the modern data pipeline. As it turns, we can: Amazon Athena can write data as well as query it. There are, of course, a few stumbles along the way. In this blog post I walk through the process of aggregating CloudTrail data using SQL.

1 2 … 9 Next →

Featured Videos

All the AWS CodeBuild You Can Stomach in 45 Minutes

In this 45 minute talk, Ken Rimple gives a quick overview of AWS CodeBuild, then dives into a few of the challenges he’s faced, from dealing with build errors properly, configuring CodeBuild to run inside of AWS, testing locally so you don’t go crazy waiting for 15 minutes each time you deploy a new build, how to properly access your build artifacts and reports, running tools like Cypress, to building and deploying Docker containers to ECS, and more.

AWS: Things I Learned the Hard Way

Amazon Web Services (AWS) is a collection of nearly 200 services. They can be intimidating to the newcomer, and offer many opportunities for mistakes: some expensive, some just inconvenient. In this Lunch and Learn, our panel of AWS experts look at some of the mistakes they made, and how these could have been avoided.

All Videos

Build a Cloud-Native Web App in 8 Weeks

Becca Refford | December 11, 2020 | aws, codebuild, cognito, cypress, docker

In this tutorial, Ken Rimple explains how to take a new application from concept to production in AWS in eight weeks.

All the AWS CodeBuild You Can Stomach in 45 Minutes

Becca Refford | September 23, 2020 | aws, aws codebuild, cypress, docker, ECS

In this 45 minute webinar, Ken Rimple will give a quick overview of AWS CodeBuild, then dive into a few of the challenges he’s faced.

Philly ETE 2020 – Neel Mitra – Bring IoT and AI Together

Becca Refford | June 23, 2020July 9, 2020 | amazon sagemaker, aws, aws iot analytics, aws IoT core, aws iot greengrass, iot, machine learning, philly ete 2020

Check out our YouTube playlist to watch all the talks from Emerging Technologies for the Enterprise 2020. Abstract Machine learning and IoT have become commonplace words in the enterprise workplace….

Philly ETE 2020 – Ken Rimple – Serverless, Schmerverless: Why Should I Care?

Becca Refford | June 23, 2020July 9, 2020 | architect, aws amplify, AWS lambda, cloud, cod.es, cognito, docker, lambda, serverless

Check out our YouTube playlist to watch all the talks from Emerging Technologies for the Enterprise 2020. Abstract Ah, Serverless. The term that means a dozen different things to a…

Philly ETE 2020 – Keith Gregory – Accounts as a Service: Why we have 50+ AWS accounts, and why you should too

Becca Refford | June 23, 2020July 9, 2020 | aws, aws account management, aws deployment, cloud computing

Check out our YouTube playlist to watch all the talks from Emerging Technologies for the Enterprise 2020. Abstract One of the chief benefits of cloud computing is the ability to…

Philly ETE 2020 – Dan Pilone – Looking over the edge: Bridging the gaps between geospatial data, cloud computing, and local disaster response organizations

Becca Refford | June 23, 2020July 9, 2020 | aws, big, big data, geospatial data, nasa

Check out our YouTube playlist to watch all the talks from Emerging Technologies for the Enterprise 2020. Abstract In this talk we look at the challenges of making geospatial data…

Philly ETE 2020 – Brian LeRoux – Less, but better, serverless with OpenJS Architect

Becca Refford | June 23, 2020July 9, 2020 | architect, aws, cloud computing, cloudformation, openjs, openjs architect

Check out our YouTube playlist to watch all the talks from Emerging Technologies for the Enterprise 2020. Abstract OpenJS Architect is the fastest and simplest framework for rapidly building web…

Lunch & Learn: AWS — Things I Learned the Hard Way

Becca Refford | June 21, 2020June 24, 2020 | aws

Amazon Web Services (AWS) is a collection of nearly 200 services. They can be intimidating to the newcomer, and offer many opportunities for mistakes: some expensive, some just inconvenient. In…

Looking to discuss an AWS project with our team? Contact us.