All Blog Posts

Aggregating Files in your Data Lake – Part 1

Keith Gregory | February 14, 2024February 15, 2024 | aws, cloudtrail, data engineering, data pipeline, lambda, performance

As I’ve written in the past, large numbers of small files make for an inefficient data lake. But sometimes, you can’t avoid small files. Our CloudTrail repository, for example, has 4,601,675 files as-of this morning, 44% of which are under 1,000 bytes long. In this post, I develop a Lambda-based data pipeline to aggregate these files, storing them in a new S3 location partitioned by date. Along the way I call out some of the challenges that face such a pipeline.

Small Data: a pipeline for low-latency decision support

Keith Gregory | August 8, 2023November 21, 2023 | aws, big data, cloud native streaming, kinesis, lambda, Postgres

In my last post, I said that I didn’t think Postgres was a good choice for a decision support database, versus a task-specific DBMS such as Redshift. In this post I’m going to take the opposite stand, and say that there are cases where Postgres is appropriate: namely, low-latency systems that contain a limited amount of data.

Why Not Just Use Postgres?

Keith Gregory | July 19, 2023November 21, 2023 | amazon web services, aws, Postgres, redshift

My last few posts have focused on Redshift and Athena, two specialized tools for managing and querying Big Data. But there’s a meme that’s been floating around for at least a few years that you should just use Postgres for anything data-related. It may not provide all of the features and capabilities of a dedicated tool, but is one less thing to learn and manage. Should this advice also apply to your data warehouse?

Performance Comparison: Athena versus Redshift

Keith Gregory | May 25, 2023November 21, 2023 | athena, aws, redshift

I’ve always been a fan of database servers: self-contained entities that manage both storage and compute, and give you knobs to turn to optimize your queries. The flip side is that I have an inherent distrust of services such as Athena, which promise to run queries efficiently on structured data split between many files in a data lake. It just doesn’t seem natural; where are the knobs?

So, since I had data generated for my post on Athena performance with different file types, I decided to use that data in a performance comparison with Redshift.

Athena Performance Comparison: Avro, JSON, and Parquet

Keith Gregory | May 16, 2023November 21, 2023 | athena, avro, aws, data engineering, data warehouse, parquet

In my “Friends Don’t Let Friends Use JSON” post, I noted that I preferred the Avro file format to Parquet, because it was easier to write code to use it. I expected some pushback, and got it: Parquet is “much” more performant. So I decided to do some benchmarking.

Beyond the Bastion: Connecting to Your Resources in AWS

Keith Gregory | April 17, 2023January 29, 2024 | aws, aws-vpc, networking

In a perfect world, there would never be a need to connect to your resources running on AWS. In the real world, it’s sometimes necessary to get your hands dirty and look at what’s happening on the actual machine, especially during development. This post dives into a few ways to connect your workstation to resources running inside a VPC. It started out as a how-to for using bastion hosts, but quickly expanded to look beyond the bastion.

Unbalanced Data in Redshift

Keith Gregory | March 27, 2023November 21, 2023 | aws, data engineering, redshift

Decision support databases have a number of quirks that are not obvious to the casual user, particularly someone coming from an OLTP background. In this post I look at how unbalanced distributions can impact your query performance, how you can identify imbalances, and what you can do to fix them.

AWS CodeBuild and Flyway Database Migrations

Ken Rimple | March 13, 2023March 13, 2023 | aws, cloudformation, codebuild, rds, security groups

Are you running a database with RDS? Would you like to manage it via migrations? This article explains how to use AWS CodeBuild to keep a database schema updated using Flyway, an open-source data migrations tool. Configuration is outlined via CloudFormation snippets. An AWS example repository is provided.

← Previous 1 2 3 … 8 Next →

Featured Videos

All the AWS CodeBuild You Can Stomach in 45 Minutes

In this 45 minute talk, Ken Rimple gives a quick overview of AWS CodeBuild, then dives into a few of the challenges he’s faced, from dealing with build errors properly, configuring CodeBuild to run inside of AWS, testing locally so you don’t go crazy waiting for 15 minutes each time you deploy a new build, how to properly access your build artifacts and reports, running tools like Cypress, to building and deploying Docker containers to ECS, and more.

AWS: Things I Learned the Hard Way

Amazon Web Services (AWS) is a collection of nearly 200 services. They can be intimidating to the newcomer, and offer many opportunities for mistakes: some expensive, some just inconvenient. In this Lunch and Learn, our panel of AWS experts look at some of the mistakes they made, and how these could have been avoided.

All Videos

Application Development Approaches in AWS Webinar

Ken Rimple | March 31, 2020June 8, 2020 | aws, docker, EC2, ECS, lambda, webinar

Chariot’s Ken Rimple, director of Training/Mentoring Services, will take you through some sample AWS architectures and the pros/cons of complexity, cost, and technical considerations for each one.

Introduction to MQTT – IoT on AWS – A Philly Cloud Computing Event

Ken Rimple | November 18, 2019June 8, 2020 | aws, aws IoT core, mqtt

What is MQTT? How does it work? Why should you care? We’ll discuss the MQTT protocol and how AWS IoT Core is an MQTT Broker able to send and receive messages to and from devices.

Connecting to AWS IoT Core – IoT on AWS – A Philly Cloud Computing Event

Ken Rimple | November 18, 2019June 8, 2020 | aws, aws IoT core, key management, pke

AWS IoT provides connectivity to IoT devices through HTTP and MQTT. In this session we learn how to leverage AWS Core IoT as an MQTT broker, how to connect your devices using a client certificate, how policies can enforce data security, and how rules are used to move data elsewhere in the AWS infrastructure.

Building Data Pipelines with Kinesis – IoT on AWS – A Philly Cloud Computing Event

Ken Rimple | November 18, 2019June 10, 2020 | aws, data pipelines, kinesis

Coping with Aging (Data) – IoT on AWS – A Philly Cloud Computing Event

Ken Rimple | November 18, 2019June 8, 2020 | aws, data management

Data has different purposes over time: when fresh, it can be used for real-time decision-making; as it ages, it becomes useful for analytics; eventually, it becomes a record, useful or perhaps not. Each of these stages requires a different approach to storage and management, and this talk looks at appropriate ways to work with your data at the different stages of its life.

That’s not a Data Lake, THIS is a Data Lake – IoT on AWS – A Philly Cloud Computing Event

Ken Rimple | November 18, 2019June 8, 2020 | apache beam, apache flink, aws, big data, data science, data-lake

This talk will review two common use cases for the use of captured metric data: 1) Real-time analysis, visualization, and quality assurance, and 2) Ad-hoc analysis. The most common open source streaming options will be mentioned, however this talk be concerned with Apache Flink specifically. A brief discussion of Apache Beam will also be included in the context of the larger discussion of a unified data processing model.

Amplify your Mobile App – IoT on AWS – A Philly Cloud Computing Event

Ken Rimple | November 18, 2019June 8, 2020 | amplify sdk, aws, aws IoT core, cognito, mobile, security

In this session we will walk through the steps required to securely communicate with your device using the Device Shadow service. This will include an overview of user authentication and authorization, connecting to AWS IoT, and using MQTT to communicate with the device’s “Device Shadow” to read and update its state. All this, using the AWS Amplify CLI and SDK.

Data Protection at AWS – Steve Pressman at AWS – Alpine Cyber Solutions

Ken Rimple | November 18, 2019June 8, 2020 | aws, security

This presentation will take you through the biggest areas where you need to focus your efforts in order to keep your data safe at AWS, and will show some real-life examples of what could go wrong if you make compromises or allow bad practices.

← Previous 1 2 3 Next →

Looking to discuss an AWS project with our team? Contact us.