All Blog Posts

Aggregating Files in your Data Lake – Part 3

Keith Gregory | February 29, 2024March 27, 2024 | aws, data engineering, data pipeline

In this final part of a three-part series, I add another aggregation step to combine a month’s worth of data and write it as Parquet.

Read More ⟶

Getting started with LLM in the Cloud with Amazon DLAMI EC2 Instances

Ken Rimple | February 21, 2024April 3, 2024 | aws, EC2, llm, ollama

So you want to execute some custom CUDA-based AI processing on a GPU, but don’t have the hardware? Have an AWS account? Try using the DLAMI machine instances. This article explains how to get started if you need OS-level access.

Read More ⟶

Aggregating Files in your Data Lake – Part 2

Keith Gregory | February 21, 2024March 14, 2024 | aws, code optimization, data engineering, data pipeline

When I ran the Lambda from my previous post against Chariot’s CloudTrail repository, it took almost four minutes to process a single day’s worth of data. That seems like a long time, and as a developer I want to optimize everything I write. In this post I look into analyzing the current runtime, and options for improving it.

Read More ⟶

Aggregating Files in your Data Lake – Part 1

Keith Gregory | February 14, 2024February 15, 2024 | aws, cloudtrail, data engineering, data pipeline, lambda, performance

As I’ve written in the past, large numbers of small files make for an inefficient data lake. But sometimes, you can’t avoid small files. Our CloudTrail repository, for example, has 4,601,675 files as-of this morning, 44% of which are under 1,000 bytes long. In this post, I develop a Lambda-based data pipeline to aggregate these files, storing them in a new S3 location partitioned by date. Along the way I call out some of the challenges that face such a pipeline.

Read More ⟶

Small Data: a pipeline for low-latency decision support

Keith Gregory | August 8, 2023November 21, 2023 | aws, big data, cloud native streaming, kinesis, lambda, Postgres

In my last post, I said that I didn’t think Postgres was a good choice for a decision support database, versus a task-specific DBMS such as Redshift. In this post I’m going to take the opposite stand, and say that there are cases where Postgres is appropriate: namely, low-latency systems that contain a limited amount of data.

Read More ⟶

Why Not Just Use Postgres?

Keith Gregory | July 19, 2023November 21, 2023 | amazon web services, aws, Postgres, redshift

My last few posts have focused on Redshift and Athena, two specialized tools for managing and querying Big Data. But there’s a meme that’s been floating around for at least a few years that you should just use Postgres for anything data-related. It may not provide all of the features and capabilities of a dedicated tool, but is one less thing to learn and manage. Should this advice also apply to your data warehouse?

Read More ⟶

Performance Comparison: Athena versus Redshift

Keith Gregory | May 25, 2023November 21, 2023 | athena, aws, redshift

I’ve always been a fan of database servers: self-contained entities that manage both storage and compute, and give you knobs to turn to optimize your queries. The flip side is that I have an inherent distrust of services such as Athena, which promise to run queries efficiently on structured data split between many files in a data lake. It just doesn’t seem natural; where are the knobs?

So, since I had data generated for my post on Athena performance with different file types, I decided to use that data in a performance comparison with Redshift.

Read More ⟶

Athena Performance Comparison: Avro, JSON, and Parquet

Keith Gregory | May 16, 2023November 21, 2023 | athena, avro, aws, data engineering, data warehouse, parquet

In my “Friends Don’t Let Friends Use JSON” post, I noted that I preferred the Avro file format to Parquet, because it was easier to write code to use it. I expected some pushback, and got it: Parquet is “much” more performant. So I decided to do some benchmarking.

Read More ⟶

← Previous 1 2 3 … 9 Next →

Featured Videos

All the AWS CodeBuild You Can Stomach in 45 Minutes

In this 45 minute talk, Ken Rimple gives a quick overview of AWS CodeBuild, then dives into a few of the challenges he’s faced, from dealing with build errors properly, configuring CodeBuild to run inside of AWS, testing locally so you don’t go crazy waiting for 15 minutes each time you deploy a new build, how to properly access your build artifacts and reports, running tools like Cypress, to building and deploying Docker containers to ECS, and more.

AWS: Things I Learned the Hard Way

Amazon Web Services (AWS) is a collection of nearly 200 services. They can be intimidating to the newcomer, and offer many opportunities for mistakes: some expensive, some just inconvenient. In this Lunch and Learn, our panel of AWS experts look at some of the mistakes they made, and how these could have been avoided.

All Videos

Application Development Approaches in AWS Webinar

Ken Rimple | March 31, 2020June 8, 2020 | aws, docker, EC2, ECS, lambda, webinar

Chariot’s Ken Rimple, director of Training/Mentoring Services, will take you through some sample AWS architectures and the pros/cons of complexity, cost, and technical considerations for each one.

Read More ⟶

Introduction to MQTT – IoT on AWS – A Philly Cloud Computing Event

Ken Rimple | November 18, 2019June 8, 2020 | aws, aws IoT core, mqtt

What is MQTT? How does it work? Why should you care? We’ll discuss the MQTT protocol and how AWS IoT Core is an MQTT Broker able to send and receive messages to and from devices.

Read More ⟶

Connecting to AWS IoT Core – IoT on AWS – A Philly Cloud Computing Event

Ken Rimple | November 18, 2019June 8, 2020 | aws, aws IoT core, key management, pke

AWS IoT provides connectivity to IoT devices through HTTP and MQTT. In this session we learn how to leverage AWS Core IoT as an MQTT broker, how to connect your devices using a client certificate, how policies can enforce data security, and how rules are used to move data elsewhere in the AWS infrastructure.

Read More ⟶

Building Data Pipelines with Kinesis – IoT on AWS – A Philly Cloud Computing Event

Ken Rimple | November 18, 2019June 10, 2020 | aws, data pipelines, kinesis

Read More ⟶

Coping with Aging (Data) – IoT on AWS – A Philly Cloud Computing Event

Ken Rimple | November 18, 2019June 8, 2020 | aws, data management

Data has different purposes over time: when fresh, it can be used for real-time decision-making; as it ages, it becomes useful for analytics; eventually, it becomes a record, useful or perhaps not. Each of these stages requires a different approach to storage and management, and this talk looks at appropriate ways to work with your data at the different stages of its life.

Read More ⟶

That’s not a Data Lake, THIS is a Data Lake – IoT on AWS – A Philly Cloud Computing Event

Ken Rimple | November 18, 2019June 8, 2020 | apache beam, apache flink, aws, big data, data science, data-lake

This talk will review two common use cases for the use of captured metric data: 1) Real-time analysis, visualization, and quality assurance, and 2) Ad-hoc analysis. The most common open source streaming options will be mentioned, however this talk be concerned with Apache Flink specifically. A brief discussion of Apache Beam will also be included in the context of the larger discussion of a unified data processing model.

Read More ⟶

Amplify your Mobile App – IoT on AWS – A Philly Cloud Computing Event

Ken Rimple | November 18, 2019June 8, 2020 | amplify sdk, aws, aws IoT core, cognito, mobile, security

In this session we will walk through the steps required to securely communicate with your device using the Device Shadow service. This will include an overview of user authentication and authorization, connecting to AWS IoT, and using MQTT to communicate with the device’s “Device Shadow” to read and update its state. All this, using the AWS Amplify CLI and SDK.

Read More ⟶

Data Protection at AWS – Steve Pressman at AWS – Alpine Cyber Solutions

Ken Rimple | November 18, 2019June 8, 2020 | aws, security

This presentation will take you through the biggest areas where you need to focus your efforts in order to keep your data safe at AWS, and will show some real-life examples of what could go wrong if you make compromises or allow bad practices.

Read More ⟶

← Previous 1 2 3 Next →

Looking to discuss an AWS project with our team? Contact us.

AWS Resources

All Blog Posts

Featured Videos

All the AWS CodeBuild You Can Stomach in 45 Minutes

AWS: Things I Learned the Hard Way

All Videos