AWS Resources

All Blog Posts

Aggregating Files in your Data Lake – Part 1

As I’ve written in the past, large numbers of small files make for an inefficient data lake. But sometimes, you can’t avoid small files. Our CloudTrail repository, for example, has 4,601,675 files as-of this morning, 44% of which are under 1,000 bytes long. In this post, I develop a Lambda-based data pipeline to aggregate these files, storing them in a new S3 location partitioned by date. Along the way I call out some of the challenges that face such a pipeline.

Small Data: a pipeline for low-latency decision support

In my last post, I said that I didn’t think Postgres was a good choice for a decision support database, versus a task-specific DBMS such as Redshift. In this post I’m going to take the opposite stand, and say that there are cases where Postgres is appropriate: namely, low-latency systems that contain a limited amount of data.

Why Not Just Use Postgres?

My last few posts have focused on Redshift and Athena, two specialized tools for managing and querying Big Data. But there’s a meme that’s been floating around for at least a few years that you should just use Postgres for anything data-related. It may not provide all of the features and capabilities of a dedicated tool, but is one less thing to learn and manage. Should this advice also apply to your data warehouse?

Performance Comparison: Athena versus Redshift

I’ve always been a fan of database servers: self-contained entities that manage both storage and compute, and give you knobs to turn to optimize your queries. The flip side is that I have an inherent distrust of services such as Athena, which promise to run queries efficiently on structured data split between many files in a data lake. It just doesn’t seem natural; where are the knobs?

So, since I had data generated for my post on Athena performance with different file types, I decided to use that data in a performance comparison with Redshift.

Beyond the Bastion: Connecting to Your Resources in AWS

In a perfect world, there would never be a need to connect to your resources running on AWS. In the real world, it’s sometimes necessary to get your hands dirty and look at what’s happening on the actual machine, especially during development. This post dives into a few ways to connect your workstation to resources running inside a VPC. It started out as a how-to for using bastion hosts, but quickly expanded to look beyond the bastion.

Featured Videos

All the AWS CodeBuild You Can Stomach in 45 Minutes

In this 45 minute talk, Ken Rimple gives a quick overview of AWS CodeBuild, then dives into a few of the challenges he’s faced, from dealing with build errors properly, configuring CodeBuild to run inside of AWS, testing locally so you don’t go crazy waiting for 15 minutes each time you deploy a new build, how to properly access your build artifacts and reports, running tools like Cypress, to building and deploying Docker containers to ECS, and more.

AWS: Things I Learned the Hard Way

Amazon Web Services (AWS) is a collection of nearly 200 services. They can be intimidating to the newcomer, and offer many opportunities for mistakes: some expensive, some just inconvenient. In this Lunch and Learn, our panel of AWS experts look at some of the mistakes they made, and how these could have been avoided.

All Videos

Coping with Aging (Data) – IoT on AWS – A Philly Cloud Computing Event

Data has different purposes over time: when fresh, it can be used for real-time decision-making; as it ages, it becomes useful for analytics; eventually, it becomes a record, useful or perhaps not. Each of these stages requires a different approach to storage and management, and this talk looks at appropriate ways to work with your data at the different stages of its life.

That’s not a Data Lake, THIS is a Data Lake – IoT on AWS – A Philly Cloud Computing Event

This talk will review two common use cases for the use of captured metric data: 1) Real-time analysis, visualization, and quality assurance, and 2) Ad-hoc analysis. The most common open source streaming options will be mentioned, however this talk be concerned with Apache Flink specifically. A brief discussion of Apache Beam will also be included in the context of the larger discussion of a unified data processing model.

Amplify your Mobile App – IoT on AWS – A Philly Cloud Computing Event

In this session we will walk through the steps required to securely communicate with your device using the Device Shadow service. This will include an overview of user authentication and authorization, connecting to AWS IoT, and using MQTT to communicate with the device’s “Device Shadow” to read and update its state. All this, using the AWS Amplify CLI and SDK.

Looking to discuss an AWS project with our team? Contact us.