Aggregating Files in your Data Lake – Part 3
In this final part of a three-part series, I add another aggregation step to combine a month’s worth of data and write it as Parquet.
In this final part of a three-part series, I add another aggregation step to combine a month’s worth of data and write it as Parquet.
So you want to execute some custom CUDA-based AI processing on a GPU, but don’t have the hardware? Have an AWS account? Try using the DLAMI machine instances. This article explains how to get started if you need OS-level access.
When I ran the Lambda from my previous post against Chariot’s CloudTrail repository, it took almost four minutes to process a single day’s worth of data. That seems like a long time, and as a developer I want to optimize everything I write. In this post I look into analyzing the current runtime, and options for improving it.
As I’ve written in the past, large numbers of small files make for an inefficient data lake. But sometimes, you can’t avoid small files. Our CloudTrail repository, for example, has 4,601,675 files as-of this morning, 44% of which are under 1,000 bytes long. In this post, I develop a Lambda-based data pipeline to aggregate these files, storing them in a new S3 location partitioned by date. Along the way I call out some of the challenges that face such a pipeline.
In my last post, I said that I didn’t think Postgres was a good choice for a decision support database, versus a task-specific DBMS such as Redshift. In this post I’m going to take the opposite stand, and say that there are cases where Postgres is appropriate: namely, low-latency systems that contain a limited amount of data.
My last few posts have focused on Redshift and Athena, two specialized tools for managing and querying Big Data. But there’s a meme that’s been floating around for at least a few years that you should just use Postgres for anything data-related. It may not provide all of the features and capabilities of a dedicated tool, but is one less thing to learn and manage. Should this advice also apply to your data warehouse?
I’ve always been a fan of database servers: self-contained entities that manage both storage and compute, and give you knobs to turn to optimize your queries. The flip side is that I have an inherent distrust of services such as Athena, which promise to run queries efficiently on structured data split between many files in a data lake. It just doesn’t seem natural; where are the knobs?
So, since I had data generated for my post on Athena performance with different file types, I decided to use that data in a performance comparison with Redshift.
In my “Friends Don’t Let Friends Use JSON” post, I noted that I preferred the Avro file format to Parquet, because it was easier to write code to use it. I expected some pushback, and got it: Parquet is “much” more performant. So I decided to do some benchmarking.
In this 45 minute talk, Ken Rimple gives a quick overview of AWS CodeBuild, then dives into a few of the challenges he’s faced, from dealing with build errors properly, configuring CodeBuild to run inside of AWS, testing locally so you don’t go crazy waiting for 15 minutes each time you deploy a new build, how to properly access your build artifacts and reports, running tools like Cypress, to building and deploying Docker containers to ECS, and more.
Amazon Web Services (AWS) is a collection of nearly 200 services. They can be intimidating to the newcomer, and offer many opportunities for mistakes: some expensive, some just inconvenient. In this Lunch and Learn, our panel of AWS experts look at some of the mistakes they made, and how these could have been avoided.
Chariot’s Ken Rimple, director of Training/Mentoring Services, will take you through some sample AWS architectures and the pros/cons of complexity, cost, and technical considerations for each one.
What is MQTT? How does it work? Why should you care? We’ll discuss the MQTT protocol and how AWS IoT Core is an MQTT Broker able to send and receive messages to and from devices.
AWS IoT provides connectivity to IoT devices through HTTP and MQTT. In this session we learn how to leverage AWS Core IoT as an MQTT broker, how to connect your devices using a client certificate, how policies can enforce data security, and how rules are used to move data elsewhere in the AWS infrastructure.
Data has different purposes over time: when fresh, it can be used for real-time decision-making; as it ages, it becomes useful for analytics; eventually, it becomes a record, useful or perhaps not. Each of these stages requires a different approach to storage and management, and this talk looks at appropriate ways to work with your data at the different stages of its life.
This talk will review two common use cases for the use of captured metric data: 1) Real-time analysis, visualization, and quality assurance, and 2) Ad-hoc analysis. The most common open source streaming options will be mentioned, however this talk be concerned with Apache Flink specifically. A brief discussion of Apache Beam will also be included in the context of the larger discussion of a unified data processing model.
In this session we will walk through the steps required to securely communicate with your device using the Device Shadow service. This will include an overview of user authentication and authorization, connecting to AWS IoT, and using MQTT to communicate with the device’s “Device Shadow” to read and update its state. All this, using the AWS Amplify CLI and SDK.
This presentation will take you through the biggest areas where you need to focus your efforts in order to keep your data safe at AWS, and will show some real-life examples of what could go wrong if you make compromises or allow bad practices.
Looking to discuss an AWS project with our team? Contact us.