Aggregating Files in your Data Lake – Part 3
In this final part of a three-part series, I add another aggregation step to combine a month’s worth of data and write it as Parquet.
In this final part of a three-part series, I add another aggregation step to combine a month’s worth of data and write it as Parquet.
When I ran the Lambda from my previous post against Chariot’s CloudTrail repository, it took almost four minutes to process a single day’s worth of data. That seems like a long time, and as a developer I want to optimize everything I write. In this post I look into analyzing the current runtime, and options for improving it.
As I’ve written in the past, large numbers of small files make for an inefficient data lake. But sometimes, you can’t avoid small files. Our CloudTrail repository, for example, has 4,601,675 files as-of this morning, 44% of which are under 1,000 bytes long. In this post, I develop a Lambda-based data pipeline to aggregate these files, storing them in a new S3 location partitioned by date. Along the way I call out some of the challenges that face such a pipeline.
Clickstream data – the behavior data collected from a user’s path through a website or app – is often used for business intelligence reports. It helps many companies answer questions like, ‘which of my products are people adding to their cart?’ or ‘What does our online purchase funnel look like?’ But our AWS Practice Lead, … Read More
This post is a quick primer on the basic titles and skills best suited to fulfill responsibilities along your company’s data pipeline.
Keith Gregory talks to Andrew Ganim, one of Chariot’s experienced software consultants, about his recent project: building a data pipeline for a multinational company.
This talk will cover the high-level design of Kinesis, how it scales, how clients can retrieve records that have been stored in it, and the use of Kinesis Analytics to to transform data and extract outliers.