Aggregating Files in your Data Lake – Part 1

As I’ve written in the past, large numbers of small files make for an inefficient data lake. But sometimes, you can’t avoid small files. Our CloudTrail repository, for example, has 4,601,675 files as-of this morning, 44% of which are under 1,000 bytes long. In this post, I develop a Lambda-based data pipeline to aggregate these files, storing them in a new S3 location partitioned by date. Along the way I call out some of the challenges that face such a pipeline.

Delving into CloudTrail events

CloudTrail provides you with an audit log of every successful API call made in your AWS account. This post focuses on management events in CloudTrail, and techniques for exploring and analyzing those events using a search engine such as Elasticsearch with Kibana.