Small files are the bane of a data lake, increasing your query times and processing costs. However, often you don’t get to control the data that you receive. For example, CloudTrail writes one file for each account and region, approximately every 15 minutes; dozens or even hundreds a day, some of which only have a few events.
An Athena query against the raw CloudTrail data might take minutes to execute, most of that time is due to the overhead of reading each file. By comparison, after aggregating the CloudTrail logs into one file per day, the same query takes only a few seconds.
In this talk, Keith Gregory walks through a data pipeline that uses Lambda to aggregate these files into a form that can be queried efficiently. He looks at the general design of such a pipeline, how to trigger it, how to monitor it, and how to be resilient to processing errors.
Keith Gregory is a frequent speaker at the Philadelphia AWS meetup. He has been a professional software developer since 1984. In that time he has worked in areas as diverse as scalable web applications and hard-real-time data acquisition. But for the past 30+ years, his path has always returned to database systems, big data, and performance optimization.