performance

Perils of Partitioning

Partitioning is one of the easiest ways to improve the performance of your data lake, because it reduces the amount of data scanned. But implementing partitions can be surprisingly challenging, as can their effective use. In this post I look at several of the issues that you should consider when partitioning your data.

Aggregating Files in your Data Lake – Part 1

As I’ve written in the past, large numbers of small files make for an inefficient data lake. But sometimes, you can’t avoid small files. Our CloudTrail repository, for example, has 4,601,675 files as-of this morning, 44% of which are under 1,000 bytes long. In this post, I develop a Lambda-based data pipeline to aggregate these files, storing them in a new S3 location partitioned by date. Along the way I call out some of the challenges that face such a pipeline.

Philly ETE 2022 — Next.js, Remix.run and Accelerating React Performance — Ken Rimple

Abstract People love React for its simplicity: you can learn the basics in half a day, and bring your favorite tools and APIS and dive right in. But that rapid application development also comes at a cost: bloated, slow, unstable Single Page Applications that grind your browser to a halt. Tools like Next.js, Remix.run and … Read More

Philly ETE 2021 — Modern (In)Efficiencies: Performance on Modern Hardware — Todd Montgomery

Abstract How can anyone keep up with new technologies in computing today? We have new CPUs, GPUs, drives, network gear, libraries, and OS versions all the time. How do those with an eye for performance deal with this rapidly changing space? In this session, we will explore some modern technologies. How they make us reconsider … Read More