Populating Iceberg Tables with Amazon Data Firehose
In this post I look at using an Amazon Data Firehose to populate Iceberg tables, with the automatic optimization features that AWS announced in November 2024.
In this post I look at using an Amazon Data Firehose to populate Iceberg tables, with the automatic optimization features that AWS announced in November 2024.
Data Scientists spend their days working in Jupyter notebooks, which are then passed to an implementation team to prepare for production. This post guides you through that process, emphasizing iterative refinement. I will be using the scikit-learn and XGBoost libraries, but other ML libraries could be swapped in. While scikit-learn offers a comprehensive library of … Read More
We recently explored a project to retrieve data from a third-party service. They didn’t offer any push capabilities such as writing to a Kafka or Kinesis stream, or even a web-hook. But they did offer a WebSocket interface, so we explored whether we could use that as our streaming source. We didn’t go that route, but I was intrigued by the idea enough to make a proof-of-concept.
My prior posts used Lambda to do data transformation. But what if we could use a non-programmatic tool, in keeping with the Extract-Load-Transform mindset of the modern data pipeline. As it turns, we can: Amazon Athena can write data as well as query it. There are, of course, a few stumbles along the way. In this blog post I walk through the process of aggregating CloudTrail data using SQL.
In this final part of a three-part series, I add another aggregation step to combine a month’s worth of data and write it as Parquet.
When I ran the Lambda from my previous post against Chariot’s CloudTrail repository, it took almost four minutes to process a single day’s worth of data. That seems like a long time, and as a developer I want to optimize everything I write. In this post I look into analyzing the current runtime, and options for improving it.
As I’ve written in the past, large numbers of small files make for an inefficient data lake. But sometimes, you can’t avoid small files. Our CloudTrail repository, for example, has 4,601,675 files as-of this morning, 44% of which are under 1,000 bytes long. In this post, I develop a Lambda-based data pipeline to aggregate these files, storing them in a new S3 location partitioned by date. Along the way I call out some of the challenges that face such a pipeline.
Decision support databases have a number of quirks that are not obvious to the casual user, particularly someone coming from an OLTP background. In this post I look at how unbalanced distributions can impact your query performance, how you can identify imbalances, and what you can do to fix them.
More, cheaper, faster: our own Keith Gregory recounts the changes in big data, data storage, and data engineering over the last two decades.
A well-designed data strategy is critical to success. Here are 3 philosophies to help you design an optimal data strategy for your business.