tutorial

Getting Started with Burpless: Writing Cucumber Tests in Clojure

No matter how rapidly the world of software development may change, one constant is the need to ensure the quality, functionality, and reliability of our software applications. As our demand for more and more complex applications continues to increase, so does the risk, not only that developers might program something incorrectly thereby introducing bugs, but … Read More

An ML tale: From notebook to production

Data Scientists spend their days working in Jupyter notebooks, which are then passed to an implementation team to prepare for production. This post guides you through that process, emphasizing iterative refinement. I will be using the scikit-learn and XGBoost libraries, but other ML libraries could be swapped in. While scikit-learn offers a comprehensive library of … Read More

Automate the Boring Stuff with AI

My motivation for creating tools often stems from a desire to get familiar with new technologies. This project was no different; I wanted to deepen my understanding of Generative AI. However, this wasn’t the primary reason for its creation. The real driving force was a persistent gap in my workflow that I couldn’t ignore any … Read More

Large Language Model (LLM) Coding Assistance

Note: It has been about three months since this was originally written, so there is a certain amount of information that is out of date. See the addendum for updated information. With all the hype surrounding Generative AI/LLM, and all the hallucinations mentioned in the news, what are these actually good for? As it turns … Read More

Transforming Data with Amazon Athena

My prior posts used Lambda to do data transformation. But what if we could use a non-programmatic tool, in keeping with the Extract-Load-Transform mindset of the modern data pipeline. As it turns, we can: Amazon Athena can write data as well as query it. There are, of course, a few stumbles along the way. In this blog post I walk through the process of aggregating CloudTrail data using SQL.

From RAGs to Riches – Adding Context to Your LLM

In my previous post, Experiences in Fine-Tuning LLMs: Time + Power = Potato?, I covered my experiences around trying to fine-tune an LLM (large language model) with a dataset, which gave me less than stellar results. Ultimately, fine-tuning is best for a use-case where additional reasoning & logic needs to be added to an LLM, … Read More

Aggregating Files in your Data Lake – Part 3

In this final part of a three-part series, I add another aggregation step to combine a month’s worth of data and write it as Parquet.

Apple Silicon GPUs, Docker and Ollama: Pick two.

If you’ve tried to use Ollama with Docker on an Apple GPU lately, you might find out that their GPU is not supported. But you can get Ollama to run with GPU support on a Mac. This article will explain the problem, how to detect it, and how to get your Ollama workflow running with all of your VRAM (which, on a Mac, is your DRAM too)!

Getting started with LLM in the Cloud with Amazon DLAMI EC2 Instances

So you want to execute some custom CUDA-based AI processing on a GPU, but don’t have the hardware? Have an AWS account? Try using the DLAMI machine instances. This article explains how to get started if you need OS-level access.

Aggregating Files in your Data Lake – Part 1

As I’ve written in the past, large numbers of small files make for an inefficient data lake. But sometimes, you can’t avoid small files. Our CloudTrail repository, for example, has 4,601,675 files as-of this morning, 44% of which are under 1,000 bytes long. In this post, I develop a Lambda-based data pipeline to aggregate these files, storing them in a new S3 location partitioned by date. Along the way I call out some of the challenges that face such a pipeline.