Small Data: a pipeline for low-latency decision support

by
Tags: , , , , ,
Category:

In my last post, I said that I didn’t think Postgres was a good choice for a decision support database, versus a task-specific DBMS such as Redshift. In this post I’m going to take the opposite stand, and say that there are cases where Postgres is appropriate: namely, low-latency systems that contain a limited amount of data.

Background and Architecture

A few years ago I worked on a project for an online gaming company that wanted the ability to run promotions in response to near-real-time wager data. The goal for any sportsbook, online or not, is to balance the amount of money wagered on either side of the game. If they can do that, then the outcome of the game doesn’t matter: the losers cover the winners, and the sportsbook takes a percentage (the “vig”) for themselves.

But sometimes, sports bettors have their own ideas. Especially when Tom Brady is playing in the Superbowl. And when that happens, a sportsbook that doesn’t make its own odds needs to find some way to bring the numbers back in balance. For example, they might offer promotional cash for bets on the other team.

And to do that, they need up-to-the minute betting totals. Something this company couldn’t get from their existing Redshift data warehouse, which was rebuilt every morning based on the prior day’s updates. However, they did have a feed of live data via Segment.com, which we fed into a Postgres database:

Architecture for a near-real-time decision-support database, fed from Segment events.

Kinesis and Lambda

Kinesis is Amazon’s event streaming service; it is roughly comparable to Apache Kafka. For those not familiar with event streaming, it’s a tool for sending messages to multiple consumers. However, unlike traditional pub-sub messaging, where consumers miss messages when they aren’t listening, the stream retains messages for a set time interval; consumers can “catch up” if they’re delayed for any reason.

Kinesis easily integrates with Lambda, sending batches of up to 10,000 events at a time. One of the nice things about Kinesis in this use-case is that you can feed the same stream into multiple unrelated Lambda functions. In our case, we had one Lambda to process sportsbook data, and another Lambda that sent data to the company’s loyalty program from the same stream.

If the Lambda fails while processing a batch of data, Kinesis can retry. It can also split the block in half and retry only the first half, and repeat this process in an attempt to isolate the message that’s causing the problem. Then, based on a configurable number of retries, it can put the bad message(s) on an SQS queue and continue to process messages from the stream.

One critical thing to understand about this retry behavior is that messages may be processed multiple times. As a result, the Lambda must be able to either process messages idempotently, so that multiple updates have the same result, or do everything in a transaction that’s rolled-back on failure.

Postgres

As I mentioned above, the company already used Redshift for their main data warehouse. It might seem like a logical choice to use that data warehouse for the streaming data as well, but there were a couple of reasons not to do so.

First, Redshift very much wants to load data via parallel COPY from S3. In the words of the Redshift Developer Guide, individual INSERT statements are “prohibitively slow.” It would certainly be possible to stage data on S3 and then load it, but that would have added latency.

And second, Redshift does not have an UPSERT operation, nor does it enforce primary keys. Given that, you must use a chain of operations to insert new rows, update old rows, and avoid duplicates. This was a particular challenge for us, because bets went through a distinct life-cycle, with multiple events between placement and payoff.

After considering these issues, it became clear that we needed all of the capabilities of a transactional database system. It just happened that the data in this database would be used for decision-support purposes.

Data Partitioning

Something else that Postgres provides – but Redshift doesn’t – is the ability to easily partition tables. In the case of this application, partitioning wasn’t so much a performance improvement as a way to discard old data when it was no longer relevant.

One problem with Postgres partitioning is that, unlike MySQL, you can’t split parition once they have data in them. This meant performing the following steps for each batch:

  1. Iterate the source records to identify all necessary partitions.
  2. For each partition, determine whether it already exists.
  3. Create any partitions that don’t already exist. Be aware that you might have a concurrently-executing Lambda that’s attempting to do the same thing, so use IF NOT EXISTS when creating the partition.
  4. Write the source records into the database.

We cached the results of step #2, so that multiple invocations of the same Lambda wouldn’t have to make the same database calls.

We also wrote the Lambda to drop old partitions after a set time period. There is always the risk of a race condition when you have concurrent Lambdas making DDL changes: one creates a partition for an old message, another drops that partition, and the first fails when it tries to insert the data. This is one area where the Kinesis/Lambda automatic retry served us well: as long as race conditions were infrequent, they would resolve themselves.

Wrapping Up

This application is an example of what I call a “micro-data pipeline,” in which purpose-built Lambda functions tap into an existing source of streaming data. Like micro-services, this lets you focus on the task at hand, and deploy functionality independently. However, also like micro-services, you can easily find yourself with dozens of Lambda functions and no clear relationships between them. Documentation is critical.

If I were to do this in the future, I would consider using EventBridge Pipes to receive records from Kinesis, and dispatch them to an EventBridge event bus. This promises to give you a centralized place to define all of the rules that apply to a particular stream. As of this writing, however, I’ve found that you still need a transformation Lambda in the stream, so it might not actually save any effort.

 


 

Can we help you?

Ready to transform your business with customized data engineering solutions? Chariot Solutions is your trusted partner. Our consultants specialize in managing software and data complexities, tailoring solutions to your unique needs. Explore our data engineering offerings or reach out today to discuss your project.