Data Engineering is more SRE than SQL

by
Tags: , , ,
Category:

Following my post about the Chariot Data Engineering interview, I received some comments along the lines of “wait, you don’t test their SQL skills?!?” Actually, we do: after loading up the test data into Redshift, the candidate creates three progressively difficult queries. But by then, I’m pretty sure they’ve got the skills we need, because in my experience, SQL is only a small part of a Data Engineer’s job.

Site Reliability Engineering (SRE) originated at Google, and focuses on “improv[ing] the design and operation of systems to make them more scalable, more reliable, and more efficient.” This doesn’t seem to have anything to do with data engineering, until you think about the path that data takes from its source, possibly in a third-party system, to its destination in your data warehouse. There are a lot of places on that path where things can go wrong.

In this post I’m going to take a look at a number of parts of the data pipeline, and how an SRE mindset can make that pipeline better. I want to stress that SRE is not Ops, even though they sometimes overlap, and even though some companies have renamed their Ops team to “SRE” to make it sound more exciting.  SRE is about applying engineering principles to improve operations.

In the rest of this post, I’ll touch on several tasks of a data engineer that I believe are closely aligned with SRE – or devops, if you prefer that term. You may be familiar with many of these practices, but I think that data engineering puts its own twist on them.

Infrastructure as Code

I’ve written a surprising amount of Terraform and CloudFormation while working as a data engineer. It’s surprising because data engineering projects are often one-offs: send the data from source X to the data lake or warehouse, maybe with a transformation script in the middle. And data engineering, unlike traditional front- and back-end development, doesn’t often have separate dev/test/prod deployment environments. Instead, you send new data directly into the production environment, so that your analysts can figure out how it integrates with the data you’ve already got.

But while any particular data pipeline might be a unique snowflake, in aggregate they all tend to look alike. You might have dozens of pipelines that all follow the same general structure (eg, a Kinesis stream that feeds a transformation Lambda, which writes data into a data lake). In this case, IaC tools let you formalize that structure, rounding off the edges of the snowflakes.

IaC scripts have another, often under-appreciated, benefit: they’re a form of documentation. Like unit tests, they give an unambiguous description of “this is what the system is.” And for a new developer, a library of IaC scripts say “this is what our systems should look like.”

Data Protection

Data protection is one of the most complex topics in any data engineering project. On the one hand is control of that data: allowing it to be used in some cases but not others. On the other is ensuring that there’s an audit trail and lineage for each piece of data from source to destination. And on the third hand, there’s data preservation: ensuring that, once data has entered the system, it will remain always available unless intentionally destroyed.

Each of these three topics is worthy of its own blog post. All of them, I think, fall under the category of “infrastructure engineering” rather than “pipeline development.” They require knowledge of your deployment technologies, as well as what options of those technologies are appropriate for each purpose.

Monitoring

Data pipeline monitoring is almost, but not completely unlike application monitoring. For a typical website, you care about whether the site is up, along with error rates and performance. In a data engineering pipeline, you care whether the pipeline is running, and being able to retrieve and process data, but you also care about the contents of the data itself.

When developing a data pipeline, a good data engineer reads whatever documentation is available, and looks at real-world data. It’s remarkable how frequently the two don’t agree, most commonly incorrect data types and misspelled field names. Write code to validate the schema of a source record, and flag records that don’t match that source.

Going a little deeper, develop some statistical metrics for the data. If you’re receiving click-stream data from a mobile app, and that data is supposed to hold a user identifier, count the number of distinct IDs, or at least a histogram of the top 10 IDs, over some time period. Aside from capturing obvious mistakes, such as every user having the same ID, over time you’ll be able to set guardrails for what the data feed should look like.

Error Handling

Once you identify anomalous data, what do you do with it? You don’t want it to make its way into your destination data lake or warehouse, but you also don’t want to throw it away. Instead, your data pipeline should contain “taps” at different points, in which bad records can be set aside for later analysis and possible replay. Streaming services, like Kinesis or Kafka, are a great way to implement such taps.

Designing for Scale

Scale is perhaps the most “engineering” part of data engineering. You need to have an understanding of how much data you’ll receive, and then pick appropriate technologies. It might also mean that you have to completely rethink your desired architecture, if the services that you plan to use can’t support the scale.

For example, my Right-sizing Data for Athena post explores the limitations of small files on S3. The back-story to that post was that a client wanted to partition their data lake using the primary key of their main entity. This would have resulted in several million partitions, and while S3 is extremely fast for single-file reads, the sheer volume of files would have made any query take excessively long.

Part of designing for scale is also designing for cost. I am constantly amazed at the number of people that pick DynamoDB as a data warehouse without considering how many read capacity units they’ll need to support their query load.

Conclusion

While writing this post I realized that I could write multiple posts on each of the topics covered, as well as adding topics. But that’s the nature of data engineering: while an individual pipeline seems simple, there is ever-expanding complexity as your data platform grows. SRE principles help you to tame that complexity.

 


 

Can we help you?

Ready to transform your business with customized data engineering solutions? Chariot Solutions is your trusted partner. Our consultants specialize in managing software and data complexities, tailoring solutions to your unique needs. Explore our data engineering offerings or reach out today to discuss your project.