Developing A Coding Test for Data Engineering

Hiring good candidates is difficult. After nearly 40 years in this business, and interviewing hundreds of candidates, I’m not going to claim that I have the answer. Just some ideas.

On the one hand, you want to be respectful of the candidate’s time. On the other, you don’t want to hire someone who can’t do the job. Chariot Solutions is a consulting company, so this second concern takes on extra importance: we don’t want to put an unqualified person on a client project.

A well-designed coding challenge is, in my opinion, the best way to evaluate a developer. There are a lot of people with strong opinions to the contrary, but I believe that seeing a person’s relevant work is a valid predictor of their future performance. Certainly better than “leet code” whiteboard questions or brain-teaser riddles. Not as good as looking at actual past work. But very few candidates have the ability to show relevant past work, usually because it’s owned by their past employer.

Chariot already had a coding challenge as part of our interview process. But our business has historically been web and mobile application development, so it wasn’t relevant to data engineers. Developing a new challenge became my challenge.

Creating the Challenge

I’ve used the word “relevant” several times now. In my opinion, that’s the key to a valid challenge: it must represent what the candidate will do in their day-to-day work.

Unfortunately, data engineering covers a wide range of topics: infrastructure, coding, queries, visualization … the list goes on. Trying to capture all of that would mean a gigantic test. In addition to not being respectful of the candidate’s time, such a test would be prone to false negatives.

I settled on a Lambda function that would accept records from Kinesis, convert them to CSV, and write them to S3. I think I’ve written some variant of this program a dozen times or more; sources change, destinations change, but programmatic transforms are one of the core tasks for a working data engineer.

For implementation language, I picked Python (with the caveat that the candidate could choose an alternative but would then have to provide build and deployment scripts). In my experience, Python is the lingua franca of data engineering, so this shouldn’t be an issue.

To get the candidate started, we provide a skeleton program and unit tests. This is partly to limit candidate toil, but also lets us quickly verify that they did, in fact, implement the challenge. And it lets them develop the test without needing AWS infrastructure.

Our estimate is that it should take a qualified candidate under two hours to complete the challenge. This isn’t an arbitrary number: we “play tested” the challenge with several Chariot consultants before giving it to real candidates. I highly recommend doing the same for every part of your interview process, lest you discover that your own people don’t pass the bar.

Administering the Challenge

The existing Chariot test is in-person (or, once COVID hit, on Zoom). We have a laptop configured with a variety of IDEs and all of the other software that you might need to accomplish it. And candidates are allowed to go out to the Internet to look up documentation. But the “live” component still adds unnecessary stress in my opinion: many people like to put their thoughts in order without someone sitting across the table from them.

So, instead, the coding challenge is “take-home”: we give the candidate access to our GitHub repo on Friday morning, and expect them to turn the results in the following Monday. We believe this gives them the most flexibility, but see below.

When the candidate returns the completed challenge, we review it and decided whether or not to move forward. If we move forward, then the candidate will have an “in-person” interview (to date, all have been over Zoom) to actually run their code.

Live Debugging

On the day of the in-person session, we spin up the necessary infrastructure in a dedicated AWS account. Once the interview starts, we deploy the candidate’s code as a Lambda (preferably, they deploy it themselves, giving a hint as to their AWS experience). And then we run an event generator, which writes approximately 10,000 events to Kinesis. The candidate’s Lambda processes these events, writes them to S3, and we upload those files to Redshift. Then we run counts on each table, and compare them to what the event generator reported.

And, invariably, the counts will be different. Because there is a bug in the generator that causes it to omit fields in some of the records, which would cause them to be rejected by Redshift.

This is, for me, the core of the interview: how well the candidate debugs problems. Because something else that I’ve found invariable is that data isn’t clean, and you have to be ready for that.

There are many approaches that the candidate could use. If they were familiar with Redshift, then they might look at the STL_LOAD_ERRORS table. If they had followed the instructions and logged records that didn’t match the defined schemas, then they could look at the logs of their function. Or they could extract the IDs of the records in Redshift and compare that to the files in S3. Whatever worked for them.

I might nudge them toward the logging solution if they hadn’t already done that, but what I wanted to see was how they responded to things not working out as planned.

Some Challenges with a Take-Home Challenge

If you want to try something similar, here are some lessons that we learned.

If you host on GitHub, don’t allow forks

In retrospect, of course, this was an obvious problem. Any candidate familiar with the GitHub workflow will fork the repository and submit a pull request. Unfortunately, pull requests live forever, meaning that any future candidate can see past candidates’ work.

As it turns out GitHub Support will delete pull requests, but it’s probably not something they’ll do on a regular basis. We decided not to find out.

Instead, we disabled forks, granted read access on the repository, and requested that candidates upload a ZIP of their work to our candidate tracking system. Overall, that’s a better approach anyway, since the response will be attached to the candidate’s record in the tracking software.

You should expect candidate questions at any time

As I said, we released the repository to the candidate on Friday morning, with the expectation that they’d give it a quick read and ask any questions during the day. But that puts more of a burden on the candidates than necessary, especially if they have a full-time job. There were a few cases where candidates would send out questions during the weekend; fortunately, I was checking email.

I think it might be better to give the candidate a full week to do the challenge: Monday to Monday. That gives them the most flexibility to work it into their schedule..

Candidates will put in excessive time

The risk of giving candidates extra time is that they’ll use it. We had a few candidates who, even though they were told that the challenge should take no more than two hours, put in much longer. To me, this indicated that the challenge (and therefore the job) were beyond their abilities, a feeling that was always confirmed by the in-person portion of the interview.

I don’t have a good answer for this. And I fear that the people least suited for the role will be most likely to devote excessive time, in the hope that it will get them the job.

Wrapping Up

Like any interviewing technique, this one is biased: it selects for people who are adaptable and who have good debugging ability. I happen to think that those two attributes are key to data engineering, so I’m OK with that.

I don’t think it’s sufficient to make a hiring decision: it doesn’t say anything about a person’s interpersonal skills, or their ability to design a solution from scratch. At Chariot we have other phases of the interview process that attempt to evaluate those factors.

And, of course, we don’t track the people that were not successful at the challenge. It’s entirely possible that all of them have become successful data engineers at other companies.

Can we help you?

Ready to transform your business with customized data engineering solutions? Chariot Solutions is your trusted partner. Our consultants specialize in managing software and data complexities, tailoring solutions to your unique needs. Explore our data engineering offerings or reach out today to discuss your project.