Friends Don’t Let Friends Use JSON (in their data lakes)

Tags: , , , ,

I’ve never been a JSON hater. In fact, I think it’s the only sane choice for structured logging. But I’ve recently run into enough pain with JSON as a data serialization format that my feelings are edging toward dislike.

Unfortunately, JSON is a fact of life in most data pipelines, especially those that receive event-stream data from a third-party supplier. This post reflects on some of the problems that I’ve seen, and solutions that I’ve used.

The problems

Limited data types

The JSON spec defines three scalar data types: strings, numbers, and booleans. There’s no support for the dates and timestamps that are an ever-present component of business data (especially event streams). And while the spec assumes infinite-precision for numbers, most real-world JSON parsers use double precision floating point numbers, matching JavaScript. This means that you can’t use JSON numbers to represent fixed-point money types or numeric IDs that are more than 14 digits long. Not that that stops people from doing so.

Data type inference

Services such as AWS Glue can look at your data to infer its structure. For simple JSON structures this is easy — after all, you just have numbers and strings to work with. But real-world data often isn’t easy. For one thing, fields may be present in some records but not others; while JSON has the idea of null, serializers often just omit null values. And nested objects may contain vastly different data structures, especially if the top-level object is a container for event data.

Worse, field contents often change. A field that only held integers when the data was first crawled now contains a floating-point number or a string. That might be caught by your crawler, or it might not: crawlers don’t always look at every file. And your crawler might respond in different ways: perhaps it leaves the original field definition in place; perhaps it creates new table definitions for the file(s) that have different field values. Regardless, the main way that you learn about such problems is when your query fails or doesn’t return the expected data.

Mixed-case field keys

JSON field keys are quoted strings that can contain almost any Unicode character, including spaces and emojis, and can be arbitrarily long. If you access that data using a SQL-based interface, however, you’ll have to deal with the rules for a SQL identifier, which allows a much more limited set of characters. And while you can quote use SQL quotes to allow a wider range of characters, you may still be burned by having two identifiers that differ only in the case of their characters (eg, “ID” and “id”).

Speaking from experience, debugging such problems is painful: the error messages aren’t that helpful, and you often have to search through many files to find the conflicting records.

Some solutions

JSON isn’t going anywhere. It’s the “lingua franca” of computer data interchange: every system can write it, every system can read it, and there are tools that allow developers to easily examine and transform it. So as data engineers, we have to accept that we’ll get JSON data, and must ensure that it supports our needs without unpleasant surprises.

However, we can take steps to avoid JSON’s limitations and prevent it from polluting our data lakes.

Perimeter defense

This is an outmoded cyber-security practice, but one that I think remains viable for data management. Stated simply: prevent bad data from getting into your data lake in the first place, and you won’t have to constantly deal with its presence.

Perimeter defense is feasible because you’re already handling the data in some way. You might be pulling it off a Kinesis stream, or responding to files dropped in an S3 bucket, or retrieving from an external data source. In any of those cases, you’ve already written code to retrieve and store the data, so it’s a relatively short stretch to transform it into a better form.

The risk with perimeter defense is that it’s no defense at all against data generated inside the perimeter. So if you’re performing transformations that both read from and write to the data lake, you’ll need to be particularly vigilant about how you write that data.

Use strings everywhere

This is perhaps the simplest solution to the problem of corrupted numeric values (treating fixed-point values or large integers as floating point), but it does require that all consumers of the data internally cast it to the desired value.

How easy it is to implement depends on the JSON library that you’re using. Most of them give you some control over parsing, for example to use an infinite-precision numeric data type rather than double precision floating point. Python’s built-in json package is one of the easiest: it uses configurable functions to parse numeric values, and you can use the built-in str() function to read them as strings:

json.load(src, parse_float=str, parse_int=str)

Taking this idea a little further, you can serialize any nested objects or arrays as a string:

raw = json.loads(line, parse_float=str, parse_int=str)
cooked = {}
for k,v in raw.items():
    if (v is None) or (type(v) == str):
        cooked[k] = v
        cooked[k] = json.dumps(v)

Then, you can use tools such as the Athena json_extract_scalar() function to extract relevant data for your queries:

SELECT  json_extract_scalar(useridentity, '$.type') as identity_type, count(*)
FROM    "default"."cloudtrail_json_stringified"
WHERE   eventtime between '2021-01-01T00:00:00Z' and '2021-01-11T23:59:59Z'
GROUP   by 1
ORDER   by 2 desc

Convert to a different file format

While you can work around many of JSON’s limitations by storing everything as a string and parsing on read, it’s nicer to have well-defined data types. And nicer still if the file incorporates a definition of the data that it contains. And there are plenty of formats that will do just that. In my AWS-centric world, here are three common ones:

  • Parquet

    Apache Parquet is a favorite format for data lakes, especially those that store large amounts of data. It is a column-oriented format, which means that individual fields can be efficiently queried: you can look at just the portion of the file that contains that field’s data, ignoring the rest. While Parquet supports a limited number of “physical” data types, it provides “logical” types on top of them to support things like timestamps and fixed-point decimal.

    However, Parquet is tightly bound to the Hadoop ecosystem and libraries. This is fine if you’re using Spark to do your data transformations, less so if you’re writing custom code. If you’re using Python, then you’ll probably use PyArrow, which requires native libraries (the pure-Python parquet library can read files but not write them). And even in Java it’s challenging to write a stand-alone Parquet program: you have to wade through a long list of deprecated and poorly-documented functions, and end up with a deployment bundle that has over 100 dependencies (and is too big to deploy on AWS Lambda).

  • ORC

    Apache ORC is another file format from the Hadoop project (are you noticing a theme here?). Like Parquet, it’s a columnar format, so can be queried very efficiently. Unlike Parquet, ORC supports indexes that can improve performance. And also unlike Parquet, ORC provides a rich set of datatypes, including timestamps and fixed-point decimal.

    For the Java developer, writing an ORC file is pretty straightforward: you define a schema (which can be constructed as a string), and write batches of rows. The API isn’t terribly consistent, and there are a few gotchas such as timestamps, but it’s a lot better than the Parquet API. And the number of dependencies is dramatically lower, although the size of a deployment bundle is still large, at 50+ MB.

    For Python developers, you can create ORC files using PyArrow, which as I noted above has a native library requirement.

  • Avro

    Apache Avro describes itself as a serialization format, rather than as a storage format. My understanding is that its original purpose was to serialize data for transport between nodes in a Hadoop cluster. The practical effect of this is that, unlike Parquet and ORC, Avro is row-oriented: the data for all of the fields in a row is stored together.

    Like Parquet, Avro is based on a limited set of “primitive types” and uses “logical types” for fixed-point decimals and timestamps. This can be a little challenging for the developer, as you often need to use transformation function before storing a value.

    The Java Avro library has very few dependencies; a Lambda deployment bundle is around 5 MB. For Python, there is the “official” avro library, and the third-party fastavro library. The former is written entirely in Python (and easily portable); the latter uses a native library, and as its name indicates is much faster as a result.

Whether you should choose a row-oriented or column-oriented format depends on how much data you’ll be ingesting, and how you’ll be accessing that data. Columnar formats are more efficient to query when you’re dealing with large data files and specific query predicates. However, if you’re processing small batches of streaming events, or reading the entire file (for example, to load it into a data warehouse), you won’t see any benefit.

As a developer, I lean toward Avro because of its relative simplicity and lack of dependencies. However, it’s lacking in documentation: they have good examples of reading and writing primitive types, but you have to read code to understand logical types (something I plan to rectify with an upcoming post).

Wrapping Up

JSON is a great format for client-server communications, where both parties know exactly how to interpret the data. For long-term storage in a data lake, however, that knowledge is missing: is that received field an ISO-8601 date, or timestamp, or is it something else? All you know is that it’s a JSON string, and you have to dig into the actual data to learn more (and hope that you dug deeply enough to avoid surprises).

To be useful in the long term, I believe that you must define a schema for your data, and ensure that schema is available at the time you query the data. And the only way that I can see to do this with current tools is to use a file format that embeds that schema information.



Can we help you?

Ready to transform your business with customized data engineering solutions? Chariot Solutions is your trusted partner. Our consultants specialize in managing software and data complexities, tailoring solutions to your unique needs. Explore our data engineering offerings or reach out today to discuss your project.