Twenty Years of Big Data

by
Tags: , ,
Category:

My first “big data” project was actually 30 years ago, at a large financial services company. We pulled information about our 10,000,000 retail customers from our mainframe systems, and stored it in a 64-processor parallel database server that I dubbed “my $3 million PC.” This machine had an almost-inconceivable amount of storage: 512 gigabytes, split into “active” and “staging” databases. At the start of each month, the current staging database would become active, and we’d start extracting data for the new staging database.

At the time, data warehouses were supposed to provide the way to find unexpected correlations, such as people buying beer and diapers together. We didn’t find any such correlations – the only one that stands out in my mind was that people with more money in their accounts tended to score higher on surveys about investor sophistication. But we did perform lots of ad hoc reporting (that was my title: Manager of Ad Hoc Programming) that gave marketing VPs a better understanding of their customer base.

And in my recollection, that was the story of data engineering for the next dozen or more years: a tool for well-funded research organizations in large companies (and, oh yes, the IRS) to perform analyses that were beyond the capabilities of OLTP databases. Maybe even finding an unexpected correlation or two. And in that time, processors got more powerful, and disks got bigger, but that didn’t mean that The Rest of Us were doing much data engineering: the barriers to entry were still too high.

Then, in 2006, Hadoop arrived on the scene, implementing the ideas that Google had pioneered to build their search index. With Hadoop, you could use a cluster of “commodity” hardware to perform parallel queries against a large dataset. You could get better query performance – and much more disk storage – from a $10,000 rack of hardware than the custom parallel database that I’d used a decade earlier.

 

Cloud computing brought another transformation in “big data for the rest of us.” Rather than build that $10,000 Hadoop cluster out of physical hardware, you could spin up a few dozen virtual machines when needed and shut them down when you were done.

 

The downside was that writing map-reduce programs is, shall we say, challenging. It’s easy to write a word-count program, the “hello world” of map-reduce. But implementing a simple join operation between two data sources requires careful thought and a surprising amount of Java code. However, SQL-compatible products such as Hive soon appeared to give developers a more comfortable experience.

At the same time that it became cost-effective to process large amounts of data, it became cost-effective to produce and capture that data. While 10,000,000 customers might seem like a big number, my laptop today could easily perform all of the queries of that parallel behemoth from 30 years ago. Today, the big numbers come from capturing those users’ activities, down to individual web page accesses; terabytes per day. It only became feasible to capture such data once hard disk capacities were measured in terabytes rather than gigabytes, and networks were measured in gigabits/second rather than megabits/second.

Cloud computing brought another transformation in “big data for the rest of us.” Rather than build that $10,000 Hadoop cluster out of physical hardware, you could spin up a few dozen virtual machines when needed and shut them down when you were done. Big data became an operational expense, payable via credit card, rather than a line item in the company’s capital budget.

But better than that, cloud providers gave you alternate ways to manage and access your data, again on a “pay as you go” basis. Data lakes use relatively inexpensive and effectively infinite cloud storage to hold your data. Analytics tools such as Amazon Athena or Google BigQuery let you write SQL that then turns into hundreds or thousands of background tasks to scan the data in those data lakes. Machine learning services even help you to find the unexpected correlations.

Like everything else in the computing field, Big Data has benefited from the trend toward more, cheaper, and faster. Capability that was out of reach of all but the best-funded companies is now available to everyone. But one thing that hasn’t changed is the need for data engineering: picking the right data, making sure that it is valid, and transforming it to better meet the needs of its users.


How can we help advance your company’s data efforts? Let’s talk. Contact us today.