According to IDC, there will be 175 zettabytes of new data created by 2025.
Zettabytes? At some point, it becomes hard to even imagine what these numbers mean because there’s no equivalency in our daily lives. But it’s fun to try.
Consider Netflix watching. Consumer Reports tells us that you would need to watch Netflix for 20 hours a day or a total of 416 videos of 90-minutes each to reach just one terabyte in a month. And there are a billion terabytes in a single zettabyte!
Prefer to think in gigabytes? Way back in 2016, Cisco blog author Taru Khurana pointed out that if every gigabyte equaled one brick, then one zettabyte would mean you could build 358 Great Wall(s) of China.
The long and short of it is that a mind-numbing amount of data exists in the world. And our smartphones, laptops, tablets, and smart TVs are pumping out more and more of it every second. Smart businesses are capturing and analyzing these data sets to outmaneuver their competitors and accelerate growth through more relevant offerings.
This process involves a number of components built around a data pipeline. This pipeline aggregates, organizes, and moves data from collection points to a destination for storage, analysis, and insights. Modern systems automate the ETL (extract, transform, load) process and include data ingestion, processing, filtering, transformation, and movement across any type of cloud architecture while adding in layers for resiliency against failure.
Every day, we run up against small and medium size businesses that think this approach is beyond their means or that they don’t generate enough data. Nothing could be further from the truth. In fact, data and analytics Bernard Marr says that “in many ways, big data is more suited to small businesses, because they’re generally more agile and able to act more quickly on data-driven insights.”
For those companies that do leverage data insights, the benefits are real and tangible. From tracking leaking toilets to helping viewers choose their next show to watch, the bottom line benefits can be enormous.
But this caution is appropriate because diving headlong into a data-centric process without understanding goals or limits can be an expensive and frustrating experience. So how to right-size your data analytics effort? Our experts recommend asking these five questions to set up reasonable project parameters.
Is it actionable?
All data is not created equal. Some can deliver better insights more suited for end business goals. Are you tracking and analyzing data that creates more busy work for teams, or generating deep learnings that deliver bottom line growth?
Unfortunately, we often find teams wait until too late in the development of a pipeline to ask these questions. It’s imperative that you map your data strategy to your business needs so you can identify the proper data pools from the outset.
How “fresh” is your data?
Like with fruit and vegetables you buy at the grocery story, your data is perishable; both in quality and shelf life. The reality is that there’s a lot of bad data out there. Fortunately, data can be fixed or cleaned if it’s to be used in downstream systems – but it’s important to do that as close to the source as possible. Identifying bad data sources and instituting either manual or automated fixes is critical.
It also takes time to collect, clean, restructure, and move data through your pipeline. That lapse can be mere seconds, or might extend for weeks. Depending on your business and the questions you’re trying to address, the acceptable delays in your data pipeline also vary. There is nothing worse than having valuable data arrive too late to be useful. As part of the design process, be sure to match your data delivery timelines to fit your needs.
How much data are we really talking about?
You have fresh, actionable data? Terrific! How much of it do you anticipate? That question is often met with silence. Or worse, companies often vastly underestimate the amount of data they will produce.
When building your pipeline, it’s crucial to understand the data size both in terms of record count and also size of record. For example, consumer demographic information is often on the order of tens of millions of records, with a few dozen different record types that each contain tens of attributes.
This is relatively small in the grand scheme of things, and the attributes tend to be fairly stable. On the other hand, consumer behavior data (clickstreams, activity data, etc.) can easily climb into the billions of rows and grow steadily over time because users are constantly taking actions across multiple platforms.
Does your business demand real-time?
Most companies likely default to the preference of having real-time data and statistics. However, there are appropriate uses for both streamed (real-time) and batch (delayed) data pipelines and analysis. The key is understanding which your business really needs because each requires vastly different architecture, resources, and investment. Unnecessarily opting for real-time creates wasted time and money in development cycles and results in a streaming architecture that is more prone to problems over time.
First, ask yourself if your business will actually change based on the actions of a limited number of users over the course of an hour. Would McDonald’s stop selling Big Macs because a higher number than usual went unsold during the lunch hour in two states for one day? Unlikely. For some companies, the more reasonable course might just be to track and analyze data in batches over longer windows to identify trends that inform the business and are actionable.
What are your security/privacy concerns?
Most of the data that exists in large corporations these days is sensitive in some way or another. Whether regulated like in healthcare and finance or more loosely defined as in retail environments, the protection of personal information is a basic requirement for nearly every company that gathers and stores data.
Ideally, a company will have an internal data classification system under which they can specify what sorts of storage, transmission, and usage requirements exist for which data sets. Whenever a pipeline is built to move data from point A to point B, you must understand the legal and ethical ramifications of the data being moved to ensure that it is treated appropriately. We recommend engaging legal and regulatory teams from the outset to minimize risk throughout your pipeline.
Ultimately, data is the oil upon which your company will function. Cleaner, fresher data delivered most efficiently to where it’s needed will allow your company to perform at its best. Reach out today to learn more about how you can optimize your data pipeline.