Big data is proving to be a powerful tool, but many companies face challenges or outright problems when implementing big data programs. There are at least three broad challenges to implementing big data:
Today, I’m going to take a moment to dig into these challenges and problems, and also prescribe some ways to address them.
Typically, big data challenges are often thought of in terms of “the three Vs:” volume, variety and velocity.
Volume refers simply to the amount of data available, but big data challenges can arise even when the amount of data isn’t massive.
For instance, if there’s a large variety of data coming from several different sources, it can be difficult to not only analyze the data, but also to enrich it and derive meaning, especially when data sets are complicated to join to each other.
When a variety of data sources must be analyzed, often new and unforeseen difficulties arise. For example, it can be difficult to relate sales data that comes aggregated to fiscal months with marketing data that could be aggregated to campaigns (or other unrelated dimensions) which don’t naturally mesh across all data sets in the analysis. Analytics that rely on a variety of data sources often require significant data preparation cycles.
Lastly, velocity refers to how quickly data is flowing in, and how quickly business conditions can change. For example, is there a need for true real-time streaming data? Identifying legitimate streaming needs vs. mere wants is critical—for example, in a dairy pasteurization operation, the sensor data streaming from pasteurization equipment in a dairy could contain a critical signal, that if missed by the plant operators, can result in thousands of gallons of spoiled milk products.
Such a use case necessitates an architecture that can instantly flag anomalies in the sensor data, but for more use cases where speed is a factor, data processing in batch intervals of one to 15 minutes is usually sufficient. Companies working with time-sensitive data need to differentiate between business problems that truly require real-time streaming architectures vs. those that could be met with less costly or complex high-speed batching solutions.
Let’s dig into some specific examples of problems that companies encounter when implementing big data.
Many websites generate a petabyte of data or more in a single month. Higher traffic websites, such as Amazon or the New York Times, can generate even larger data loads. In the past, companies wanting to analyze such enormous volumes of data had to expand or lift-and-shift their data into data warehouses or hire specialized analytics vendors to process their data in proprietary computing clouds.
Even with these massive data warehouses, hours or even days were needed to analyze the data, prepare it for consumption and upload it to the customer—and those were only the first few steps in a long process of realizing insights.
After the initial read of data was captured, it often had to be put into one or more presentable formats and then analyzed by experts. Yet many companies never even made it this far. Data warehouses are expensive and often have inefficient resources for large storage/compute endeavors. Furthermore, owing to high demand and the limited number of data experts and computing resources available, analytic work backlogs were often weeks or months long. By the time data was analyzed and returned to companies for action, it was often out-of-date and not particularly useful.
Then came Hadoop, an open-source framework that allows companies to store and analyze extremely large data sets using a cluster of computers, running on inexpensive commodity hardware.
In other words, severs can be networked together to process the massive data sets at significant savings compared to older technologies. However, using Hadoop directly is much easier said than done. For most companies, it’s necessary either to hire a skilled Hadoop programming team or purchase off-the-shelf tools to make Hadoop easier to use.
Comcast, for example, compiles huge amounts of data in near real-time batch intervals. This allows them not only to gain a view of how their customers are interacting with their services, but also to identify problems such as service outages.
That leads us to our next challenge.
In order to implement most big data solutions, and especially in-house or data warehouse solutions, a team of highly skilled programmers and data scientists is needed. The problem is, these workers are always in short supply and high demand. Hiring programmers is difficult and expensive, as is retaining them.
Additionally, even if you put together a skilled, stable team of programmers, it sometimes becomes a bottleneck for the rest of the business. With some big data implementations, much of the analytic work is taken out of the hands of analysts and given to programmers instead. This workflow is slower and by the time data is properly sorted, analyzed and then sent back to business analysts and decision makers, it’s often out-of-date.
Owing to the scarcity of big data programming talent, it’s efficient for big data solutions to be as labor-light as possible. While finding good programming talent is a widespread problem for many industries and companies, recent advances in out-of-the box, analyst-friendly big data solutions have made it easy to dramatically reduce workloads on programmers. Now, analysts can be more directly involved in both compiling and, of course, analyzing data. And programmers can focus on more specialized work.
It’s fair to wonder, however, if programmers and analysts can keep up with the rate of technological change, which brings us to another major challenge of big data: the complexity and rapid evolution of the technology ecosystem.
Technology is always evolving. Think back just 10 years. The first iPhone was months away from hitting the market, ushering in the age of the smart phone, and many cars still had tape players as a standard feature. Tablets, self-driving cars, blazing-fast wireless data services that effortlessly linked phones to every other device (including cars) — all of these things were still castles in the sky.
The point is, technology changes quickly. Some of the technologies we’re using today could become obsolete within a couple years. Like every other type of technology, big data is going to undergo immense changes in the future.
This is especially true given the emergence of the Internet of Things (IoT), which will see an ever-increasing number of devices (likely tens or hundreds of billions) interconnected into a complex web. And many of those devices capture and report several types of signal and logging data.
Big data technology is rapidly changing. That’s true for each component of the architecture, from storage hardware to the computing frameworks needed to crunch big data, to the business intelligence platforms that empower analysts. The best business intelligence platforms can rest on top of the other changing technologies, and are capable of evolving at the same pace as the industry itself does.
There are challenges to big data, but the benefits are many. From discovering new insights and revenue streams to uncovering untapped areas of efficiencies, there’s a lot big data can do for your organization. Of course, challenges lie ahead—but thinking about these issues ahead of time, choosing the right team and selecting the best software for your needs will go a long way towards easing any difficulties.