The Cloud Changes Big Data Analytics, and Big Data Analytics Needs to Change

Andrew Brust
March 5, 2018

What’s perhaps less obvious is how the world of cloud computing has changed attitudes toward data collection, retention, and analysis. In turn, that is becoming a hugely significant factor in the adoption and effectiveness of Big Data and data lake technologies.

From Rags to Riches

The finite storage of hard drives, the high cost of enterprise storage, and the even higher cost of storage expansion in appliance-based data warehouse platforms led customers to triage their data and minimize what they kept. Only the must-have data has been placed in most data warehouses. Even if viewed as interesting and useful, data sets stored in flat files have also been viewed as dispensable if they haven’t been operationally critical. Often, that caused them to be tossed aside.

On the other hand, cloud object storage like Amazon’s S3 is relatively cheap – especially if you factor in operating costs like management, data center power, and real estate. And perhaps more importantly, cloud storage is provisionable at a very fine-grained level and on-demand. As you need more space, you can get it immediately, in small increments, paying just for what you need. Gone are the days of provisioning storage based on projected need and then actually using it in a miserly fashion to stave off expensive and disruptive further expansion.

This shift in the model has also shifted the mentality of data retention, from keeping only what’s necessary on the one hand to getting rid of only what’s demonstrably valueless on the other. It’s now cheaper to keep data in many instances than to devote time and resources to determine whether to discard it. The default has flipped; we’ve gone from an abundance of caution to an inclination towards inclusion.

Purpose-Driven Infrastructure

Beyond the revolution in storage, another key advantage of the cloud exists in computing resources’ elasticity. Most technology-involved professionals understand the basic rubric of this by now: instead of procuring and budget for infrastructure based on peak demand, customers can instead outfit themselves with infrastructure to handle the routine load, then supplement that infrastructure temporarily to handle spikes in demand. Again, most technologists and technology consumers already understand this.

In the analytics world, this advantage of cloud architectures is perhaps even more pronounced, as there is an abundance of ad hoc workloads that repeatedly, though intermittently, cause spikes in required compute resources. That means customers won’t merely want to “top-off” their existing infrastructure. Instead, they will need to provision new infrastructure specially dedicated to these ephemeral workloads, and then they will want to deprovision it completely. In the world of data lake technology, that means not just adding nodes to Hadoop and Spark clusters, but actually creating entire clusters to handle the emergent workload, then deprovisioning them entirely.

Analytics Utopia

Combining these changes (the inclusion bias in data retention and the accommodation of spontaneous workloads using dedicated, disposable infrastructure) is very bullish for analytics. With these changes in place, we have access to much more data, and we have the ability to provision workload-exclusive infrastructure instantly and tear it down just as quickly.

As long as we’re intrepid enough to take advantage of these resources newly available to us, as long as we’re proactive about wading through the data, and as long as we lean forward and do the analytics work, the potential for analytical insight, data-driven competitive advantage, and bona fide digital transformation is huge.

Unintended Consequences

There are downsides to these newfound capabilities, though. All that data and all of that infrastructure flexibility are great, but there is a burden to the bounty. Newly obsessive data retention leads to less formality, less vetting, and less organization around what data is available and its location. The lowered barrier to retention means that while more data is kept, it’s also more off the radar. In effect, the ability to throw data in S3 and other cloud object storage willy-nilly has led to the shadow data lake phenomenon.

Despite the slightly nefarious name, shadow data lakes have upsides in addition to their downsides. The context is that veritable data lakes can emerge simply under people saving data sets to cloud object storage casually. The downside is that such repositories are unofficial and unmonitored. The upside is that the people saving the data sets, however accidentally, are doing the otherwise hard work of putting data in a landing zone and tacitly endorsing it. Organizations that do nothing to onboard that data in a more formal manner leave themselves vulnerable to potential regulatory non-compliance and miss out on the potential value of the data itself.

But organizations with the right analytics tools can take those lemons and make lemonade – they can explore the data, verify its value and accuracy, and make their organizations more proactively data-driven. Shadow data lakes represent a great opportunity if handled correctly. The right procedures and the right discovery tools are the keys to success.

And that ability to dedicate whole compute clusters to bursty, ad hoc analytics workloads? That’s nice, too, except that most higher-level Big Data analytics tools are built to run on single, persistent compute clusters. While you could, of course, do your analytics work by writing your own code to run on the ephemeral cluster’s execution frameworks like MapReduce, Tez, or Spark, that approach severely curtails self-service business user access to data lake analytics.

It’s possible to take these significant challenges and turn them into a huge opportunity. What’s required to do that? Simple – a self-service Big Data analytics platform designed for the new paradigms. Such a platform must be able to ingest data from and store data to cloud object storage. It must be designed for the scenario of data sprawl across such storage, and it should tackle the phenomenon of shadow data lakes head-on. The platform must also be designed for – and not merely tolerant – the notion of ephemeral infrastructure spun up exclusively to take on emergent, ad hoc workloads.

Achieving Balance

At the same time, we can’t go overboard. Because as much as the cloud facilitates experimental, greenfield data discovery on an ad hoc basis, repeatable production workloads aren’t going away. For these workloads, capacity is more predictable and static, such that provisioning on-premises infrastructure, or just more statically configured cloud infrastructure, to execute these workloads is quite feasible.

This analytics work will still be carried out greatly and critical to the digitally transformed organization. Indeed, analytics carried out in ad hoc, ephemeral environments that turn out to be lasting value becomes the fodder for new such production work. Think of the “bursty,” ephemeral work in the cloud as hunting. Think of the steady, predictable production work as gathering. Both are important, and they are inextricably tied.

So, a self-service analytics platform needed for the new paradigms must be respectful of the older ones. It must be versatile; it must be ambidextrous. Its sweet spot must encompass both old and new workflows, such that neither becomes a stressor.

Such a platform will also facilitate customers’ shift from conventional data warehousing platforms because it will allow those customers to adopt data lake technology for their non-operational workloads first, running in parallel with the production processing on older platforms. Customers can then verify the value of Big Data and data lake technologies, and with that confidence and due diligence established, they gradually move production workloads to them. And they’ll do it with conviction.

Enlightened Path

That’s the virtuous cycle to prudent adoption and use of Big Data technology, spanning the special-purpose to the everyday. That’s the path to digital transformation. That’s the route to rational acknowledgment that customers have to ramp up to new technologies and can’t just cut over to them all at once.

In short, this is how to make the data lake and Big Data analytics a value-positive tool for enterprises instead of an adoption burden and migration quagmire. It’s how analytical insights will transpire organically, helpfully, powerfully, and harmoniously.

To learn more, watch this webinar: Have a cloud data warehouse, now what?