Datameer Blog post
The Cloud Changes Big Data Analytics, and Big Data Analytics Needs to Change
by Andrew Brust on Mar 05, 2018
The cloud computing revolution has changed technology, and how businesses use it. It’s made for unprecedented levels of agility and flexibility in the use of, and budgeting for, technology. It’s lowered barriers to entry, it’s leveled the proverbial playing field and it’s also pressured organizations to accelerate their adoption of technology in order to stay competitive. But you knew all that.
What’s perhaps less obvious is how the world of cloud computing has changed attitudes toward data collection, retention and analysis. And that, in turn, is becoming a hugely significant factor in the adoption and effectiveness of Big Data and data lake technologies.
From Rags to Riches
The finite storage of hard drives, the high cost of enterprise storage and the even higher cost of storage expansion in appliance-based data warehouse platforms led customers to triage their data and minimize what they kept. Only the must-have data has been placed in most data warehouses. Data sets stored in flat files, even if viewed as interesting and useful, have also been viewed as dispensable if they haven’t been operationally critical. Often, that caused them to be tossed aside.
On the other hand, cloud object storage like Amazon’s S3, is relatively cheap – especially if you factor in operating costs like management, data center power and real estate. And perhaps more importantly, cloud storage is provisionable at a very fine-grained level and on-demand. As you need more space, you can get it immediately, in small increments, paying just for what you need. Gone are the days of provisioning storage based on projected need and then actually using it in a miserly fashion to stave off expensive and disruptive further expansion.
This shift in model has also shifted the mentality of data retention, from keeping only what’s necessary on the one hand, to getting rid of only what’s demonstrably valueless on the other. In many instances, it’s now cheaper to keep data than it is to devote time and resources to determining whether to discard it. The default has flipped; we’ve gone from an abundance of caution to an inclination towards inclusion.
Beyond the revolution in storage, another key advantage of the cloud exists – elasticity of compute resources. Most technology-involved professionals understand the basic rubric of this by now: instead of having to procure and budget for infrastructure based on peak demand, customers can instead outfit themselves with infrastructure to handle routine load, then supplement that infrastructure temporarily to handle spikes in demand. Again, most technologists and technology consumers already understand this.
In the analytics world, this advantage of cloud architectures is perhaps even more pronounced, as there is an abundance of ad hoc workloads that repeatedly, though intermittently, cause spikes in required compute resources. That means customers won’t merely want to “top-off” their existing infrastructure. Instead, they will need to provision new infrastructure specially dedicated to these ephemeral workloads, and then they will want to deprovision it completely. In the world of data lake technology, that means not just adding nodes to Hadoop and Spark clusters, but actually creating entire clusters to handle the emergent workload, then deprovisioning them entirely.
The combination of these changes (the inclusion bias in data retention and the accommodation of spontaneous workloads using dedicated, disposable infrastructure) is very bullish for analytics. With these changes in place, we have access to much more data, and we have the ability to provision workload-exclusive infrastructure instantly and tear it down just as quickly.
As long as we’re intrepid enough to take advantage of these resources newly available to us, as long as we’re proactive about wading through the data, and as long as we lean forward and do the analytics work, the potential for analytical insight, data-driven competitive advantage and bona fide digital transformation is huge.
There are downsides to these newfound capabilities though. All that data and all of that infrastructure flexibility are great, but there is a burden to the bounty. Newly obsessive data retention leads to less formality, less vetting, and less organization around what data is available and where it’s located. The lowered barrier to retention means that while more data is kept, it’s also more off the radar. In effect, the ability to throw data in S3 and other cloud object storage willy-nilly, has led to the phenomenon of the shadow data lake.
Despite the slightly nefarious name, shadow data lakes have upsides in addition to their downsides. The context is that veritable data lakes can emerge simply by virtue of people saving data sets to cloud object storage in a casual manner. The downside is that such repositories are unofficial and unmonitored. The upside is that the people saving the data sets, however accidentally, are doing the otherwise hard work of putting data in a landing zone and tacitly endorsing it. Organizations that do nothing to onboard that data in a more formal manner leave themselves vulnerable to potential regulatory non-compliance and miss out on the potential value of the data itself.
But, organizations that have the right analytics tools can take those lemons and make lemonade – they can explore the data, verify its value and accuracy, and make their organizations more proactively data-driven. Shadow data lakes represent a great opportunity if handled correctly. The right procedures and the right discovery tools are the keys to success.
And that ability to dedicate whole compute clusters to bursty, ad hoc analytics workloads? That’s nice too, except that most higher-level Big Data analytics tools are built to run on single, persistent compute clusters. While you could, of course, do your analytics work by writing your own code to run on the ephemeral cluster’s execution frameworks like MapReduce, Tez or Spark, that approach severely curtails self-service business user access to data lake analytics.
It’s possible to take these significant challenges and turn them into a huge opportunity. What’s required to do that? Simple – a self-service Big Data analytics platform designed for the new paradigms. Such a platform must be able to ingest data from, and store data to, cloud object storage. It must be designed for the scenario of data sprawl across such storage and it should tackle the phenomenon of shadow data lakes head-on. The platform must also be designed for – and not merely tolerant of – the notion of ephemeral infrastructure spun up exclusively to take on emergent, ad hoc workloads.
At the same time, we can’t go overboard. Because as much as the cloud facilitates experimental, greenfield data discovery on an ad hoc basis, repeatable production workloads aren’t going away. For these workloads, capacity is more predictable and static, such that provisioning on-premises infrastructure, or just more statically configured cloud infrastructure, to execute these workloads is quite feasible.
This analytics work will still be carried out at great scale and will be critical to the digitally transformed organization. Indeed, analytics carried out in ad hoc, ephemeral environments that turns out to be of lasting value becomes the fodder for new such production work. Think of the “bursty,” ephemeral work in the cloud as hunting. Think of the steady, predictable production work as gathering. Both are important and they are inextricably tied.
So, a self-service analytics platform needed for the new paradigms must be respectful of the older ones as well. It must be versatile; it must be ambidextrous. Its sweet spot must encompass both old and new workflows, such that neither becomes a stressor.
Such a platform will also facilitate customers’ shift from conventional data warehousing platforms because it will allow those customers to adopt data lake technology for their non-operational workloads first, running in parallel with the production processing on older platforms. Customers can then verify the value of Big Data and data lake technologies, and with that confidence and due diligence established, gradually move production workloads to them as well. And they’ll do it with conviction.
That’s the virtuous cycle to prudent adoption and use of Big Data technology, spanning the special-purpose to the everyday. That’s the path to digital transformation. That’s the route to rational acknowledgement that customers have to ramp up to new technologies and can’t just cut over to them all at once.
In short, this is how to make the data lake and Big Data analytics a value-positive tool for enterprises instead of an adoption burden and migration quagmire. It’s how analytical insights will transpire organically, helpfully, powerfully and harmoniously.
To learn more about how to get immediate value from your shadow data lakes in the cloud, register for the upcoming November 29th webinar: Tap the Power of the Cloud for Big Data
Andrew Brust is CEO and founder of Blue Badge Insights, advising big data vendors and customers on strategy and implementation. He covers big data and analytics for ZDNet, is conference co-chair for Visual Studio Live! and is a Microsoft Data Platform MVP.