Datameer Blog post
Cloud Platforms for Analytics: The House Brand Ain’t Always Enough – conclusion
by Andrew Brust on May 08, 2018
Last week we looked at industry trends that have contributed to the creation of market demand for public cloud solutions, what’s in the cloud analytics stack and Amazon integration pairs. If you missed it, check out part one first.
A Difference of Agendas
The cloud providers seek to supply you with core components, and enough platform programmability to facilitate them being cobbled together. The cloud providers could present a much more integrated experience, with tools that provide a higher level of abstraction over these base components – but they don’t.
Each public cloud vendor provides a platform, rather than full-on solutions that run on the platform. The goal is to provide building blocks that others can compose into full solutions. The cloud providers see more value in providing innovative, breakthrough capabilities in raw form than they do in the finishing work of tying these services together. That tying is something they leave to partners, and/or the internal technology personnel at their more sophisticated customers.
While the cloud providers aren’t trying to be hostile to customers, the analytics stack approach they take nonetheless poses challenges and threats to said customers:
- On-premises data sources are often left behind
- A one-size fits-all approach creates a lack of differentiated offerings for startups, small and medium businesses and enterprise customers
- A significant integration burden is left for the customer to solve on its own
These facts leave the customer holding the bag. We will now look at these burdens and we’ll see that the risks there are quite real. Then we’ll explore the advantages of using third-party analytics software that sits atop the cloud analytics stack components. We’ll see how such products make using the tools in that stack more feasible and reduce the possibility of project failure.
Cloud Analytics Survival Skills
First, let’s be clear that integrating cloud analytics components on one’s own is not by any means impossible. The procedures for doing so are documented and, on a one-off basis at least, may be quite feasible. But customers who wish to go this route will need specific skill sets in-house.
For example, guidance from Amazon has it that EMR should be used for complex data transformation tasks. Whether writing code that runs natively on Spark Core or Hadoop’s Tez execution engine, or using higher-level components like Pig, Hive or Spark SQL, sophisticated processing can be done on data sitting in S3 as it makes its way from one service to another.
But while that may sound an elegant integration, the reality is that it’s anything but seamless. For one thing, creating the EMR job that performs the data transformation could require a knowledge of Java programming and Hadoop or Spark APIs. In a less dire situation, it would require knowledge of Pig and its Pig Latin language, or else the intricacies of HiveQL combined with creating user defined functions (UDFs – again, in Java) that are callable from the Hive queries. And regardless of which of those skill set combinations figured into things, an understanding of string manipulation, the S3 folder structure, column data types and column naming all figure in to this exercise.
Orchestrating the execution of the job in some automated fashion, as Kinesis Streams data arrives in S3, for example, would require further skills, either in the authoring and operation of Data Pipeline jobs, using Lambda or in something more homespun.
If you want to build major, production-ready analytics pipelines on a public cloud platform, you’re going to need several competencies in place:
- Database and SQL skills
- ETL skills and an appreciation for the infrastructure mechanics of ETL processes
- Familiarity with Spark and/or Hadoop
- Readiness and budget for in-house solution development and the project management prowess to make those solutions successful
Beyond this, maintenance of the solution will be a continual responsibility. Risk tolerance is a factor as well. Whether developed by in-house resources or a solution provider, risks of cost overruns and project failure will still be pronounced, all the way through to delivery. Integration projects are significant undertakings, in terms of development efforts, ongoing management and maintenance.
Buy is better than DIY
But, even in the cloud, the pioneering days are over, and the need for do-it-yourself, bespoke solutions in the world of data analytics is long past. It’s better to buy than DIY.
But buy what, exactly? Specifically, when it comes to managing analytic data in the cloud, the answer is to procure a well-engineered third-party software product that leverages the public cloud provider components, while providing an agile, end-to-end experience. Such products do several important and powerful things, including:
- Leveraging the full power of the base components
- Shielding customers from needing skill sets for each such component, or from working in the components’ low-productivity tooling or command line environments
- Integrating the base components and orchestrate their cooperative operation, while providing an simplified, abstracted user experience
- Having been rigorously tested to run on the particular combination of components that they utilize, and do so in robustly supported scenarios
- Having been proven out, in real deployments, running in production, with big enterprise customers
When third-party products are used in this capacity, the somewhat esoteric open source and cloud technologies they exploit suddenly become friendly tools that are much less risk-laden than when used in “raw” form.
To the user, cloud object storage starts to feel like conventional storage. Hadoop and Spark shine through in terms of power, and the complexity of operating them fades in importance and impact. The exploration, integration, preparation, enrichment and analysis of structured and unstructured data, all together, becomes a reality for your users.
In addition to the architectural advantages, a complete 3rd party analytics product offers a number of other benefits:
- The ability to auto-scale, based on the needs of your processing jobs
- End-to-end governance and security
- The elimination of resource wasting data copying
- Complete operationalization of data workflows
- Out of the box integration with familiar BI and Analytic tools
The abstraction created by such tools allows customers to organize and curate their data in a way that uses cloud storage, without them needing to familiarize themselves with its mechanics. And once the product delivers the final data for consumption, it can be pushed into any destination – data warehouse platforms, BI tools, data science tools, etc. – whether on-premises, in the private cloud, in the public cloud.
The innovation the underlying components provide is harvested and used to maximum advantage. The rough edges, the mismatches, and the required technical knowledge disappear as factors because the third-party product absorbs these challenges. The product’s engineering team assumes the skill set burden; the product’s user experience smooths out the components’ rough edges and interface mismatches; and the product’s adoption by other customers removes risks of unproven deployment, compatibility or performance.
Make it Repeatable
Another advantage to certain third-party analytics products is that they work as both ad hoc, interactive, exploratory data tools and yet can also work in an operational capacity, executing formalized data pipelines in production. Products with that versatility allow users to start out working interactively and then, for work with lasting value, promote the assets to scheduled jobs that run in production.
Certain third-party products even allow customers to compose multiple assets built within the products, so that one file or project always executes subsequent to another. These products allow customers to define the dependencies between assets and make certain that re-execution of one automatically triggers re-execution of those dependent on it.
Analytics products that do all of this, utilizing public cloud analytics components and running in the cloud themselves, typically do so in a self-contained manner. They do not require additional tooling, management frameworks or workflow environments. The best products offer an experience that is turn-key.
Such products tame the wildness of public cloud analytics stacks and make them work for the customer.
Public cloud analytics technology is incredibly powerful and innovative; implementing it successfully can be complex and difficult, and projects may not succeed. The cloud analytics stacks are collections of individual open source projects, not unified, proprietary suites from a single vendor. That pedigree underlies both the stack’s power and its complexity, and is something enterprise customers need to pay very close attention to.
This approach by the cloud providers can leave enterprise customers in a tough spot. Not only do the customers face significant challenges to successful cloud analytics implementations, but also because they trust their cloud providers, they may be lulled into a false sense of security that the challenges have been eliminated. It’s double jeopardy.
The good news is that the complexities can be conquered, and the risks largely mitigated, through the use of commercial third-party analytics applications. These products run atop these powerful components, exploit their power and hide their complexity, giving users the best of both worlds. Runaway skill set requirements are eliminated; integration challenges are removed; abrupt context switches between disparate open source technologies go away.
But using such products hasn’t been the mainstream strategy in the Big Data analytics world, especially in the cloud. Refusal to adopt this strategy has led to project failures and a reputation deficit for powerful open source analytics technologies. If use of third-party analytics products like Datameer by enterprise customers were de riguer, that reputation could be turned around very quickly.
The agility you expect form the cloud will prove to be elusive if you limit yourself to its first party offerings. First party + third party = customer solution success. It’s a simple equation that can yield excellent results. But without third-party offerings, success will play hard-to-get. The house brand just ain’t enough.
Be sure to check out our on-demand webinar, Analytics in the Cloud: Is Your Data Ready? Learn where it makes sense to embrace data in the cloud, challenges and approaches to agile data pipelines in the cloud and how companies are getting value from cloud analytics today.
Andrew Brust is CEO and founder of Blue Badge Insights, advising big data vendors and customers on strategy and implementation. He covers big data and analytics for ZDNet, is conference co-chair for Visual Studio Live! and is a Microsoft Data Platform MVP.