For big data, Hadoop and Apache Spark, a funny thing happened on the way to the enterprise mainstream. Although most enterprise IT groups and many business units have bought in on the value of big data, its deployment, in many cases, still falls into one of two models. And both of these common big data implementations have drawbacks, because they are rooted in big data’s compromised, pre-mainstream adoption.
In one case, legacy BI platforms, on which customers may have standardized long ago, is implemented to work with Hadoop. This is usually via some abstraction layer that makes Hadoop look like one or another older, conventional data server technology.
In the other case, customers are getting to big data with tools that are designed specifically for it, but which lack the usability of enterprise or self-service BI. While each approach has its merits, you should also be aware of the limitations they impose. There is also a third, “best of both worlds” option to consider.
Read on to learn more, and dig even deeper in this free eBook, “Getting Value from Hadoop: Build, Bridge or Buy”
Building a Bridge to Leverage Traditional BI
Not surprisingly, lots of organizations have invested significantly – sometimes in the millions of dollars – into traditional enterprise data warehouse and BI tooling, and they don’t want that investment to gather cobwebs. As such, they’re compelled to look at what are essentially bridging technologies – or workarounds – to get older technologies to work in the new big data world.
To this end, a number of technologies exist in the market, which fall under the larger category of “SQL on Hadoop.” All of these technologies expose data on the Hadoop cluster through a relational database management system (RDBMS) interface. These technologies help conventional products that were built for RDBMSes to query data in Hadoop – essentially by fooling them into thinking they’re still talking to a relational database.
While this solves the problem of mere access to big data, it doesn’t do so in a very elegant fashion. In effect, customers’ tools can now connect to Hadoop, but only because the tools think that they’re not. Hadoop is made to bend to the will of the very technologies it seeks to usurp. The tools only connect to Hadoop because it’s hidden and suppressed, because it’s been made to change its stripes.
Some other commercial tools exist that create an OLAP (online analytical processing) abstraction over Hadoop. Some of these only create a semantic data model over Hadoop. When queried, they in turn dispatch SQL queries using one or more of the tools we’ve already discussed. Such OLAP tools introduce two layers of indirection (OLAP and RDBMS), making things even more complicated than the SQL-on-Hadoop scenarios.
Other products are more efficient, implementing physical (“materialized”) OLAP cubes natively on Hadoop. This approach is more compelling, because it avoids emulation, and instead creates an execution engine that can query data stored in Hadoop directly. That’s a nice touch, and they can still function in a way that is compatible with older, non-Hadoop-based OLAP tools.
Schema ain’t right
But whether it’s SQL-on-Hadoop or OLAP-on-Hadoop, and whether the queries are translated or executed directly, all of these tools require that the data’s schema be pre-declared and strictly adhered to. In other words, they require the data to conform to schema in order to be stored. It can be argued that this approach, sometimes called “schema on write,” is the antithesis of working with big data.
One reason for this: big data often involves querying unconventional data sources whose structure can have some variation between data records. Another: Two different analyses on the same data may warrant its structure being interpreted differently. Forcing structure in advance removes that flexibility at time of analysis, in practical terms.
Starting Down a New Path: Purpose-built for Hadoop
On the other hand, there are tools that are very much designed for Hadoop and Spark. For example, there’s Hue, a Web browser-based development environment for running jobs (for example, on MapReduce), executing individual commands and scripts, or submitting SQL queries. Another category, so-called “notebook” tools, like Apache Zeppelin and Jupyter, allow for working against Apache Spark. Although they have data visualization facilities built right in, they are designed for use by developers and not business users.
Commercial tools are also available for working with big data. Many of these allow for data transformation pipelines to be designed and executed, a task these tools doubtless make much easier, but which are, if anything, are developer-oriented, and are hardly able to be called “self-service”.
Even with all its advantages, harnessing the power of Hadoop directly has its challenges, which do not come for free. Programming all the analytic process steps inside of Hadoop is a time consuming, costly endeavor that requires very specialized skills. In addition, there are analytic infrastructure tasks that may need to be hand-coded.
This would dramatically impact the time it takes to deliver your big data projects:
- Programming the individual analytic steps takes far more time that being able to configure and model them in an analytic platform.
- The programmatic approach will often have team members focusing on different steps, requiring complex hand-offs between team members that require extra time.
- Re-use of components in a programmatic approach is more difficult, eliminating ramp-up time efficiencies in subsequent projects.
- Operationalizing programmatic analytics adds even more coding and risk for items such as scheduled jobs, security, and integration with downstream business applications.
Meet in the Middle: Schema-on-Read AND Self-Service
The question that’s important, and perhaps fairly obvious by now, is where is the middle ground? How can you avoid the use of RDBMS or OLAP abstraction layers, with their schema-on-write approaches, but offer a business user interface, AND avoids developer minutiae, be it code, command lines or detailed data flow diagrams? What about an approach that joins the flexibility of schema-on-read with a user interface that’s self-service in design?
The Advantages of the Modern BI Platform Approach
It’s a good question. A modern BI platform built natively on Hadoop, like Datameer, leverages the unlimited compute and storage power of Hadoop while abstracting the technical complexity AND still allowing you to leverage existing BI investments. It is directly usable by business analysts, who can define all the steps in the analytic process themselves, so there is no programming required.
What makes a modern BI platform? Read this: “7 Requirements of a Modern BI Platform”
As is the case with the other categories explained above, there are differences in what you get with different tools that classify themselves as modern BI platforms. Here’s what to look for to achieve the “best of both worlds” approach outlined above:
- A self-service end-to-end platform, making data integration, preparation, analytics, visualization and operationalization easy for the end-user
- Dynamic modeling capabilities, giving you the schema flexibility you need
- The ability to collaborate and re-use analytic results as data sets for further analysis, encouraging data exploration
- Functionality to operationalize the insights you find, feeding meaningful business processes and making your insights actionable
- Governance controls so you can democratize data access without introducing chaos
- Ability to leverage existing BI investments with the right import and export functionality and data source connectivity so switching between tools is seamless
There’s plenty more to discuss here on each of the three approaches. To dig in even further and understand how each approach will affect your outcomes, both from a time and staffing perspective, be sure to check out this free eBook, “Getting Value from Hadoop: Build, Bridge or Buy”.