About Us Icon About Us Icon Business Analyst Icon Business Analyst Icon CEO Icon CEO Icon Datameer Icon Datameer Icon Envelope Icon Envelope Icon Facebook Icon Facebook Icon Google Plus Icon Google Plus Icon Instagram Icon Instagram Icon IT Professional Icon IT Professional Icon Learn Icon Learn Icon Linkedin Icon Linkedin Icon Product Icon Product Icon Partners Icon Partners Icon Search Icon Search Icon Social Networks Icon Social Networks Icon Share Icon Share Icon Support Icon Support Icon Testimonial Icon Testimonial Icon Twitter Icon Twitter Icon

Datameer Blog

Are You Thinking About Hadoop All Wrong?

By on January 30, 2014

**First printed in rediscoveringBI Special Edition, January 2014. Reprinted with permission from Radiant Advisors.**

We are at a technological crossroads.

Forty years ago, when databases first came in to play, hardware was by far the bigger cost over human capital or time. The traditional 3-tier architecture of first needing to extract and transform data before loading it into a data warehouse, and then putting a business intelligence (BI) tool on top of that was the best we could do given the limitations of proprietary hardware that was extremely expensive to scale. And, while this approach worked very well — and still does to this day — the fact is that business needs today go above and beyond what traditional databases are capable of doing.

Today, human capital and time are the far bigger expense over hardware, and businesses in general have less and less time to make decisions. Yet, with the exponential increase in data complexity, the time it takes to get data integrated and analyzed in traditional systems is increasing. This leaves businesses with traditional systems stuck with accepting old, incomplete data to inform decisions, or a reliance on gut feelings.

Think about this: TDWI says the average change cycle to add a new data source to a data warehouse is 18 months. I don’t know a single department that could possibly wait 18 months for an answer. This isn’t about teaching an old dog new tricks, it’s about letting RDBMSs continue to work on the traditional transaction-based use cases they were built for, but then bringing new systems that were purpose built for big data workloads.

Enter Hadoop

Moore’s Law is what paved the way for Hadoop, a linearly scalable storage and compute platform that is optimal for data analytic workloads. This brings to the table a schema-on-read approach as opposed to the traditional schema-on-write with ETL. And it’s this fact — that ETL is no longer needed — that opens up big data analytics to solving new business use cases that traditional systems simply can’t. There’s no longer a prohibitive 18-month change cycle.

Let me be clear. Potential cost-savings aside, the most immediate benefit your business can realize from implementing Hadoop with a self-service big data analytics tool like Datameer is a significant time-savings when it comes to integrating data. This, again, is thanks to the fact that Hadoop is linearly scalable on commodity hardware and does not require a data model to be created before data is stored.

The basic concept is this: use a self-service big data analytics tool, like Datameer, to integrate any data, all data — structured, semi-structured, unstructured, big or small: all of it — in Hadoop, in its raw format. Call it a data lake, a data reservoir, or whatever you will; let it be your central repository for raw data. Once you have all your data integrated — and remember you can easily add anytime a new data source crops up — you begin your analysis by simply building “views” or “lenses” on your data with Datameer to find the insights that matter to your business.

Think of big data analytics on top of Hadoop as 3D printing. The raw data is your raw material, and just like it doesn’t matter what you want to print, it doesn’t matter what kind of analysis you want to perform — your data stays raw and is pushed through a template you build in Datameer, to unveil the insights you’re looking for.

Don’t Get Stung By Hive

One thing I want to make very clear is that Hadoop is not “just another data source,” and any BI tool that simply connects to Hadoop as a data source is severely limiting Hadoop’s potential benefit to your business. In fact, if a BI tool is your only interface to Hadoop, you’re leaving a lot on the table and minimizing your ROI from implementing Hadoop in the first place. If you truly have a big data use case, involving structured and unstructured data, you need a tool that is purpose-built for Hadoop.

Traditional BI tools that connect to Hadoop usually do so through Hive, a data warehouse infrastructure built on top of Hadoop that allows for querying and analysis of structured data only. Like structured data stores used in traditional BI, Hive requires tables and schemas that are then queried via a SQL-type of language. This approach carries the same limitations of many existing systems in that the questions that can be explored are limited to those that have data in the Hive schema, rather than the full raw data that can be analyzed with Datameer. Forcing data into a schema with Hive negates the flexibility that Hadoop provides.

In short, if you’ve invested in Hadoop, and you want to be able to have a business user build an analysis on data stored in Hadoop, that’s great, but the BI tool is limiting them to structured data only. This is why using a self-service tool that works directly with MapReduce and HDFS, like Datameer, is critical.

Ultimately, if you want to bring Hadoop in to complement your existing data warehouse, it all comes down to your particular business use case(s). Let me illustrate what’s possible by bringing Hadoop to the table with three different examples.

Sales Funnel Optimization

A leading software security company used Datameer and Hadoop to integrate and analyze all of their customer acquisition data to understand and measure how people move through a sales funnel to eventually become a customer. This meant bringing together data sources that are housed in several different systems, including search engine advertising, marketing automation and email nurturing, and CRM systems. The level of integration — including trying to join structured and unstructured data — was extremely time consuming and cost prohibitive with their traditional systems.

Datameer’s 55 pre-built data connectors enabled the company to quickly load data from Google Adwords, web logs, logs from a content delivery network, Marketo, JSON from product logs, and a CRM system — all within less than a week. After the initial load, Datameer was set up to load data on an hourly basis.

From there, business users joined all the data together using Datameer’s spreadsheet user interface, and started to build analyses using Datameer’s pre-built analytic functions. Through the analysis, they identified the bottlenecks in the conversion process that enabled them to triple customer conversion and increase revenue by $20 million within six months. 

Predictive Maintenance

While a lot of big data use cases are about making money, using data to optimize production is a great way to save time and money. One study found that auto-manufacturing executives have estimated the cost of production downtime ranges anywhere from $22,000 per minute to $50,000 per minute.

One of the leading global auto manufacturers used Hadoop and Datameer to combine unstructured and structured data from Programmable Logic Controllers (PLC) and proprietary factory and maintenance ticketing systems. The PLC devices housed detailed robot data, including the temperature of components when the robot broke down. By pulling together and analyzing temperature and vibration sensor log files with maintenance history, the manufacturer was able to understand why certain robots broke down in the past. With this knowledge, the manufacturer was able to create a robot maintenance schedule to identify and service robots before failure occurred, which resulted in lowering its factory outage time by 15 percent.

Competitive Pricing

Agility is a must when it comes to making rapid pricing decisions based on competitive data. Using a traditional BI tool and data warehouse, a major retailer’s IT team struggled to prepare the necessary data in a timely fashion, plus the cost of expanding the existing data warehousing systems proved to be prohibitive. The team needed a single hub of all competitive pricing information for all lines of business that was flexible and could handle the variety and volume of new data coming in. The team also required a single datastore as the entry point and hub for all enterprise data assets that could also feed other decision systems.

Using Datameer, the team was able to load raw data of all different sizes and formats into Hadoop daily, and then cleanse and transform the data using pre-built functionality in Datameer. Datameer and Hadoop then fed lower-capacity data warehouses like Netezza, Oracle, and Teradata. By bringing all the data together, the retailer was able to compare product pricing with competitive stores, test hypotheses, and gain competitive insights.

The Bottom Line

When leveraged properly as the powerful storage and compute platform that it is, Hadoop brings an absolutely unprecedented amount of flexibility to data analytics. With Datameer on top, Hadoop’s no-ETL and schema-on-read approach means not only no more 18-month change cycle, but it grants business users the flexibility to finally interact with their data on a truly iterative basis. Only then are businesses able to ask and answer questions they’ve never been able to ask before, allowing them to make data-driven decisions that drive their business forward.


Connect with Datameer

Follow us on Twitter
Connect with us on LinkedIn, Google+ and Facebook

Stefan Groschupf

Stefan Groschupf

Stefan Groschupf is a big data veteran and serial entrepreneur with strong roots in the open source community. He was one of the very few early contributors to Nutch, the open source project that spun out Hadoop, which 10 years later, is considered a 20 billion dollar business. Open source technologies designed and coded by Stefan can be found running in all 20 of the Fortune 20 companies in the world, and innovative open source technologies like Kafka, Storm, Katta and Spark, all rely on technology Stefan designed more than a half decade ago. In 2003, Groschupf was named one of the most innovative Germans under 30 by Stern Magazine. In 2013, Fast Company named Datameer, one of the most innovative companies in the world. Stefan is currently CEO and Chairman of Datameer, the company he co-founded in 2009 after several years of architecting and implementing distributed big data analytic systems for companies like Apple, EMI Music, Hoffmann La Roche, AT&T, the European Union, and others. After two years in the market, Datameer was commercially deployed in more than 30 percent of the Fortune 20. Stefan is a frequent conference speaker, contributor to industry publications and books, holds patents and is advising a set of startups on product, scale and operations. If not working, Stefan is backpacking, sea kayaking, kite boarding or mountain biking. He lives in San Francisco, California.