About Us Icon About Us Icon Business Analyst Icon Business Analyst Icon CEO Icon CEO Icon Datameer Icon Datameer Icon Envelope Icon Envelope Icon Facebook Icon Facebook Icon Google Plus Icon Google Plus Icon Instagram Icon Instagram Icon IT Professional Icon IT Professional Icon Learn Icon Learn Icon Linkedin Icon Linkedin Icon Product Icon Product Icon Partners Icon Partners Icon Search Icon Search Icon Social Networks Icon Social Networks Icon Share Icon Share Icon Support Icon Support Icon Testimonial Icon Testimonial Icon Twitter Icon Twitter Icon

Datameer Blog

Bursting the Big Data Bubble

By on September 10, 2012

Below is an article I originally wrote for ZDNet’s Big On Data blog, republished here with permission. Special thanks to Andrew Brust (@AndrewBrust)!

We’re in the middle of a Big Data and Hadoop hype cycle, and it’s time for the Big Data bubble to burst.

Yes, moving through a hype cycle enables a technology to cross the chasm from the early adopters to a broader audience. And, at the very least, it indicates a technology’s advancement beyond academic conversations and pilot projects. But the broader audience adopting the technology may just be following the herd, and missing some important cautionary points along the way. I’d like to point out a few of those here.

Riding the Bandwagon
Hype cycles often come with a “me too” crowd of vendors who hastily rush to implement a hyped technology, in an effort to stay relevant and not get lost in the shuffle. But offerings from such companies may confuse the market, as they sometimes end up implementing technologies in inappropriate use cases.

Projects using these products run the risk of failure, yielding virtually no ROI, even when customers pony up significant resources and effort. Customers may then begin to question the hyped technology. The Hadoop stack is beginning to find itself receiving such criticism right now.

Bursting the Big Data bubble starts with appreciating certain nuances about its products and patterns. Following are some important factors, broken into three focus areas, that you should understand before considering a Hadoop-related technology.

Hadoop is not an RDBMS killer
Hadoop runs on commodity hardware and storage, making it much cheaper than traditional Relational Database Management Systems (RDBMSes), but it is not a database replacement. Hadoop was built to take advantage of sequential data access, where data is written once then read many times, in large chunks, rather than single records. Because of this, Hadoop is optimized for analytical workloads, not the transaction processing work at which RDBMSes excel.

Low-latency reads and writes won’t, quite frankly, work on Hadoop’s Distributed File System (HDFS). Mere coordination of writing or reading single bytes data requires multiple TCP/IP connections to HDFS and this creates very high latency for transactional operations.

However, the throughput for reading and writing larger chunks of data in a well-optimized Hadoop cluster is very fast. It’s good technology, when well-understood and appropriately applied.

Hives and Hive-nots
Hive allows developers to query data within Hadoop using a familiar Structured Query Language (SQL)-like language. A lot more people know SQL than can write Hadoop’s native MapReduce code, which makes use of Hive an attractive/cheaper alternative to hiring new talent, or making developers learn Java and MapReduce programming patterns.

There are, however, some very important tradeoffs to note before making any decision on Hive as your big data solution:

  • HiveQL (Hive’s dialect of SQL) allows you to query structured data only. If you need to work with both structured and unstructured data, Hive simply won’t work without certain preprocessing of the unstructured data.
  • Hive doesn’t have an Extract/Transform/Load (ETL) tool, per se. So while you may save money using Hadoop and Hive as your data warehouse, along with in-house developers sporting SQL skill sets, you might quickly burn through those savings maintaining custom ETL scripts and prepping data as requirements change.
  • Hive uses HDFS and Hadoop’s MapReduce computational approach under the covers. This means, for reasons already discussed, that end users accustomed to normal SQL response times from traditional RDBMSes are likely to be disappointed with Hive’s somewhat clunky batch approach to “querying”.

Real-time Hadoop? Not really.
At Datameer, we’ve written a bit about this in our blog, but let’s explore some of the technical factors that make Hadoop ill-suited to real-time applications.

Hadoop’s MapReduce computational approach employs a Map pre-processing step and a Reduce data aggregation/distillation step. While it is possible to apply the Map step on real-time streaming data, you can’t do so with the Reduce step. That’s because the Reduce step requires all input data for each unique data key to be mapped and collated first. While there is a hack for this process involving buffers, even the hack doesn’t operate in real-time, and buffers can only hold smaller amounts data.

NoSQL products like Cassandra and HBase also use MapReduce for analytics workloads. So while those data stores can perform near real-time data look-ups, they are not tools for real-time analytics.

Three blind mice
While there are certainly other Big Data myths that need busting out there, Hadoop’s inability to act as an RDBMS replacement, Hive’s various shortcomings and MapReduce’s ill-suited-ness to real-time streaming data applications present the biggest stumbling blocks, in our observation.

In the end, realizing the promise of Big Data will require getting past the hype and understanding appropriate application of the technology. IT organizations must burst the Big Data bubble and focus their Hadoop efforts in areas where it provides true, differentiated value.

Connect with Datameer

Follow us on Twitter
Connect with us on LinkedIn, Google+ and Facebook

Stefan Groschupf

Stefan Groschupf

Stefan Groschupf is a big data veteran and serial entrepreneur with strong roots in the open source community. He was one of the very few early contributors to Nutch, the open source project that spun out Hadoop, which 10 years later, is considered a 20 billion dollar business. Open source technologies designed and coded by Stefan can be found running in all 20 of the Fortune 20 companies in the world, and innovative open source technologies like Kafka, Storm, Katta and Spark, all rely on technology Stefan designed more than a half decade ago. In 2003, Groschupf was named one of the most innovative Germans under 30 by Stern Magazine. In 2013, Fast Company named Datameer, one of the most innovative companies in the world. Stefan is currently CEO and Chairman of Datameer, the company he co-founded in 2009 after several years of architecting and implementing distributed big data analytic systems for companies like Apple, EMI Music, Hoffmann La Roche, AT&T, the European Union, and others. After two years in the market, Datameer was commercially deployed in more than 30 percent of the Fortune 20. Stefan is a frequent conference speaker, contributor to industry publications and books, holds patents and is advising a set of startups on product, scale and operations. If not working, Stefan is backpacking, sea kayaking, kite boarding or mountain biking. He lives in San Francisco, California.