Bursting the Big Data Bubble
Below is an article I originally wrote for ZDNet’s Big On Data blog, republished here with permission. Special thanks to Andrew Brust (@AndrewBrust)!
We’re in the middle of a Big Data and Hadoop hype cycle, and it’s time for the Big Data bubble to burst.
Yes, moving through a hype cycle enables a technology to cross the chasm from the early adopters to a broader audience. And, at the very least, it indicates a technology’s advancement beyond academic conversations and pilot projects. But the broader audience adopting the technology may just be following the herd, and missing some important cautionary points along the way. I’d like to point out a few of those here.
Riding the Bandwagon
Hype cycles often come with a “me too” crowd of vendors who hastily rush to implement a hyped technology, in an effort to stay relevant and not get lost in the shuffle. But offerings from such companies may confuse the market, as they sometimes end up implementing technologies in inappropriate use cases.
Projects using these products run the risk of failure, yielding virtually no ROI, even when customers pony up significant resources and effort. Customers may then begin to question the hyped technology. The Hadoop stack is beginning to find itself receiving such criticism right now.
Bursting the Big Data bubble starts with appreciating certain nuances about its products and patterns. Following are some important factors, broken into three focus areas, that you should understand before considering a Hadoop-related technology.
Hadoop is not an RDBMS killer
Hadoop runs on commodity hardware and storage, making it much cheaper than traditional Relational Database Management Systems (RDBMSes), but it is not a database replacement. Hadoop was built to take advantage of sequential data access, where data is written once then read many times, in large chunks, rather than single records. Because of this, Hadoop is optimized for analytical workloads, not the transaction processing work at which RDBMSes excel.
Low-latency reads and writes won’t, quite frankly, work on Hadoop’s Distributed File System (HDFS). Mere coordination of writing or reading single bytes data requires multiple TCP/IP connections to HDFS and this creates very high latency for transactional operations.
However, the throughput for reading and writing larger chunks of data in a well-optimized Hadoop cluster is very fast. It’s good technology, when well-understood and appropriately applied.
Hives and Hive-nots
Hive allows developers to query data within Hadoop using a familiar Structured Query Language (SQL)-like language. A lot more people know SQL than can write Hadoop’s native MapReduce code, which makes use of Hive an attractive/cheaper alternative to hiring new talent, or making developers learn Java and MapReduce programming patterns.
There are, however, some very important tradeoffs to note before making any decision on Hive as your big data solution:
- HiveQL (Hive’s dialect of SQL) allows you to query structured data only. If you need to work with both structured and unstructured data, Hive simply won’t work without certain preprocessing of the unstructured data.
- Hive doesn’t have an Extract/Transform/Load (ETL) tool, per se. So while you may save money using Hadoop and Hive as your data warehouse, along with in-house developers sporting SQL skill sets, you might quickly burn through those savings maintaining custom ETL scripts and prepping data as requirements change.
- Hive uses HDFS and Hadoop’s MapReduce computational approach under the covers. This means, for reasons already discussed, that end users accustomed to normal SQL response times from traditional RDBMSes are likely to be disappointed with Hive’s somewhat clunky batch approach to “querying”.
Real-time Hadoop? Not really.
At Datameer, we’ve written a bit about this in our blog, but let’s explore some of the technical factors that make Hadoop ill-suited to real-time applications.
Hadoop’s MapReduce computational approach employs a Map pre-processing step and a Reduce data aggregation/distillation step. While it is possible to apply the Map step on real-time streaming data, you can’t do so with the Reduce step. That’s because the Reduce step requires all input data for each unique data key to be mapped and collated first. While there is a hack for this process involving buffers, even the hack doesn’t operate in real-time, and buffers can only hold smaller amounts data.
NoSQL products like Cassandra and HBase also use MapReduce for analytics workloads. So while those data stores can perform near real-time data look-ups, they are not tools for real-time analytics.
Three blind mice
While there are certainly other Big Data myths that need busting out there, Hadoop’s inability to act as an RDBMS replacement, Hive’s various shortcomings and MapReduce’s ill-suited-ness to real-time streaming data applications present the biggest stumbling blocks, in our observation.
In the end, realizing the promise of Big Data will require getting past the hype and understanding appropriate application of the technology. IT organizations must burst the Big Data bubble and focus their Hadoop efforts in areas where it provides true, differentiated value.