On Tuesday, my colleague Stefan and I gave a presentation at the Open Source Business Conference in San Francisco where we discussed how our customers are using Datameer and Hadoop for big data analytics. We also got a chance to listen to a few other presentations and talk to other folks in the industry, and one thing quickly became clear: there is a lot of confusion being created by the incumbent BI vendors around Hadoop. The most surprising part of the day for me was a presentation by a major incumbent claiming that their Hadoop solution is real-time.
There are two types of data management and business intelligence incumbents in the market today:
1) Those that claim they’re “Big Data” companies now because they’ve connect to Hive on Hadoop.
2) Those that criticize latency in Hadoop and claim they offer real-time analytics but use Hadoop for preprocessing and than copy the data from Hadoop into their database or system.
Since we’ve already talked a bit about the limitations that Hive imposes on Hadoop in a few other posts (which you can read here and here), let’s focus on the second type of BI vendors – those that claim Hadoop’s downfall is its latency and lack of real-time.
We want to be clear on this: Hadoop is not, and never will be, real-time.
Hadoop has latency, it is batch processing. The traditional alternative, ETL (Extract, Transform, Load), is also batch processing, with latency that is far worse when you need to add a new data source to your static schema. If it is even possible, it will take weeks or months for IT redo the schema. Also lets not forget that loading data into a RDBMs, especially if it is a lot of data is never real time since we have to build a b-tree data structure in the backend.
Decision makers today need changes made to their data pipeline in hours, not weeks or months. Data has become more complex and traditional ETL and data warehousing tools simply don’t fit today’s data supply chains. We buy data integration from one vendor, a data warehouse from another and business intelligence tools from another. We have 3 groups of people involved now, ETL engineers, DBA’s, and data or business analysts. We have 3 vendors, 3 phone numbers and when something goes wrong, they all blame each other.
There is a better way.