About Us Icon About Us Icon Business Analyst Icon Business Analyst Icon CEO Icon CEO Icon Datameer Icon Datameer Icon Envelope Icon Envelope Icon Facebook Icon Facebook Icon Google Plus Icon Google Plus Icon Instagram Icon Instagram Icon IT Professional Icon IT Professional Icon Learn Icon Learn Icon Linkedin Icon Linkedin Icon Product Icon Product Icon Partners Icon Partners Icon Search Icon Search Icon Social Networks Icon Social Networks Icon Share Icon Share Icon Support Icon Support Icon Testimonial Icon Testimonial Icon Twitter Icon Twitter Icon

Datameer Blog

Incumbents are Confused… and BTW, Hadoop is not Real-time Data Analytics

By on May 25, 2012

On Tuesday, my colleague Stefan and I gave a presentation at the Open Source Business Conference in San Francisco where we discussed how our customers are using Datameer and Hadoop for big data analytics. We also got a chance to listen to a few other presentations and talk to other folks in the industry, and one thing quickly became clear:  there is a lot of confusion being created by the incumbent BI vendors around Hadoop. The most surprising part of the day for me was a presentation by a major incumbent claiming that their Hadoop solution is real-time.

There are two types of data management and business intelligence incumbents in the market today:

1)  Those that claim they’re “Big Data” companies now because they’ve connect to Hive on Hadoop.
2) Those that criticize latency in Hadoop and claim they offer real-time analytics but use Hadoop for preprocessing and than copy the data from Hadoop into their database or system.

Since we’ve already talked a bit about the limitations that Hive imposes on Hadoop in a few other posts (which you can read here and here), let’s focus on the second type of BI vendors – those that claim Hadoop’s downfall is its latency and lack of real-time.

We want to be clear on this: Hadoop is not, and never will be, real-time.

Hadoop has latency, it is batch processing. The traditional alternative, ETL (Extract, Transform, Load), is also batch processing, with latency that is far worse when you need to add a new data source to your static schema. If it is even possible, it will take weeks or months for IT redo the schema. Also lets not forget that loading data into a RDBMs, especially if it is a lot of data is never real time since we have to build a b-tree data structure in the backend.

Decision makers today need changes made to their data pipeline in hours, not weeks or months. Data has become more complex and traditional ETL and data warehousing tools simply don’t fit today’s data supply chains. We buy data integration from one vendor, a data warehouse from another and business intelligence tools from another. We have 3 groups of people involved now, ETL engineers, DBA’s, and data or business analysts. We have 3 vendors, 3 phone numbers and when something goes wrong, they all blame each other.

There is a better way.


Connect with Datameer

Follow us on Twitter
Connect with us on LinkedIn, Google+ and Facebook


Rich Taylor

Subscribe