Big Data & Brews: Cloudera on SQL
In continuing my conversation with Mike Olson, Cloudera’s co-founer and current chief strategy officer, he shared a really interesting perspective around SQL. Given that there is a lot of talk right now around this topic, you’ll definitely want to tune in!
Stefan: Welcome back to Big Data and Brews with Mike Olson from Cloudera. Cheers.
Mike: Cheers, man.
Stefan: Isn’t it great? I came up with this idea doing Big Data and Brews, obviously German heritage. Drinking beer at work and creating some [inaudible 00:00:22] content at same time.
Mike: That’s fantastic. I told my PR staff, I really have no choice. I have to drink at work. It’s a requirement.
Stefan: Yeah. I’m not sure what the legal impact is, but let’s ignore this for now.
We talked about SQL a little bit in the last segment, but one more thing I want to dive into is now there’s Hive. You guys support Hive.
Stefan: You have Impala as well, but my understanding is Impala has the same SQL dialect/engine as Hive?
Mike: No. It’s an entirely different ground up implementation. I’ll go into detail if you want.
Stefan: The big question here is really, everybody has its own little SQL story now. I think that was one of the really challenging things at the data base world that you obviously have tremendous experience.
How you see this? Is it coming together? Is that a problem for customers?
Mike: I think the brutally frank answer is no, it’s not going to come together. The deal is, there is so much money in the SQL query processing market, and they are such large, entrenched players that there’s huge interest in exclusive control over the different engines in the market which means nobody wants to collaborate, really, with anybody else because they’d like to have their differentiated offering.
The traditional vendors, think IBM and Oracle and Teradata and others, have decades invested in a really high performance query processing analysis and they’d love to move that to a big data platform, right?
Our point of view is that in the long term, any native SQL engine running on the Hadoop infrastructure, what we call an enterprise data hub, one place to land all your data, any native SQL implementation has to be competitive on performance with those engines, right?
If you think about SQL on Hadoop came from, so at Facebook, they needed to type SQL queries against their..
Stefan: Yeah, they had a whole bunch of SQL developer, right?
Mike: Yeah. At the time, by the way, Hadoop was MapReduce plus the storage layer. That’s all.
They did what you would do. They wrote some code that sat on top of MapReduce that took queries and turned them into MapReduce jobs. Batch mode, high latency, complex processing. It’s not that it worked well, but hey man, an elephant that can dance at all is a pretty remarkable thing.
Hive was born on MapReduce and was designed for the MapReduce kind of high latency execution framework. It was designed to basically submit batch jobs to MapReduce. That’s where we were.
If you look at what Google runs internally for their infrastructure, and they invented MapReduce and the storage layer, right? They built a distributed query processing engine, beginning called F1 and evolved into something called Spanner, but it is a native distributed query processing engine built from the ground up to run in parallel on the same hardware infrastructure but not to rely on MapReduce.
Special purpose engine, the thinking went, is going to out-perform any general purpose engine with a thin layer on top. And they’ve demonstrated that was true.
That was our intent with Impala. Impala v Hive, that was a very easy decision for us. Of course, the Hive ecosystem is evolving now as well.
Stefan: Yeah, but isn’t what really holds us back here the storage layer? I mean, it’s a sequential file system. If you do, if you really want to do a random access to it, I mean, you have to do a whole bunch of network hops to do seeks.
Isn’t that what holds us back architecturally at this point to make any meaningful SQL on top of that? And if we replace that, don’t we have another Greenplum, Vertica, Aster Data or Teradata here?
What’s the difference between a native SQL and Hadoop?
Mike: I lost my chalk. Hang on. I’ll just grab another one out. For the record, for the people watching on television, note Hive sits on top of MapReduce, at least original generation, and Impala is a native engine.
You asked a good question, right? Transactional storage is sort fundamental in relational databases, or at least the relational databases that we think about when we think about Oracle and SQL Server.
Hive isn’t used for update intensive workloads. It’s used, kind of, for batch mode data transformation and some large query reporting.
Our vision for Impala is likewise not an OLTP engine. It’s an analytic database, so I want to mostly read, and when I read I want to run sophisticated analyses. I want to create window functions and look at the data.
The key missing feature from the storage layer, frankly, is transactions. Doug Cutting, one of the Hadoop project founders and our chief architect, has gone on record as saying he believes the storage layer will offer transactions at some point in the future.
I tell you what, I’m an old card database guy? That is hard. That is just blindingly hard.
I think that what happens in the near term is for analytic work loads, very good interactive SQL is going to happen. I think that Hive, which has dramatically improved performance lately because it has swapped out the MapReduce engine for a second generation …
Mike: Yeah, Tez, and then we’re going to be working on a port to Spark. It’s not burdened by all the latencies of MapReduce but it’s still on top of a quasi-batch mode engine, and …
Stefan: The whole storage has kind of optimized batch from the beginning, right? That’s the challenge, though.
Mike: Large sequential transfers right, dependent and mostly workloads.
Stefan: That’s why I really question how far can we get with this, because if you oversimplify the way the HDFW works is like a tape drive.
Mike: That’s actually not that over-simplified, really.
Stefan: So if we want to have the skip function of a CD, we can’t really build this on top of a tape. And then again, if we take all of this apart, we already had Greenplum, and it was kind of a decent analytical database at this point, right?
Mike: Yeah, I’ll say a couple of things. So first of all, there are things-
Stefan: Where are the advantages, then. Where are the whole schema on read, the beauty of all the unstructured data.
Mike: Again, I don’t claim that SQL should be the only way, but I do think it’s an important way. And schema is a valuable addition to data you manage here. You don’t want to require it, but if you can exploit it, that’s great.
This storage layer is getting better quickly. A year and a half or more ago, a couple years ago, we began working on adding performance enhancements. Most recently, we have included read caching, in-memory read caches for HDFS, and that was largely a Cloudera-driven project. There’s work going on by some engineers Hortonworks right now to add right to hidden memory, basically, so right caching as well.
The addition of good memory management to the storage layer really helps these guys. It doesn’t solve the transaction problem, and that’s the problem I think is hard, so when will this overtake traditional OLTP? When will Oracle or SQL Server have to work? You know what? I think those days are a pretty far way away. Just me, but I do think we can do really important analytic and processing and transformational workloads given the tools we’ve got today.