Big Data & Brews: Cloudera on Spark, Tez & the Enterprise Data Hub
In our last installment of Big Data & Brews with Mike Olson, he shares his thoughts around Spark, Tez, and Cloudera’s vision for a data hub.
I wanted to call out Mike’s comment that we’re now seeing the emergence of what he thinks of as the next generation of analytic and exploratory applications aimed at business users, delivering business value that hides the technology underneath. I couldn’t agree more — big data should’t be complicated and inaccessible.
Tune in below to hear more.
Stefan: Let’s talk a little bit about Spark. There’s a lot of hype around Spark and there’s a lot of challenges with Spark. What’s the good and ugly?
Mike: The good is that a group of researchers at UC Berkeley took the lessons that we collectively had learned from a decade or so of use of MapReduce and used those lessons to design what you might call a second system. If you were starting from a clean sheet of paper today on this scale or architecture, what would you build? Would it be exactly what Jeff Dean and Sanjay Ghemawat conceived as MapReduce in 2002 or so? No, probably not.
Spark is a general purpose engine for executing directed a-cyclic graphs of processing. MapReduce has a problem that you map, you shuffle, you reduce. If you’re going to do complicated stuff, you got to string a bunch of those together, and that middle step, shuffle?
Stefan: The shuffle, that’s why it’s batch.
Mike: That’s exactly, just stand on the parking brake for a while.
Spark uses memory wisely so they say it’ll run very quickly, but it also doesn’t force you to put that synchronization step in between every useful operation, so it runs much better. What’s another good thing is it began at Berkeley but now it’s got a huge ecosystem of contributors and developers, like 500 people are writing and committing code into the project. It’s very widely embraced by much of the industry.
I believe, we believe at Cloudera, that it’s the likeliest successor to MapReduce. That doesn’t say that MapReduce, the original Google concept, ever leaves the platform. That engine will be part of CDH forever, our distro. But we see more new workloads launching on Spark than on MapReduce these days. Easier to program. Addresses a bunch of those latency and other issues.
I think it’s got a great, great future. Databricks, the company, got funded by Andreessen Horowitz and others – they’re driving innovation on the platform. We’ve got people contributing. The ecosystem looks pretty healthy there.
We also think that it’s the right substrate for Hive. There’s no question that MapReduce has been a challenge. People have huge latency issues. Making Hive run better is an important goal because so many workloads depend on Hive for the transformation and processing and exploratory workloads.
We think by swapping MapReduce out and Spark in we get two benefits. One is Spark, with its huge ecosystem, is going to continue to innovate and faster and better in lots of ways. Second, we dramatically reduce the latency in queries running here. Still, by the way, we expect to be much slower Hive running on Spark than Impala running natively on that data, but they’re used for different workloads already.
Stefan: So you continue to invest in Impala.
Mike: Oh yeah, no question. Our attitude is analytic work was Impala, the kind of transformation and batch workloads are going to run on just a much better Hive with a faster substrate than MapReduce.
Stefan: Where is Tez coming in in your strategy?
Mike: Tez is a good engine. It’s a solid work. Right now, really the only company contributing to the Tez project is Hortonworks. Right now, we don’t have a Tez strategy. That’s not to say we wouldn’t evolve one later, and that’s not in any way to impugn the work, but a couple of observations.
One is, it’s still a very new project, and you know I know that software becomes reliable by being deployed and running in production at lots and lots of places. Early projects are always unreliable.
Stefan: It’s very dangerous, yeah.
Mike: I think Tez is likely to mature. Certainly Hortonworks has made a big bet on it, but early deployments are going to be challenging.
The fact that Spark has such a large community, the fact that it’s been in production deployment for a couple of years already, we think make it a safer substrate here. We’ll continue to evaluate how and when we would embrace Tez.
Right now, the strategy is, and we actually announced this a week or two back. We’re working with IBM, Databricks, Intel, MapR and us, Cloudera, to take this suite of tools that run on MapReduce, Pig, Hive, Sqoop and others, and port them to run on Spark. We think that’s going to be a better bet long term. Just the natural successor engine to MapReduce and less complexity to the platform. Less proliferation of different alternative SQL and so on.
Stefan: What’s with the challenges that Tez has around data guarantees? You kill one server and you’re not sure how much data you prepossess. That was a whole reason we had the traffic stage, right? Because if MapR died, you can re-execute the whole thing. So Tez seemed to be a little bit challenged there. As we go to financial service companies, they’re like really, really concerned, you know, calculating certain risk scores twice by mistake because of other issues.
Mike: Look, I’ve said it before, building database systems is a very difficult thing. I’ve got huge respect for the existing products that are in that market. Because we’re not including Hive in CDH at present, and because frankly I don’t know nearly enough about the limitation, I don’t want to pick on it.
What I will say is young technology aimed at these seasoned workloads, people have been running those kind analytics and those kind of processes on databases for a very long time. It’s a risky thing to do. You’re exactly right. You want to know the semantics of the underlying engine, its predictability and so on.
I think those are the hurdles that Hive on Tez, that new combination on that new engine, need to clear. Probably it will, probably it’s going to take some time.
Stefan: Let’s switch gears a little bit. You guys driving very hard the concept of a data hub. What is that? What’s the difference to what we did before and what other use cases did you see in the different industry areas?
Mike: The first thing that I would say is that as data volumes are exploding and data becomes not people generated but machine generated, the phone you carry in your pocket, the sensors in your building and so on. It’s getting produced at a rate that’s just totally out of whack with whatever happened before. Traditional centralized systems can’t scale up to store it. That’s why the Google architecture, the scale out system is so attractive.
The existing systems our customers run … We walk into a data center and they’re running Teradata for mission critical applications, like if that application gives a wrong answer the CFO goes to jail. Those workloads are unlikely to migrate to any new platform for a very long time if ever. That said, data volumes are exploding. We’ve got these new capabilities, and I’ll just add Spark because we’ve been talking about it, coming into this platform, we believe that this collection, what the Hadoop ecosystem has produced, is a natural place for vast new data sets and for a bunch of new workloads to migrate.
By the way, I should say this as well. There’s a resource management layer in here called Yarn which is very important. It you’re going to be using lots of ways to get at data, you want to be sure that you can control resource consumption, memory, CPU and all the rest. Without this layer, this whole thing doesn’t work.
Stefan: Let me pause you there for just one second. One push back we heard from a few customers is Impala isn’t sitting on Yarn at this point. Is that …?
Mike: Fixed in the next release. We’ve done it. We began our work on Impala several years ago, and when we did, Yarn was very far from ready for prime time, but you need everybody to pay attention to resource management here.
This architecture, you’ve got lots and lots of data stored in a convenient place, you’ve got different ways to get at it, you’re able to manage all those workloads in a clean way. You’ve got security and data governance and data lineage and a bunch of enterprise cloud services, this becomes a natural place to land data. Maybe even to take some workloads, ETL is a good example, that traditionally ran on a big data warehouse, and move it to this scale out cheaper and also, by the way, tens, hundreds, thousands of computers are going to get faster infrastructure.
This is a natural, central place to land data. We call it an enterprise data hub, because those spokes have to connect to other systems – your existing data warehouse, your document management system, or to your users, the folks that you serve with Datameer. The data, the centralization is good, important, allows people to capture data that they couldn’t of afforded to do before, and explore it in ways that they never could before because they’ve got these new algorithms and tools, but it has to integrate into the rest of your infrastructure. That’s why it’s a hub. It must connect to these other systems.
Stefan: Of course, besides Datameer, what are you seeing living on top of the data hub? What are the excess mechanisms? Just typing and SQL queries, maybe not the future.
Mike: When we first started out, the only thing there was MapReduce, and I talked to people in the database industry that said, “You have to sell the Java programmers, you’re doomed.”
What they didn’t recognize was that you would see these other engines come into the market that could raise or simplify the interfaces that people dealt with.
If you look at the evolution of the relational database market, relational databases were brand new the ’80s then I was working on that tech thing. There were no dbas. SQL had not even won yet. There were still IDL and Quell. There were no applications that ran on this platform. That’s where we were when Cloudera started. We had this great new architecture with a totally unscaled user base and no apps. That’s gotten much better. You guys were very early in the market making data stored in this infrastructure available, but it was merely that you were visionary early first compared the rest of the market.
We’re seeing more tools come out now that run on this infrastructure.
Stefan: Mostly connected to an SQL engine then, yeah?
Mike: There are lots of SQL applications but, for example, we’ve got a couple of ISV partners, independent software vendor partners, that build full stack applications. So you can go right now to nice systems and you can buy an application that’s basically a next best action recommender for retailers. It runs on all this infrastructure and the user doesn’t even know that there’s HDFS and MapReduce on all stuff. They’re just running that app.
Stefan: And Datameer is behind that.
Mike: And Datameer is in there as well.
We’re seeing now the emergence of what I think of as the next generation of analytic and exploratory applications aimed at business users, so built on top of this level, but delivering business value that hides the technology underneath.
That’s what happened with relational databases, right? People started paying attention to Oracle Financials and to Oracle database.
Stefan: That will be the future. Delivering value on top.Thank you so much for joining. Thanks for the great refreshing beer. It’s such a great summer day in San Francisco. I hope you come back.
Mike: I will.
Mike: Good to see you, man.