Given its opening day at O’Reilly’s Strata Conference, today seems an appropriate day to share this second part of my discussion with Eric Baldeschwieler who shares the rest of his story of how Hadoop came to be within Yahoo.
Stefan: I really have a tough job. Sitting here, drinking…
Eric: It’s good to bring your passions together.
Stefan: Yeah. Good. Absolutely.
Welcome back to Big Data & Brews, with Eric14.
Eric: Hello again.
Stefan: As promised, we had a few more beers and we shared a few more laughs here … well, a few more sips. We stopped a little bit where the history of Hadoop got really interesting. You said your team worked on your own system, and then you kind of got convinced or talked into adopting that very, very early version of Hadoop that run on the Apache license. Let’s double click little bit on this. I’m certainly curious about all the conversations and discussions that you guys had. Intel was all C++ based, right?
Stefan: You guys all hardcore, low-level, deep down there. Now there’s those guys coming along, saying, “Hey, why don’t we …” or maybe those hippies, the Steve Altman hippies … “Hey, why don’t we do Java? Everything is good, we have garbage collection.” That’s one of the never-ending stories in conversations and fights in the Hadoop land. It’s like, “Why aren’t you doing C++?” What’s maybe your experience with that as well?
Eric: Yeah, sure. Everybody on the original team at Yahoo was a C++ coder, so it wasn’t a decision that we took without some consideration.
Stefan: (laughs) They had to learn a new programming language, basically?
Eric: They did. The day I came into the staff meeting and said, “Guys, Rami and I have been talking about it, and he’s convinced me. We’re going to go with Doug Cutting’s project. We’re going to take all of our learnings and bring it into what was Nutch. [00:02:00] We’re going to create a new project, he’s going to spit it out, and we’re going to just commit to that.”
Stefan: Did they throw things?
Eric: Very long faces. I would say it took them about six months before people started to see that this had been a good decision. For about six months, I was the least popular person on the floor.
But yeah, so why Java? First off, it was a bit coincidental. The thing that really mattered to us in the short term was that we were adopting an existing project. That just completely changed the dynamic of the internal conversation about whether our company could contribute to open source. Because if you’ve ever tried to convince a company to make a major investment in an open-source project, it’s hard.
I’ve watched a lot of companies that have made the decision that they’re going to compete in the Hadoop ecosystem … I guess we should have stopped at five beers. A number of companies that have decided they’re going to compete in the Hadoop ecosystem … not just use it, but actually have Hadoop-based products … haven’t yet figured out how to contribute to an open-source project. So for a company whose business was something else entirely, it was a long, drawn-out process.
Stefan: I think I remember Doug Cutting had to actually commit all the patches you guys did for the first, what, year or something?
Eric: It took us 18 months to really get to a fully normalized situation. For the first 18 months, Doug was doing basically all the commits. Then we got Owen as the second committer, and they shared it. That was a full-time job for them during that period. So yeah, that was pretty nutty. But in the end, we got there.
I didn’t appreciate it at the time, but it was a really revolutionary decision that Yahoo did that, so I give those guys a lot of [00:04:00] kudos for backing us. The open source decision was really hard, so adopting an external project meant that it’s a much easier decision to say we’re going to improve something that exists, versus to say we’re going to take our own artifact and move it into open source. It’s just much harder for a company to do that. As our initial project, it made it much easier. I think we’d still be debating with the legal team-
Eric: … if we’d built Juggernaut to completion and then tried to open-source it. We wouldn’t have succeeded yet. But that’s legal.
In terms of Java, part of it is, given that what exists was in Java, we kind of inherited that. In retrospect, did it make sense? I think it made sense for a number of reasons. The first is just that Java is so much more productive. There’s so much better tooling, in terms of debuggers. You don’t have the garbage collection issues. You just have far fewer bugs to begin with.
Then you’ve got all the free tooling to do all kinds of code analysis, all kinds of things. Yes, similar tools exist for C, but they’re just … All the new academic work happens in Java because it’s so much easier to do academic work in. It’s just a much richer set of tools. It’s a much easier language in which to write correct code, which meant that the first version of Hadoop was correct much sooner.
That’s not to be ignored, because in this game, agility matters. Getting it right, and getting it working, and getting it out into people’s hands is a huge piece of what’s needed to make open source succeed. The fact that Hadoop was working, and succeeding, and visibly improving mattered a heck of a lot more than its ultimate performance in its first … even today, but certainly in its first few years of life.
I think Java [00:06:00] really bootstrapped that. It gave us a lot of leverage in terms of the amount of correct working code we could get in people’s hands quickly. The fact that it was slower than it might have been if it was handcrafted C++ code didn’t matter as much that it was there and it worked today. That factor can’t be ignored, because in an open source project, when you always have new people coming in, the learning curve and the code-correctness curve issue never goes away.
If you’re going to have a project with hundreds of contributors that’s doing complicated stuff, Java’s a real advantage. Just code transparency and all those analytics tools and everything else, we really leveraged all that.
Then you get to the argument about whether … that doesn’t matter. Ultimately, C++ could be better. There’s people that’ll stand up and say, “That’s just not right. Java can produce as good code as C++.” There’s a case. We’ve been doing a bunch of work on Stinger, Hive, where I’ve looked at one part. The Impala crowd is doing LLVM, really tight C++ code, and the other crowd is doing vectorized code and letting the git do its thing. Really …
Stefan: Same thing.
Eric: It’s the same thing. The performance difference isn’t that noticeable. Java has one real Achilles heal, which is that it’s much harder to do dense memory management. HBasic, for example, or the NameNode, is a place where the amount of data you can put in memory really matters. Then you start having to invest a lot more to use memory effectively in Java. That kind of obviates a lot of the advantages of Java. But that’s kind of third order. First, you need to get there. [00:08:00] Then, at some point it becomes as complex as C++, but it’s still not worse than C++.
I think in the context of trying to build a big open-source ecosystem, I think Java has been a huge advantage for Hadoop, and all the people that have leapt up and … Every year, there’s been a challenger to Hadoop, and every year, these things have not proven relevant. I’m pretty bullish. I actually think a lot of folks, they have reasons why … Over time, will pieces of the Hadoop ecosystem be recoded in C? Maybe, sure. There’s things in JNI in the main Hadoop code line now, and there will probably be more in the future.
Stefan: Compression, for example, right?
Eric: Yeah, compression, that’s an easy one. System calls in Java are terrible.
Stefan: Yeah. I remember the first version of Hadoop where discrete was hard-coded, a DF. That was one of the main reasons why Hadoop worked awesome on Linux machines. As soon you run it in Windows, if you call DF, it’s of course a completely different command. The way it was implemented, it called DF and then it parsed results. If you actually go on a different Unix version where DF had a different syntax, you’re screwed.
The first time I saw that implementation, I’m like, “Wow. You’re really just executing a command and parsing the text that’s coming back to make a decision like how much space can I use of this hard drive?” Obviously, things got better over time.
Eric: Yeah, and I think Hadoop is getting to the point where in the future, it’s going to start driving its requirements into Java. That’ll be fun to see how those things co-evolve.
No, I mean, there are limits to Java. I wouldn’t argue that it’s the right thing [00:10:00] for everything. But I would argue that agility really matters in this game because, yes, you can statically code what’s in Hadoop today better in C++, but what you want to be doing is not recoding the inner loop for a 2X performance game. What you want to be doing is taking all your learning from the last n years and building a new algorithm, new data structure, new approach that’s qualitatively better. Java’s a better language to do that next version in.
I think Java’s going to continue to be the major language of Hadoop, or at least the JVM will be the major platform for Hadoop for the immediate future.
Stefan: As you started building Hadoop then with your team, how was the adoption path within Yahoo, and what kind of services and projects moved on there? Was it people that were more rightadoo on it, and some people that came later in the game? I hear there was like a camp that really liked a database that starts with O, and then there was you guys, and …
Eric: No, not really an Oracle camp. There’s … I mean, how many 20-minute segments do we have to talk about this?
Stefan: (laughs) Well, we’ll make more. That’s for sure already.
Eric: Adoption was interesting. A company like Yahoo, there’s so many different organizations and teams doing so many different things. The first year or two of Hadoop, when we were just getting off the ground, we kept discovering other projects that asserted that they were also solving similar problems. They wanted to figure out why we shouldn’t abandon Hadoop and adopt their C++ version.
Stefan: (laughs) Of course.
Eric: That they were using to manage data processing for their AD log pipeline, or something like that. So for the first year [00:12:00], it was really just focusing down and saying, “Look, none of these other things that we’re discovering internally solve the search problem. Our goal is to write something that can work at internet scale. There doesn’t exist such a thing. That’s why we’re building this.” We just had to sort of stay focused.
The next thing that happened was those guys that I told you had been coming into my office and saying, “I want to do more research. You’ve got the crawl data. You’ve got the search logs. If I could get that, I could do better science and I could make more money for Yahoo.”
We kind of got to the point where we could make Hadoop work when we started … when Doug took it out of Nutch and put it into the Hadoop project … to working about 20 nodes. We got to the point where it was working on about 100 nodes. Then we realized, to get it to work at the point where it could work on 1,000 nodes was going to take us another 18 months or so.
With all these other parties saying that they had competing systems … It’s hard on a company to not have a product for two years. So we said, “Let’s put out a Hadoop cluster for the science teams. That’ll be a great proof of concept. It’ll show the rest of Yahoo that we’re building something valuable. That’ll keep our managers happy, guarantee funding. Good thing.” (laughing) We weren’t thinking of it as our mission. We were thinking of it as a way to stop Arcotti from coming into my office every month and demanding …
Stefan: Demanding results.
Eric: Yeah, and give the science team something to work with. But mainly, it was just to show that we were producing value, because our goal was to rebuild the web crawl infrastructure with it. But that was, like I said, just a long ways out.
So we put it out in the science hands. What we expected was they’d get some interesting, basically, research results. Maybe they would get data results, where they would [00:14:00] work on the data, and come up with a new spelling correction dictionary, which would then get put into a production system.
But what happened was, they did all that. They were very excited, because basically we blew their minds. The productivity went up orders of magnitude because instead of spending their day trying to find data around Yahoo and then figure out what subset they could get on whatever storage they had … Half their time was spent doing IT work, basically, finding and moving data and not doing research. When they did the research, they would always have to do it on a tiny subset of the data. Now they could just have all the data, move it once to one place, share it, and do their work. They were just much more work productive.
Because they had so much more compute resource, they could write their code in Java and be much more productive. There was this explosion of research results. More importantly, they started to build prototypes of production systems. They said, “Look, we want to take the AD logs and process them every 15 minutes to come up with a model of what people are interested in, and put that back into production every 15 minutes. If we do that, Yahoo will make more money.”
That changed the game. All the sudden, this created this completely unanticipated virtuous cycle where teams were demanding that we build and support Hadoop clusters for them, so that they could build production applications, not in search at all. It just started to grow. I found myself running a Hadoop service with ultimately thousands of customers inside Yahoo and 40,000 nodes.
But yeah, early days, it was not … People had to think differently to use it. There’s this guy Larry Heck who ran the search and advertising science team, who saw what it could do. Originally, there was this guy Arcotti and a couple of other scientists who used it and got good results. Larry saw this and said [00:16:00], “I’m going to make everyone on the science team use Hadoop.” We started to maintain metrics, of just how many people on his team, what percentage of every subgroup was using Hadoop. He started taking that to his staff meeting and asking people, “Why aren’t you using Hadoop?” This just caused it to explode.
Yeah, it was this huge unanticipated success with the science teams doing … We thought of data science as something that was going to just produce, as a side project. Ultimately, it became a pipeline of new applications. It drove most of the use of Hadoop in Yahoo.
Stefan: What was, over that period of time … I’m sure it changed, but just your first thought, [00:16:43] … what was your most requested feature at that period of time?
Eric: Oh gosh. More data? (laughs) It was mostly just more scale, more data, and then obviously more speed. The focus was never on more features. It was really always on just more stability, more scale, and more performance. Really, when our first Hadoop t-shirts from the summits had the Yahoo scaling Hadoop, because that was it. It wasn’t about new APIs. I think the APIs were satisficing for a lot of work in the early days. I give the Google guys a lot of credit for that. Doug built something that worked, in terms of the APIs.
It was really much more of that just taking everything we’d learned and building internet-scale systems. We improved the performance orders of magnitude. We improved the scale orders of magnitude. That’s a lot of engineering, a lot of learning.
Stefan: Let’s switch gears here a little bit. What are the really cool things going on right now, in today’s world? Are you really excited the Hadoop ecosystem, or what’s going on in the Hadoop backend?
Eric: There’s a tremendous amount of stuff going on, [00:18:00] of course. Obviously, the transition to YARN is very exciting. That has been a journey we started … I’m afraid to guess when we really started that. 2009, 2010?
Stefan: Can you explain this a little bit more, for the folks?
Eric: Sure. The original Hadoop version …
Stefan: Here, if you want to …
Eric: (laughs) I don’t think there’s a lot of drawing. There’s a lot better diagrams you’ll find on the web. But the original Hadoop version is built with basically two systems. I haven’t used a chalkboard since I TA’d.
Stefan: Isn’t it fun? (laughing)
Eric: Yeah, it’s taking me back to ’95. HDFS and MapReduce are the two basic layers of Hadoop. Each one of these runs, basically you have a [inaudible 00:18:55] node. HDFS handles your storage, MapReduce handles your compute layer.
The problem with the MapReduce layer is that it assumes one model of programming. It’s a very powerful model of programming. Look at all the things people have done with Hadoop. But there are of course many other possible ways that you could approach distributed computation or just cluster-sharing.
A lot of what people do with Hadoop is actually very simple things that you could do on cloud. You just need to launch a process and do some computation on the data. MapReduce wasn’t designed for that. It certainly wasn’t designed for running all these new frameworks that are emerging. Even worse than that, because every time you change MapReduce you have all the daemons across all the nodes in your cluster, even just evolving MapReduce has been relatively slow, because you have to do it carefully.
We had this thought, which was, what if we break it into three layers, basically? [00:20:00] HDFS, YARN … which technically stands for Yet Another Resource Negotiator, but really we just wanted a name with a “y” in it so that people knew it came from Yahoo (laughing) … and then we could have multiple frameworks on top. You can have MapReduce. You could have, let’s say … MPI was one we thought about a lot in the early days, although that hasn’t been realized yet. You could have lots of others.
Today, draw the list. Storm is really interesting, Spark is coming along. Both of those are happening at Yahoo today. People are looking at web app containers like Tomcat and figuring out how to run that in the cluster. Just all kinds of long-running services, and HBase, that’s a fun one to run in the cluster.
Anyway, the idea is lots and lots of different frameworks. Every user can choose a different kind of compute model. MapReduce basically was doing two things. It was doing resource management choosing …
Stefan: Right, the hard-coded model, for its own purpose.
Eric: Right, exactly. The resource management is deciding what resources you get on what nodes. And then the actual logic of the MapReduce: where does the user code run, what happens when a node fails, et cetera. So we broke that up into call it a user component and a system component. One of the reasons we did this was not even so that you could run lots of frameworks. It’s so that you could evolve MapReduce more quickly.
Stefan: And you have maybe multiple versions. It’s almost kind of a class load or virtualization framework.
Eric: Exactly. Now, if you believe you can improve MapReduce, that just becomes something that you can run a test version of from your desktop and see if [00:22:00] the new one or the old one were better. Whereas before, you had to go and change the clusters, schedule downtime, lots of pain. It will make all the existing work much more agile. It’ll also create hopefully just untap huge amount of innovation. Watching that innovation is going to be really interesting over the next few years.
Stefan: Eric, thank you very much for joining me for a drink here at Big Data & Brews. Hope to see you back soon.
Eric: Yeah, it’s been fun. Thank you.