Stefan's Blog

Big Data Musings From Datameer's CEO

Big Data & Brews: Eric Baldeschwieler on the History of Hadoop

By on February 4, 2014

Tags: , , , , , No comments

Eric Baldeschwieler is an influential figure in the Big Data and Hadoop community. I was honored to have him in to chat to hear his view on the history of Hadoop.

TRANSCRIPT

STEFAN:          Welcome to Big Data and Brews, today with E-14 [00:00:10], Eric Baldeschwieler. Welcome.

ERIC:               Thank you.

STEFAN:          Can you introduce yourself and- nice, cold drink you brought with you.

ERIC:               Sure. I’m Eric Baldeschwieler. I’ve been working with Hadoop since it’s inception. Before that I was building search engines for Yahoo and Inktomi, so I’ve been working with big data since ’96, by some reckoning.

STEFAN:          So was it Inktomi before… and they joined the team after acquisition.

ERIC:               That’s right, Inktomi was acquired by Yahoo in 2003. And the beer! This was in my fridge.

STEFAN:          We’re both German, so…

ERIC:               That’s right, there’s that cultural heritage here. This is actually a California beer. There’s a lot of great beer in California.

STEFAN:          I do agree.

ERIC:               This one is from… I’m having a senior moment. It’s from the south of us near San Luis Obispo, Paso Robles.

STEFAN:          So, a microbrewery?

ERIC:               I found it in my fridge.

STEFAN:          Nice.

ERIC:               I actually drink it fairly regularly, obviously.

STEFAN:          Okay. Well, then let’s do it. So, the first question that I have, when we hang out somewhere at one of those events [00:01:41], “Why E-14?” What’s the background there? That was your e-mail address at Yahoo, I assume?

ERIC:               Eric-14 is a label that’s been with me for a long time, and actually goes all the way back to my sister’s grade-school, where there were two Karen B.’s. There was a Karen-6 [00:02:00] and a Karen-14, because Baldeschwieler has 14 letters in it.

STEFAN:          Okay.

ERIC:               So when I had to choose an e-mail address in college, it seemed the obvious choice.

STEFAN:          It’s very obvious. Let’s see, so that’s two different kinds? I have a double-barrel ale…

ERIC:               I think it’s the same beer with two labels.

STEFAN:          Oh, okay, good. Well, just mixing it up. It has 5% alcohol, so… okay.

ERIC:               So, we can’t be held responsible for what comes next.

STEFAN:          Yeah, we just talk about technology, right?

ERIC:               Let’s try.

STEFAN:          Cheers! [00:02:36]. Yeah, that’s a good, strong beer. It’s not one of those American beers that comes in a can and is brewed from, maybe corn or something.

ERIC:               I have this hypothesis that they never stopped brewing beer in California during Prohibition.

STEFAN:          Yes?

ERIC:               If you go out of the cities there’s all these really good microbreweries, so I think they just never stopped.

STEFAN:          Yeah, that makes sense. So, help me a little bit with your history. What did you do, maybe before Inktomi and how did your career go there [00:03:20] and, I guess, most famously to run the whole Hadoop team and really build this up there. So, help us to get there, from a technology perspective.

ERIC:               All right. I came to the Valley… God, ’87, from Carnegie-Mellon and did some… worked at a video startup called DigitalFX, so built lots of systems. Then went off to school, came back, worked at Electronic Arts on video games.

STEFAN:          Oh, cool! I didn’t know that.

ERIC:               That was fun. We made a flying game on the 3DO which was a platform that was very [00:04:00] interesting for a short period of time.

STEFAN:          And it’s all hardcore C++.

ERIC:               Mm-hm.

STEFAN:          The good old stuff.

ERIC:               That’s right, trying to figure out- the fun thing about those applications was that you really had to figure out how to use the entire machine. Your app didn’t fit comfortably in the amount of memory and the amount of storage and the amount of CPU you had, so you had to understand the machine. That’s one of the things I really to this day look for when I hire people for this kind of work, is you want people who haven’t just written in Java or written in C, just plugged things together. You want people who have had to struggle something that doesn’t fit in the size, in the amount the amount of resources they have and have had to really learn how to program as a result.

STEFAN:          Yeah. One of my favorite papers is, “Why you as a Java developer should learn assembly.” Did you ever read this thing?

ERIC:               No. It makes sense to me, you should.

STEFAN:          Yeah, right? To really understand the whole [HOP-son 00:04:58] switches and memory management and all that kind of fun stuff. It was a really interesting paper. They basically say, “You don’t need to write everything in assembly, but if you really understand the concepts, then a lot of stuff, including garbage collection and what-have-you really make sense.

ERIC:               It’s always fun to just ask people, “How does a function call, how is it implemented,” or, “How does garbage collection work?”

STEFAN:          Right.

ERIC:               Right. If you’re not ready to work in system infrastructure, if you don’t understand those things…

STEFAN:          Yeah. Okay. So, the flight simulator, and then…?

ERIC:               Then back to… then I hitch-hiked around Europe for a couple of years.

STEFAN:          Oh, that’s cool! A couple years, even? What was your, beside Germany, of course, what was your favorite?

ERIC:               I’m actually, my grandparents are Swiss, so I actually managed to get a job at the ETH in Zürich.

STEFAN:          Oh, cool. Beautiful city.

ERIC:               So that was a home base. Not only is it [00:06:00] a beautiful city, but you can take the train in an hour and you can be in Italy, France, Germany, the world changes very quickly from there, so it was a great place to explore Europe from. So, did that, and then back to Berkeley for a couple of years, and then into Inktomi.

STEFAN:          Okay.

ERIC:               Which, it was just an amazing time to be looking for a job, of course, because the dot-com thing was just really starting.

STEFAN:          Yeah.

ERIC:               Eric Brewer, who was one of the founders of Inktomi was my advisors. I looked at a number of other places, but ultimately I decided I’d be nuts not to join this thing.

STEFAN:          Right.

ERIC:               So really, it was at the beginning of a sort of search engine revolution that happened throughout the dot-com era. Inktomi, kind of, is one of these companies that was huge, and then got very small near the end.

STEFAN:          Yeah.

ERIC:               But throughout that period we were just building and rebuilding and rebuilding this search engine. My part of it was actually building the content system. How do you crawl every document in the world and tear it apart and index it so that it can be found by the runtime search engine? So, that’s building a big data system, if you think about it. I ran a team from ’97 to 2005, both at Inktomi and then after the acquisition at Yahoo, that did that, and by the end… the ‘marketing number’ was we had a hundred billion document crawl. Probably more carefully, we knew of a hundred billion URLs. That was how Google was putting it so we did the same thing. We were crawling tens of billions of documents and managing hundred terabyte data sets, things of that sort, which back in 2003, 4, 5 were really big numbers.

STEFAN:          Yeah, definitely. What do you mean with the content system? Was that part of the crawling or the [post 00:07:55] processing?

ERIC:               All of that, of discovering and fetching every document [00:08:00] in the Web, i.e., crawling it. Then, okay, you have a pile of tens of billions of documents, how do you turn that into something that a search engine can use? That means building a web map so you can do page rank and all those things, so taking every document, tearing it apart into sets of terms, then sorting all that information by the terms, sorting it by the document that the link is referring to. Basically, doing petabyte scale sorts. We built a series of infrastructures to do that, and just storing and managing all that data. We built, rebuilt, and rebuilt that system about four times by 2005.

STEFAN:          Wow.

ERIC:               Then 2005, actually what happened was, we were thinking of re-architecting it again, and the science team started coming and banging on our door saying- he didn’t stop at my door, they were coming, they were giving me a lot of advice (laughter). That they would love to have a system like these papers they were reading from Google, because they wanted to do research on all the documents that we were crawling and we’d built these thousand-computer clusters, one of them to crawl the web, one of them to do the sort of page rank calculations, one of them to actually build the final indexes that the search engine used. But all of that hardware was dedicated to that one purpose…

STEFAN:          One thing, yeah.

ERIC:               …and they couldn’t use it. We were looking to re-architect all that anyways, so we said… that was at the point where we said, “Let’s build the system based on the MapReduce interfaces and the HDFS interfaces. Well, the Google File System interfaces. We knew we could do it because we’d built four of these before. We already had a thousand-net system that was effectively running MapReduce, but the APIs suggested by the Google guys were better. We looked at that and said, “Okay, if we rebuild it this way, then we’ll be able to [00:10:00] reuse that software in many more applications.”

STEFAN:          Before we go into more of that really cool infrastructure, I want to… [I will click 00:10:08] a little bit on that challenges around search engine, because I work that area as well and was always like, “Boy, how do you get to such a big crawl?” And there’s all those little tricks and HTTP keep-alive, and DNS-caching was a really big problem for us, especially in the Nutch days, because we basically didn’t go to one house and download all the documents, but we took URL-by-URL, so we had to do a DNS lookup for every URL we had to… how did you guys manage to crawl that much data?

ERIC:               Some of those problems get easier at scale. Some of them get harder. So, you would partition the crawl by host, so things like DNS-caching were not a big deal. You obviously do need to do it, but the biggest problem we had was being nice to the world. I learned a lesson back in, around ’99, when one of my engineers running a test crawl on his laptop took down Microsoft.

STEFAN:          (laughter) That was you?

ERIC:               It was just one… but you think about it, how many websites, especially back then, were prepared to have somebody open up and start to fetch thousands and thousands of pages simultaneously. They just weren’t architected for it. Today I don’t think you could take down Microsoft from your laptop by just asking for its pages, but back then you could take down any site in the world that way.

STEFAN:          Right.

ERIC:               Just figuring out how to be polite. Then you have the reverse problem, which is…

STEFAN:          Right, you’re not fast enough.

ERIC:               …well, I need to get a billion, I need to get millions and millions of pages out of this website. How do I sequence that? So yeah, there’s a lot of tricks. You partition it where you could, then you had to keep a list of active hosts. The other one that really was a pain was [00:12:00]-

STEFAN:          And active host means open connections?

ERIC:               Yeah. You had to keep a list of the hosts that were big enough that you couldn’t just take the… by default you’d just take all the URLs and just randomize them and you’re done. That’s a strategy that Nutch probably used, [hash by the URL 00:12:15]…

STEFAN:          But then we had the DNS lookup problem that slowed us down.

ERIC:               Which, yeah, there’s that problem, and there’s the problem that some hosts are just too big for that and you’re still going to violate politeness if you do that. For those big hosts you would need to come up with a strategy. Some of them were just like, Yahoo, they have a lot of pages, but what really got us in trouble were things like affiliate networks. If you look at Amazon, they keep your cookie in the URL, and that means that every… you can crawl the whole site as a unique site per URL in the world, it’ll just dish you out a new one every time you visit. When we first… we’d go back and look at the crawl logs and realize that, “My gosh, we have a billion pages from one website. What is going on?” Then we’d have to go back and figure out rules. You could just ban the site, that was the first solution. There’s better solutions.

STEFAN:          Did you build a cookie management system?

ERIC:               Mmm (affirmative). We did a little bit of that. After I moved off the Hadoop project they got very sophisticated. They actually built a crawler that used Mozilla as its core.

STEFAN:          Oh, okay.

ERIC:               So they were building the whole page [00:13:27].

STEFAN:          To execute the Java Scripts… oh, cool.

ERIC:               Because, again, in the early days, a page was a page, but at this point, what is a page? Unless you can… is it Flash? Is it… who knows what, right? There’s just rendering it out so you can understand what a human is seeing gets to be a hard problem.

STEFAN:          Now I know why Chrome is that high up on the whole stats for the [process 00:13:50] because Google is just using Chrome on them. Sorry, I interrupted you. You were going into more of the infrastructure side before I sidetracked you with all that cool [00:14:00] search engine stuff.

ERIC:               Sure. That was a fun journey, of course, but by 2005 we looked at this and it became clear that there was an opportunity to re-architect our system. We had a number of different problems that we were trying to solve. One was, make the data accessible to scientists; one was, build a larger-scale version of the crawl. They always needed to get bigger every year, and the clustering systems that we had worked at a thousand nodes but they were starting to spend more time doing, sort of, handling error recovery than actually running the code, because as you get to that scale, things fail all the time. If you’re not very clever about how you handle failure, what you think of as an exception case is the dominant case in the code. Every time a node would fail, the system would go down for a few hours to rebuild…

STEFAN:          Ooh, okay.

ERIC:               …so we were failing, nodes were failing frequently that the system was spending about half its time rebuilding. That was still okay, but if we doubled it again, then you wouldn’t see any improvement to that size. Not unless you start spending more for your hardware, which you don’t want to do, since that was a multi-million dollar budget as it was. We were looking at all that. Hadoop solved that problem, it was clear how it could solve that problem. It had the built-in failure design that was more sophisticated than what we were doing. We also had this sort of science recruitment problem, which was that, to build the best search engine and advertising systems in the world, we needed to hire great people. At that point in time, nobody thought of Yahoo as a place that did big data science, so we decided, “How are we going to put ourselves on the map?” We could build a different system than what Google has done and publish that design, but they have more systems researchers and more staff, [00:16:00] and we’d be a second system. It’s not going to get a lot of attention. But if we take their inspiration and open-source it, that will put us in the conversation in a completely different way.

[16:16] That was kind of a decision that we made as a company, was that we want to actually put Yahoo on the map by building an open-source big data system. We knew we could do it because we built several. We started out with the Google papers, and we then actually went through this whole design process of, “How are we going to refine this, make it better, put our own touch on it?” We changed and changed and changed and we got to a point where we had a completely different design than what the Google papers suggested. Then we looked at that and said, “Well, what’s going to happen, we’re going to open-source this thing, we want it to be adopted, what’s the best guarantee we have of getting it adopted?” It’s to conform to a familiar template. These papers are out there, they’re a template that everybody understands, so we took the design all the way back down to being very similar.

Another thing we decided really early on was we, if we clone these papers- clone is probably not a word that people would like to hear. But if we use these APIs as our inspiration, then one of two things happens. Either our open-source project wins, everyone adopts it, and then there’s huge benefit to us because the world is contributing the infrastructure we use, we get to hire people who already know how to use our infrastructure, et cetera. Or, we lose, somebody else builds an even better open-source version of these papers. Because somebody will, and then what happens? Then we have a very easy job of porting all of our infrastructure to the dominant paradigm, and again we win. Open-sourcing seemed like a really good bet. Conforming to those papers seemed like a really good bet, [00:18:00] although that was, it was interesting, the sort of nimbyism- not, nimbyism’s the wrong word. Not invented here, NIH. People, even years after the Hadoop project was just going gangbusters, people would stand up in big town-hall meetings in Yahoo and say, “I can’t believe that Yahoo’s not innovating, because Hadoop is just a copy of the Google papers. When are you going to do something original?” It’s like, well, you know (laughter). Great ideas happen elsewhere, I forget who said that, was that… I forget who said that, but it’s a valid quote. What’s important is not proving that you have a clever idea, what’s important is executing.

The choice of open-sourcing a MapReduce infrastructure did tremendous good to Yahoo. In the end, we managed to hire and retain a great team, although Yahoo was going through some challenging times, shall we say. Much better than that, we really put Yahoo on the map in terms of hiring scientists. They built a really world-class science organization that drove a lot of innovation across all of Yahoo. The story gets to Hadoop, actually, in another step, which is, having made this decision, we staffed a team out of the people that had built the last four crawling systems and started building a prototype called Juggernaut. But then at the same time, Raymie Stata, who was at the time the chief architect of search and advertising, he later became the CTO of Yahoo, had hired Doug Cutting. Doug Cutting had built this prototype of MapReduce and the Google File System and the Nutch system. You’re probably… you were already using that at this point…

STEFAN:          Oh, yeah, I was working with him since 2003 on that.

ERIC:               That puts you in a very rarefied crowd.

STEFAN:          One of three, yeah.

ERIC:               But yeah, so you were one of the [00:20:00] very few users of Hadoop before it was Hadoop. So Raymie started suggesting that I adopt Hadoop as the foundation for this open-source system we’re going to build. This was something that we discussed and debated for about six months before finally I concluded, okay, this makes sense, we’re going to do it. It’s funny, we should get back to the Java / C++ thing in a second, because that was one of the things, but we were looking at this and going, “Our design is much better than Doug’s implementation. We know how to build these systems, we’ve built four of them, why should adopt this thing that is really just a prototype?”

STEFAN:          Right.

ERIC:               But then we looked at it, and it’s like, “Okay. It’s a prototype, but it’s a prototype in Apache, and it’s built by someone who’s built successful open-source projects before, and who can teach us a hell of a lot about that.” There’s this huge, jarring transition we made where I took the team that we’ve built to build such a system, threw out our prototype, and adopted Hadoop, and then started the process of understanding it and refining it.

STEFAN:          So, before we go into more of that history of Hadoop at Yahoo, and I certainly want to know more about whole C++ and Java discussion, I’m sure that will be fun, we will make a break, have a few more beers…

ERIC:               A few more beers (laughter)

STEFAN:          In between, so we…

ERIC:               So we’ll be [00:21:36] singing when we come back!

STEFAN:          Exactly! Well, that’s the idea, I want to hear all the songs about all the interesting things that happened behind the curtain. We’ll be right back.

ERIC:               All right. [00:22:00]

 

Comments are closed.