Big Data & Brews: Pivotal’s Chief Scientist on the Early Days of Hadoop
Milind is one of the early contributors to Hadoop (he’s been working on it since 1.0) so it was a great wrap up to our conversation hearing about what it was like for him in the early days implementing Hadoop at Yahoo. We also had a chance to discuss his thoughts on Apache Mesos, what he thinks the next big thing in big data will be, and when he first realized that the Hadoop was gaining traction.
Pull up a KingFisher of your own and enjoy. :)
Stefan: Welcome back to Big Data & Brews. Today we have Milind. Cheers.
Stefan: From Pivotal and Yahoo, and I didn’t know you’re a rocket scientist.
Milind: That way, I’m a molecular dynamicist, whatever they call them, because back in Helena, when I was doing my PhD, I worked on molecular dynamics applications. I understand the same thing about molecular dynamics, as I understand about rockets. Nothing.
Stefan: But you wrote the software for it.
Milind: I wrote the software for it, yeah.
Stefan: In our last episode, we talked a little bit about the fun experience and kind of the beginnings of Hadoop at Yahoo. What is one of the most fun stories in the team, in the early days you think of? What was the biggest bang or the most breakthrough?
Milind: I believe fun things. Mahdev Gourner and Hirang. They were our first support engineers. Every afternoon from 3:00 to 5:00 they would actually sit near the phone in their cubes, and early Hadoop users at Yahoo would actually call them. The first Operations person that we hired, I mean Hadoop was rapidly developing. I think at one point there were three users in the space of a week, just bug fixing. At that time we did not have processes around Hadoop because no production work was actually being done there, right? Most of these early Hadoop users were what you now call as data scientists, exploring data.
Stefan: Real data scientists. I mean today, everybody is it.
Milind: Yeah, real data scientists.
Stefan: No, my grandmother is a data scientist. She did a web-based data scientist training, got a certificate by email and … no, I’m …
Milind: Okay. It’s possible, I don’t know.
Stefan: No, but I mean today, every organization that provides Twitter training is providing data scientist training now. That was like real, PhD scientists, right? [2:56]
Milind: Right, absolutely. They were doing real, live scale natural language processing, that kind of stuff.
Stefan: Mostly on the web corpus was my understanding.
Milind: Almost all of that work was on the web corpus, because the first data set that I remember, that we loaded, was the million best English language pages, or something like that. From within the web corpus, you probably know the name Arkady Borkovsky. Arkady later went on to become CTO of Yandex Labs. He is primarily responsible for Hadoop streaming. That was the data set that he had generated offline, so we brought that in and a lot of work actually went on that, initially.
Operations, we didn’t have any processes, we just basically, the new release comes out, “Hey, let’s just deploy it, right? Shut off the clusters, all the 600 unit cluster, Kryptonite.” Shut off the clusters …
Stefan: And the phone rings.
Milind: Suddenly phone rings, “What happened to my job? You shut down the cluster!” “Okay, it will be up in another 15 minutes, don’t worry about it.”
Stefan: If we would do that today, there might be more than a phone that’s ringing. [4:06]
Milind: Exactly. Then sanity prevailed, there were people actually deploying the clusters by giving the users warning, saying, “Hey, Wednesday evening we are going to deploy the cluster” all those kinds of things. I went on to start a team within the Hadoop team, called the Grid Solutions Team. Earlier, it used to be called the Utility Computing Group. That did not quite fly, so we changed the name to Grid Computing Group.
I started a team called Grid Solutions Team, which was all these hundreds of new migrants to Hadoop from all over Yahoo!. This was sort of helping them architect their applications, or pull their applications to run on top of Hadoop. Ultimately that team grew to around six people here in the US and 40 people in India, in Bangalore.
It was a really nice team. We got to play with early Hadoop users, we actually got to look at how people are using Hadoop, what kind of applications they are using on Hadoop, that kind of thing. Also, around the upgrade time, that’s when our headache essentially grew. The API’s would suddenly change.
Stefan: That still happens today, by the way. Don’t want to blame anybody.
Milind: We had some open to all clusters, I mean these were not operated according to strict SLA’s or anything like that. Any new release of Hadoop, we’ll first deploy there, literally ask people to play with that and then pull their applications there.
Stefan: Almost like a Facebook deployment model, where you first try …
Milind: Yeah, the bronze, silver, gold kind of model, that’s right. All these emails, all those things basically, most of those, we’ll ignore them. Then suddenly when the production clusters were upgraded, we got phone calls. “Hey, why is my application failing?” “Have you tested your application on these bronze clusters?” “No.” “You didn’t tell us.” Then I would get to look at all these emails. That’s the fun thing that used to happen.
Stefan: What was the most common, as you had those people that migrated from kind of a traditional architectural … and I think it’s very relevant for today’s store, right? Now Hadoop arrives at maybe not a mass market, but the broader market. Back then, do you see kind of the same mistake that the early adopters at Yahoo! did, today done in whatever, financial services, telecourse. Are people doing the same mistakes still? [6:44]
Milind: I think it’s on both users of Hadoop, as well as people who are selling Hadoop. The right expectations of management from the IT team is extremely crucial. We are selling Hadoop today and I won’t blame anybody, but almost all vendors are essentially saying, “This is the next thing since sliced bread. All that workload that you had running on there … “
Stefan: It prints money and does dishes too!
Milind: Exactly. Hey, my grade goes to Hadoop. Hadoop is actually great at those things, right? We need to manage those expectations. That is what happened I think after the few early adopters, after they got used to Hadoop, they knew exactly what it could be used for, how to make the best use of it, etc.
Then, I think some time around 2009, there was a mandate from the top saying, “Hey, consider Hadoop first.” A lot of people basically tried to move their use cases to run on Hadoop, without knowing what to expect. That’s where actually there were a lot of issues. Really, in spite of doing all this work, by 2010 we could not guarantee a 10 minute SLA, because what if we got into your 10 minute SLA and your job fails at the end of 9 minutes and you have to restart it all over again?
The second thing was in general, the multi tendency aspect. Isolation across the different users, isolation across multiple jobs running on the same cluster. That was the trouble that we faced, moving from Hadoop on Demand onto this single job tracker and the single capacity schedule.
Stefan: Right. On Demand really had one sandbox.
Milind: Exactly. The HDFS part of Hadoop on Demand was actually shared, but you could create MapReduce clusters that could be used only by you or your application, and then dismantle them. Moving from there to a single MapReduced cluster across the entire cluster, that basically was a major issue.
Stefan: Does Pivotal have a story around this? [8:51]
Milind: Now obviously even the Hadoop story has changed a lot, with YAWN. In general, the multi tendency aspect, I think it was in 0.22. something there … actually 0.20.25 or something like that, where a lot of these guard rails went in. One job running in a Hadoop cluster should not be able to trample upon the other job. A lot of people did a lot of work on that. I think we are in much better shape now, but in terms of isolation across multiple applications running on the same clusters, even in the YAWN cluster, I think now it is bounded by the container technology that is used.
The containers, I can bound the container to a few specific courses, or I can bound it to control the amount of RAM that it uses, the amount of scratch space it uses, but then the network bandwidth and disk bandwidth, those are still open to other containers.
Stefan: Yeah, and that’s what I guess it’s very critical, right?
Milind: Exactly. If you really want to offer guaranteed real time performance on some of these queries, that’s the thing that matters.
Stefan: Actually, in fact it’s the most critical thing. Well, it’s one of the fundamentals.
Milind: Exactly. That is why I think, just the virtual machines that are getting there … the heavyweight virtual machines, like the VMWare kind of things, those have implemented that isolation pretty nicely.
Stefan: Yeah, very low level.
Milind: I am right now experimenting with a hybrid infrastructure, where I look at multiple virtual machines on a single physical machine, treat them all as different YARN nodes. YARN essentially schedules those containers inside of those virtual machines.
Stefan: Okay, so you make VM via a YARN process? [10:49]
Milind: No. VMWare is not a YARN process, but instead of having a node manager on a physical machine, the node manager will be on the virtual machine. I could have basically multiple compute virtual machines, a single storage virtual machine running inside and YARN can schedule.
The changes that we have to make in the YARN capacity scheduler is basically to say, “This node group goes to this particular queue or this particular user.”
Stefan: That will be then true multi tendency.
Milind: Yeah. That really opens it to multi tendency. If it’s with darker and the whole lib container rewrite, all that thing has happened. I think the container technology is also evolving pretty quickly, which basically means that we’ll soon have the network isolation and everything else.
I think the SDN in its totality is going to be maybe two years away from rapid adoption, but once that happens, I think we’ll be in a very good shape to tell our multi tendency story.
Stefan: I think that will be really interesting then to get Hadoop and kind of the data driven environments into the SAS companies. The biggest challenge in the SAS companies at this point is they basically legally can’t do multi tendency, because it’s really challenging, and then all the advantages, hardware utilization advantages of Hadoop kind of go over them.
What do you think about things like Mesos? [12:20]
Milind: When the Mesos project was started, I was following it pretty closely. Then we were working on YARN at that time, at Yahoo!. At some point I think it was a marketing clash. Right? I still think that Mesos and YARN can coexist very nicely, Mesos managing the underlying system resources, whereas YARN taking care of the application resources.
Stefan: Mesos would do kind of the I/O piece and the CPU and memory and YARN more high level logic? [12:56]
Milind: Exactly, and since it is written in C, YARN could be just one of the applications. Now YARN itself is becoming sort of a resource manager for all different kinds of application, but really that layer would have been nice, which is physical resource management done by Mesos and YARN essentially utilizing those containers.
I still do that sometimes with a summer project or summer intern or something like that, just to see how they actually can play very nicely together.
Stefan: Do you need more interns?
What’s the most exciting thing that’s coming up? As a visionary that looks into the future, anything on the horizon that you’re really excited about? I mean Storm isn’t really the latest, greatest thing anymore. What’s the next thing beyond that? [13:51]
Milind: In the short term, when we launch this whole Pivotal HD, one of our visions has always been that you get a buffet of computing frameworks to run on the same data. That’s the reason that we want to accumulate all this data in HDFS, whether it is HAWQ connective data or HAWQ trying to access data from H-Base or whatever, right?
Why just be limited to that? Because of YARN, now we can have something like GraphLab, which specializes in graph model and graph analytics, have access to the same compute resources and the same data, on top of the same Hadoop cluster.
GraphLab is the recent one that we added, but in order to run GraphLab, GraphLab runtime is actually OpenMPI. OpenMPI has been used in high performance computing for a long time, so what we managed to do is make YARN as the resource manager for OpenMPI.
We’ll be open sourcing it soon, and I’ve been saying “soon” for a long time now, so …
Stefan: Well you work at a small start-up, there’s only one lawyer that has to look at the paperwork, right?
Milind: Right, exactly. That’s the project that we call as Hamster.
Stefan: There’s a whole zoo of animals here.
Milind: I can’t take full credit for Hamster, but I can take credit for the name. It stands for Hadoop and MPI on the same cluster. Basically we’ll see a lot of these specialized runtime systems for specialized machine learning algorithms, or any time your communication pattern changes, that’s what you’re going to have. All of them sharing the same data.
Stefan: The big vision is to leave the data in one space and move the buffet of tools to the data?
Stefan: What we did traditionally over the last 30 years or so, we moved the data into the specialized environment. Now we moved the specialized environment to the data. [16:01]
Milind: Although the data as a bytestream is there in HDFS files, we actually need to impose some structure to input format and record readers and things like that, right? Really, each catalog is actually a great step in going towards that, because it actually associates a particular record structure with any arbitrary data set, that both MapReduce as well as Hive as well as Pig, all of them can use, simultaneously, right?
Stefan: Or HAWQ.
Milind: Or HAWQ. Exactly. That’s the thing, that’s where I was going. It’s that we basically make use of something that we call Pivotal Exchange and Framework, which essentially is a translation layer between these outside data sources. Outside as in, not in our native format, and make it accessible to all our different components that are running on top of Hadoop.
One more recent happening that I’m really proud of is basically we introduced the support for Parquet as a native table type in HAWQ. You can basically, when you do a Create Table, you basically say, Create Table type=Parquet.
Now it’s a completely open format, and we have done a lot of work on it, and the reason we went with Parquet is that it is accessible both from Java and C++.
Since HAWQ is written in C, so basically that gets into the native access for Parquet, the same files are then available via Parquet input format, for the rest of the ecosystem. Right?
Stefan: Now you’re working on this and you’re one of the really ,really early guys in the Hadoop space, and you’re working on this. Where was that moment when you had to pinch yourself, “I can’t believe people are doing this with software I wrote so many years ago?”
What’s the most amazing use case you saw? [18:02]
Milind: First was not the use case. Once I started doing Hadoop evangelism outside of Yahoo! … by the way, the first Hadoop tutorial delivered anywhere was in ApacheCon in 2008 or 2009, I am forgetting, but this was in New Orleans.
Stefan: I was there.
Milind: It was delivered by me and It was sponsored by Cloudera so Christoff Bisciglia, Aaron Kimbell, Tom White, all of those actually were …
Stefan: Didn’t we go in the evening? Anyhow …
Milind: The French quarter thing, let’s push that out. Really, I was in USINEX the year after that. My tutorial proposal got accepted there. The next door there was a Solaris performance tuning tutorial going on with Richard McDougall.
Richard McDougall has written Solaris performance tuning books, all about Solaris. A really great guy. His tutorial had like six people attending them and my tutorial next door had something like 30 people attending. That’s the point where I basically realized …
Stefan: Something is shifting.
Milind: Something is shifting, exactly. USENIX 2009, this was in San Diego. Among the attendees in my tutorial there were three people representing all three different agencies. That was basically, “Okay, what have we done?”
Stefan: Saving the world.
Milind: Saving the world, yeah. Recently actually, my daughter took part in the Synopsis Science Fair, here in South Bay. I went to drop her there and I took a look at what all kids were doing, from 7th grade to 12th.
There was actually a kid from the 7th grade, who did … what was his title? “Effect of number of computers on computation time.” He basically took a MapReduce job and he basically said, “If I run this on three machines, if I run this on ten machines, if I run this on 20 machines, how much does the compilation time changes?”
He discovered that it goes down for some time and then it basically goes back up. This was all done using Amazon AWS and Hadoop. I actually was tempted to make him an internship offer right there. I don’t know about underage recruiting or anything like that.
Stefan: We looked a little bit into the future. What’s the day-to-day future for you look like? What kind of development environment are you using? What’s the day to day life look like? [21:04]
Milind: I don’t do coding anymore, unfortunately.
Stefan: No? Me neither. Remember the good times when we were coders?
Stefan: You just run it.
Milind: Yeah, absolutely. I do sometimes code reviews and stuff like that. So far which I used Eclipse, that’s pretty much it.
Stefan: I only get these days to get, pull, shake out, master, clean, actually no, it’s cradle, clean and then I call my service people, “Oh it’s not working!”
Great, thank you very much. What do you think as your last sentence? What is the recommendation, as people are getting their feet wet with Hadoop, what should they look out for? [22:25]
Milind: I’ve been saying this for a long time, right? For people who are running or writing JavaMap produced program, if you can do with Hive or Pig, just do it with that. I’ve always considered … This was 2008, something like that. MapReduce has always been like assembly language, always. Always use higher level languages. So what if you lose 5% performance? Don’t care, right?
Now I would basically say, for people that are getting into the Hadoop space, explore it from the top, just use the right tool for the right job. If you can do your use cases with these higher-level tools, just go for it. It actually attracts so much complexity, having to write a bunch of code. Three hundred lines to just do a word count? I mean come on.
Stefan: Thank you very much for joining. I hope we can have you back soon.
Milind: Absolutely, thank you.