Big Data & Brews: Pivotal Chief Scientist, Milind Bhandarkar, Talks about Life Before Pivotal
Following last week’s snapshot of how Pivotal works, Milind and I shared some of my favorite Indian beer, Kingfisher, and talked about the cool projects he’s had a chance to work on before joining the company. Did you know he worked on the first super computer in India and also at the Center for Simulation of Advanced Rockets?
Sneak peek: We also talked about why I think the term “data river” should replace “data lake.”
Milind: All right, sure. I’m Milind Bhandarkar and I’m the Chief Scientist at Pivotal. The beer today is actually from my native place in India. Kingfisher is a huge alcohol empire in India. Now they have a diversified … they have an airline also.
Stefan: I was going to say that. Even that is my favorite. I want to try that airline. I don’t know why the German breweries don’t have airlines.
Milind: Yeah, that’s where it is. There are a bunch of other brewing companies but I think Kingfisher is the most popular in India.
Stefan: It’s the most popular with me. If I go somewhere and they have different choices I would rather go for Kingfisher than for an American beer because it’s a very much lager style beer.
Milind: Yeah, it is, it is.
Stefan: Oh and it’s so good and refreshing.
Milind: I was trying to make a decision whether to bring IPA but I [1:23].
Stefan: Oh you know me long enough.
Milind: IPA also has a connection with India. They had to make it stronger because beer was never manufactured or brewed in India. When British ruled India they used to get all their beer transported from Britain all the way to India.
Stefan: But they only have miserable beer.
Milind: In their transport the fizz goes out. The alcohol content will go down so they had to make it … so Indian Pale Ale actually is extra hoppy.
Stefan: Extra hoppy, yeah, yeah.
Milind: Exactly, exactly. Is it okay?
Stefan: Well …
Milind: Eight percent alcohol versus 5 percent, I would go with 5 percent.
Stefan: It’s just like a real … you can look through it. Cheers. Thank you for coming. Just really good and refreshing. Tell me a little bit more what you do at Pivotal. [2:22]
Milind: As a Chief Scientist basically I look out 2 years or maybe even 18 months in the future as to see what are technologies that are out there that are making an impact on this whole big data space in general. I basically, like a strategist, I build their direction towards moving towards that technology. Just to give you an example, I’ve been looking at Spark for almost 2 years now since it came out of Rad Lab and then Amp Lab. I’m really trying to see what are the use cases for Spark in our big data stack and how we can actually integrate Spark to not look different just as another project running on the Hadoop distribution but really integrate within the end to end data workflows that most of our common use cases are … I mean the distribution gets used for. That’s the kind of things that is my day to day job.
Stefan: Pivotal has a kind of a unique situation given the big backing, the big company and a lot of historical technology, the Greenplum technology and of course the Greenplum experience. What are the components in the mix that makes Pivotal so strong in these certain areas and what are the advantages? [3:53]
Milind: Sure. Whenever I go and talk to customers I basically say, “You know our Pivotal Hadoop distribution is just about a year old, but the products in there have a long history. They’re not just 1 year old.” The 2 major components of the Pivotal Hadoop Stack today are HAWQ which is essentially Greenplum database running on top of HDFS storing data natively within HDFS. The recent one that we announced is GemFire XD which used to be called SQLFire. It is sort of an in memory SQL processing engine so moving back to do persistent storage on top of HDFS as well.
Stefan: You have the super fast, 10-year super mature SQL engine with Greenplum in there and now you have that kind of maybe a Storm kind of but more mature in memory SQL? [4:47]
Milind: Yeah, exactly. So in Memory SQL so it actually came out of a company called GemStone almost 12 years old that company which was acquired by VMWare in 2010. They are in a couple of spaces. The space of in memory data grids, IMDG, that’s where they primarily came from. They saw the storage engine that was built as an IMDG if we could implement SQL front and do that. They essentially implemented a SQL front and that stores the data in GemFire. GemFire was the original IMDG the in memory data grid engine.
We took that and we basically replaced its persistent layer which used to use only local disks, which basically means on failure you basically have to replicate and all these things, we replaced that with HDFS. Now the use cases that are enabled by such an architecture is that you have your high speed data ingestion coming in, right? That gets treated into SQLFire. For the first 5 minutes, 10 minutes as long as the data fits in memory you can actually query at in memory speeds-
Stefan: As it comes in?
Milind: As it comes in.
Stefan: Yeah, that’s awesome.
Milind: Then as that data ages that data then basically gets on to the persistent store on to HDFS.
Stefan: You maybe even have multiple layouts. You have a super hot layer, you have a warm layer –
Milind: Warm layer and then the cold layer, that’s basically the –
Stefan: The hot layer is in memory and then the warm layer is what? [6:13]
Milind: It’s actually directly in HDFS. In fact, there is actually a local disk layer as well. We haven’t ripped it out. The local disk later there is asynchronous event queue listener listening. So within a few minutes that data gets on to HDFS which then becomes queryable not only by GemFire itself but also by MapReduce, by HAWQ, by whatever technologies you have which is actually querying the same data. If you see the team of Pivotal HD it’s basically bringing all these data sources into a single, now the marketing term is called data lake, we call it data dump or data deposit, whatever you want.
Stefan: No, I disagree. It’s not a lake, it’s not an ocean even though we all name this data ocean, right? I would argue it’s a data river because there are strings and there are many small strings. There’s your mobile app, there’s your social marketing, there’s your log files and they’re flowing together and forming maybe rivers and eventually it becomes the Colorado River. That’s the way. Eventually people build water power and this is where you extract the value, here’s where your ROI is. [7:37]
Milind: I like that analysis.
Stefan: I think the interesting thing about this is that you have multiple dimensions to that. You have location. You need to transport the data because it’s somewhere here and eventually the big power over here, but you also have time with this as well. As the data is generated, how fast do you get it over here? Anyhow, my new … the new coined term by me data river.
Milind: Okay, no, but it’s interesting you say springs here, right? This is the layer at which data gets ingested. Another part of the whole Pivotal Hadoop Stack is actually Spring XD which is –
Stefan: I didn’t mean that spring though, I mean a water spring, a data spring, but I’m a big fan of Spring. A fun fact I’ve looked at JBOSS for awhile with the whole Hibernate and then Spring integration. The good old time best Spring was the killer of JEE. I wonder what the killer of, maybe Hadoop is killer of the old JEE… the new the heavy, a lot of implementing, kind of maybe yeah?
Milind: People would like to put it in this way the flexibility that Hadoop actually provides you. I would say the old large data ware housing platform that’s more an JEE that the flexible Hadoop is trying to encroach upon or –
Stefan: Every time I say the same thing I’m missing inversion of control that was the big break through for Spring. If Hadoop would have inversion of control we would be already 5 years ahead because we could block storage system, transport, we could carry … well, we’re kind of getting there. Anyhow …
Milind: Yeah, but the refactoring or the rearchitecting of Hadoop is I think slowed down because of the maturity of used cases especially in the only Web 2.0 companies. I mean having to change interfaces or improve APIs in Hadoop actually became bogged down in this several month long upgrade cycle like at Yahoo.
Stefan: Yeah, of course because it was inversion of control from the first half where you could just say, “Well, okay…” what you can now do with YARN I think we’re going the right direction where we say, “Oh, I want to run this on MapReduce 2.0, on MapReduce 5.0” and you can just pluck them. [10:02]
Milind: Correct, exactly.
Stefan: Tell me a little bit more about what you did before Pivotal. I mean we know each other for a little longer, right? [10:14]
Milind: 10 years, that’s true. I can’t believe it’s 10 years, right?
Stefan: It’s unbelievable, right?
Milind: It’s unbelievable, yeah.
Stefan: Hadoop made me all these grey hairs. I didn’t have grey hairs.
Milind: Same thing happened to me. I mean I actually jokingly say upgrade at Yahoo from Hadoop 0.18 to Hadoop 0.20 made me lose half my hair.
Stefan: I can see that though. That was a big thing.
Milind: That was a big thing. We got rid of Hadoop On Demand and we went to the new capacity scheduler. That literally deserted …
Stefan: You already said you worked at Yahoo, but tell us a little bit more. When you did join Yahoo, how long, what part did you play there? [10:58]
Milind: Sure. I would divide my career into 2 halves, almost equal halves now. The initial part I would say from ‘91 to 2005 it was mostly about high performance computing. After graduating from India I was fortunate to work on a project that the government of India had launched on their department of electronics there to build the first indigenous super computer in India because the weather department they could not get … because they were export control on how much US could export in terms of converting hardware. The cell of Cray … I believe it was X-MP at that time, Cray X-MP was blocked by the US government, you could not sell it to India.
Stefan: You guys could do without that one. I mean that was the best thing you could buy for money but it wasn’t great if you looked at it right now.
Milind: No, I mean if you look at it, right? I mean yeah, that was one of the first very innovative supercomputer and things like that but if you look at the amount of compute power that it had it probably is much less than my laptop today.
Stefan: Yep, it’s amazing, right? [12:12]
Milind: It’s amazing. I worked for a couple of years on that project. Our team built a full supercomputer called PARAM in India. At that time after that project finished I was, “Okay, where do I go next?” I came to the United States, University of Illinois at Urbana–Champaign to do my PhD in panel computing. I spent 8 years there, too long to do my PhD. 2002 I graduated and I was getting tired of academics, having to write all these research proposals or getting NSF grants and all these things. I basically just took a complete U turn and said 2002 I’m not going to have anything to do with research.
At that time in the Valley there was only one company hiring which was Siebel Systems so I joined Siebel to do their application server, load balancing and applications. I spent 1-1/2 years there and joined a startup Pathscale and we actually built a message passing system on top of Infiband so 1.3 microsecond latency user space to user space.
Stefan: Nice, in 2000?
Milind: This was 2003, 2003-2004.
Milind: As I was winding down that project an ex-colleague of mine, actually he was a grad student at University of Illinois as well Sameer Paranjpye, he was at Yahoo at that time and he basically said, “Hey, we are launching this project with Eric. We are going to revamp the entire search content engine.”
Stefan: Yeah, and it happened to be a few tens of thousands computers. Do you want to be a part of that? [10:03]
Milind: At that time it was around … let me remember, I think it was around 1000 machines running the web map itself then there were some 3000 machines-
Stefan: That was huge.
Milind: Yeah, that was huge.
Stefan: I’m like, “Oh, I want to have 10 minutes on those machines, what I could do.”
Milind: Then there was an Echelon project. They’re doing indexing and then there was obviously the search engine. All combined there will be something like 6 to 7000 machines that are doing that that time. Basically I said, “Okay, when do I get my hands on to this machine?” Basically I joined Yahoo.
We started working on a project called Juggernaut. Juggernaut was essentially the Google MapReduce and GFS papers. Why doesn’t Yahoo have an infrastructure like that? [14:46]
Stefan: Eric talked about that a little bit.
Milind: Right, exactly. I was the first one hired from outside of Yahoo into that team. Eric and Samil were managing those teams. Owen was the first one to join, second one was me.
Stefan: But Owen was already in Yahoo-
Milind: Owen was already working on…
Stefan: He did NASA and Yahoo, right?
Milind: Yeah, he was already working on web map at Yahoo at that time.
Stefan: Hadoop is rocket science.
Milind: It is rocket science, absolutely.
Stefan: Based on Owen.
Milind: Yeah, so Owen moved from NASA. In Illinois I used to work in a center called Center for Simulation of Advanced Rockets.
Milind: The first 2 hires rocket science. I mean I did not understand rockets I basically just, “Tell me what programs to write and will write those.”
Stefan: I didn’t know that. [15:32]
Milind: Amazing thing, I mean first 6 hires into our team, 4 hired PhDs, Konstantin Shvachko, Hairong Kuang, me and Owen. You can basically see it was a very rocket science-y stuff that it was an advanced science project that we were running in there. It has been an amazing ride. I was there for 5-1/2 years. I contributed some initial work in Hadoop. I think my first batch went in 0.1 or 0.1.1 or something like that. Hadoop Record I/O the whole serialization system that was … I think it’s still used in Yahoo. That was my doing.
The one that I’m most proud of, yeah, the one … that came from a project called JUTE, by the way. As part of this Juggernaut we actually started building a Juggernaut MapReduce which was primarily Owen’s stuff. JFS was Konstanin Shvachko and Hairong and I was working on a project called JUTE which was … I like to come up with names so it was Juggernaut Utility Table Environment. That’s what JUTE actually stood for. We basically said, “Okay, let’s take this …” it was actually the full fledged project. It was supposed to do the whole columnar layout in the JFS, Juggernaut File System.
After we adopted Hadoop in December 2005 we basically said, “Okay, let’s take the serialization piece of JUTE that we have already built and let’s move that under Hadoop as Hadoop Record I/O. That was my doing.
The one that I’m most proud of is actually some code that I wrote in order to debug an issue in HDFS which is the Pi Estimator Program. The initial version of the Pi Estimator so you start a Hadoop estimator and basically say, “Okay, let me submit a Pi Estimator example.” That’s my doing.
Stefan: This still runs today, right?
Milind: It is still run today. It has actually changed a lot now from the initial versions because Douglas took a stab at it later and someone else, Nicholas. He also took a stab. I think Nicholas is the one who has completely in it now.
Stefan: Did someone discover more pi numbers with your program? [17:55]
Milind: Oh no, that was Nicholas. That was the running the Hadoop MapReduce Program in order to find the particular digit of pi. Using some 1300 or something like that machines or 1500 machines he discovered 2 billion digits of pi or something like that. There was a lot of discussion on dev random which used to be Yahoo internal mailing list about that that is this really a good use of finding digits of pi?
Stefan: You made it into the books, right?
Milind: Yeah, yeah, exactly, exactly.
Stefan: Great. Well, we will come right back and talk a little bit more about the good fun old times and then about a little more about what you do at Pivotal today. Thank you for joining.
Milind: Thanks, thanks for inviting me.