Big Data & Brews: How Pivotal Works
You’ve probably heard a lot about Pivotal over the past year, but have you really gotten to understand how it works and why it’s different from traditional Hadoop distributions? This week, just one week after their one year anniversary, I thought we’d share a snapshot segment of how Pivotal works, according to their Chief Scientist, Milind Bhandarkar. Enjoy:
Stefan: How is Pivotal composed? What are the layers and what are the tools … maybe you can lay it out for me.
Milind: Yeah I could, right.
Stefan: Let me put the beer on my side here, so I have it handy.
Milind: Sure, okay. At the bottom layer, of course is HDFS, right.
Stefan: It’s native HDFS as …
Milind: It’s native … it’s absolutely. It’s basically open source Apache HDFS. Of course now there is a whole movement of Hadoop compatible file systems. There are some from our partner at the … out parents in particular EMC. We have support not only for open source HDFS, but for other scale out file systems like Isilon and soon to be object storage system called Wiper. [0:46]
Stefan: Oh cool.
Milind: It’s not really … from the application point of view, it’s …
Stefan: Doesn’t add up yeah.
Milind: Just looks like HDFS. Right, okay that’s basically the thing. Obviously you have the whole MapReduce stack which has Hive … Okay I’m now going into 90 degrees. Hive, Pig, Mahout etc. so that’s one stack. Obviously for all of these like other distributions, we have our own installation and configuration manager and a monitoring and management system that we call as the Pivotal Command Center that manages this entire stack.
The two big additions to this stack that we did over the years, is first thing that we call which is called HAWQ. HAWQ as I said is Greenplum database running on top of HDFS. Now as part of … since we took the Greenplum database engine and put it, execution engine, and put it on top of HDFS yeah.
Stefan: So this is Greenplum right? This is the …
Milind: This is Greenplum database.
Stefan: The fast Greenplum that is incredibly fast.
Stefan: It has security, that has monitoring, that has all the tool integration.
Stefan: Why wouldn’t I use that? Instead of something like this or one of the many wannabe Hives?
Milind: You should be, that’s basically the thing. Earlier we used to say that okay performance is our main distinguishing factor. Really the whole ecosystem of tools and the connectors from outside of HAWQ [2:20]
Stefan: … on top of that, there’s this running on top of that and that, and Informatica pumps in data. You basically can … you have Greenplum and you can basically say “I replace Greenplum with HAWQ and what I get for free is that it runs on Hadoop.”
Milind: It runs on Hadoop, exactly.
Stefan: My migration from like a traditional MPP database with Greenplum that I maybe use Hadoop could be absolutely pain free.
Milind: Absolutely pain free.
Stefan: I just basically replace this and I’m done. Then I have the opportunity to do all that magic stuff, but boom it’s there.
Milind: Even with this though, because of HDFS’ semantics and limitations on in-place updates etc. We had to drop a few pieces from Greenplum database to get this working. We used to have a table called the hip table which would actually organize the tables in a classic B-plus tree. On managing B-plus trees on HDFS is pain to say the least.
That is why we have moved to the whole append on the tables. The execution engine essentially remains identical.
Milind: Then we have a project that was going on for almost 4 years now with UC Berkeley and several other universities for a package of machine learning libraries called MADlib. That runs inside of the HAWQ, so HAWQ is basically execution engine plus a bunch of UDFs, PLR all the UDFs written in various languages plus MADlib, okay. That sort of for the analytics use cases.
Stefan: The only thing … if I replace my Greenplum with HAWQ and have Hadoop basically for free in there. The only thing I’m kind of losing or what I have to move to is append only tables. [4:11]
Milind: Yes, that’s right.
Stefan: I mean you will have that anyhow with Hadoop.
Milind: That is correct.
Stefan: All my ETL infrastructure, my analytics infrastructure … everything just works.
Milind: Yeah absolutely.
Stefan: I basically do the trick where someone has all the Chinese on the table and you just pull out the thing and it’s still right there.
Stefan: That’s cool.
Milind: Over the last year, since Greenplum and various projects from VMware also merged into Pivotal, we ported GemFire which is actually the sequel fire part of GemFire with the sequel query interface. That is rebranded as GemFire XD. What GemFire XD is, is basically an in-memory data store with a sequel query engine. These are the two components that are sort of a special in our Pivotal Hadoop distribution which are not available from elsewhere.
The rest of the things … if you want to just use Pivotal Hadoop distribution for running your MapReduce and Hive, yes you can do that too.
Stefan: I get kind of an SAP HANA that sits on top of freaking fast Greenplum and I get the Hadoop under, and I basically leave everything as is, my whole infrastructure. And instead of writing into Greenplum tables now for free, I get HAWQ tables and HDFS and Parquet, that’s as you said early on? [5:39]
Milind: It is now with the Hadoop 2.0 in the PHD or Pivotal Hadoop 2.0 it’s Parquet.
Stefan: It’s Parquet and then I can use every MapReduce framework like Mahout to go against… [5:54]
Milind: Yeah absolutely right. There’s only one correction that I would like to make. GemFire XD is actually optimized for gets and puts. Unlike SAP HANA, HANA from its beginning was engineered to be for what they call as the OLAP, Online Everything Processing. OLTP and OLAP combined, right? It is actually much more efficient I think based on all the reports that I heard to do scan-based workloads in SAP HANA. Whereas the points get sent … The rapid data ingestion as well as gets and puts, actually are pretty well optimized in GemFire XD.
Stefan: Well, that’s kind of more critical in that environment anyhow right?
Milind: Exactly, because the scan … For scan based workloads, when you are scanning large amounts of data. We already have a product which is optimized for that, yeah. That’s what the entire Pivotal HD looks like. Now on the top, we basically are working on a project called Spring XD and Spring XD is … since you have worked with Spring, it’s basically Spring data, Spring integration and Spring batch with plugins for Hadoop and everything else.
Stefan: I write a Spring application and I basically write to Hibernate or to GemFire.
Milind: Yes, exactly.
Milind: That’s basically the whole goal. Now Spring XD is I think at what we call as the release candidate 6. It’s not yet GA, but soon … I think by the end of this year it should be GA. [7:23]
Stefan: Can I just take my Spring application by today writes an RDBMS and put your framework under and I’m done.
Milind: Yeah absolutely.
Stefan: It writes into Hadoop but I also have the performance to pull out because of this?
Milind: Exactly, so both of these have JDBC drivers, both HAWQ as well as GemFire XD have JDBC rivals, right. The data, when it gets retired from GemFire XD actually lands in HDFS so that it can be ingested back into HAWQ as well. That all …
Stefan: That makes a really strong enterprise application, data driven enterprise application story where you guys say … “You have your Spring application or you have your environment that’s already there. By the way we did the table cloth trick where you just replace it and nobody even notice it.”
Milind: Exactly yeah.
Milind: Basically the thing. You will basically see more things here, so graph lab, open MPI, all these things, Spark, in future will see it there.
Stefan: Come there?
Stefan: In your own world the customer is after Pivotal, what is the reason they’re going for Pivotal?
Milind: I think the main reason that they are going for Pivotal, actually there are three main reasons now. Because also HAWQ, GemFire XD that came out recently. This is the main reason, and the second one is our support pedigree from EMC.
Milind: A lot of our support …
Stefan: You might get that before it’s not..
Milind: Being part of EMC, Greenplum in particular was part of EMC. Actually a lot of support people have come from that support grind of EMC. That whole …
Stefan: You can call them at 2:00 a.m.
Milind: If you call them at 2:00 a.m.
Stefan: They pick up.
Milind: You can call the execs at 2:00 a.m. And they will wake up people, that’s how it works sometimes, yeah absolutely.
Stefan: All right, I want to have that phone number.