One of my favorite parts of our Big Data & Brews series is when our guests use the chalkboard to diagram. This week, let’s take a look back at a couple of those chalk talks for a special “How it Works” episode: featuring Concurrent, MapR, and Pivotal.
Stefan: As you described and what I think is so fascinating, everybody wants to do the in-memory, machine learning, real-time stream… It sounds really attractive, and I’m sure it makes a hell of a resume if you have that background, the check boxes. But I think the big value in a lot of companies is just bringing the data together and getting a 360-degree view of your process, [00:16:00] your customer, your behavior.
Supreet: Yes, absolutely. So if I could use this…
Supreet: Let’s say traditionally you had your enterprise data warehouse and I was doing some batch of bulk analytics. The simplest one is I’m doing a simulation for a fraud scanner. I’ve developed a new rule that says if this, this, and this happens, it’s a fraud. But just to make sure there are not a lot of false negatives, and I don’t make all of my customers angry, let me do a…
Stefan: Okay, let’s not talk about credit card companies. With my bank that happens quite frequently. I guess they should use Cascading.
Supreet: Change your bank to mine. (Laughter)
So the aim is the more you can run your simulation on historical data to check off false positives, the better you know what this data is. But at the same time, you’ve detected a new attack pattern and time matters. So the more time it takes for simulation, the more time fraud is happening as well.
So you have an enterprise data warehouse and there is this much data that fits on it. Let’s say it is 3 months of data. It takes maybe 14 hours to run the simulation on that.
Stefan: And meanwhile you’re bleeding money over here.
Supreet: Meanwhile you’re bleeding money because you want to maintain your customer relationship and the trust that you’re not taking a kayak trip to Papua New Guinea… (Laughter)
So the first set of use cases where the big data makes a very interesting play is… This is 3 months and [00:18:00] let’s say this takes 12 hours to do. Move this to Hadoop.
The first thing is what if instead of 3 months you can run it on 12 months? That doesn’t fit in the enterprise data warehouse. But now it can if it moves over here. And on top of that, instead of 12 hours, it takes 3 to 4 minutes to do it.
It’s a very simple query; it’s not machine learning. It is porting your existing queries written in PL SQL or whatever over to big data.
Stefan: That’s a Hadoop environment.
Supreet: Yes, that’s Hadoop.
Stefan: And the color of that data warehouse was red? Or blue? (Laughs) The reason I’m asking is I would expect that a big financial services company with a credit card, that has unlimited credit and could just buy more… What was the technical challenge to not scale it out further? Was it just limit at half?
Stefan: So you just hit the maximum limit on [inaudible 00:19:13].
Supreet: Most of the enterprise data warehouses, despite the investments that are made, very quickly — and this is not specific to the organization that I was in — but in talking to enterprise companies, they end up hitting their peak capacity to support SLAs very quickly. They plan for 4 years but it’s happening within 2 years. So to say that I’m going to run that query on 12 months of historical data instead of 3 months… And this is a very simple query.
The second example is I’m trying to develop a new fraud algorithm and the data, let’s say there is a source system… Oh, that’s wet too.
Stefan: [00:20:00] That’s like in my old school.
Supreet: Right. A wet sponge.
Stefan: It’s the only thing we had in East Germany. (Laughter) And a big ruler that we got a…
Supreet: Okay, the source system. You can call any one of these sensor systems. This is not specific to machine learning, and these days the Internet of Things is a big deal. It’s a lot of data. And a lot of data doesn’t just mean the speed, the pace, but it could mean it’s thousands of attributes available, depending on the context.
Stefan: Depending on the individual user.
Supreet: Yes. Let’s say there’s an event like a swipe and [inaudible 00:20:47] variables. Over here, there’s only limited space. And only the most key variables are kept here. But what if I could keep additional variables direct from my algorithm? Again, you’re not going into machine learning. And it’s not just one.
If I do a join with another data set over here, it becomes really expensive. So the breadth and the depth, just move all these over there. That is the most immediate impact. You can show a quick ROI for that and then take it from there.
Stefan: And what’s the secret sauce behind all of this, you touched a little bit that you built this in from scratch but where is the magic of MapR start and where is it pulling it all the way through? [18:32]
Tomer: Let’s use the whiteboard here, and look at what is the MapR distribution, what is the overall architecture. So it all starts with, maybe I’ll move out of the way so it’s easier to see. It all starts with a collection of over 12 different, maybe 15 even now, open-source Apache community projects. So we have things like, we have Hive here, Pig, Flume, we just announced YARN. So many different projects here, and these are just a few of them, there is ZooKeeper and Sqoop and Stinger Tez coming to play. We’re going to run Impala on the platform, that’s part of it as well.
And what we do is we’ve added our own data platform, so underneath here you have the MapR data platform. So this is, and I call it MDP here for short, but this is the MapR data platform. And what this is really is two things. First of all, it’s a distributed filesystem and it’s also a NoSQL database in one. So think of this as, we could call this, that’s going to be hard to see but call this MapRFS here, and MapRDB here, but really it’s a single service. So files and tables are integrated, they are part of the namespace, they are protected in terms of snapshots in DR the exact same way. So it’s one process running on every node, that’s the MapR data platform.
From a file standpoint, it’s fully read/write, so you can – you don’t have the limitations here, you are probably aware of with HDFS. So you can modify files, you can do concurrent reads and writes, then we also expose that through an NFS interface. So if you look at the interfaces for the MapR data platform, clearly we have the, of course, the Hadoop filesystem API. So that’s the Hadoop filesystem API. And this is the same API that’s exposed by HDFS or Amazon’s S3 implementation or Google cloud storage, their integration with Hadoop.
And we also have an NFS interface here, which is the standard storage interface. And customers use that extensively for everything from loading data into the cluster, so they could, for example, maybe export data from – a lot of our customers they’ll export data from Teradata using standard Teradata utilities. Or they use things like R and SaaS, and because we expose a standard file interface, just like any network at that storage, you just mount the MapR cluster, and you get one giant mass, that’s it. [21:10]
Stefan: Wow, that’s cool.
Tomer: And actually before coming here, I was watching a video you did with Eric Nachbar. He described the reasons he was using MapR and he loved the fact…
Stefan: Those integration pieces are so critical for customers. If you go to a customer and say, “Oh great, we have this magic cool storage thing, and by the way you have to write a few pull scripts and some Java to get data in.” They look at you like, “Uh, okay.” That’s just like 20 years back rather than step forward.
Tomer: It’s analogous to what would you rather use to keep your documents here, will you put them on FTP server or you put them in a network of that storage? I mean it’s obviously a lot easier when you don’t have to use a tool to get data in and out. And then there is the HBase API, because that’s the standard interface in Hadoop for tables. So we expose that and that’s the way you basically read and write the behaviors.
Tomer: So that’s kind of the overall, we add our management innovation as well. So this is management, we call this our MapR Control System. That’s kind of what it is, and the nice thing here is that there is so much innovation happening here at this layer as well, so you’ll get projects like Spark and Shark and MLlib and YARN and Hive and Pig, and I could go on and on with all these different projects that are either in the distribution or coming this year as fully supported projects in our distribution. And then we add our own innovation that provides value beyond what you’re able to get. And actually broadens the use cases that are possible with this platform.
Stefan: So my understanding too is that the MapR filesystem, you guys completely rewrote also in a different language than the Hadoop distributed filesystem. And I think you guys had some DNA there, so it’s not the first filesystem that your core engineers wrote. [23:12]
Tomer: That’s one of the things we’ve done. So I think we spend over, we invest over 50% of our engineering at this layer, but we’ve certainly done a lot of innovation here, and very early on too. And M.C. Srivas, our CTO and co-founder, he actually came from Google where he was – he was on the Google Search and BigTable teams, driving that from within Google. And Google as we all know is ahead of the game here when it comes to Big Data. So he had that experience of running MapR using Big Data at scale, the best place possible. And then before that he was at Spinnaker Networks, which was acquired by Net App, and they are clustered filesystem.
Stefan: How is Pivotal composed? What are the layers and what are the tools … maybe you can put it on me.
Milind: Yeah I could, right.
Stefan: Let me put the beer on my side here, so I have [inaudible 00:00:12].
Milind: Sure, okay. At the bottom layer, of course is HDFS, right.
Stefan: It’s native HDFS as …
Milind: It’s native … it’s absolutely. It’s basically open source Apache HDFS. Of course now there is a whole movement of Hadoop compatible file systems. There are some from our partner at the … out parents in particular EMC. We have support not only for open source HDFS, but for other scale out file systems like [Isilon 00:00:41] and soon to be object storage system called Wiper.
Stefan: Oh cool.
Milind: It’s not really … from the application point of view, it’s …
Stefan: Doesn’t add up yeah.
Milind: Just looks like HDFS. Right, okay that’s basically the thing. Obviously you have the whole MapReduce stack which has high … Okay I’m now going into 90 degrees. High big Mahout etc. so that’s one stack. Obviously for all of these like other distributions, we have our own installation and configuration manager and a monitoring and management system that we call as the Pivotal command center that manages this entire stack.
The two big additions to this stack that we did over the years, is first thing that we call which is called HAWQ. HAWQ as I said is Greenplum database running on top of HDFS. Now as part of … since we took the Greenplum database engine and put it … execution engine and put it on top of HDFS yeah.
Stefan: So this is Greenplum right? This is the …
Milind: This is Greenplum database.
Stefan: The [inaudible 00:01:50] Greenplum that is incredibly fast.
Stefan: It has security that has monitoring, that has all the tool integration.
Stefan: Why wouldn’t I use [00:02:00] that? Instead of something like this or one of the many [inaudible 00:02:06].
Milind: You should be, that’s basically the thing. Earlier we used to say that okay performance is our main distinguishing factor. Really the whole ecosystem of tools and the connectors from outside of HAWWQ [crosstalk 00:02:20]
Stefan: … on top of that, there’s this running on top of that and that, and informatic [inaudible 00:02:26] in data. You basically can … you have Greenplum and you can basically say “I replace Greenplum with HAWQ and what I get for free is that it runs on Hadoop.”
Milind: It runs on Hadoop, exactly.
Stefan: My migration from like a traditional MPP database with Greenplum that I maybe have to Hadoop could be absolutely pain free.
Milind: Absolutely pain free.
Stefan: I just basically replace this and I’m done. Then I have the opportunity to do all that magic stuff, but boom it’s there.
Milind: Even with this though, because of HDFS’ semantics and limitations on in place updates etc. We had to drop a few pieces from Greenplum database to get this working. We used to have a table called the hip table which would actually organize the tables in a classic B-plus tree. On managing B-plus trees on HDFS is pain to say the least.
That is why we have moved to the whole append on the tables. The execution engine essentially remains identical.
Milind: Then we have a project that was going on for almost 4 years now with UC Berkeley and several other universities for a package of machine learning libraries called MADlib. That runs inside of the HAWQ, so HAWQ is basically execution engine plus a bunch of UDFs, PLR all the UDFs written in various languages plus MADlib, okay. That sort of for the analytics use cases.
Stefan: The only thing … [00:04:00] if I replace my Greenplum with HAWQ and have Hadoop basically for free in there. The only thing I’m kind of losing or what I have to move to is append only tables.
Milind: Yes, that’s right.
Stefan: I mean you will have that anyhow with Hadoop.
Milind: That is correct.
Stefan: All my ETL infrastructure, my analytics infrastructure … everything just works.
Milind: Yeah absolutely.
Stefan: I basically do the trick where someone has all the Chinese on the table and you just pull out the thing and it’s still right there.
Stefan: That’s cool.
Milind: Over the last year, since Greenplum and various projects from VMware also merged into Pivotal. We ported GemFire which is actually the sequel fire part of GemFire with the sequel query interface. That is rebranded as GemFire XD. What GemFire XD is, is basically an in memory data store with a sequel query engine. These are the two components that are sort of a special [source 00:05:04] in our Pivotal Hadoop distribution which are not available from elsewhere.
The rest of the things … if you want to just use Pivotal Hadoop distribution for running your MapReduce and Hive, yes you can do that too.
Stefan: I get kind of as SAP HANA that sits on top of [freaking Fas 00:05;23] Greenplum and I get the Hadoop under, and I basically leave everything as is. My whole infrastructure and instead of writing into Greenplum tables now for free, I get HAWQ tables and HDFS and that’s [parke 00:05:38] as you said early on?
Milind: It is now with the Hadoop 2.0 in the PHD or Pivotal Hadoop 2.0 it’s [parke 00:05:45].
Stefan: It’s Parke and then I can use every MapReduce framework like Mahout to go against [inaudible 00:05:54].
Milind: Yeah absolutely right. There’s only one correction that I would like to make. GemFire XD is actually optimized [00:06:00] for [inaudible 00:06:01] and puts. Unlike SAP HANA, HANA from its beginning was engineered to be for what they call as the OLAP, Online Everything Processing. OLTP and OLAP combined, right? It is actually much more efficient I think based on all the reports that I heard to do scan based workload in SAP HANA. Whereas the points get sent … The rapid data injection as well as point gets inputs, actually are pretty well optimized in GemFire XD.
Stefan: What’s kind of more critical in that environment anyhow right?
Milind: Exactly, because the scan … For scan based workloads, when you are scanning large amounts of data. We already have a product which is optimized for that, yeah. That’s what the entire Pivotal HD looks like. Now on the top, we basically are working on a project called Spring XD and Spring XD is … since you have worked with Spring. It’s basically Spring data, Spring integration and Spring badge with plugins for Hadoop and everything else.
Stefan: I write a Spring application and I basically write to hibernate or to GemFire.
Milind: Yes, exactly.
Milind: That’s basically the whole goal. Now Spring XD is I think at what we call as the release candidate 6. It’s not yet GA, but soon … I think by the end of this year it should be GA.
Stefan: Can I just take my Spring application by today writes an RDBMS and put your framework under and I’m done.
Milind: Yeah absolutely.
Stefan: It writes into Hadoop but I also have the performance to pull out because of [inaudible 00:07:34].
Milind: Exactly, so both of these have [inaudible 00:07:36] rivals, both HAWQ as well as GemFire XD have [inaudible 00:07:39] rivals, right. The data, when it gets retired from GemFire XD actually lands in HDFS so that it can be ingested back into HAWQ as well. That all …
Stefan: That makes a really strong enterprise application, data driven enterprise application story where you guys say … You have your Spring application or you have your environment that’s already there. By the way [00:08:00] we did the table cloth trick where you just replace it and nobody even notice it.
Milind: Exactly yeah.
Milind: Basically the thing. You will basically see more things here, so graph lab, open MPI, all these things spark in future will see it there.
Stefan: Come there?
Stefan: In your own world the customer is after Pivotal, what is the reason they’re going for Pivotal?
Milind: I think the main reason that they are going for Pivotal, actually there are three main reasons now. Because also HAWQ, GemFire XD that came out recently. This is the main reason, and the second one is our support pedigree from EMC.
Milind: A lot of our support …
Stefan: You might [inaudible 00:08:37] before it’s not [crosstalk 00:08:38]
Milind: Being part of EMC, Greenplum in particular was part of EMC. Actually a lot of support people have come from that support grind of EMC. That whole …
Stefan: You can call them at 2:00 a.m.
Milind: If you call them at 2:00 a.m.
Stefan: They pick up.
Milind: You can call the execs at 2:00 a.m. And they will wake up people, that’s how it works sometimes, yeah absolutely.
Stefan: All right, I want to have that phone number.