Tomer Shiran from MapR sat down with me over some Israeli beer (and humus!) to tell me what’s going on with the company and the cool things that are being done with its technology, like the Aadhaar Project in India.
Stefan: Welcome to Big Data & Brews. Today we have Tomer from MapR.
Tomer: Hi Stefan, how are you?
Stefan: I’m fascinated by all the stuff that’s on the table. Can you introduce yourself and all the goodies you brought today?
Tomer: Sure, absolutely. So my name is Tomer Shiran, I run product management at MapR. We’ve been around for, I guess, five years now, and I joined the company about four and a half years ago. And I’ve known for Stefan for quite some time now, I think I’ve been at every Hadoop conference, at least the major ones, I guess except the time when my second daughter was born. Then I had to skip one but, what can you do? But yeah, we really we like to do at MapR, I wanted to take it up a notch here, we have those sessions with beer, and this time we have the beer and we have hummus, so the theme here is Israel. [1:10]
Stefan: Oh, I love hummus. I in fact two nights ago made my own hummus.
Tomer: You did?
Stefan: From scratch, not canned beans or something, no, no. You take the dry beans and you cook them, you let them sit, you cook them again, all the good stuff. I’m vegan so you have to take it up a notch. That’s fantastic, I’m a big hummus fan. So what’s the brew we are drinking today, dark lager beer? [1:42]
Tomer: This is actually the most popular beer in Israel. So it’s kind of fitting with hummus.
Stefan: Oh wow, cool. Very nice. Let get’s going on the beer.
Tomer: It’s called Goldstar.
Stefan: Here we go. So what did you do before MapR? [2:00]
Tomer: Before MapR, so I was actually at Microsoft, in product management at Microsoft.
Stefan: Sorry, what was the company called?
Tomer: There was a company that, it’s not so involved in the Hadoop space but does is a lot of other things, I guess, recently switched their CEO.
Stefan: I hear about that. So what kind of area did you worked at Microsoft?
Tomer: I actually was network security.
Tomer: So we were building an enterprise firewall VPN product, which is actually used quite a bit.
Stefan: Yeah, and before Microsoft?
Tomer: Microsoft I was doing product management for that product, then before that I was doing engineering, I was a software engineer. And then before that I was at IBM Research, I was a researcher at IBM Research in Haifa in Israel.
Stefan: So how is the Big Data scene over there?
Tomer: In Israel?
Tomer: It’s a good question. I always go back at least once a year just for family reasons and all that, and every time I go there I spent about a week doing customer visits and meeting a lot of companies that are using Hadoop and, some of them are MapR customers now but yeah, there is a lot of tech in Israel, I think it’s the second most densely populated area in terms of startups after Silicon Valley. So you can imagine there is a lot of Big Data use happening there. There is also OEM opportunities there where some companies in Israel take the MapR product, they make it part of their product, which they then sell to telcos, for example, and other similar examples like that. [3:41]
Stefan: It’s amazing, also for us we see a lot of use cases and companies coming to us and the whole startup scene is just fascinating. Everybody in the world tries to replicate Silicon Valley, and I think the only area that really nailed us is the Tel Aviv area.
Tomer: It is, it’s kind of Tel Aviv, actually that whole area.
Stefan: That’s amazing.
Tomer: Have you been there?
Stefan: I’m going there very soon.
Tomer: Oh you are?
Stefan: I never was there unfortunately. So cheers.
Stefan: Let’s try a good brew here. [4:14]
Tomer: Just tell me what do you think about Israeli beer?
Stefan: I love it, it’s really good. I don’t like American beers too much, we always think they taste like cooked socks, like the IPAs or something, I don’t know, I don’t get along with them. This is really good. It’s like a lager, it’s not too aggressive, not too hoppy. Good, good stuff.
Tomer: Shall I give you some hummus here?
Stefan: Oh yeah, let’s get started.
Tomer: Let’s get started with some hummus before we get into the complex things.
Stefan: I think I have a hummus addiction to be honest, it’s just so good. Kids, have more hummus, lot of protein, good stuff for you.
Tomer: That is also the best pita you can find in the Bay Area.
Stefan: Yeah? Where do you get this from?
Tomer: It’s a place in Palo Alto called Orens Hummus. So I was there commercial, I guess.
Stefan: Oh cool. That’s good stuff. So just traditionally go in there or …
Tomer: Oh, rip and you just tear a little bit off and you just put it in there.
Stefan: Alright. Future of Big Data & Brews, visit us you have to keep up with the MapR folks. So let’s talk a little bit about MapR, what’s going on, any cool updates recently? [5:37]
Tomer: We just wrapped up 2013 a couple of months ago. It was a great year, I think the Hadoop market grew and we expanded a lot as a company. We’ve topped 500 paying customers now.
Stefan: Wow, congratulations.
Tomer: So it’s been a lot of work, a lot of excitement and number of employees has obviously probably doubled or tripled in the last year.
Tomer: Internationally we have offices now in, I guess in Asia we have Korea, Japan, Singapore, Australia.
Tomer: Yeah, teams actually, MapR teams in all these locations. Australia, India, China now, and then across Europe, Germany, which is your favorite place, right?
Tomer: The UK, France, a team in Sweden now as well. So yeah, a lot of international expansion, it’s been fun.
Stefan: That’s pretty amazing, it’s really international. What a lot of people are doing, is it different than like what people do in the US versus in Europe, use-case wise or, are there buckets? [6:50]
Tomer: I’d say the US is probably still ahead in terms of the maturity of the customer base, although we have actually a significant number of customers in these other countries. One great use case in Japan is, it’s actually a beverage company, so this is one of the biggest companies, beer and whisky in Japan.
Stefan: Oh nice.
Tomer: They have some pretty cool use cases, so I think you would get the standard kind of marketing use cases that people do with Hadoop. They have all of those but that they also have these really cool vending machines, where they are doing image protection and they have a video camera that’s looking at you and kind of recommending a beverage tea when you walk up to it.
Stefan: Based on what they used before?
Tomer: I think they look at your image and compare it to other people that had similar characteristics, things like that. So it’s a pretty cool use case.
Stefan: That’s so Minority Report where based on your, like what was it, retina to get the advertisement? I guess we live in the future. That’s amazing.
Tomer: Yeah, we do.
Stefan: I just wish the latest Mac OS X version wouldn’t be that buggy if you live in the future. So we see a lot of Hadoop deployments but what really jumps out is that you guys have just massive deployments, like from all the companies we are working or knowing off working with, knowing off like that really jumps out. You guys have the hundreds of machines deployments. So what is kind of the, why is that? Why are you guys kind of the first choice for gigantic Hadoop environments?
Tomer: It’s a great point. I think it’s true we have, it’s hard to even count how many of these customers that I mentioned were 500 paying customers are in the hundreds of servers, there is actually a lot of them. We have the largest deployment in financial services with over a 1000 nodes and …
Stefan: How much petabyte is that 1000 nodes now?
Tomer: Those are like 12 drives and either two or three terabytes each, so that’s a lot of petabytes. I’m not trying to do the math here, I’m … [laughter]. But really what happens if you look at how these Hadoop users evolve, the most, the greatest number of Hadoop users are the ones that are deb-in test. There is tens of thousands of those if not more. And they tend to be one, two, three nodes, so developer maybe downloading the MapR Sandbox or VM from one of our competitors and playing around with it.
But it’s very small, and then you get to the first production use case, so typically when someone deploys Hadoop, they deploy the first use case, and that could be 10, 20, maybe 30 nodes. And there is other small number of those out there in the wild but it becomes more interesting, they are doing something meaningful to the business.
And then the real production deployments, that’s where you get into multiple use cases running on the platform, often times these are hundreds of servers in a cluster, and the value that MapR brings to the table, a lot of it applies to that production deployment. So that’s kind of what we had always focused on was, let’s make Hadoop production readier or enterprise great, so people can actually run it in mission critical use cases.
Stefan: What are some of the use cases you are seeing with the customers, like what is your favorite one?
Tomer: My favorite one is actually, it’s not going to be one of the more popular use cases but the Aadhaar project in India actually so.
Stefan: Can you tell us more about it? I think that’s just an ginormous project. [10:44]
Tomer: Yeah, it’s a really cool one too and it’s really valuable in terms of what it’s doing in that country. So India has over 1 billion people living in the country, it’s something like 1.25 billion people. And one of the challenges there is that about half of the population doesn’t even have an identity. There is no social security number or anything like that, and that prevents these people from opening bank accounts, it prevents them from getting a medical care, it prevents them from getting government services, government aid, things like that. It also encourages a lot of fraud in the overall system, right.
So what the Aadhaar project is doing is it’s basically building the world’s largest biometric database, and ideas to provide every resident of India an identification so they can get government aid and medical services and open a bank account and do commerce and things like that. And it’s lifting a lot of people out of poverty.
Stefan: It’s fantastic.
Tomer: I think it’s up to about 750 million people already in the database, I think it’s about 10 petabytes of data. So for every person you have the person’s photo of the face, you have the ten fingerprints, the two iris scans. So you have all that information for every one of these people, and it’s not just collecting that information and storing it, it’s also enabling every point of service in India to also be able to verify that identity, because now you need the bank and every other service provider to be able to check your identity. So that’s a system that needs to respond within 200 milliseconds at very, very high load in terms of transactions per second. So we are really happy that we are powering that from the backend, from Hadoop in database standpoint. So that’s one of the projects I’m most excited about.
Stefan: Is that HBase, you said database, or was there more magic stuff in between, anything you can talk about from architectural perspective?
Tomer: Sure. If you look at our M7 product, that’s kind of our highest edition, so we have M3 as our free edition, M5 is our enterprise edition and M7 is our enterprise database edition. What it provides is really an in, I like to call it in-Hadoop database. So we’ve built into our data platform, which has always kind of been a distributed file system. Now it also has an integrated database in it, which exposes the HBase API but provides much, much high-performance and much more reliable, consistent low latency.
So that’s the technology that a lot of our customers now are using to kind of expand Hadoop into more operational use cases. So we all know traditionally Hadoop has been great at doing things like more analytics. Originally batch but then recently people talking more about interactive use cases. We believe in the fact that Hadoop will expand – needs to expand – to serve, not just the analytics use cases but also the operational use cases. So providing that full platform that does all of those different types of use cases on one platform. [14:00]
Stefan: That’s gigantic, and it’s such a wonderful project where, when I envisioned the whole thing it started on little open-source project, in my wildest dreams I wouldn’t think you could have so much impact with technology. It’s just wonderful. What are the more standard use cases you guys see, that’s a fantastic project, but what is kind of the day-to-day bread and butter for your platform? [14:28]
Tomer: I think advertising and marketing are pretty common use cases across the board, and it …
Stefan: And is that more in the ad companies or is it more kind of the traditional big companies trying to understand their customers or…?
Tomer: It’s actually both, so you look at some of our customers like the Rubicon Project, which is the largest ad exchange in the US in terms of audience reach. And they are doing 90 billion auctions, ad auctions, every day. And each of those auctions is probably a dozen or more bids, so all these bids are, we are talking about trillions of events every month that are processing the cluster, in they MapR environment, and they predict the prices that the auction are going to and all sorts of things like that.
But then if you look at many other customers that we have across telco and retail they are doing, these are customers that have tens to hundreds of millions of end users or end customers, and they are doing everything from better ad targeting to turn analysis, all those types of use cases.
Stefan: What kind of product enhancements does that try for you guys? Like where is that, you touched a little bit on the lower latency requirements but where do you really see Hadoop as it is today? You said a little bit expanding into the new real-time-ish production use cases, what other functionality dimensions are driven by those use cases? [16:00]
Tomer: I think the customers that are doing these things. And I think you mentioned earlier how you see a lot of these, a lot of our customers are doing big deployments that are really impactful to their business. When the company wants to do that, they need a set of enterprise grade dependability characteristics. So they want true high-availability, one that self-heals automatically. They want a real, consistent snapshots, they want disaster recovery across data centers. So every Hadoop vendor now who says, “We have those things or we’ve added those things, we’ve caught up with MapR.” But there is a difference between building those into the architecture and doing something at marking and checkbox.
Stefan: Oh yeah, editing it. It’s the Tower of Pisa and we just put something on the side of it.
Tomer: So, let’s take an example of snapshots. So MapR we’ve provided snapshots from day one, much like you would see in any other, in an enterprise storage or an enterprise database, the ability to go back in time. Let’s say user accidentally deleted data, or you had an outage and you wanted to go back to a consistent point in time. So it is something that enterprises expect, you wouldn’t buy a NAS or a database if you couldn’t go back and do point in time recovery.
MapR is only Hadoop distribution that provides that from a Hadoop standpoint. And our competitors they’ve tried to add that to HDFS, and the result is really inconsistent snapshots or what they – they sometimes call fuzzy snapshots, but people don’t …
Stefan: That’s a really nice marketing term, by the way.
Tomer: It’s great.
Stefan: It’s a fuzzy snapshot.
Tomer: It’s more or less consistent, it’s sometimes consistent.
Stefan: Let’s hope it is consistent.
Tomer: Let’s hope it is.
Stefan: The whole thing just crashed, let’s hope. [18:00]
Tomer: And as the Hadoop market has matured over last year and we’ll continue to mature over the next year or two years, people stop buying those arguments. They don’t comprise when they buy a storage system or database. No, they are not going to comprise when they buy a Hadoop environment.
Stefan: And what’s the secret sauce behind all of this, you touched a little bit that you built this in from scratch but where is the magic of MapR start and where is it pulling it all the way through? [18:32]
Tomer: Let’s use the whiteboard here, and look at what is the MapR distribution, what is the overall architecture. So it all starts with, maybe I’ll move out of the way so it’s easier to see. It all starts with a collection of over 12 different, maybe 15 even now, open-source Apache community projects. So we have things like, we have Hive here, Pig, Flume, we just announced YARN. So many different projects here, and these are just a few of them, there is ZooKeeper and Sqoop and Stinger Tez coming to play. We’re going to run Impala on the platform, that’s part of it as well.
And what we do is we’ve added our own data platform, so underneath here you have the MapR data platform. So this is, and I call it MDP here for short, but this is the MapR data platform. And what this is really is two things. First of all, it’s a distributed filesystem and it’s also a NoSQL database in one. So think of this as, we could call this, that’s going to be hard to see but call this MapRFS here, and MapRDB here, but really it’s a single service. So files and tables are integrated, they are part of the namespace, they are protected in terms of snapshots in DR the exact same way. So it’s one process running on every node, that’s the MapR data platform.
From a file standpoint, it’s fully read/write, so you can – you don’t have the limitations here, you are probably aware of with HDFS. So you can modify files, you can do concurrent reads and writes, then we also expose that through an NFS interface. So if you look at the interfaces for the MapR data platform, clearly we have the, of course, the Hadoop filesystem API. So that’s the Hadoop filesystem API. And this is the same API that’s exposed by HDFS or Amazon’s S3 implementation or Google cloud storage, their integration with Hadoop.
And we also have an NFS interface here, which is the standard storage interface. And customers use that extensively for everything from loading data into the cluster, so they could, for example, maybe export data from – a lot of our customers they’ll export data from Teradata using standard Teradata utilities. Or they use things like R and SaaS, and because we expose a standard file interface, just like any network at that storage, you just mount the MapR cluster, and you get one giant mass, that’s it. [21:10]
Stefan: Wow, that’s cool.
Tomer: And actually before coming here, I was watching a video you did with Eric Nachbar. He described the reasons he was using MapR and he loved the fact…
Stefan: Those integration pieces are so critical for customers. If you go to a customer and say, “Oh great, we have this magic cool storage thing, and by the way you have to write a few pull scripts and some Java to get data in.” They look at you like, “Uh, okay.” That’s just like 20 years back rather than step forward.
Tomer: It’s analogous to what would you rather use to keep your documents here, will you put them on FTP server or you put them in a network of that storage? I mean it’s obviously a lot easier when you don’t have to use a tool to get data in and out. And then there is the HBase API, because that’s the standard interface in Hadoop for tables. So we expose that and that’s the way you basically read and write the behaviors.
Tomer: So that’s kind of the overall, we add our management innovation as well. So this is management, we call this our MapR Control System. That’s kind of what it is, and the nice thing here is that there is so much innovation happening here at this layer as well, so you’ll get projects like Spark and Shark and MLlib and YARN and Hive and Pig, and I could go on and on with all these different projects that are either in the distribution or coming this year as fully supported projects in our distribution. And then we add our own innovation that provides value beyond what you’re able to get. And actually broadens the use cases that are possible with this platform.
Stefan: So my understanding too is that the MapR filesystem, you guys completely rewrote also in a different language than the Hadoop distributed filesystem. And I think you guys had some DNA there, so it’s not the first filesystem that your core engineers wrote. [23:12]
Tomer: That’s one of the things we’ve done. So I think we spend over, we invest over 50% of our engineering at this layer, but we’ve certainly done a lot of innovation here, and very early on too. And M.C. Srivas, our CTO and co-founder, he actually came from Google where he was – he was on the Google Search and BigTable teams, driving that from within Google. And Google as we all know is ahead of the game here when it comes to Big Data. So he had that experience of running MapR using Big Data at scale, the best place possible. And then before that he was at Spinnaker Networks, which was acquired by Net App, and they are clustered filesystem.
Tomer: But yeah, this combination is really what we are excited about in terms of expanding the use cases that are possible with Hadoop.
Stefan: I will have a little bit more hummus here, and a drink and we’ll have a little break from Big Data & Brews. See you guys soon.