About Us Icon About Us Icon Business Analyst Icon Business Analyst Icon CEO Icon CEO Icon Datameer Icon Datameer Icon Envelope Icon Envelope Icon Facebook Icon Facebook Icon Google Plus Icon Google Plus Icon Instagram Icon Instagram Icon IT Professional Icon IT Professional Icon Learn Icon Learn Icon Linkedin Icon Linkedin Icon Product Icon Product Icon Partners Icon Partners Icon Search Icon Search Icon Social Networks Icon Social Networks Icon Share Icon Share Icon Support Icon Support Icon Testimonial Icon Testimonial Icon Twitter Icon Twitter Icon

Datameer Blog

Big Data & Brews: Tomer Shiran of MapR Talks About Hadoop & MapR Innovation

By on March 18, 2014

Tomer and I were able to finish off the fantastic pita and hummus he brought in while we wrapped up our conversation around what MapR is up to and the cool things he sees happening in the Hadoop ecosystem.


Stefan:           Welcome back to Big Data & Brews.  We have awesome hummus and pita, and Tomer from MapR.

Tomer:            And Tomer.  Well, that’s how the hummus got here.

Stefan:           This is awesome.

Tomer:            It is.

Stefan:           Good beer.  Let’s talk about a little bit of the Hadoop market and how you guys see it, and where it’s standing, where are the pain points.  Where does it need to go to be successful like on a big picture thing?  We talked a little bit about Spark and Storm.  When are you guys thinking about that?  The “we” stuff.

Tomer:            I think there’s a few things that are really happening right now in the market and are really important.  There’s also a long way to go.  This is the beginning of that long run.  You mentioned the training for Iron Man so this is like the beginning of that, the beginning of the run.  I think some of the dynamics right now are the expansion in terms of the use cases, right?  We’ve invested a lot and we’ll continue to invest to make Hadoop suitable for analytics and operationally use cases.  We’re really excited about YARN. We just announced adding YARN to our distribution, which opens it up to a more processing framework standpoint.

I think if you look at it, really, what’s happening is what we’ve done in the data platform and what’s also now happening in the rest of the stack is expanding Hadoop to be a lot more than the original batch processing framework.  I think another thing that really needs to happen in the market is it has to be easier.  It has to be a lot easier to build applications.  The companies right now that are adopting Hadoop in a big way are the companies that have the resources, right? Looking at most of our customers, these are the global 2000, right, and the technology companies, the Web 2.0.  That’s the mix of companies right now.

If we’re going to make this something that appeals to the broader, mid market, and beyond, then I think we need tools like, for example, Datameer, right, where it’s – [3:03]

Stefan:           Totally.

Tomer:            Well, it has become easier for the user to use and I think that’s an area that will evolve a lot in the next few years.

Stefan:           Yeah.  I think we see two sections, right?  We see the business user and, of course, our products there with Datameer.  Then what we also see is we need higher-level abstractions like we need Ruby on Rails for Datameer for Hadoop, integrated in Datameer of course.  Make it simple, explaining to people what MapReduce is and why you can’t just put an object into a private method in your map function and why it’s consistently …you’re just like, “Oh, well…It doesn’t work the way it used to work,” right?  I think we need to bring this a layer up too, right?  We need to hibernate the Ruby on Rails from…

Tomer:            Make it easier for developers and, of course, the analysts.

Stefan:           We see there are two audiences.  There’s these data driven products that people build.  That’s where they build companies on and then we see the really heavy traditional business intelligence community that’s the focus where we see it, right?  It’s cool innovation that can come into the overall scheme of noSQL and have that … what are the exciting new open source projects?  It may be too early but you like keep an eye on nowadays.  You’re like, “Oh, this is a cool Git app project.” [4:38]

Tomer:            It’s a great question.  There’s really so many of them.  Being responsible for product management in MapR, I get a lot of requests from the field that you can imagine, right?  It seems like every other day, I get a question about some project X that was started probably two days ago in some remote country, and then it’s like, “What do you think about this project?”  I’m like, “Okay.  Well, we got to do some research because we’re starting on a daily basis.”  We’re excited about a lot of these projects.  Spark and ad ecosystem are pretty interesting.  They serve a different set of use cases or maybe actually some existing use cases in a better way, things like machine learning where you have iterative algorithm.

Stefan:           Yeah.  Shared memory.

Tomer:            That shared memory across the cluster.  The machine learning algorithms can … they can converge a lot faster, right, where they have that type of in-memory system.

Stefan:           The way I think about Hadoop, it’s a virtualization platform.  There was a physical machine and then you have many virtual machines on top of that.  That totally disrupted the data center and how we deploy things.  Hadoop is the opposite.  Many physical machines, one virtual machine.  So far, we have storage.  We have compute.  Now, we got a little bit of memory, it’s a little early with Spark.  With you guys, there’s a cool fun system like storage implementation that has all the good stuff you talked about.  When there wasn’t a lot of innovation like the kind MapR produced.  Right now, we have Spark.  It’s the memory.  Do you see any kind of really cool stuff coming up there or are you working on something? [6:24]

Tomer:            Yeah.  The way I look at it and … I don’t know if can …

Stefan:           We do have a sponge right here.  It’s really old school here.  It’s like in my childhood.

Tomer:            This is the kind that doesn’t scratch the pots.  I don’t know if you know that.  If you’ve done any cleaning recently but it is the curved one.  They actually don’t scratch your pans and pots.

Stefan:           Good.  That’s what I do, right? I work 12 hours a day here and then I literally go home and usually clean up the mess form the night before.  I work late.

Tomer:            My wife would actually be proud that I actually knew that.

Stefan:           You see.  I go to my San Francisco apartment and the only thing it doesn’t have is a dishwasher.  I feel like back in the ‘60s so I literally … yeah.  I’m an expert.

Tomer:            It doesn’t get any better than that.  Okay.  What are we going to talk about?  Oh, the whole stack, right?  If you look at the way Hadoop started, originally, you had HDFS and then you have MapReduce on top of this, right?  This was MapReduce and then you had a few things that could translate into MapReduce.  You had things like Hive and you had Pig, a few other things that could translate maybe higher level abstractions or languages.

Stefan:           Cascading, Datameer.

Tomer:            Right, Cascading, Datameer and so forth.  A lot of change since then, right?  The first thing that MapR did was we made this much stronger, right, much more general purpose, right?  We built a MapR data platform and introduced a fully read-write data platform supporting standard storage interfaces in addition to the Hadoop file system API, right?  It was great.  Hadoop was really designed for it to enable that type of extensibility from day one, right?  That’s how Amazon, how Google, plugged in their storage systems into Hadoop, right?  That was a standard Hadoop file system interface, right?

Later, what we did is we actually expanded this beyond files.  We expanded this to support not just files but also tables and made the data platforms suitable for both analytical applications but also for live online applications, what we call the operational applications.  We have quite a few customers who have actually gone from Oracle to MapR. [8:47]

Stefan:           Oh, wow.

Tomer:            They’ve made that transition there.  Another thing that has actually happened if you look at the distribution in the MapR distribution as well as other, Hadoop distributions in this case, is the introduction of YARN as a more general purpose resource manager instead of MapReduce we use, right?  That opens up the platform to a broader set of use cases that share the same compute resources, right?  Now, you have MapReduce as one example of that and, of course, Hive and Pig here on top … but you also have things like Spark, right?

You’ll have more of these in the future.  Well, the really nice thing from the MapR standpoint here that’s differentiated is that many of these new applications here, if you bring, for example, an HPC application, they really weren’t designed for the batch characteristics of HDFS, right?  They expect the full read-write system.  They expect the ability to ingest data in real time and they expect all of these characteristics.  That’s what makes this combination really, really powerful.

Stefan:           Yeah.  You don’t need to run, and read write.  You basically already have that.

Tomer:            Yeah, because you have both on YARN and also, even things that don’t support YARN that can run on this platform here.  Those things were not designed to talk to HDFS.  There’s tens of thousands of applications out there that weren’t designed for that whether it’s Elasticsearch or Vertica, or many other things that can run on this platform.

Stefan:           Can you give us a few examples of this?  You just said Vertica?  What else is cool beyond the standards that you guys then can run on MapR data platform? [10:36]

Tomer:            It’s a great question.  One example is we announced a partnership with HP around Vertica, and Vertica actually runs natively on the MapR platform by leveraging these capabilities.  It’s a read-write database, right?  It needs a real storage system underneath in order to provide the full range of capabilities that it provides on Hadoop.  Other examples are things like Solar and Elasticsearch, right?  Our customers, if you look at from an integration standpoint, they use tools like R and SAS, and low import-export utilities from Teradata and from Oracle.  They can use all these tools with MapR in a very easy way because they just mount the cluster and those tools have existed for 15, 20 years.  Everybody in the company is familiar with how to export from Teradata.  They can just export to a file and that file.  It just happens to be on this data platform.

Stefan:           Yeah, cool.  Let’s talk a little bit about MapR as a company and the history.  You guys are located in the South Bay.  How big are you guys now? [11:44]

Tomer:            We’ve crossed 200 employees.

Stefan:           All the engineering is in the Bay Area or it’s … you said you’d all …

Tomer:            Yeah.  The majority of the engineering is in headquarters here in San Jose.  We’ve actually been taking over suites in the building and just at least our second building, so a lot of fun expansion there.  It’s hard to keep up.  We have a small team in Hyderabad in India.  We have support teams actually in several different countries including, of course, US headquarters.  Also, we have support staff in India and the same location as our engineering team there.  We have a support team in Japan as well.  Yeah.  It’s very international.

Stefan:           Putting on the entrepreneur here, as a product manager, what are your biggest challenges?  Now, you overcome them in such a fast moving market. [12:42]

Tomer:            Well, there are a few different challenges.  One of the challenges that I think we all face now is just hiring so I spend a lot of time recruiting obviously.  In terms of the product I think that the key is to stay focused.  There are a lot of distractions in this space both from a company standpoint, different things that are happening outside, what should we do about this, what should we do about that.  At the end of the day I think the really important thing is focus on the customers and what are they trying to achieve and what do they need in order to achieve the next set of use cases and being more successful.  That’s really where you have to focus.  That’s why I spend a lot of time with our customers and prospects and just understanding what they’re doing, what their use cases are, what challenges they have and what do we need to do.

Stefan:           Usually to me if I talk with customers, I love our customers and what they are doing but they frequently come back and say I need a faster horse instead of saying, “Oh, I want to get faster from here to here.”  They always have, “Oh, you can do this by usiing that SourceForge project combined with this Github” and then, ‘Hey, I wrote you a little code, why don’t you put it in the product?” That kind of thing, right?

Tomer:            Yeah, I think the key is to really understand what their problem is, what they want to be able to accomplish versus just how do you want it solved.  They don’t see everything that you see.  It’s not their core business.  They’re not spending all their time thinking about these different technologies and what can be done.  A lot of times they’ll ask for something specific but you really need to understand what they really need is something else or if we added something else we provide them with that value but also provide a lot more value to 10 other or hundreds of other companies.

Stefan:           The data probably got bigger, the team got bigger and what basically goes up is coordination cost between the team.  So you slow a little bit down from an engineering perspective.  How are you guys really keeping the productivity high as you crazily scale the company? [14:56]

Tomer:            That’s a good question.  We always share engineering things with skill like Hadoop does.  That will do twice the amount of work as five people.

Stefan:           Let’s figure out how to do that.

Tomer:            I think the key there, what we’ve done is we’ve divided it into teams.  We have teams for these various Apache open source community projects.  We have a large engineering investment there and then we have a separate group that manages the ends responsible for the resource management and MapReduce.  We have another team for the file system, one for the integrated database and another one for management.  We try to keep into teams and keep very clear and consistent APIs between them so they can develop independently.  Yeah, it’s a challenge like every other company.

Stefan:           We did something similar.  We have a more integrated product.  We couldn’t split it up.  What we recently did is that we defined a team that is not allowed to get bigger than seven people.  If we have a challenge like we work on a certain module, if it’s getting bigger than seven people could work, we have to divide that part of the product into multiple textual subsystems.  Just a crazy thing.

Tomer:            You don’t lay off the eighth person or something like or the weakest of the 8 once you get the 8.

Stefan:           If you get the eighth, the person goes into QA.

Tomer:            I see.

Stefan:           No.  It’s really interesting because as you said, as you really get to so many more developers.  The productivity drops if you don’t do something creative or tremendous, right?  Interestingly enough though, we use our product to analyze our code, our tickets and our log files from the test.  We have a Git app integration and then we really can put that altogether.  Then we can say … I don’t write code anymore, but if I would they would say, “Hey, Stefan.  You just broke the test suites.  It would cost us $6,000 to run against the Hadoop distribution in Easy2 cloud again.”

One of our biggest expenses are really running all the test suites.  We can actually pull it all the way down to your Git app commit that then broke the test suite. [17:23]

Tomer:            Pretty advanced.

Stefan:           Fun.  Fun.

Tomer:            I am a big believer in eating your own dog food.  That’s something we were … we’re big on it.  In Microsoft, we do a lot of MapR as well.

Stefan:           You’ll see on the system.

Tomer:            We us MapR to analyze home data.  We use MapR just to our bug database.  We have bugzilla on MapR.

Stefan:           Oh, cool.  Bugzilla?  You wouldn’t use JIRA?

Tomer:            We actually have two different systems that were in a transition, I think, but yeah.

Stefan:           Is it just using Git or what’s your code for your system?

Tomer:            Yeah, we’re using Git right now.

Stefan:           That was a big change for me from like … I started my life in CVS.  I didn’t do anything else before that.  SVM and then Git was just like, “Oh, my God there was like real merges and it does it by itself?” How does your product management process look like?  You said you started a lot of feature requests, requiring gathering on the customer side.  Then how do you pull this through?  Do you guys have a better program or valuable program? [18:32]

Tomer:            We have a product manager for each product that we have and for a few different projects as well.  The way we do it is, as a team, we spend a lot of time in the field and listening to customers and talking to them, right?  We take that input as well as a variety of other sources of inputs and we have to produce basically the product road map.  We work with the rest of the team at MapR to come up with what’s our road map for the 18 months.  We got from there to defining the requirements for the next release and the subsequent releases.  For big releases, we’ll have beta programs and we’ll engage anywhere from 10 to 20 customers typically on a private beta to have them work with the product and give us their feedback.  Once we address that, we’ll do our GA.

Stefan:           You don’t do push it out and see if 5000 people fail with it like other open source projects.

Tomer:            Our approach I think is a little bit different from our competitors, and maybe in that regard.  We really treat Hadoop as an enterprise class system because look at these customers.  They’re running things that are mission critical, right?  If it’s down, they’re losing millions of dollars every day.  You can’t just put something out there and have your customer do QA for you.  We’ve invested a lot in the QA team and QA automation, and so forth.  It’s pays a big deal.  That was I was thinking about in the long run. [20:06]

Stefan:           Any fun projects you’re working on besides your day-to-day job?  Do you still do recreational code writing or any fun stuff beyond your day-to-day job?

Tomer:            My day to day.  I’d say the biggest project that I have right now beyond my day job is teaching my daughter to ski.  I have my 5-year-old and she did actually a bunch of black runs this weekend.

Stefan:           Oh, wow.  You’re going up to Tahoe, or where?

Tomer:            The snow.  You’re going to ask me about the snow conditions there but it hasn’t been a great year.  At the same time, it’s being out in the mountain and having snow and skiing, even if you can’t ski the whole mountain, is awesome.  I’ve been having a lot of fun with that.

Stefan:           Wow.  What’s in the future for you, for MapR, for the Hadoop ecosystem?  What do you think? [20:54]

Tomer:            What do I think?  Well, I think if you look at the trends in the market, IT spending is growing at about 2.5% annually.  Data is growing at about 40% annually, right?  There’s a disruption there that has to happen and Hadoop is that disruption.  MapR is the company that’s bringing Hadoop to the enterprise as well as the Web companies.  With a production ready distribution, I think we’re in a great position to feel that disruption.  I think it’s the biggest disruption really since that relational database 30 years ago.

Stefan:           Strata was a few weeks ago.  What was that for you guys? [21:37]

Tomer:            I think one of the things I heard actually from someone who I was talking to was that this year at Strata, there were a lot more suits than previous years which is a good sign with the Hadoop market maturing, and the decision makers from our perspective when you look at our customers and the prospects.  The decision makers being at the show and really being more tuned in to this big data revolution.

That’s one thing that was appearing at the conference.  We had a few announcements ourselves.  We announced YARN, the MapR distribution.  We announced the MapR sandbox which is a really nice, easy-to-use virtual machine that you can download from our website, from MapR.com, and learn Hadoop, get up the speed.  It’s like for somebody who’s new to Hadoop and wants to look and learn how to write code, how to run queries, things like that.

There was also a great panel with one of the sessions at Strata for Hadoop users.  Actually, all of them are MapR customers.  It was Climate Corporation, Cisco IT, Solutionary which is a management security company, and the Rubicon Project which is an ad exchange.  They talked about how to achieve production successfully with Hadoop which is what we help our customers do.  I think one of the greatest comments I heard there was from Piyush at Cisco IT.  He’s their distinguished engineer and chief architecture for big data at Cisco.  He was asked, “Well, how do you make Hadoop successful?”

His response was that, you have to get the architecture right upfront because if you can get the right architecture in place, then the conversations will be, “how do we get value out of this?  How do we increase revenue?  How do we reduce costs?” things like that versus, “How do we solve this issue with the name noder?  How do we solve this issue with this open source project?’”  Really, an IT focused discussion as opposed to a business discussion.  Cisco has been using our product for a while now and it started with a simple use cases like offloading the data warehouse.  Then there was a use case that actually increased revenue by 40 million dollars by providing recommendations to their channel partners on which opportunities to engage.  There’s now over 12 different use case and different business groups running on the cluster.

Stefan:           Your pricing model is percent on the ROI? [24:04]

Tomer:            We would have IPOed a long time ago if that was the case.  It’s a standard per node and no subscription type model.

Stefan:           Yeah, great.  Well, thank you very much joining for Big Data & Brews.  Thank you for the awesome beer and amazing hummus and pitas.  You have to come back and bring more.

Tomer:            We’ll do it again.

Stefan:           Cheers.  See you guys soon.


Connect with Datameer

Follow us on Twitter
Connect with us on LinkedIn, Google+ and Facebook

Stefan Groschupf

Stefan Groschupf

Stefan Groschupf is a big data veteran and serial entrepreneur with strong roots in the open source community. He was one of the very few early contributors to Nutch, the open source project that spun out Hadoop, which 10 years later, is considered a 20 billion dollar business. Open source technologies designed and coded by Stefan can be found running in all 20 of the Fortune 20 companies in the world, and innovative open source technologies like Kafka, Storm, Katta and Spark, all rely on technology Stefan designed more than a half decade ago. In 2003, Groschupf was named one of the most innovative Germans under 30 by Stern Magazine. In 2013, Fast Company named Datameer, one of the most innovative companies in the world. Stefan is currently CEO and Chairman of Datameer, the company he co-founded in 2009 after several years of architecting and implementing distributed big data analytic systems for companies like Apple, EMI Music, Hoffmann La Roche, AT&T, the European Union, and others. After two years in the market, Datameer was commercially deployed in more than 30 percent of the Fortune 20. Stefan is a frequent conference speaker, contributor to industry publications and books, holds patents and is advising a set of startups on product, scale and operations. If not working, Stefan is backpacking, sea kayaking, kite boarding or mountain biking. He lives in San Francisco, California.