About Us Icon About Us Icon Business Analyst Icon Business Analyst Icon CEO Icon CEO Icon Datameer Icon Datameer Icon Envelope Icon Envelope Icon Facebook Icon Facebook Icon Google Plus Icon Google Plus Icon Instagram Icon Instagram Icon IT Professional Icon IT Professional Icon Learn Icon Learn Icon Linkedin Icon Linkedin Icon Product Icon Product Icon Partners Icon Partners Icon Search Icon Search Icon Social Networks Icon Social Networks Icon Share Icon Share Icon Support Icon Support Icon Testimonial Icon Testimonial Icon Twitter Icon Twitter Icon

Datameer Blog

Special “Use Cases” Big Data & Brews: Hortonworks, Pivotal, MapR

By on June 17, 2014

We’ve had a number of guests visit us since we’ve kicked off Big Data & Brews and one thing that I always like to ask them about is what kind of use cases they are seeing with their technology. Tune in below to hear what Ari Zilka of Hortonworks, Milind Bhandarkar of Pivotal, and Tomer Shiran and Ted Dunning of MapR had to say.


Ari Zilka, Hortonworks

Stefan:What are some of the most exciting used cases you see with the customers, especially around those integration, scenarios you just mentioned?

Ari:There’s some very mundane used cases that are exciting to the customers. Let me start there, like operational data store collapsing. I find most customers have hundreds to thousands, if you’re talking about a big Fortune 500 company. They will literally have thousands of copies of the same data, for compliance reasons usually. Different teams can have a piece of different pieces of data and they want to silo them for each other to have zero regulatory risk.

Our data lake architecture allows us to basically say, “You know what, you may have 3,000 files in Hadoop data lake but at least it’s just a data lake.” Sometimes it turns out we can collapse the redundancy down to … it’s just really three copies at the lake, usually you can because we can do column level encryption with the Columnar Store and we can say, “This guy is, based on log-in, is allowed to decrypt the column, this guy is not.” We do a lot of format preserving encryption. You can use the column to do analytics it’s just not the …  like a Social Security Number it’s not the real one. It’s just a unique one processing.

The mundane used cases are collapsing out all these redundant copies of data that saves people a lot of money. The sexy used cases are Data Science, like I just spoke to our Head of Data Science before coming to meet you and he found … for example … I can’t say which customer this is with [00:08:00] but in health care he built a new algorithm in Hadoop. He on-boarded all these structured data sources, forget Hadoop for unstructured data. This is all databases and then some Office style documents like PDF’s of X-rays and things like that from hospitals.

Stefan:Maybe doctor notes, who knows?

Ari:You bring all these data in. You do some natural language processing. You do some pharmacy history data analysis and he found that he could write an algorithm literally, NR, he could write and algorithm that found the relationship between diseases and drugs. These are the drugs people take and these are the diseases they have and these are drug … the time correlation between them and he started asking questions of this health care provider’s doctor.

He said, “There’s something very strange going on. It seems like anyone who has atherosclerosis, calcification of the arteries … There’s a very strong correlation between them and taking flu medication.” The doctor said, “What are you talking about? That makes no sense.”

They started looking into it and a week later the doctors came back and said, “You know what, the calcification gets really, really bad for someone on the flu. The arteries hardened for a short period of time even worse than when they were hardening in the natural state of that person’s health before and they will soften again a little bit  after the flu passes but during the flu the arteries lock down and it’s an immune response and the person is at huge risk of heart attack.”

We figured out together with this health care provider that you could actually say, “If someone has atherosclerosis and they come to you with the flu you need to treat that as an emergency situation because if you don’t get that flu dealt with they could have a heart attack. More importantly is anyone with atherosclerosis should just get a flu vaccine every year [00:10:00] and not be at risk of catching the flu.”

Then he found something much more interesting, which is HIV positive patients were taking prescription only hyper fluoridated toothpaste in huge volumes and no doctor could tell him why and it turns out that the drug cocktails to treat HIV weaken the teeth. Dentists stepped in and prescribed hyper fluoridated toothpaste.

Stefan:Interesting. Isn’t it wonderful what you find out about it … things they don’t.


Milind Bhandarkar, Pivotal 

Stefan:Now you’re working on this and you’re one of the really really early guys in the  Hadoop space, and you’re working on this. Where was that moment when you had to pinch yourself, “I can’t believe people are doing this with software I wrote so many years ago?”

What’s the most amazing use case you saw?

Milind: First was not the use case. Once I started doing Hadoop evangelism outside of Yahoo! … by the way, the first Hadoop tutorial delivered anywhere was in ApacheCon in 2008 or 2009, I am forgetting, but this was in New Orleans.

Stefan:I was there.

Milind:It was sponsored by Candera. Christoff, Aron Kimpbell, Tom White, all of those actually were …

Stefan:Didn’t we go in the evening? Anyhow …

Milind:The French quarter thing, let’s push that out. Really, I was in [Usemix], the year after that. My tutorial proposal got accepted there. The next door there was a Solaris performance tuning tutorial going on with Richard McDougall.

Richard McDougall has written Solaris performance tuning books, all about Solaris. A really great guy. His tutorial had like six people attending them and my tutorial next door had something like 30 people attending. That’s the point where I basically realized …

Stefan:Something is shifting.

Milind:Something is shifting, exactly. Usemix 2009, this was in San Diego. Among the attendees in my tutorial there were three people representing all three different. That was basically, “Okay, what have we done?”

Stefan:Saving the world.

Milind:Saving the world, yeah. Recently actually, my daughter took part in the Synopsis Science Fair, here in South Bay. I went to drop her there and I took a look at what all kids were doing, from 7th grade to 12th.

There was actually a kid from the 7th grade, who did … what was his title? “Effect of number of computers on computation time.” He basically took a MapReduce job and he basically said, “If I run this on three machines, if I run this on ten machines, if I run this on 20 machines, how much does the compilation time changes?”

He discovered that it goes down for some time and then it basically goes back up. This was all done using Amazon AWS and Hadoop.  I actually was tempted to make him an internship offer right there. I don’t know about underage recruiting or anything like that.


Tomer Shiran, MapR 

Stefan:That’s pretty amazing, it’s really international. What a lot of people are doing, is it different than like what people do in the US versus in Europe, used case wise or, are there buckets?

Tomer:I’d say the US is probably still ahead in terms of the maturity of the customer base, although we have actually a significant number of customers in these other countries. One great use case in Japan is, it’s actually a beverage company, so this is one of the biggest companies, beer and whisky in Japan.

Stefan:Oh nice.

Tomer:They have some pretty cool use cases, so I think you would get the standard kind of marketing use cases that people do with Hadoop. They have all of those but that they also have these really cool vending machines, where they are doing image protection and they have a video camera that’s looking at you and kind of recommending a beverage tea when you walk up to.

Stefan:Based on what they used before.

Tomer:I think they look at your image and compare it to other people that had similar characteristics, things like that. So it’s a pretty critical cool use case.

Stefan:That’s so Minority Report where you are basically based on your, like what was it retina get the advertisement. I guess we live in the future. That’s amazing.

Tomer:Yeah, we do.


Tomer Shiran, MapR 

Stefan:What are some of the use cases you are seeing with the customers, like what are your favorite one?

Tomer:My favorite one is actually, it’s not going to be one of the more popular use cases but the Aadhaar project in India actually so.

Stefan: I think that’s just an enormous project.

Tomer: Yeah, it’s a really cool one too and it’s really valuable in terms of what it’s doing in that country. So India has over 1 billion people living in the country, it’s something like 1.25 billion people. And one of the challenges there is that about half of the population doesn’t even have an identity. There is no Social Security Number or anything like that, and that prevents these people from opening bank accounts, it prevents them from getting a medical care, it prevents them from getting government services, government aid, things like that. It also encourages a lot of fraud in the overall system, right.

So what the Aadhaar project is doing is it’s basically building the world’s largest biometric database, and ideas to provide every resident of India an identification so they can get government aid and medical services and open a bank account and do commerce and things like that. And it’s lifting a lot of people out of poverty.

Stefan: It’s fantastic.

Tomer: I think it’s up to about 750 million people already in the database, I think it’s about 10 petabytes of data. So for every person  you have the person’s photo of the face, you have the ten fingerprints, the two iris scans. So you have all that information for everyone of these people, and it’s not just collecting that information and storing it, it’s also enabling every point of service in India to also be able to verify that identity, because now you need the bank and every other service provider to be able to check your identity. So that’s a system that needs to respond within 200 milliseconds at very, very high load in terms of transactions per second. So we are really happy that we are powering that from the backend, from Hadoop in database standpoint. So that’s one of the projects I’m most excited about.


Tomer Shiran, MapR 

Tomer:I think advertising and marketing are pretty common use cases across the board, and it …

Stefan:And is that more in the ad companies or is it more kind of the traditional big companies trying to understand their customers or?

Tomer:It’s actually both, so you look at some of our customers like the Rubicon Project, which is the largest ad exchange in the US in terms of audience reach. And they are doing 90 billion auctions at auctions every day. And each of those auctions is probably a dozen or more bids, so all these bids are, we are talking about trillions of events every month that are processing the cluster, and they map our environment, and they predict the prices that the auction are going to and all sorts of things like that.

But then if you look at many other customers that we have across telco and retail they are doing, these are customers that have tens to hundreds of millions of end-users or end customers, and they are doing everything from better ad targeting to turn their analysis, all those types of used cases.

Stefan:What kind of product enhancements try for you guys, like where is that, you touched a little bit on the lower latency requirements but where do you really see Hadoop as of today, you said little bit expanding into the new real-time-ish production use cases, what other functionality dimensions are driven by those use cases?

Tomer:I think the customers that are doing these things. And I think you mentioned earlier how you see a lot of these, a lot of our customers are doing big deployments that are really impactful to their business. When the company wants to do that, they need a set of enterprise grade dependability characteristics. So they want true high-availability, one that self-heals automatically. They want a real consistent snapshots, they want disaster recovery across data centers. So we have a vendor now who says, we have those things and we’ve added those things, we’ve caught up with MapR. But there is a difference between building those into the architecture and doing something for a checkbox.

Stefan:. Is it like, oh yeah, we  love pizza and we just put something on the fly.

Tomer:So, let’s take an example of snapshots. So MapR we’ve provided snapshots from day one, much like you would see in any other, in an enterprise storage or an enterprise database, the ability to go back in time. Let’s say user accidentally deleted data, or you had an outage and you wanted to go back to a consistent point in time. So it is something that enterprises expect, you wouldn’t buy  or a database if you couldn’t go back and do point in time recovery.

MapR is only Hadoop distribution that provides that from a Hadoop standpoint. And our competitors they’ve tried to add that to HDFS, and the result is really inconsistent snapshots or what’s they – they sometimes call fuzzy snapshots, but people don’t …

Stefan:That’s a really nice marketing term, by the way.

Tomer:It’s great.

Stefan:It’s a fuzzy snapshot.

Tomer:It’s more or less consistent, it’s sometimes consistent.

Stefan:Let’s hope it is consistent.

Tomer:Let’s hope it is.

Stefan:The whole thing just crashed, let’s hope.

Tomer:And as the Hadoop market has matured over last year and we’ll continue to mature over the next year or two years, people stop buying those arguments. They don’t comprise when they buy a storage system or database. No, they are not going to comprise when they buy a Hadoop environment.


Ted Dunning, MapR 

Stefan:Let’s come back to the recommenders. What are the kind of use cases you see people using?

Ted: Recommenders are just amazingly ubiquitous lately. A friend of mine, co-contributor Robin O’Neil, just recently showed me that the new Google Maps is almost entirely based on recommenders. There’s way, way too much stuff to show on any map. There’s a massive amount of stuff. You wouldn’t be able to read it. What it does is based on what you’ve done lately, what you’ve clicked on, what you’ve typed in, it selects which things it wants to show you. This actually now in my talks, I show what happens if I search for a restaurant by name [00:10:00] near our office. It shows all these restaurants in the same price range, roughly the same cuisine and I search map our office is. All the restaurants go away and all these high tech offices on the map show up. It knows which sort of thing  im’ doing.

In fact, the demo used to work better, because now it knows that I search for restaurants in that neighborhood. So it starts showing.

Stefan: It makes this the technology in the restaurants.

Ted: That’s right. It’s learned already some aspects of what I like. There are many, many things that it does. It will deemphasize roads if you’re a bicyclist or you’re on mass transit and countervailing approaches be done, too, different scales it might show you more roads.

Stefan: Is there data privacy issue with this?

Ted: There are data privacy issues everywhere and people really, really don’t recognize how ubiquitous they are. Google has a pretty darn good track record and they’ve taken a lot of efforts. Ultimately, just like search histories, those are sensitive and even though they’re not legally sensitive, Google is doing the right thing as treating them as very sensitive and being pretty careful. We’ve used Google as a partner with the Google compute engine and I’ve been very impressed, for instance, the discs that you get on the virtual instances are encrypted by default. In fact, I don’t know how to defeat that. If you just do a toy example, the data at rest is encrypted.

Maybe you should do a better key management so it’s always changing or whatever, but at least the zero, the simplest case, is done well. So yeah, I think that there is always issues about privacy. There are really subtle things you can do with big data, a whole value, but those things can also be used to invade privacy.

Connect with Datameer

Follow us on Twitter
Connect with us on LinkedIn, Google+ and Facebook

Stefan Groschupf

Stefan Groschupf

Stefan Groschupf is a big data veteran and serial entrepreneur with strong roots in the open source community. He was one of the very few early contributors to Nutch, the open source project that spun out Hadoop, which 10 years later, is considered a 20 billion dollar business. Open source technologies designed and coded by Stefan can be found running in all 20 of the Fortune 20 companies in the world, and innovative open source technologies like Kafka, Storm, Katta and Spark, all rely on technology Stefan designed more than a half decade ago. In 2003, Groschupf was named one of the most innovative Germans under 30 by Stern Magazine. In 2013, Fast Company named Datameer, one of the most innovative companies in the world. Stefan is currently CEO and Chairman of Datameer, the company he co-founded in 2009 after several years of architecting and implementing distributed big data analytic systems for companies like Apple, EMI Music, Hoffmann La Roche, AT&T, the European Union, and others. After two years in the market, Datameer was commercially deployed in more than 30 percent of the Fortune 20. Stefan is a frequent conference speaker, contributor to industry publications and books, holds patents and is advising a set of startups on product, scale and operations. If not working, Stefan is backpacking, sea kayaking, kite boarding or mountain biking. He lives in San Francisco, California.