Big Data & Brews: Ari Zilka, CTO of Hortonworks on Larger Hadoop Ecosystem & Vision
Stefan: Welcome back to Big Data & Brews with Ari from Hortonworks.
Ari, we’ve got a community question.
The community question, if I can rephrase it slightly, what are you bringing to the table as a new CTO of Hortonworks? What’s really your focus?
Ari: What I bring to the table as Hortonworks’ CTO is the combination of the inbound-outbound focus. I think our CTO level focus in the past was more on the future road map of the core technology, so working with the academia, working with start-ups, working with big established software and hardware vendors on where can Hadoop go over the next 10 years.
I’m focused a lot less on that and a lot more on what are people trying to do with it now and what are gaps, because I see Hadoop as … It’s interesting I met with a customer who called folks like you and me Hadoopers, people who are inside the circle of trust, if you will, and who understand all the terminology and it all makes sense to us. We live and breathe it every day and ideally we’ve used it … for a while in anger.
You have the Hadoopers and then you have mainstream enterprise and when Hadoopers talk Hadoop there’s this death by a thousand paper cuts, things that are just understood. They are painful but they are lived with and accepted. When you show those to the mainstream they say, “That’s unacceptable. I won’t even adopt it.”
Stefan: Like the writable and stuff and the serializeable interface. Or the no way to overwrite that communication protocol?
Ari: Right, or security is inherited from file system level, posixs level, concepts, constructs and that’s all you’ve got.
Stefan: Good luck with that.
Ari: You don’t actually have ackels you just have user group and other-
Stefan: Right [00:02:00], and if you’re in health care and there’s legal requirements for you to do that, who cares?
We talked … we went really deep, so help me a little bit more to paint a picture of your overall ecosystem. What are you guys doing? What’s really cool about your platform?
Ari: From an ecosystem perspective that’s a great question. What we do that we feel is unique is we build a platform for others to build data applications on. We are the data management platform.
I like to think of us as Amazon, EC2 if you will. We want to get it right in terms of what services do you need to assemble platforms, so we treat ISV’s and end users equally, as opposed to other platforms which say, for example Hadoop is fine except for the implementation. It’s too slow, so I will build a faster implementation that is purely an IT sale. Hadoop is Hadoop get what you want from Apache and run it on my new Runtime.
There’s folks who I compare to like a BEA / JBOSS, the difference is only the container but they all implement the same standard. Then there are folks who say, I derived from Hadoop core, so I will build a database company or an analytics company on top of Hadoop core technology and you will consume my package product which happens to use things like HDFS, maybe math produced maybe not, may be USI … all the different components from the community project ecosystem to build a particular product like a database.
You essentially have database vendors, you have container optimizations for IT and then you have Hortonworks which is trying to build a general purpose data platform for both ISP’s and end users to do analysis and build tools to do analysis on. That [00:04:00] leads to a difference in ecosystem around us versus everyone else. First of all, we have a lot of the big bucks vendors instead of competing with us, aligning with us.
Stefan: That would be Microsoft?
Ari: Microsoft, Teradata, SAP, Red Hat, Rackspace are all already aligned with us. Folks like Oracle and IBM work with us every day because customers pick the combination of their superior databases with our superior Hadoop platform and say, “I want Hadoop from Hortonworks and relational from Oracle and you must work together.” We work together fine.
The products are integrated. You go out to market and you get this view that perhaps they have a particular alignment like IBM ships its own Hadoop or Oracle ships something inside the big data appliance and that’s all their support. They support anything but then you have vendors who have premier integrations who are such as Teradata or SAP or Microsoft. Microsoft, Cloud, [inaudible 00:05:00] cloud is built on top of Hortonworks.
SAP HANA bridging to Hadoop which is a really sexy type of capability. You have terracotta-style low latency fast access to structured data through HANA but you have Hortonworks working as a data lake underneath HANA preparing, cleansing, and materializing data into HANA as fast as you need it to. Sometimes the other way around HANA’s preparing it to load into Hadoop for long term storage but you have this two-way flow of data where HANA’s the low latency layer and we don’t have to solve that problem, meanwhile, Hadoop is a data lake underneath it.
Then you have Teradata where you have them handling big data at low latency and random access and us handling big data at medium latency and non-random access but batch access and we put the two together and say, “Okay, well, again we become the data lake for Teradata, slightly different from HANA to SAP.” Where with Teradata we’re saying, “You’ve built [00:06:00] an ecosystem of tools around Teradata, keep that in place but you want to grow your analytics capabilities without growing your entire warehousing footprints.”
Let’s bring those warehouse images on to Hortonworks data lake for long term retention and joining with new interesting analytics and new data sets and new work loads and do new analytics in the lake and do existing infrastructure, leave it in place.
Stefan: What are some of the most exciting used cases you see with the customers, especially around those integration, scenarios you just mentioned?
Ari: There’s some very mundane used cases that are exciting to the customers. Let me start there, like operational data store collapsing. I find most customers have hundreds to thousands, if you’re talking about a big Fortune 500 company. They will literally have thousands of copies of the same data, for compliance reasons usually. Different teams can have a piece of different pieces of data and they want to silo them for each other to have zero regulatory risk.
Our data lake architecture allows us to basically say, “You know what, you may have 3,000 files in Hadoop data lake but at least it’s just a data lake.” Sometimes it turns out we can collapse the redundancy down to … it’s just really three copies at the lake, usually you can because we can do column level encryption with the Columnar Store and we can say, “This guy is, based on log-in, is allowed to decrypt the column, this guy is not.” We do a lot of format preserving encryption. You can use the column to do analytics it’s just not the … like a Social Security Number it’s not the real one. It’s just a unique one processing.
The mundane used cases are collapsing out all these redundant copies of data that saves people a lot of money. The sexy used cases are Data Science, like I just spoke to our Head of Data Science before coming to meet you and he found … for example … I can’t say which customer this is with [00:08:00] but in health care he built a new algorithm in Hadoop. He on-boarded all these structured data sources, forget Hadoop for unstructured data. This is all databases and then some Office style documents like PDF’s of X-rays and things like that from hospitals.
Stefan: Maybe doctor notes, who knows?
Ari: You bring all these data in. You do some natural language processing. You do some pharmacy history data analysis and he found that he could write an algorithm literally, NR, he could write and algorithm that found the relationship between diseases and drugs. These are the drugs people take and these are the diseases they have and these are drug … the time correlation between them and he started asking questions of this health care provider’s doctor.
He said, “There’s something very strange going on. It seems like anyone who has atherosclerosis, calcification of the arteries … There’s a very strong correlation between them and taking flu medication.” The doctor said, “What are you talking about? That makes no sense.”
They started looking into it and a week later the doctors came back and said, “You know what, the calcification gets really, really bad for someone on the flu. The arteries hardened for a short period of time even worse than when they were hardening in the natural state of that person’s health before and they will soften again a little bit after the flu passes but during the flu the arteries lock down and it’s an immune response and the person is at huge risk of heart attack.”
We figured out together with this health care provider that you could actually say, “If someone has atherosclerosis and they come to you with the flu you need to treat that as an emergency situation because if you don’t get that flu dealt with they could have a heart attack. More importantly is anyone with atherosclerosis should just get a flu vaccine every year [00:10:00] and not be at risk of catching the flu.”
Then he found something much more interesting, which is HIV positive patients were taking prescription only hyper fluoridated toothpaste in huge volumes and no doctor could tell him why and it turns out that the drug cocktails to treat HIV weaken the teeth. Dentists stepped in and prescribed hyper fluoridated toothpaste.
Stefan: Interesting. Isn’t it wonderful what you find out about it … things they don’t.
What are … Let’s shift … That was great.
Let’s shift a little bit gears here again, so outside of the Hortonworks platform and all the stuff you see, what is the most exciting technologies you seeing out there right now?
Ari: The most exciting thing –
Stefan: You know, if you go on GitHub, what is like, “Oh?” What is so … what are you listening or subscribing to on GitHub? Or anything like this … what’s really cool?
Ari: The stuff I’m paying attention to right now is around streaming, so Storm, Samsa, Continuity, and –
Ari: Kafka for sure.
I really like these micro batches and transactional consumption of stream being events. Being able to consume events, streams hundreds at a time in a pseudo-transactional fashion, so we’re not paying that silly XA price anymore where people are getting reasonable reliability to their consumption or at least stable use of data.
The other thing I really like is machine learning, graph processing. I really think that there is … First and foremost there’s a missing APR layer to Hadoop. One of our competitors wrote a blog … I think it was last week that asserted that there are only two workloads that matter in Hadoop, Sparks and SQL, which is just totally wrong. Totally wrong. Hopefully for obvious reasons but we’ve got to onboard data, you’ve got to manage data, so there’s a whole data [00:12:00] management layer in it of itself that’s Spark is not for nor is SQL, for Classic ETL or ELT type workloads.
Separate of that there are non-iterative analysis just batch workloads that you can do joining a data sets. I don’t need to do that in Spark, to join a table to another table or do a customer master analysis, for example, let me find the customer across channels, a 360 degree view. What it is to remain to something like Spark, is machine learning and graph processing.
I think the world is on the cusp of a breakthrough in scaled out machine learning. The dirty little secret of Hadoop is machine learning is really done on giant in-memory notes for the most part. They’re SAS read but then SAS historically was done on a big in-memory servers, R is done on big in-memory servers.
Stefan: You need the shared memory right?
Stefan: That’s a problem.
Ari: You want to load your whole data set into memory and then it will rain across it because you’re constantly doing things. Let say you want to do Churn Analysis. I lost customers … I’m Amazon and I lost these customers and I retain those customers across a one year boundary. What is the difference between them? What are the factors? What are the characteristics of lost and retained customers that in the future I can look at some of them and say, “He’s trending toward a loss by end of this year?” Is it his age? Is it his buying patterns? The department he buys in? The number of visits? What do I have to worry about to retain and grow my customer base?
That Churn Analysis, you typically will take 500,000 people into a corpus and start examining them A versus B type testing and you want to do that in a machine learning fashion, in memory because you’re going to say, “Is it field 1, AGE? Is it field 2, LOCALE? Is it field, 3 SOCIO-ECONOMICS? Is it field 4, PAYMENT TYPE? Is it field 5, THIS GUYS ALWAYS GETS GROUND, THAT GUY ALWAYS GETS FED EX?” [00:14:00]
At some point, someone did analysis, for example, that lead to the creation of Amazon Prime, that said, “I can have a bunch of people pre-pay for their shipping, fund everyone else’s shipping and keep these people happy because they get everything in two days even though they pay more of it, you know, and I won’t lose money. I’ll end up making money on the whole thing. ” So, that’s a business analyst. That’s a one-time very heavyweight process.
You really need to be able to do things like that in an iterative like, “I want to discover what are the relationships between two groups? And what is the right segmentation between groups?” That’s done in an iterative fashion typically done in memory because I’m constantly revisiting the same data over and over. I think that something like Spark is starting to crack the nut on, “How do I load all of that volume of data across a cluster’s RAM and then start crunching on the patterns at scale?” That’s where everyone wants to go.
I really don’t want to do … I talked to my customers about segments of one. I really don’t want to do the classic Nielsen thing where I pick a thousand families and give them a set-top box and say, “The whole country will watch what the thousand people say they will watch and my brilliant statisticians get it right.” I don’t even want to say, “You know what, forget Nielsen, I’ll go to Direct TV and watch what everyone watches … 30 Million people watch. I want to know what everyone watches.”
I want to say, “In my ideal world, sticking with the TV analogy, for a second … In my ideal world commercials becomes a modular time spot in a show and all the commercials that are relevant to me are computed by Hadoop and loaded into my DV-R and they just play when that slot appears.”
I know you are on Facebook looking at cameras that your friends are using. I know you’re on subaru.com looking at cars. I know you are on bestbuy.com considering different turntables and so I’m just going to play those commercials for you.
Stefan: Based on your retina [00:16:00] ID of re-targeting.
Ari: That’s okay but based on the account owner at least in the house and maybe the device owner like this is installed in my teenagers room but I can literally get –
Stefan: I was almost going there.
Ari: I can literally get down to a segment of one that’s ideal to me. First of all, a marketer will pay me way more for that eyeball than a generic eyeball.
Stefan: More targeting, right?
Ari: Yes. Secondly, it’s more relevant to me. I tend not to skip commercials if you get the segmentation right for me. I’m like, “Woah, what is that? I want to hear about that?” It’s possible nowadays to get that right.
The things that excite me are things that unlock the ability to look at volumes of … like groups of data in the billions range, so that we can start to find very discreet segments and target much, much better. In fact, my health care customers says, “You should work more with me than anyone else because we are doing good work. We are saving people’s lives.”
Stefan: For a lower price, I’m sure.
Ari: But then I turned it back on him in saying, “Actually, what you’re asking me to do is build the recommendation engine in health care. You’re asking me to tell doctors, people who have this disease should take that care path. People who take this medication should also take that medication.” It’s the exact same science as retail or travel.
Stefan: It’s just a little bit more complex to follow the different formats and the doctor knows safety for them.
Maybe to round this up what is … There’s obviously a lot of noise in the market still and people trying to get their feet wet, what’s the right approach to start with big data in general at Hortonworks? Where do I start? Obviously, maybe not the type of the data scientist that tries to make –
Ari: Everyone starts there though.
Stefan: Yeah, it’s kind of weird, right?
Ari: It is.
Stefan: It’s sounds interesting, “Science, ohhh.”
What’s … from your experience [00:18:00], what’s the first you would recommend?
Ari: There’s a forking to that answer, two pronged paths. If you’re a developer and want to pick up Hadoop, which a lot do … I’ve literally gotten emails from people who I used to work with saying, “I can add value to my resume if I know Hadoop.”
Ari: The answer there is the Hortonworks Sandbox. It was literally a virtual machine image at Hortonworks.com. You go to the sandbox, Hortonworks.com/sandbox, click on that link, download it and you can start to do machine learning tutor … There’s a bunch of snapped-in tutorials, machine learning, basic SQL –
Stefan: Database is pre-installed?
Ari: PIG. Datameer can actually build a sandbox and we should do that together.
Stefan: We have that with you guys.
Ari: We do.
I thought we did.
People can actually wire up a Hadoop cluster, loads some … There is sample data into it … they could bring their own data into it. It’s going to run on their laptop or some kind of desktop class machine and then wire up things like Datameer and actually start visualizing and prove to themselves that A.) They can wrap their heads around this problem domain but B.) Instead of battling with their leadership that Hadoop is something we could be doing, they can show people real value.
All my customers where the developer has brought in Hadoop, where that developer has become a hero it’s because they actually went to EC2 or they went to our Sandbox, stood up a cluster, loaded some safe data into that cluster and showed some value before they called a bunch of people.
The other path though is the data lake paths. If you want to go at scale … you’ve already convinced yourself or you don’t care about the individual API’s programming, interface, user interface at a sandbox level, what we see people doing is basically saying, “Let me land a cluster with 10 to 100 nodes into a data center and let me pull some critical data sets onto that cluster and create a lake where people can come to that cluster [00:20:00] and start playing with that data sets.”
Either retain the data set that used to be too expensive to retain, retain it longer, retain a finer [inaudible 00:20:10] version of it. Sometimes you go from a no LTP data store which is detailed to a warehouse and lose all the detail as you do some kind of sampling and or some kind of process to lower the volume of the data. You may drop columns, most people tend to drop columns when that sample rose.
Here you say, “I have customers with like, 8,000 columns. Can you guys turn it into 15 columns?” So, load the 8,000 columns version in Hadoop and let people start exploring it. Hadoop has back up. Hadoop has retention. Hadoop has an archive. Then lets some scientists and analysts onto those data sets and typically go for the 360 degree view of the customer or the cross-channel analysis, what’s the customer doing across of my lines of business and find your highest dollar value customers. That is the lowest hanging fruit. It has turned us into the archives that can then feed the 360 analysis.
Stefan: That’s the most prominent used case we see all the time of our product.
Ari: Interesting. Good to know.
Stefan: Great. Thank you very much. That was really exciting. Thanks for the beer.
Ari: Thank you.
Stefan: Cheers and I hope –
Ari: I drank all of mine.
Stefan: I hope we have you back soon.