Big Data & Brews: Ari Zilka, CTO of Hortonworks, on Project Stinger, YARN and Tez
I had a really interesting discussion with Hortonwork’s CTO Ari Zilka about Project Stinger, ORC files, YARN and Tez. The transcript of our talk is included below the video with a bunch of useful hyperlinks so you can go find out more.
Stefan: Welcome to Big Data and Brews, today with Hortonworks‘ CTO Ari Zilka. Hey, welcome.
Ari: Thank you, Stefan.
Stefan: Can you introduce the beer you like here? It says Czech style pilsner. It’s very European.
Ari: It’s a Gordon Biersch Pilsner. I really like Pilsner Urquell from Czech Republic.
Stefan: Now, we’re friends. Good, I love that. It’s actually my …
Ari: At least when it’s on draft in Czech Republic in Prague.
Stefan: Yeah. You travel there a lot?
Ari: Yeah, I love that city.
Stefan: It’s beautiful.
Ari: The Xbox One has this driving game, Forza Motorsport. They have a track which is the streets of Prague. I’ve been playing it and it just makes me want to go back to Prague right now.
Stefan: Do you go over the bridge at well into the castle?
Stefan: In the game?
Stefan: Okay. Wow. Nice. Yeah, there’s cool technology in Prague, I have a bunch of friends there.
Ari: Mmhmm (affirmative).
Stefan: Can you introduce yourself? What’s your history? What are you doing beside drinking Czech-style beer?
Ari: Sure. Should I actually drink the beer?
Stefan: Yeah, please. I will have a Porter today.
[00:01:18] Ari: Okay. Anchor Steam, which interestingly enough is bottled over in South of Market in San Francisco about two blocks from the company I founded, in Terracotta.
Stefan: Cool. That’s a good intro into your history. What did you do before Hortonworks?
Ari: I’m CTO at Hortonworks now. I started out at Hortonworks as Chief Products Officer, transitioned into the CTO recently. Always did part of the CTO role. I don’t know what a Chief Products Officer is, but basically we used to divide CTO into outbound and inbound. I was outbound, focused on customers. Now, I’m both outbound and [00:02:00] inbound focused on product and roadmap and features.
Before Hortonworks, I did the CTO role for Terracotta, the exact same thing, outbound and inbound. When I say outbound, yes, I’ve done talks. I’ve stood alongside Rod Johnson at SpringOne and things like that and got the Duke’s Choice Awards from James Gosling at JavaOne for what the team did at Terracotta but more importantly is I spend most of my time with customers.
I’m always doing what, it’s not a word, but our CEO calls it “solution architecting”. Architecting is a not a word but I don’t know what else to call it, basically, solutioning with customers. Give me your problem domain, give me what you thought about, what you’ve researched so far, and we’ll go to a white board first. We’ll sketch it all out. We’ll deal with all of your corner cases, edge cases, complexities, volume, variety, velocity, even though I hate the three Vs and anything marketing spiel like that. Go through all of that stuff, nail it down then start building proofs of concept, project plans to prove to ourselves this architecture will work. I’ll checkpoint with you. That’s the outbound side of my role.
The inbound side of my role is come back into the organization, engineering and product management representing all that myriad set of use cases and say, “Hive needs to do this next. Nobody is using Pig. Everybody is using Pig.” Things like that.
Stefan: Well, cheers and congratulations on your new role.
Ari: Cheers, thank you.
Stefan: Well, title … Let me double-click on the history. Terracotta was a pretty cool back end for kind of a distributed environment in JavaWorld [00:03:44]. I happen to know.
Ari: Cool. Actually, as I go around at Hortonworks, I find most people know Terracotta. I just wish more people had written checks for the software.
Stefan: Let’s start a little bit there [00:04:00]. I think you were also a CTO at Walmart?
Ari: Chief Architect …
Stefan: Chief Architect.
Ari: … At Walmart.com.
Stefan: Okay. You’re in the Bay Area quite a while?
Ari: Yeah. I went to …
Stefan: Did you grow up here?
Ari: No, I went to Cal and never left.
Stefan: Well, it’s hard to leave, right?
Ari: Yes, very.
Stefan: What did Terracotta do before we jump into the Big Data? I mean this is Big Data, Terracotta, yeah?
Ari: Terracotta is big fast data is what we used to call it.
Stefan: Yeah, in distributed environment?
Ari: Well, like I don’t know if I’m allowed to say it but I’m not at Terracotta anymore so I don’t care but Paypal for example, paypal.com is powered by Terracotta. I think that’s 40 terabytes of purchase histories in-memory to figure out fraud detection. Without going into detail, it is big data. It’s big in memory data, that’s why their products now are called Big Memory.
Essentially what Terracotta does is a two-tier application level cache that a developer is in charge of. A developer uses it, wants objects from some data store and doesn’t really want to know when they’re getting them from the data store or when they’re getting them from local memory. Then they want to deal with the fact that their application is actually deployed to multiple instances. They don’t want to deal with consistency across threads and across, I call it space and time.
My data is changing. I need to deal with the freshness of that data/ correctness of that data and I need to deal with the latency of access to that data. I’ll basically stick a Terracotta server in front of my data store and then I’ll wire all my applications on to Terracotta. This is sort of a misnomer because you’ll read around the database and put data into Terracotta. You won’t read through Terracotta [00:06:00] but you get a cache here. You get a shared cache down here. This is actually scaled out, partitioned and replicated. Rate zero plus one in software, in memory, then it offloads this guy tremendously. So anything I changed here is visible here, is visible here. I have dials or a continuum to be able to set consistency levels, read consistency, read-write consistency, XA compliance and al that kind of stuff.
What this allowed the average application to do is to store terabytes in memory at 1 millisecond latency access time, worst case 10 milliseconds down to Terracotta.
Stefan: You already said objects, what made me really excited because as a big friend of SQL, obviously I did a bunch of Spring applications with Hibernate and EH Cash and those kinds of things. You guys are the distributed version of that but you always store Java objects?
Stefan: Okay. How did you deal with the whole serialization? Did you do reflection on … I mean how did you compress the objects?
Ari: There are two incarnations of Terracotta. The first incarnation didn’t go and actually we disassembled applications at the byte code layer and found when byte codes were editing field level values, so we had a zero marshalling system. We weaved ourselves into an application and watched it make changes. You would grab a lock, meaning a synchronized barrier that would start a journal in your local thread. We’d keep track of everything you changed. When you release the lock, we’d flush the change. We were memory-model consistent but transparent with pseudo-no-marshalling.
It didn’t go because most applications weren’t thread-safe. They weren’t actually doing things exactly.
Ari: Yeah. We knew it. We just thought our tools could help them find their thread-safety issues. It was just too hard for them to clean up their apps. We went to a straight [00:08:00] serialization on put, de-serialization on gap kind of model, with the caveat that you want to not use Java serialization. It’s space-inefficient. It’s time-inefficient.
Secondly, you had an opportunity to optimize that. You could store a deserialized cached form in the application. You didn’t deserialize every time you got. You deserialized the first time you got on this note, this note or that one.
Stefan: Did you have your own serialization interface then if you didn’t use a Java one?
Ari: No [00:08:34] …
Stefan: You overwrite the serializer?
Stefan: Yeah. I did this once for Hadoop. It wasn’t very popular because well back then, we discussed with Doug, “Should we have writables or should we have serializables?” I was a big fan of, “Hey, Java is serializable. We just have to overwrite the serializer.” Obviously if you just use Java serialization system, it’s incredibly slow, right?
Ari: Yeah, but it’s incredibly intuitive. The Java devs knows what to do with transients… [00:09:01].
Stefan: It is. Exactly. Yeah, and then we ended up with writables and then nobody ever … that was one of the biggest or still is the biggest problem in Hadoop. People are like, “Oh, I have that string-writable object. Let me put that into a local variable and get access to it a little later. And they’re like, “Why did the string change?” Because obviously a recyclable object all the time. Anyhow, but interesting. Good.
Ari: What’s interesting about all of this, it’s kind of funny because we were working on Project Stinger at Hortonworks and trying to make Hive much faster. One of the things we came across was the Java system class loader is god-awful slow. For the Hadoop core jars, they’re so big that it takes somewhere between half a second and two and a half seconds to just start up a JVM with Hadoop, proxy classes and everything wired in.
We had teams start to write out own class serializers and we called it [00:10:00] a Hadoop-shared object so we could pass classes around and shared them in a cluster, recycle them but load them faster into the system. We got down to like a 40 millisecond JVM startup time with the only catch that you had to override the system class loader which I told the team, “Hey. I’ve played this game before. [Crosstalk 00:10:18]”
Stefan: Yeah, was there, done that.
Ari: Yeah. No one wants to overload the system class loader.
Stefan: Yeah. It’s a little sketchy but on the other hand side, you have your Hadoop class and usually in that JVM don’t want anything else, right? But yeah, I had a lot of fun with class loading when I worked for JBOSS. Good old times, right?
Stefan: Overwriting and making sure that lock for J in different version, in different EGB jobs …
Stefan: Yeah. All that good stuff. Great. Tell me a little bit about your work at Hortonworks, what’s the most exciting project that you guys are working on right now?
Ari: The most exciting stuff we’re working on right now is inside Project Stinger. There’s two exciting things. I’m going to erase this diagram or try to.
Stefan: It’s style if you have multiple layers.
Ari: Inside Project Stinger, there’s two really exciting things we’re doing. One is on the storage and access to data layer. It’s ORC files and vectorization. Super exciting and we’ll help anyone …
Stefan: That’s Owen’s child?
Ari: Actually, no…
Ari: He never wants to admit it but ORC is called Optimized RC, in reality, it’s Owen’s RC, but he’s not arrogant so he doesn’t buy that.
Stefan: No, he’s not. I know Owen since a while.
Stefan: Now Doug Cutting would come in and said, “Look, you know, I can do that much better with… [00:12:00].
Stefan: Jump on that.
Ari: Sure. At the end of the day, ORC is a format contributed by Microsoft’s super geniuses inside the PDW team. These are actually guys who have spent in some cases 30 plus years in data storage formats for relational workloads. ORC has some things that are just superior to anything else on the Hadoop platform right now.
Ari: Like block level indexes and type-aware indexes. It’s a columnar format. I’m not going to bother to sketch up … maybe I should.
Ari: But I mean if you have a block like this, you really want to store all column one values and then all column two values and then all column three values. Typically, you call this a columnar store. Some advantages we get and most columnar stores, all columnar stores do that. That’s not interesting.
Advantages you get is you can take column one and compress it because now you have more …
Stefan: Most likely better compression.
Ari: Most likely better compression because you have more consistency across value space.
Stefan: Luckily, very frequently Hadoop files are sorted, so even better compression.
Ari: Yeah, exactly. Like you take a web log and this would be IP address. This would be Access port, this would be browser type. Browser type is going to be what now? It’s going to be Chrome or what’s it’s called? Chrome or Safari or Internet Explorer and that’s it.
Stefan: No Internet Explorer anymore.
Ari: What we’re columnar and we can actually handle columns where the individual record sizes are quite large. We’re type-aware which means you tell us if this is an inch or some kind of numeric value or long. You tell us that this is string [00:14:00] and then we start doing things that are type-aware. We are sequel and Java-type compliant which is superior to anything else, but then we have an index as part of each block.
Stefan: Even now you get the block much faster?
Stefan: How big is the block size?
Ari: It’s configurable but a good block size is like a gigabyte.
Stefan: Okay. Well then it makes to have a indexing file.
Ari: For example a string index would be a dictionary. We dictionary encode strings and we write all the unique strings up in the index. We compress them down and then we write an integer look-up.
Stefan: Yeah, makes sense.
Ari: If you have things like URLs, they repeat a lot. The URLs are actually going to just be URL 1, URL 17, URL 22, URL 33. Integers we’re going to have min, max, average, things like that.
Stefan: Per block?
Ari: Per block. We’re going to have start date, end date.
Stefan: Okay. It makes it small and really fast.
Ari: We find it. We do it and then what you can do then is you can basically use the index to skip a block. When you’re doing filters and aggregations, you want to basically say, “There’s a where clause. There’s a query predicate. I can apply the query predicate to the index.”
Stefan: Right. You don’t even need to …
Ari: Right. I don’t even need to hydrate the rows or columns at all.
Stefan: That’s a slow pod, the deserialization, the inflection.
Ari: Yeah. This is row one, this is row two, this is row three. Vectorization is the classic Java loop with my connection, I get a result set. What result set dot has next, then resultset.field1.field2.field three next result. In a while loop, that while loop kills performance because you’re iterating through a result set and you’re paging data from various large ram pools into processor L1 cache.
What we’re doing in [00:16:00] vectorization is saying, “Leave this block, dehydrate it, flattened, unmarshalled as a block. You can look inside the index as much as you need to. When you do pass across the block, vectorize the query predicate.” Turn the query predicate into scalers and then pass it across this as a mask. You’re basically looking at a giant set of ones and zeros.
Stefan: You basically just overlaying …
Ari: You’re saying, “I’m looking for the following pattern, 1011.” It says, “There is 1011.” Then you say, “Oh. Well, I want 1011 this way.” It says, “Okay. I see here right here. It’s row 23.” Then you pull out row 23.
Stefan: Then you’ve even done these with [inaudible 00:16:44]?
Ari: I’m moving a gigabyte at a time through the data. Obviously L1 caches are on the order of megabytes. The move of a gigabyte into a megabyte L1 cache or a 16-mega L1 cache will take you hundred of clock cycles or a few thousand clock cycles which means it will be done under a second. We literally found a single laptop could manage a terabyte search in under two seconds with vectorization.
The ORC file is tied to the vectorization and the ORC file is tied to a lot of stuff we intend to provide to technologies like Datameer on top of us which is this global level index. If we take these indexes, you can use them now to show people ontologies about their data.
I have a column. I know its name. I know its type and I know its value range. I can bring the dictionary forward into data mirror. I can bring the integer min max data time-stamped values into datameer and that becomes dimension data or anthological data that’s very interesting to the end user. Even though the system doesn’t know what it is, it’s very interesting to the end user.
I can use to speed up searches because I could skip blocks. [00:18:00] Really what we’re talking about is take that index, centralize it to the entire table space or data set if you will and then proffer that up to anyone who wants it. Now you can build systems that actually service queries without ever looking at a record at all.
By the way, you paid no price to compute it except on ingest. You laid it out per block.
Stefan: That will be my question on how is that impacting right performance?
Ari: You’re packing a gigabyte block. You’re consuming some extra memory and you’re type-aware so you’re marshalling the types on write down into disk. Our write performance for an ORC file needs to get better but right now we have tuned it so it’s faster I think to write an ORC file than an RC file for example.
Stefan: That wasn’t great. It’s always a question of who you are performing for… [ 00:18:58].
Ari: Well, we’ve benchmarked ORC file writing against anything we can get our hands on. We find the best things out there. We’re competitive with. Well, I don’t even want to name specifics but I don’t see a problem. I see a problem in performance relative to an absolute number I’d like us to get to so we can write tens of megabytes a second across a cluster. I want to write hundreds of megabytes a second but no one is writing that fast right now.
Stefan: Let’s zoom a little bit out of this. This was really great and helpful. Let’s zoom a little bit out. That will be part of Stinger.
Ari: This is out in public GA.
Stefan: Okay. Good. Basically everybody can write against that then?
Ari: Yeah. In fact that’s where YARN and Tez come in. I’ll let you zoom out in a second but no one knows about ORC except us and our customers, our paid customers and our open-source followers of Hortonworks platform [00:20:00] as something that’s part of the Hadoop community solution space. What I’d like the whole world to do is since ORC is in the open domain, truly open, in Apache, gifted away, we need to build all the tooling around it. People need to be able to ingest into ORC which is not obvious.
Obviously I don’t want to write one record into ORC. I want to write a whole block into ORC. People need block writers into ORC that buffer up and guarantee delivery perhaps through Storm and Kafka or things like that but you need a buffer block writer for ORC. Then you need to be able to assemble tools that consume ORC data efficiently. Not just Hive itself but what if I want to write my own system that has nothing to do with sequel but still completely dependent on ORC, vectorization and block level index. I’d like to skip blocks and have query predicate pushed down, project my query on to my data as Datameer or as custom application. How do I do that? The answer is Tez and YARN and all these other stuff.
Stefan: We do a bunch of the indexing stuff. We already do on our system but not on that low level, right? Since we own data ingestion, we already do all this. I’m sure we have on the next version really cool stuff coming up.
Anyhow, well, cheers for that. Cheers on that.
Ari: Thank you.
Stefan: Let’s make a quick break and then we continue with the next session and talking a little bit more about the Hortonworks ecosystem and what else you guys are doing.
Ari: Okay. Sounds good. [00:22:00].