About Us Icon About Us Icon Business Analyst Icon Business Analyst Icon CEO Icon CEO Icon Datameer Icon Datameer Icon Envelope Icon Envelope Icon Facebook Icon Facebook Icon Google Plus Icon Google Plus Icon Instagram Icon Instagram Icon IT Professional Icon IT Professional Icon Learn Icon Learn Icon Linkedin Icon Linkedin Icon Product Icon Product Icon Partners Icon Partners Icon Search Icon Search Icon Social Networks Icon Social Networks Icon Share Icon Share Icon Support Icon Support Icon Testimonial Icon Testimonial Icon Twitter Icon Twitter Icon

Datameer Blog

Special “How It Works” Big Data & Brews: On Spark, HBase and Project Stinger

By on May 27, 2014

We had a few more great chalk talks that I wanted to share in this week’s special “How it Works” Big Data & Brews. Pull up your favorite brew and hear about some of the cool Hadoop open source projects from former Quantifind CTO Erich Nachbar who shares how Spark works, Michael Stack about HBase and Hortonworks’ Ari Zilka about Project Stinger.


Stefan:           Tell me more about Spark you spend a lot of time with Spark help us to understand maybe; most architects are obviously moving pieces why do you like Spark? If you’d like to there’s markers the chalkboard so feel free, but so how did you get to Spark and what is exciting about that.

Erich:              I met Mate who is the main contributor or founder of Spark at one of the Hadoop meet ups where he was show casing his project. He used to be a Grad student out of Berkley UC Berkley and that was his project. At that time or still is small somewhere on the, order of 15,000 lines of scala code. He thought what if I could give people the option of either process my data similar to Hadoop where I load something from disk to the processing and spill it back.

It can say hey you know what if you have enough RAM you can also just say hey cash this particular dataset in memory and then apply these operations in memory. It works in both ways and what he did to ease the transition is he is building everything on top of HDFS as like the bases so he can use any input format as like the [00:10:00] source.

Stefan:           It’s a Hadoop input format to get E [inaudible 00:10:04].

Erich:              Correct so they can process [inaudible 00:10:07] files, but the point is that in our case for example we run this on MapR, but any Hadoop history [inaudible 00:10:15] would actually do and then on top of it is typically …

Stefan:           Why are you using MapR?

Erich:              Well if you look at the other distribution, MapR I think has a kick ass, file system. You can NFS mount it on the data scientist boxes and just fax us the data. It’s obviously limited to the speed of the network interface, which is not suitable for large amounts of data, but it’s good enough if the data scientists say, “Hey I want to just like poke at this data load it in R and then play around with.”

Stefan:           Sorry I interrupted you run it by HDFS and then printable format.

Erich:              Yeah and so what you would do is it would co-locate in good old Hadoop fashion the data notes would be co-located with the Spark workers. What you would have is this is weird because I’m mixing physical with logical if you would say this is a data node – run me a Spark – Spark worker – worker in here and so you would get data locality. With the input format the IP address, it would actually find out so you would get the same locality advantages that you would have with Hadoop and …

Stefan:           Spark then directing the data the Spark will flow directly to [inaudible 00:11:37] of the data node or how is it getting data load covered [inaudible 00:11:42].

Erich:              Correct you would install them on that same box.

Stefan:           Okay then you’re accessing HDFS in the same let me think about this how would you, integrate them to name note.

Erich:              The name is always…

Stefan:           Is it bypassing just straight going off in the …

Erich:              Yeah, but name note is really just [00:12:00] it would find out where the blocks are located and then it would schedule the jobs under…. It works very much similar to what the Passtracker would do.

Stefan:           Yeah, okay.

Erich:              It’s pretty much the same thing.

Stefan:           Then they are just implemented a fake top strike or …

Erich:              It’s still encapsulated you would only use the file system you wouldn’t use any of the job cast strikers off of that at all.

Stefan:           Because I think the [inaudible 00:12:18] itself by Hadoop are not going to be open for this way to access data locality I guess I just take the whole thing.

Erich:              I think you can actually get you can get to it I thought so yeah.

Stefan:           All right.

Erich:              It seems to work yeah and you have SparkMaster it has a little webeye that …

Stefan:           Then you push your jobs to SparkMaster and it distributes to all, the Spark…

Erich:              The rest works the exact same as the thing so that one of the big differences that they do that is a very impressive is if you look at …

Stefan:           [Inaudible 00:12:52] let’s not forget to drink no, no, no cheers.

Erich:              Cheers.

Stefan:           I get enough ah good.

Erich:              One of the cool things that Spark actually does is when you’re in Hadoop land you have a jar that is your job and it gets pushed out to all the nodes and if you have a larger job that causes a person overhead…

Stefan:           Yeah.

Erich:              What Spark does is let’s say you have a map operation so I have a collection. Map because I’m running a map operation on it. I could let’s say this is my record so there’s this color coded record and I’d say record x 2 x 2 this would since a functional programming actually this result would be admitted as the result of that operation and …

Stefan:           Any closures.

Erich:              What it actually does it serializes only the byte code for that [00:14:00] closure pushes that out to the nodes so we actually it was so interactive that we actually drove for prototyping purpose we drove that to our customer front and through running jobs.

Stefan:           Okay.

Erich:              Starting a job takes maybe a half a second and even less than that because it’s efficient about like what it actually pushes out to the node then when you have it in memory you really just run whatever it is over in the nodes.

Stefan:           You kept all the data in memory all of the time then?

Erich:              Yeah I mean clearly this is if you have a big CPU problem this is great and if you can afford the RAM.

Stefan:           Right.

Erich:              If you have plenty of …

Stefan:           [Inaudible 00:14:38].

Erich:              Petabytes of data…

Stefan:           It’s all right.

Erich:              Exactly.


Stefan:           For the people that don’t know HBase high-level, how is that working? What’s the difference between …

Michael:        He wants me to draw on the board. He’s going to regret it.

Stefan:           I can fill your glass more up, then throwing it together.

Michael:        I don’t know if you’ve read the big table paper. We’re pretty much the same.

Stefan:           Most likely people seem distant.

Michael:        It’s that rare commodity. It’s a well-written [04:00] paper.

Stefan:           It’s actually true.

Michael:        It’s actually understandable.

Stefan:           It’s actually the whole reputation of HBase fast paper was actually written very well.

Michael:        There’s nothing more than a big table like in your Excel table, and then you have rows. Except this goes like billions and then what happens is you take this table and you break it into pieces and then this piece you put it on a server of some kind, a regent server. I think this is going to go bad.

Stefan:           No, it’s good. It’s awesome.

Michael:        You can have many of those. Then each one of these regions, there can be many of those. You could have one region, or this could have like hundreds of regions.

Stefan:           Peer region server.

Michael:        Peer region server.

Stefan:           Do you add more data in between or do you just append like an HDFS file systems. Can you insert so to say?

Michael:        I suppose that’s where HBase comes into play. We add the random read-write to …

Stefan:           You can basically update individual rows, and you can add things.

Michael:        You know like small. HDFS or even map producers usually talk about doing terabytes, being fluid, streaming through loads of stuff. What we add to the family is the random hook-up of little bits.

Stefan:           They say in general a queen or master server can manage all of this?

Michael:        I never heard it called a queen. I think I’m going to call the queen for now. We have a master process [06:00] that coordinates all the region server processes.

Stefan:           I assume regions are then replicated between multiple servers.

Michael:        The thing is we run actually on HDFS, right? We just write to HDFS.

Stefan:           So HDFS is taking care of the replication?

Michael:        As you know HDFS it does the replications, so when we write we write to three replicas.


Stefan:         Tell me a little bit about your work at Hortonworks, what’s the most exciting project that you guys are working on right now?

Ari:                  The most exciting stuff we’re working on right now is inside Project Stinger. There’s two exciting things. I’m going to erase this diagram or try to.

Stefan:           It’s style if you have multiple layers.

Ari:                  Inside Project Stinger, there’s two really exciting things we’re doing. One is on the storage and access to data layer. It’s ORC files and vectorization. Super exciting and we’ll help anyone …

Stefan:           That’s Owen’s child?

Ari:                  Actually, no…

Stefan:           No?

Ari:                  He never wants to admit it but ORC is called Optimized RC, in reality, it’s Owen’s RC, but he’s not arrogant so he doesn’t buy that.

Stefan:           No, he’s not. I know Owen since a while.

Ari:                  Yeah. On the storage site, it’s ORC files and vectorization on top of those ORC files. Then on the factoring side, on the architecture side, it’s projects Tez and YARN.

Stefan:           Now Doug Cutting would come in and said, “Look, you know, I can do that much better with… [00:12:00].

Ari:                  Parquet.

Stefan:           Jump on that.

Ari:                  Sure. At the end of the day, ORC is a format contributed by Microsoft’s super geniuses inside the PDW team. These are actually guys who have spent in some cases 30 plus years in data storage formats for relational workloads. ORC has some things that are just superior to anything else on the Hadoop platform right now.

Stefan:           Like?

Ari:                  Like block level indexes and type-aware indexes. It’s a columnar format. I’m not going to bother to sketch up … maybe I should.

Stefan:           Yeah.

Ari:                  But I mean if you have a block like this, you really want to store all column one values and then all column two values and then all column three values. Typically, you call this a columnar store. Some advantages we get and most columnar stores, all columnar stores do that. That’s not interesting.

Advantages you get is you can take column one and compress it because now you have more …

Stefan:           Most likely better compression.

Ari:                  Most likely better compression because you have more consistency across value space.

Stefan:           Luckily, very frequently Hadoop files are sorted, so even better compression.

Ari:                  Yeah, exactly. Like you take a web log and this would be IP address. This would be Access port, this would be browser type. Browser type is going to be what now? It’s going to be Chrome or what’s it’s called? Chrome or Safari or Internet Explorer and that’s it.

Stefan:           No Internet Explorer anymore.

Ari:                  What we’re columnar and we can actually handle columns where the individual record sizes are quite large. We’re type-aware which means you tell us if this is an inch or some kind of numeric value or long. You tell us that this is string [00:14:00] and then we start doing things that are type-aware. We are sequel and Java-type compliant which is superior to anything else, but then we have an index as part of each block.

Stefan:           Even now you get the block much faster?

Ari:                  Yeah.

Stefan:           How big is the block size?

Ari:                  It’s configurable but a good block size is like a gigabyte.

Stefan:           Okay. Well then it makes to have a indexing file.

Ari:                  For example a string index would be a dictionary. We dictionary encode strings and we write all the unique strings up in the index. We compress them down and then we write an integer look-up.

Stefan:           Yeah, makes sense.

Ari:                  If you have things like URLs, they repeat a lot. The URLs are actually going to just be URL 1, URL 17, URL 22, URL 33. Integers we’re going to have min, max, average, things like that.

Stefan:           Per block?

Ari:                  Per block. We’re going to have start date, end date.

Stefan:           Okay. It makes it small and really fast.

Ari:                  We find it. We do it and then what you can do then is you can basically use the index to skip a block. When you’re doing filters and aggregations, you want to basically say, “There’s a where clause. There’s a query predicate. I can apply the query predicate to the index.”

Stefan:           Right. You don’t even need to …

Ari:                  Right. I don’t even need to hydrate the rows or columns at all.

Stefan:           That’s a slow pod, the deserialization, the inflection.

Ari:                  Yeah. This is row one, this is row two, this is row three. Vectorization is the classic Java loop with my connection, I get a result set. What result set dot has next, then resultset.field1.field2.field three next result. In a while loop, that while loop kills performance because you’re iterating through a result set and you’re paging data from various large ram pools into processor L1 cache.

What we’re doing in [00:16:00] vectorization is saying, “Leave this block, dehydrate it, flattened, unmarshalled as a block. You can look inside the index as much as you need to. When you do pass across the block, vectorize the query predicate.” Turn the query predicate into scalers and then pass it across this as a mask. You’re basically looking at a giant set of ones and zeros.

Stefan:           You basically just overlaying …

Ari:                  You’re saying, “I’m looking for the following pattern, 1011.” It says, “There is 1011.” Then you say, “Oh. Well, I want 1011 this way.” It says, “Okay. I see here right here. It’s row 23.” Then you pull out row 23.

Stefan:           Then you’ve even done these with [inaudible 00:16:44]?

Ari:                  I’m moving a gigabyte at a time through the data. Obviously L1 caches are on the order of megabytes. The move of a gigabyte into a megabyte L1 cache or a 16-mega L1 cache will take you hundred of clock cycles or a few thousand clock cycles which means it will be done under a second. We literally found a single laptop could manage a terabyte search in under two seconds with vectorization.

The ORC file is tied to the vectorization and the ORC file is tied to a lot of stuff we intend to provide to technologies like Datameer on top of us which is this global level index. If we take these indexes, you can use them now to show people ontologies about their data.

I have a column. I know its name. I know its type and I know its value range. I can bring the dictionary forward into data mirror. I can bring the integer min max data time-stamped values into datameer and that becomes dimension data or anthological data that’s very interesting to the end user. Even though the system doesn’t know what it is, it’s very interesting to the end user.

I can use to speed up searches because I could skip blocks. [00:18:00] Really what we’re talking about is take that index, centralize it to the entire table space or data set if you will and then proffer that up to anyone who wants it. Now you can build systems that actually service queries without ever looking at a record at all.

By the way, you paid no price to compute it except on ingest. You laid it out per block.

Stefan:           That will be my question on how is that impacting right performance?

Ari:                  You’re packing a gigabyte block. You’re consuming some extra memory and you’re type-aware so you’re marshalling the types on write down into disk. Our write performance for an ORC file needs to get better but right now we have tuned it so it’s faster I think to write an ORC file than an RC file for example.

Stefan:           That wasn’t great. It’s always a question of who you are performing for… [ 00:18:58].

Ari:                  Well, we’ve benchmarked ORC file writing against anything we can get our hands on. We find the best things out there. We’re competitive with. Well, I don’t even want to name specifics but I don’t see a problem. I see a problem in performance relative to an absolute number I’d like us to get to so we can write tens of megabytes a second across a cluster. I want to write hundreds of megabytes a second but no one is writing that fast right now.



Connect with Datameer

Follow us on Twitter
Connect with us on LinkedIn, Google+ and Facebook

Stefan Groschupf

Stefan Groschupf

Stefan Groschupf is a big data veteran and serial entrepreneur with strong roots in the open source community. He was one of the very few early contributors to Nutch, the open source project that spun out Hadoop, which 10 years later, is considered a 20 billion dollar business. Open source technologies designed and coded by Stefan can be found running in all 20 of the Fortune 20 companies in the world, and innovative open source technologies like Kafka, Storm, Katta and Spark, all rely on technology Stefan designed more than a half decade ago. In 2003, Groschupf was named one of the most innovative Germans under 30 by Stern Magazine. In 2013, Fast Company named Datameer, one of the most innovative companies in the world. Stefan is currently CEO and Chairman of Datameer, the company he co-founded in 2009 after several years of architecting and implementing distributed big data analytic systems for companies like Apple, EMI Music, Hoffmann La Roche, AT&T, the European Union, and others. After two years in the market, Datameer was commercially deployed in more than 30 percent of the Fortune 20. Stefan is a frequent conference speaker, contributor to industry publications and books, holds patents and is advising a set of startups on product, scale and operations. If not working, Stefan is backpacking, sea kayaking, kite boarding or mountain biking. He lives in San Francisco, California.