Big Data & Brews: Antonio Piccolboni, Independent Data Scientist, part 2
0:44 Alright, welcome back to Big Data & Brews!
So let’s talk about your open source project. So the way I would think about this is I’m just writing R code, and say now execute R code, distributed on Hadoop cluster. Is that the way to think about it, or can you explain more how the whole overall user interface, the user interaction from my side, as an R day-to-day user.
So I wish it were that simple, but its not simple. So the project has several components, one is HBase, interfaces with HBase, HDFS interfaces with HDFS, so for HDFS, the commands are really similar to what you would expect from a regular file system, so you are just moving a file? move a file. you just probably prefix with HDFS and that’s it.
But its in R?
1:47 It’s in R, exactly. You are at the prompt, the R interpreter, or you can write scripts or modules, or whatever your level of R is, but with HBase, you start having a unique database model that people are not used to. So if you’re used to RODBC, the SQL interface for R, they think oh its another database, well they’re going to find out that these days there isn’t one database there’s many and each one has different capabilities. So the HBase interface is pretty much modeled on top of an interface, you can get a range of keys you can create a table, all the things you can do in HBase, you can do a join, it doesn’t have a big layer of abstraction on top of it that adds secondary indices or other stuff that HBase doesn’t have but that someone could build, conceivably. And when you get to one package I work on in particular, is RMR2, and that interfaces with Map/Reduce, you have to organize your computation around Map/Reduce, so that’s a big surprise for people that like R for a number of reasons, in particular because it has some 5000 packages, many of which implement the latest and greatest statistical methods, and they hope they can just switch it on, and magical powers, and now it runs on Hadoop. Unfortunately that doesn’t happen. You can use all of this packaging inside the MapReduce functions, that’s great…
Right, but you still have to write MapReduce functions, but you write them in R, is that right?
Absolutely. At Revolution, a proprietary offering, they offer specific algorithms that they have already ported.
Hadoopinized, maybe? Okay.
3:47 Ported is incorrect though because the sequence of step is different, so its a new algorithm, but the statistic is the same. So a lam is lam, the algorithm is different, but the model is the same. So, of course, that’s a handful, not 5000, so it would take years, if not decades, to convert the whole community and say let’s start writing for real, let’s start writing for in-memory laptop.
Right. So let me ask you something, is it hard for people to come out of the R community to kind of understand and get their heads around some of the limitations that Hadoop provides, like okay this is Map, this is shuffle, this is reduce, is there any help that you provide there to some higher abstraction, or they just straight have to write their own partitioner, their own map function, reduce function, and its just you write that in R instead of Java.
4:46 The first step of RMR2 was to help people do that, but on the other hand its based on streaming underneath and its not just on streaming because we serialize every R object transparently, so you put in matrix as a key, it comes out the matrix on the other side. You can put in a function as a key, I don’t know why you would do that…
Let me make sure I understand this. So you said you serialize R objects but then you said you use streaming, so you basically use Hadoop streaming to communicate between R and Java MapReduce code?
And then how do you do the serialization? You use streaming there too, or you serialize the R straight into a byte-stream, or how is that working?
I think its a hybrid of the two things you said. So Hadoop streaming defines a vanity format that they’re willing to process, so for a simple data set, you can just use that, and for the complicated one, we just say this is a bunch of row bytes, and underneath its serialized using the internal R machinery, and thats why we can serialize functions, type bytes doesn’t have a provision to serialize functions or clojures, we just say okay these are row bytes, don’t bother, so they just go through that way.
So, that would mean that you could do like image analytics with R on Hadoop as well, right, or are you kind of limited to string, integer, boolean data types?
No no you can do everything, you could have the key be an ID in the map call and the value be some raster image. That’s why people find it to be particular, you know its just complex data types, and passing matrices around, particularly doing matrix computation, so you know you process medium size matrices and then in the reduce you pull them all together and stuff like that.
6:57 When we talked offline you talked a little bit about performance. Is the whole streaming thing an issue? Is there anything in sight with better R-Java integration to help you in the future, or is that not really the problem because usually doing complex CPU processing in I/O isn’t a problem.
There’s a price to pay when you pay when you use very small keys and values, especially for calling out to R serialization, so I’ve been using specialized R-types like a matrix or a function, but very small sizes, that’s kind of a sore point we’re trying to fix now. That means we need have to implement them ourselves, a little more of the serialization machinery. You know we have people that are taking some of these use cases that are not performing very well and we’re trying to fix those. For those that work well, there is a price to pay for using R anyway, and so if the natural kind of discs are not incredibly performing, we can get really close to Java, I used comparison as a guiding test case for a release and we got within 20% of java. That’s on EC2. Which is slow anyhow. On my laptop its all SSDs and 4 cores, we are like 5x slower than Java, the reality is somewhere in-between.
8:51 It might not matter you know, because you just add more hardware to the system, the whole idea is its linearly scalable.
Absolutely, but there’s a point where if you use the interpreter the wrong way, you’re going to have a 100x slow down, and not everybody can afford the hardware to counter that, so we need to be like 2x, 5x distance, and then teh other advantage is that we’re going to pay off. For instance, if you’re prototyping quickly, if you’re doing one off, if you’re doing exploratory analysis, you don’t want to do that in Java, I can tell you I’ve done that in the old times.
Right, and you write a class long like this, for like 2 lines of code that really matter.
Right, which, talking of abstraction, we have a new package that just released, its called plyrmr. Plyr is one of the most downloaded packages for data manipulation in R. We took a special for that, and a little bit from SQL to get something easier. So you don’t select the key, you just say groupBy, its the same thing. But people can just belong to their culture, a groupby, oh I understand this. Instead of Map its called select or transform, and it has a slightly easier syntax.
And people are like “oh yeah, I’ve seen that before”
And if you see underline its like two lines from the other package, which okay, create a key value pair, the key is the column, its very simple, but we are trying to get another little step of abstraction. Again it doesn’t do the little magical translating algorithms, but it makes it more accessible.
10:31 Well that’s the most important thing, right. So what are some of the exciting use cases you can share that people are running on this platform.
Oh wait a second. And, with plyrmr, I can write the word count, in a tweet, its 140 characters.
But its too slow, so I”m working on the speed, and as soon as the speed arrives, I will tweet that this is word count 140 characters.
Nice. I think this is really where everyone is going. It’s about efficiency of engineering resources.
That’s why you have casa, log, scala, how readable, manageable and short can you make code, right, because in the end writing code that compiles is the real problem. The problem is to write code that human beings can read.
I like how you say short, its an opportunity for a little plug. I have a microblog about restraint in programming, which is mostly conciseness, if you look at actual citation, but in general if you decide not to use loops, and do everything with functions. When I was used to use Perl in my old biotech years, I refused to use for loops, so I call that no for loops.
What’s the URL?
Asceticprogrammer.info and I thought the ascetic programmer should not drink beer. and there’s a citation of a Paul Graham blog that says I think it says conciseness is power. Essentially we talk about what makes code elegant, what makes it reusable, easy to fix, and spinning around conciseness. If you use more, don’t express yourself twice, if you don’t have boilerplate code, its all toward a compression of code and then bugs are fewer, and all the other advantages seem to be correlated at least, if not caused by conciseness, so I think its a good point.
I want to come back to the use cases though. What is the most amazing thing that people are running on your platform that you’re aware of that you can share?
Yeah I know its in production in Fortune 100 in at least 3 cases, but will they tell me what they’re doing? But there’s many verticals, banks, agriculture…
And most likely, like some predictive analytics? Scoring, fraud analytics?
It’s mostly modeling complex phenomenon? Like you know in agriculture, trying to get the exact amount of water, I’m not saying plant by plant, but maybe square foot by square foot, but at this point its like big data, or data-intensive agriculture. I used to joke, Imagine one day when sensors are cheaper than corn, and we can just spray them, they are biodegradable we can just spray on the field, and then somebody from Monsanto I don’t think we’re very far from that, they can see a lot, and so they can take the thing to RHadoop and they can do something with it.
14:03 Any other use cases? So Agriculture, I guess to then predict growth rates, I guess that’s a big business right, any financial services?
Yeah financial services use it but they don’t tell me what the use case is.
Is there like image processing?
Not yet. But I want people to do it.
It’s so fascinating what you can do there.
There’s all the graphic devices in R write radically out to file systems, so there’s some technical issues, that you need to grab the output, but yes my idea is to do a big data visualization in R where you can build something that intrinsically not possible without the infrastructure of R and Hadoop.
Maybe doing a lot of tiles and partition by the tiles.
Or a video, one of the two.
15:03 Switching topics a little bit, what was the most difficult part about starting an open source project in the Hadoop space? Is it getting the community on board, is it getting the right ideas implemented, what did you see as the cultural, social challenge to build an open source project?
Well you know R is great because it has a level that is interactive, exploratory, where you basically don’t need to be a programmer. Short of a spreadsheet, its the next best thing really, if you’re really starting from nothing, and there’s multiple levels, you can write scripts, you can write modules, and really do development like a real programming language its really full featured. We have a two million strong community of users. But those users aren’t ready to do a pull request, those are users that most of them use it most interactively, and then theres another subset that can write decent scripts, and then package developers at least in the public sphere, is 100-200 people. So when I say its 2 million people, don’t imagine this is 2 million people ready to do pull requests of the project, right. So we really need to help them deploying it, I think its a hard part to deploy it, and a known java piece of code on a Java infrastructure is kind of the hardest part when people are getting started, and we need to do a lot of hand holding and convincing administrators that it is alright, it is tried and true and nothing bad is going to happen, so I think being a C-based, Fortran based software in a Java world is a little bit difficult combined with the fact that users are not necessarily technically sophisticated. Which is not to say they’re stupid, they’re excellent statisticians, domain experts, they’re just not computer scientists. Who could install the thing on a thousand machines with a single keystroke right? Not everybody can do that.
17:22 Yeah we see that quite a lot. There’s just different audiences. The perception of what is data scientists is mixed. You have software developers, right, they push into “oh, now look into this support vector machine,” and they’re like “well, I’m writing Java software” adn then you have people that are very strong on the mathematical side, and they say “well why don’t you develop a 1000-node distributed computation platform,” and they’re like “I do different things.”
You know they say if you build it, they will come I guess there’s lots of people interested in more direct access to data, to make it easier, and if there is a new tool that opens it up to another sector of people, you will make another bunch of people more productive and happier. And there’s a limit, you can’t make it more easier than some, and then, whatever, you have to do the best you can do.
So outside of RHadoop, what are some other open source projects you see that you’re really excited about, that you maybe think about incorporating, that you think have a great future?
I looked into Node,
I think Reno is something on my to-do list, but i don’t think I’ve had time to actually…
But you have R!
I can try to model it, so, we’ll see, its already moving on the server-side, it has interesting tools, its got a lot of people working on interpreters so that you can write crappy code and it will run fast anyway,
20:42 Well thanks for stopping by, cheers, so go check out Rhadoop, and get your R code running on Hadoop.