This week, independent data scientist Antonio Piccolboni stopped by the studio to chat about his history as a bioinformatician, how he got started with one of the earliest versions of Hadoop at Quantcast, and a little bit about the open source project he’s working on with Revolution Analytics to combine R and Hadoop.
As always, feel free to comment below or join the conversation on Twitter with us (@StefanGroschupf and @Piccolbo) at #BigDataBrews.
0:07 Welcome back to Big Data & Brews today with Antonio Briccolboni, no, no, that’s not right.
Close, close, my mom would recognize me.
Why don’t you introduce yourself and your brew?
OK, my name is Antonio Piccolboni just so that other people recognize me. This is Honey & Sons English Breakfast, a classic tea. I wanted to show not only that I am more sober than all your other guests, but I have to drive right after this talk and you know having a green card and DUIs don’t go together.
You know I have the same problem, people always ask me why don’t you drink and drive? Don’t drink and drive. I have a green card, and it’s a crime, and if you have a crime, you have problems with your green card.
It’s of course a crime for everybody, we’re not encouraging that, but yeah we could be on a plane really quickly, directed to our own countries, but yeah that’s a point.
Yeah it’s really comfortable in California. How long you in California?
14 years? Okay, twice as long as I am.
Well you know it will take me a lot longer to build my first company, but.
1:32 So you’re Italian, what brought you to the US?
Well my friend who lives in Sunnyvale took me around Silicon Valley for the first time, showed me this is Apple, this is HP, and they told me this is like bringing a priest to the Vatican. This is the place to be. You know, Italy doesn’t really have a software industry. One of my professors actually is running the largest software company you’ve never heard of. Its very local in character, and the opportunities are not that great. The country’s not been doing incredibly well economically, but at that time I didn’t have the ability to see the future, I just saw that there wasn’t a lot going on for people like me and that I should look outside.
2:19 Yeah, but the wine and the beer is much better, right, not in the U.S., wouldn’t you agree?
Wine? Yes. I come from a wine-making area, there are hills around my parents house one hour west of Venice, near Verona. So I was thinking about bringing, I found a local wine, here, I was thinking about bringing that as a different brew, but again, driving with that would be hard.
2:51 We do this next time. Let me set up the tea and you can tell us a bit about your professional history, and experience, and what you doing today.
It all started almost randomly by an interesting turn of events I ended up working in bio-informatics, it was a little bit of a random event, but it was also quite promising at that time, but its also what brought me to the states, through different fellowships and whatnot related to that research, so I came on the academic path.
3:33 So our CTO has a bioinformatics background too and I think they are the most ‘real’ data scientists, actually have a bioinformatics background, because its very algorithmic heavy, right, you do the whole DNA sequencing is obviously a big data problem, really a calculatable thing, what was your specialty in the whole bioninformatic thing?
3:58 I think I went for the… I made the choice to help with the experimentalists, so I initially just studied a fundamental problem regarding proteins because that looked good on my PhD dissertation, but then I went to the mass spectrometry and then micro arrays which means two of the most powerful technologies in experimental microbiology, the idea was to help people to store and process and internalize those massive amount of data that came out of those two technologies. They’re kind of interesting general purpose, you can use them to look at different aspects, a different class of molecules, different processes in the body, and tissues, so it was, on the other hand, if people are, if young bioinformaticians are listening, theres a word of caution, because this technology changed very quickly, so its very hard to base a career on a specific experimental technology because they come and go. The big everlasting themes of bioinformatics research are related to biology, not to the machines that study biology. Evoluiton, the study of evolution will be with us forever, I mean for decades at least, those are the things you should go into, there’s a lot of competition too, those attract really the best and the brightest, but can you imagine we’re going to have 10’s of thousands of genomes of different species not in the distant future, and then we try to go back in evolution as far as possible, the more species, the more variety you have, the farther back you can reasonably go, its kind of an algorithmic time machine, its very fascinating, and so there’s a lot of competition and absolutely the best people are converging into those eternal problems.
6:07 So is that already what brought you to Hadoop and to R, or what are the steps that got you to where you are today?
Well we had the need of Hadoop, I remember we had this 200 machine node cluster already in 2004 it was big for biotech in those times, and my group owned specifically, were just outside the server room and the two things were separate, the software was separate, the hardware was separate, the room was different, and there was a cable in the middle, and the cable was the problem, not the solution. And so we would start the computation and the server would die on us, almost deterministically, and so we thought why don’t we mix the two things? And there were people alerting me that Hadoop was coming up, but I was a little bit too extreme for biotech to go with 0.16 or something at that time, and so I started, I kind of got tired of doing a bit of shoddy computing, because I was in biotech, and so biology was more important than, and a friend of mine from that company went to Quantcast. I have to tell you that…
7:34 Very, very early adopter of Hadoop, yeah?
Very early. So I’ll say the job search trying to go from biotech to web industry is not the easiest step in your career because they have an idea of what people are doing in biotech. And but this friend of mine went to Quantcast, he made a splash there, it was really good, he’ll say we’ll get more of those, and so I joined Quantcast, and Ron Bodkin interviewed me, among other people and I remember they asked me, how much data do you see every year? Sorry, how much data do you see? And I said just 1… 1.5 terabytes of data every year. And they looked at me and they said, we have 4 terabytes of data every day. And I thought that’s it, I’ll never get the job, but instead they let me in, and I started working on Hadoop, I think it was 0.16, and tried to cobble together some algorithms, and we were doing web ratings, that was my project, trying to get better web ratings.
8:51: Nice. Yeah Ron has his own company now called Think Big Analytics I think, right? So definitely Quantcast brought a whole, it was kind of a breeding area for big data experts, because they very early adopter first of Nutch, then Hadoop, then I remember I heard this story that the Facebook team was invited to Quantcast, and they were kind of like, “oh so Hadoop? We’ve never heard about this.” And then so Hammerbacher and so on got a little tour from Ron and folks there and that was kind of the initial step on how Facebook got hooked on the whole Hadoop thing and started Hive and things, were you involved in that kind of story?
9:40 Not with Facebook, you know there was a certain attitude that we’re smarter than everybody else at Quantcast, its negative to say that, I detect some negative aspects, but they didn’t have any problems changing the guts of Hadoop. And they let you, actually, I made some modification, not deep, but then I saw that I was trying to have genetics work better with that, and then I saw eventually somebody picked it up in the community and did it much better than I actually tried to do it, so yeah, it was very experimental, and we were replacing old components, and really, yeah, I think that is one of the reasons it was a breeding ground, it wasn’t a black box, it was 0.16 so we can touch everything, so we can make it better even.
10:29 I think they even wrote their own file system, right, or heavily changed it? If I remember back then?
They hired a guy had done the file system, the name of the file system…
And then you had guys writing KFS and HDFS in parallel, for some time, I remember, right? But that must be in what, 2005, 2006 maybe?
Yeah. In 2007 I left so.
Wow, time is running.
10:59 So, what are you doing today, and tell us a little bit more about your open source project, and your day-to-day involvement. But let’s make sure the brew is good. Actually I will take some soy cream as well. Are you vegan? Or into soy?
Ok. Let’s see. Cheers.
Thanks for coming.
Yeah that’s a different brew than usual but its still good, it will definitely make me wake up a little bit
Yeah don’t drink too fast.
Haha, yeah I will make too many jokes then, if I get hooked on the caffeine.
11:46 Yeah so I have my own micro consultancy and I’ve been working a lot with Revolution Analytics, and what I like is that we have an open source project and that we’re combining R and Hadoop, so I thought once I turn the threshold of 40, I’d rather, I’d got to something interesting before I get age discriminated by the valley, so I put one leg on R, and one leg on Hadoop, and if I can, sell myself as an expert and have some recognition as an expert on the combination of the two, I can survive age discrimination until retirement ensues. And the other thing, with Open Source, not only its nice to return to the community and you use so much open source, its nice to contribute a little bit, but there’s a really an important motive which is your code is out there, which next time you need to find a job you just say go, look at what I’ve done, we have real users, the code is in production already, used by other consultancy, in production in top companies, so its not a joke, its real code, you can see the style, before I remember I had my portfolio the code I could share was limited to like six pages because it was all proprietary, and that’s a problem because what we do is code, and it has to be out there there’s no two ways about it. You can go and do the interview lottery, but good luck, its never really worked for me. I have to tell you, I don’t know, I’ve done it from both sides I think it’s a very difficult, low-probability exercise.
13:40 We will do a short break, and we will come back with the next show.