This week, Ron Bodkin and I talked about his time at Quantcast, one of the very first companies to work with Hadoop. We talked a bit about the main challenges he faced while he was there, before Ron explained what he’s up to at Think Big Analytics.
Enjoy the full episode!
Stefan: Welcome to Big Data & Brews, today with Ron Bodkin.
Ron: Nice to be here.
Stefan: From Think Big Analytics. Can you introduce yourself and the lovely brew you’ve brought with you?
Ron: Certainly, Stefan. We started Think Big just about fours years ago, always with an exclusive focus on helping enterprise customers take advantage of big data. I came to that because previously, I had been VP of Engineering of a pioneer in this space called Quantcast.
At Quantcast, the initial mission was to measure the Internet. Nielsen and Comscore were saying the only way you could measure Internet properties and advertising campaigns was with small samples. We proved you could directly measure the Internet. That quickly led us toward using Hadoop, we put it in production in 2006. It went on to build a fast-growing lookalike business model to find millions of people like those who buy goods and services online, using Hadoop.
We built our own on NoSQL technology, because there was nothing commercially viable. We were starting to look in 2007 at distributed data science. Out of the chance I had to lead the teams that did that, I realized it was a great opportunity to bring the band back together, and pull together some the veterans from a company I was the founding CTO of in the 90s called C-bridge, which was Internet services, helping the enterprise take advantage of that technology trend.
The other founders of Think Big, Rick and Katie, were senior people. Rick, was the first senior hire at C-bridge, and we recruited other veterans along the way. Started working with a couple of companies. PayPal, actually at the time was Where, and it got acquired into PayPal so it was hyperlocal advertising, place recommendations. We helped them build their data platform and data science, and a shared customer, Charles Schwab. Helping them with CRM using Hadoop to analyze their customer log data.
We were off and running back in 2010 and here we are four years later, over 70 people. Having worked with a number of fantastic enterprises, with really three main focus areas: customer analytics, using clickstream and other datasets to really drive effective understanding of the customer and go beyond siloed applications. Financial fraud analytics, helping understand fraud and risk in financial scenarios, especially consumer facing. Then, high-tech manufacturing, device data, and manufacturing datasets. We bring deep expertise in the engineering and data science required to deliver great outcomes with big data. [2:32]
Stefan: Let’s talk about the brew before we go into the details.
Ron: I’m a big fan of hoppy beers. I love IPAs. I couldn’t resist. There’s a few really fun beers. I thought we might get in trouble if I brought you a bottle of Hoppy Endings, so you’ll have to look that one up on the Internet. Hop Stoopid’s another great choice that also comes in a convenient sharable bottle. It’s a local brewery, Lagunitas, we’re talking about beforehand that it’s served in the German style. I picked it up this morning, drove around to meetings, so it’s now served, I would call it cowboy cold. You might call it German. The proper German temperature. [3:13]
Stefan: Don’t cool beer too much, because the flavor goes away.
Ron: It’s a little bit like white wine. We were tasting white wines, and struck by serving it a warmer temperature that it’s a much nicer taste. You really can actually taste it instead of being so cold it suppresses the flavor.
Stefan: I’m usually not a big fan of hoppy beers, but at least you can see through this one. That’s good.
Ron: What do you think?
Stefan: It’s good. It’s not as a … It’s a smooth hoppiness. It’s not as like, it hits you in the face. You know, I-
Ron: It’s not like Torpedo IPA, a slap in the face.
Stefan: And after, we go with the ears. It’s good. I like it.
Ron: Glad you like it.
Ron: Cheers. As we were talking about the history with Think Big, we very early on, you and I met each other, I think it was at a Hadoop meetup back in 2010 when Data Mirror was inside a shoebox and Think Big, we didn’t yet have an office. I remember us getting together and starting to talk about the opportunities around big data and what customers needed with Hadoop.
Stefan: Yeah. You talked a little bit about Quantcast. I really want to dive into that. If I remember correctly, the first cluster you guys deployed was actually still Nutch and then you built your own … You extracted MapReduce out of there, it was before it was even called Hadoop. [4:45]
Ron: Right. It was definitely MapReduce and HDFS were still part of the Nutch project when we deployed in production in 2006. The seminal moment was, we … And here, everyone is the same, Paul Sutter, who was the co-founder and president, He had worked at AltaVista, so he understood how you built a large-scale custom application to deal with data at this scale. He had recommended, by Michael O’Sheehanokoph who’d worked with him before, to read the MapReduce paper by Google.
The thing that really was compelling about that was the number of programs that people had checked into source control at Google using MapReduce just grew exponentially, it totally worked as a general purpose programming paradigm that solved a lot of interesting problems. Even though it was a simpler programming model that meant you could actually build a common system to support it. That convinced us that it was worth going with MapReduce instead of a custom system to do distributed sort merge for doing audience measurement.
The result of that was, we discovered that the Nutch project had what became Hadoop inside of it, and ran with it. As part of that, we had to do a lot of our own engineering work around Hadoop. Unlike Yahoo, who is another early customer and could afford thousands of nodes, and cared mostly about reliability, we had to make it work on a startup’s budget, so we had to tune the efficiency, and tune the sorting algorithm. We did a lot to evolve it and made our own modifications to make it work well for our needs, add our own programming layer on top.
Stefan: Didn’t you guys, at some point, even run the second file system in parallel? [6:28}
Ron: That’s right.
Stefan: Key-V, what was it? I forgot.
Ron: It was KFS, that evolved into what’s now QFS, Quantcast File System. The big reason for running two file systems was the concern around possibility of data being lost through bugs. By having a second file system in the cluster that was read only, write once, then read, it meant that we could backup critical datasets in a way that they weren’t likely to be damaged.
If there ever were a bug in the file system code that would corrupt data, it would ensure that we’d have a copy of that on another file system and not lose it permanently. Obviously HDFS has matured a lot since then, so the concern around there being a bug in HDFS that eats data is a lot less than it would have been in 2007, 2006. Things have evolved.
Stefan: My favorite one was always when you saw those posting in the email list, “Is there an undelete for HDFS?” Because someone typed in HDFS client remove / space’ oh, I did a space after the slash. [7:37]
Ron: Oh yeah. That’s the big reason why we had a write once file system. The number one cause of data deletion in the history of Quantcast was user error. People inadvertently, recursively deleting things, right? As you say. Having a backup is important, but if you have a backup where you have mirroring, where you just mirror the delete command into your backup, it doesn’t help you, right?
Stefan: No. That’s really early. How much data, how big was your infrastructure at this point? I think you guys, back then, I remember was just mind-goggling, big, ginormous. Today we’re all like, “Yeah, you know.” It’s not that big anymore, but back then it was quite impressive. [8:25]
Ron: Quantcast went through a series of clusters. In the early years, probably every 15 months we upgraded to a new physical cluster. The clusters grew not only in the number of machines, but the networking and scale of the equipment. The biggest cluster, we actually kept for a longer period of time and kept extending, started out with several petabytes of capability.
Stefan: And that’s 2007.
Ron: We put that one in place, it went live, if I remember right, in 2008. Earlier than that we’d had a smaller cluster. But yeah, we had overall, we grew it to … We thought about in terms of the number of cores, and we actually did our own scheduling, where we had lanes that were collections of separate sets of nodes that would be scheduled independently. We found that trying to have jobs coexist on the same machines didn’t work very well for us, so we split it out into a set of virtual clusters on the same hardware.
There we had a few thousand machines, each reasonably dense storage. Back then, you didn’t have … Two terabyte drives weren’t even common, there’s like terabyte drives. So we ended up having, then, several petabytes of storage and pretty significant capacity. We had the ability to sort routinely 100 terabytes of data.
Stefan: What was the biggest challenge, especially in the early days? I think at some point, you guys got drifted quite off the Nutch trunk as we called it in SVN. What was the main things you guys fixed in your version? [10:20]
Ron: Early on, we did a number of releases where we resynced up with what became Hadoop. We’d forklift our changes in. We put some hooks in to give us better visibility into what was going on in the jobs, we built a higher level programming API for making it easier to program MapReduce jobs. Still in Java, but more of a data binding approach. We invested in a second file system and integration of that, which we talked about.
The other piece that we put a lot of effort into was tuning efficiency, especially sort efficiency. We had big jobs that routinely sorted large amounts of data. Even today, the sorter in Hadoop is not optimized in the way it could be. We made a number of changes. I don’t think I’m allowed to talk about exactly what we did, but we made a number of modifications and improvements to drive better sort performance.
Stefan: There’s this fun little story, where this really small startup came to you guys. I think it’s called Facebook back then?
Ron: That’s right.
Stefan: I think Peter Thiel introduced the Facebook data team, Hammerbacher code to you. They came by, they did a visit and like, “What is this thing here? This is incredibly slow, it’s in Java, we’re all in C++.” What happened there? [11:45]
Ron: They had tried out Hadoop and had found that running it out of the box, they were getting worse performance on a cluster of machines than they could get doing the same thing with a single program, word count, and so forth. We showed them what we had done to make Hadoop work well, and get meaningful scalability and great results. It was pretty important in them deciding to embrace Hadoop. They were considering MPP database alternative at the time, and convinced them that embracing Hadoop and contributing to that would be a better choice.
Stefan: I think now, they’re by far the biggest Hadoop user out there? [12:23]
Ron: Yeah, they have the biggest Hadoop cluster. We’ve been doing some work with them on that cluster.
Stefan: So it came all the way back to you?
Ron: That’s right. Able to help them out.
Stefan: You showed them Hadoop and you still have to help them with Hadoop. [12:37]
Ron: That’s right. The thing that’s so exciting is that as organizations take advantage of Hadoop and big data, there’s so much value to be created. There’s immense opportunity, companies see so many ways they can use analytics and drive more results. Even a company like Facebook, that has a very deep set of capabilities, still looks for outside help because they have so much they want to do. There’s such a need for expert help like we can provide.
Stefan: That leads us into your customers. What is it, what you really do with the customer. What I heard you saying is kind of, you build big data applications. Where does it start, where does it end, what are the challenges as you go to that journey with customers? [13:22]
Ron: We want to help customers be successful through the life cycle of big data adoption. That often starts with a strategy and road map, helping bring the technology and business teams together. Understand what are the low-hanging fruit first opportunities who are starting to get value out of this technology? What’s a sequence of build out of technological capabilities to get to a bowl architecture? How do you develop the organization, create the right center of excellence and competency?
We help customers plan out for success, then execute it in an implementation program where we’ll integrate best in class platforms and tools. We bring some of our own packaged components that make it faster and easier to get results. Really assemble solutions that can start to give real business insight.
We see customers need help in a range of areas that typically, we see a lot of organizations that have tried out, done a lot of proof of concept of various technologies but have not yet … There’s so many options out there, and it’s not always clear that sometimes vendors position technologies that aren’t best for everything as a fully general solution. We help customers understand what are the real best uses, how do you integrate these things, how do you get an agile approach of starting to deploy capability in a test to learn fashion?
Stefan: Is the profile of the companies and organizations you work with rather kind of fast-moving startups, or big companies trying to get their feet wet with big data? [15:00]
Ron: We work with a range. I’d say that we do a lot of work with larger established organizations that have enterprise assets. They often have a lot more to think about in adopting big data. It’s not tabula rasa, they have to think about governance, regulatory constraints, and how to integrate with existing data processing analytic systems. Yet, at the same time, they recognize their industries are being impacted by the deluge of data, the ability to be smarter.
Startup competitors are using big data to change the conversation, to build engagement with customers, to build new offerings. Established players in industries are needing to respond by investing in big data solutions of their own.
Stefan: In those larger organizations, is there more of a technology challenge, especially maybe around integration, what I just heard? Or is there very frequently more social challenge? You talked about the center of excellence. Where do you see, really, some of the roadblocks? [16:04]
Ron: I think there’s challenges in both. Often, organizations make the mistake of assuming this technology is a small, incremental change to what they’re currently doing. It’s so overplayed, the statement that this is a paradigm shift, some technology that people are using is going to change everything.
Not surprisingly, most organizations, most IT departments, are skeptical of that and assume, “That’s fine, we’ll send somebody to training. A couple weeks later, they’ll all be experts and we’ll be able to build our own big data solutions.”
Stefan: Then they fail.
Ron: And then they fail. Maybe they rely on legacy services firm that have abandoned the ability to do software engineering a long time ago and now configure packages, but tell them they know how to use big data. They put together a whiteboard and a PowerPoint, but when it comes to actually implementing the system, they don’t know the patterns, they don’t know how to make it successful.
Stefan: Hadoop seems to be, as a Open Source platform, very complicated. The technology idea was very simple, but then all the configuration, all the knobs you can turn on, is that a challenge that a lot of people are running into? [17:19]
Ron: Yeah. I think that there’s a range of complexity. Everything from how do you work with data in the environment, how do you build applications that robust, what are the patterns for data modeling, as well as how do you administer, operate and manage a production application? How do you integrate with existing systems? All of these are new patterns and are complicated. It’s fair to say that enterprise data infrastructure has always been fairly complicated.
If you think about a really well-established database like Oracle, you still expect that any organization that’s running Oracle, a whole cadre of experts, database administrators, who are dedicated to care, feeding, and keeping that system up and reliable. There’s a lot of complexity in managing data systems. That’s a technology that’s got unbelievable amount of investment, both in terms of human hours from consumers, customers of it, as well as investment by one of the largest software companies in the world.
Stefan: Where do you guys start, and where are companies that selling Hadoop distributions end? [18:32]
Ron: I think that there’s a very different emphasis in our business models. We’re emphasizing developing strategic applications for customers that really create unique differentiation from them. How do you blend together datasets? Break down silos? Start to drive business outcomes? Whereas the Hadoop distribution vendors are really focused on driving a platform business and subscriptions revenues around that. As one of our customers said, we like that there’s a bit of overlap in services between our distribution vendor and Think Big, because if there weren’t an overlap there’d be a gap.
We want to make sure we’ve got things covered, but we like to use our distribution vendor for days of expert services to certify things, for basic cluster configuration. When it comes to actually doing the heavy lifting of the integration, the application architecture, we want a vendor-neutral expert that knows how to assemble the right pieces together and deliver outcomes. [19:30]
Stefan: Great. Thank you very much.
Ron: Thank you.
Stefan: See you in the next episode, and let’s have a little bit more-
Ron: Hop Stoopid.
Stefan: Hop Stoopid, jeez, that’s good.