Stefan's Blog

Big Data Musings From Datameer's CEO

Big Data & Brews: Cloudera on the Secret to Success & Scaling Hadoop

By on July 22, 2014

Tags: , , , , , No comments

Over Sierra Nevada Pale Ales, Mike Olson and I dove into what he saw as the secret to success — a combination of the quality of people you work with and also standing in the middle of a market that is exploding around you.

Mike also diagrams the initial expectations of Hadoop and how Jeff Hammerbacher and Amr Awadallah argued (successfully) how it could be scaled.

Watch the full video below:

TRANSCRIPT:

Stefan:           But you did a fantastic job, I mean Cloudera has such a big market. What’s the secret to success? If you maybe look in the back mirror at night. You know, this is where we really hit the nail on its head.

Mike:              I’ll say a couple things. First of all, yeah, we’ll take credit for doing a bunch of stuff right. And I’ll even talk about what some of those things are. I will say that, and this has been true for all of my career, two things factor hugely.

One is just the quality of people you’re going to attract. How well can you hire, how talented are the people you are able to bring into the organization? If you concentrate really hard on that, that really helps, because then you are flexible and smart as a company in ways that you couldn’t otherwise be. Look, not for nothing, it absolutely helps to be standing in the middle of a market that’s exploding around you.

Stefan:           Sure.

Mike:              You know, we were, I think, smart, but also pretty lucky, that big data broke for us, broke for you, in the way that it did, and at the time that it did.

I think the key insight we had in 2008 and into 2009 was, first of all that, traditional enterprises were going to have big data problems and that Hadoop was going to be the right platform for it. You’ll remember back then in the consumer internet everyone knew this technology. But if you went to a bank or a hospital or an insurance company noone even heard of big data. That name hadn’t happened yet. Recognizing that was important.

The really key insight, and I actually credit this jointly to Amr and to Jeff Hammerbacher, who had run these systems at scale at Yahoo and at Facebook. Hadoop is this big, accommodating storage system, and it’s got this processing engine called MapReduce. That was what Google invented. Scale out processing that you could send to the data, and this spreads across lots and lots of servers.

Everybody looked at MapReduce and the database, and I did too when I was there, and thought, no queries, no transactions. There’s a lot of stuff that’s just wrong about it, very powerful, transformative, but it doesn’t look like a database. The key failure I think of those of us in the relational industry, this paper was published in 2004 by Google, was assuming that there was somehow a law of physics that that was all you could do.

It turns out you can build other engines that run on all that hardware in a scale out way. And so, we’ve developed, we’ve released something called Impala, early on we embraced the HBase NoSQL database as yet another one of the boxes that sits on top of all that data. These all share those servers and get at the data and give users multiple ways to attack this huge amount of data. If the data is a  petabyte you’re not going to move it anywhere. The only way you can process it is if you’ve got ways to send the right kind of processing to that data, in different engines. That’s the way.

So very early on Jeff and Amr were already arguing for this architecture. I think we’ve done a good job of inventing and then driving out into the open source SQL system that technology, taking advantage of the work that the community broadly does in inventing new engines of that sort as well.

Stefan:           You hit on a really interesting one here, close to my heart.Let’s double-click on that technology. Wasn’t the promise and also a big message of Cloudera very early on, the whole idea of late binding schema, some people call it Hadumping, bring the data in first and then you later decide what kind of analytics you want to do. But if you think of like Hive or Impala or HAWQ, Pig, SQL isn’t that kind of going in a different direction than schema and read, because you have to build your schema, you have to like predefine the kind of questions you want to eventually answer.

Mike:              Right.

Stefan:           What’s your take on that?

Mike:              You’re exactly right. And I think it’s and excellent point. The deal with HTFS and some would say HBAse, but basically this scaled out storage architecture is you can land anything at all in there. You don’t need to know today what kind of censors you’re going to have in your building ten years from now. So you don’t have to design a system that knows about those datatypes in advance. You just land bits in this thing and you interpret them later when you want to work with them. At write time you don’t need to declare your schema. At analysis time you can impose it because you know what kind of data you are working with.

Stefan:           But I would argue if you move some, you load your raw data in and you then you maybe want to use Impala or Hive, you would have to move the data into the schema of Hive or Impala. Right? No? Did I miss something here?

Mike:              This engine absolutely needs a schema down here. Remember I said you can store anything you want. It doesn’t have to be schema-free and unstructured. This storage layer is perfectly happy to store all of the integers and all of the salaries and all of the birthdays that you want it to, right? And it turns out there are a huge number of workloads where the data is really scaled out, where the schema is known in advance, or you can evolve it over time, and where SQL is the way you want to get at it. Not every workload, right?

That’s why MapReduce, for example, is still such a powerful piece of this system. In fact MapReduce is a good tool for taking complex unstructured, or complex datatypes, and turning them into the tabular types that a SQL engine expects. But absolutely the case that this forces you to use schema but also allows you then to use traditional tools and the skills in the organization that work with that data. Not that this is the right way to get at big data, but it’s critical that this be one of the ways to get at big data.

Comments are closed.