Pretty much everyone has an opinion about Spark versus MapReduce right now. What’s really cool though is that Monte was around when MapReduce was presented in its early days and has also kept a close eye on Spark’s maturity over the last few years. Meaning, he’s been close to both technologies for quite some time. Watch episode 3 to get his perspective.
Andrew: I want to go back to your discussion of the Hadoop stack.
Andrew: I don’t know, these days, the word Hadoop comes up, and almost as quickly the word ‘Spark’ comes up. Spark, of course, is another Apache open source project with big data applications. First question: Would you consider Spark part of the Hadoop stack? Or do you see it … Is this an either/or, or is this a both?
Monte: Yeah, so the first answer to that question, and I’ve had the opportunity to watch Spark mature from the early days, and in fact Mike Franklin who is the head of the AMPLab at Berkeley where Spark was created, and now the department chair, is on our board of advisors as a matter of fact.
Andrew: Okay, okay.
Monte: Mike’s been a great counselor to the company, but I’ve watched Sparks since the beginning, and let’s see, to be technically accurate, Spark exists independent of Hadoop, to be technically accurate. Having said that, though-
Andrew: Market realities, though.
Monte: Market realities and technical realities yield … It works best in concert with Hadoop, and the way I like to think about it, is that Hadoop really provided us two fundamental things in its early days. It was an analogue of the Google file system, so the Hadoop file system, and an analogue of big table which was Hbase, and an analogue of MapReduce.
MapReduce was the first time, as a computer scientist, for me … I remember when I was first exposed to MapReduce, it was at an advisory board meeting for the dean of computer science at Carnegie Mellon University, I sit on his advisory board. He presented the Google paper on MapReduce and said, “This is going to change everything, and all of our research is going to revolve around this big data architecture.” The beauty of Hadoop at that time, was that it made distributed computing accessible by the average computer scientist.
Beforehand, you had to have some specialty, a Masters or a PhD in distributed systems to do anything with tens, hundreds or thousands of computers simultaneously. There’s so many issues to deal with, locking up the system synchronization, clock synchronization, deadlock, and all of the other issues, fail over, and now MapReduce was an abstraction that let programmers get at that.
Andrew: The stuff that we were calling grid computing, or high performance computing-
Monte: Back then.
Andrew: Or cluster computing-
Monte: Got to be an expert.
Andrew: Relatively high barrier to entry, kind of a priesthood.
Monte: Relatively high barrier, right.
Andrew: Hadoop lowered that barrier.
Monte: It democratized distributing computing, right? All of a sudden, everybody could run on thousands of machines, and not only that, with the simultaneous emergence of the cloud with Amazon, you’d have to go procure or provision those machines, right? You can go light up a job, run it, and then shut it down. It was a fantastic convergence, and this is great, but-
Andrew: Timing was good.
Monte: Timing was great, but MapReduce is a fairly abstract concept. Taking any complex algorithm or task and breaking it into a mapping function and a reduction function is something that is really pretty abstract, you have to be pretty mathematical to think about that, and so there were limitations on what kind of skill sets the people had that used Hadoop.
Andrew: Yeah. Ostensibly, any Java programmer could approach this thing, but in reality, it’s not so much about the programming language, but the algorithmic-
Andrew: Yeah, mindset.
Monte: The mindset. Now Spark comes along with a different perspective. It is invented in the context of distributed computing, but it takes a set theoretic view of manipulating data, what do I mean by that? I mean that it thinks of things in terms of sets of data, and applying functions or transformations and operators to that data to result in other sets, and then in other sets, and this is a very natural way for programmers to think, and especially database people.
Andrew: I was going to say, SQL results SATS, and SATS, sound like they have some affinity.
Monte: There’s a great deal of affinity here, now Spark started before its SQL layer, and it was still a programmatic API.
Monte: But still that programmatic API was far less esoteric than map and reduce functions, it was “Okay, let’s apply a projection, and let’s filter that out according to some predicates so I have this smaller set, and maybe I want to join 2 sets together.” It was very intuitive, right? It took off, and not only that, it was highly performant, for a variety of reasons. Spark is an in-memory architecture, it doesn’t require you to write to disk for its interim results, whereas MapReduce did, and it also-
Andrew: That said, it can spill to disk if it needs to.
Andrew: Which gives it a leg up on-
Andrew: A lot of pure in-memory systems, which-
Monte: That’s exactly right.
Andrew: Which break once you fit the memory line.
Monte: So you don’t need to be all in-memory, but it can take advantage of the memory, and it had a nice pipelining architecture, too, meaning as you go through this sequence of transformations, the second and third and fourth transformation don’t have to wait until the first one finishes because it pipelines its operations as the results start flowing through, whereas MapReduce, as you do mapping, you wait for that to be done, all of the results come over to the reducers, and the reducers run, so fundamentally-
Andrew: You can daisy chain those things together, but they’re synchronized, one has to finish before the next one kicks off.
Andrew: Whereas, the way you’re describing Spark, it sounds more iterative.
Monte: I guess what it is, it sort of flows interim results down the chain of operations as they become available.
Monte: That’s called pipelining. That’s a very effective, and I would say more performant operation, that’s why you hear Spark being 100 times more performant than Hive, as Hive being the SQL layer on top of the traditional MapReduce interface.
Monte: Spark is fantastic. We’ve been watching it for a long time.