Today we’re featuring Jack Norris, MapR‘s Senior Vice President of Data and Applications. He’ll provide us with his perspective on the big data and Hadoop world.
This is part of our podcast series on big data thought leaders. Be sure to subscribe to our blog to get updates as soon as they’re published!
Transcript, lightly edited for clarity:
Andrew: You may hear about it being over-hyped, but big data is still big news. Today we’re talking with Jack Norris, MapR’s Senior Vice President of Data and Applications about what’s happening in the Spark versus Hadoop space, the future of big data and what’s happening with enterprise. I’m Andrew Brust with Datameer, and this is the Big Data Perspective.
Jack, first of all, thank you for joining us, it’s been a little while since we’ve talked.
Jack: Yeah, no, thank you. Really happy to be here.
Spark vs. Hadoop, or Both?
Andrew: We’ve known each other for a while and we’ve usually had the chance to talk about the industry in general in different capacities, so here’s one more capacity. As we’re looking ahead to the balance of 2017, we were just hoping to get your take on a few different things.
I’ll start with something that maybe a year ago would have been super controversial, but it’s still pretty important now even if it doesn’t have quite the shock value. IBM and their new Watson Data Platform is an interesting thing in the big data space because it’s not based on Hadoop, rather it’s based on Spark. Knowing that MapR is kind of invested in both platforms, I’m wondering what you think of this as about Spark versus Hadoop and whether that is a real debate or if that’s a red herring. If it is a debate, which side’s winning?
Jack: Well, it’s a really interesting question, I’m glad we’re starting here. I think the market is starting to become much more intelligent about these big data technologies and approaching them. There’s a lot written about Spark versus Hadoop, and Spark provides a lot of ease of use for the developers and handles streaming analytics really well. Hadoop continues to be the preferred solution if you’ve got incredibly large workloads that go beyond the memory capacity of your cluster. There’s different perspectives. I think the biggest issue with this Hadoop versus Spark perspective however, is to look at big data in terms of two layers, in terms of the compute and in terms of the underlying data layer.
When you look at that perspective, it’s a really interesting comparison because you’ve got the Hadoop distributed file system on one side, which is a batch write-once data layer, and you’ve got Spark, which has no persistent data layer. The reason that we’re so interested in it from MapR is what we’re seeing is the need for a real scale out, real-time, enterprise-grade data layer that can take and handle multiple workloads, multiple machine learning, NoSQL, deep file analytic processes that are brought on top of it.
Not only to support a broad set of applications, but to do that in a way that you can do operations and analytics together because it’s ultimately, “How do you impact the business? How do you bring the intelligence and the insight of data to bear in these business functions?” I think whether it’s from a Spark perspective or a Hadoop perspective, that’s ultimately what organizations are looking to do.
The Future for Big Data From Different Sources
Andrew: Yeah, I like your take and I don’t know if you said it explicitly but it definitely seems like all these different open-source projects and some of the more commercial technologies that are layered on top of them tend to work together pretty cooperatively. As a part-time member of the press myself, I think I can stipulate that sometimes the press likes to make things into a competition although ultimately, it may end up being a teaming up. Maybe along those lines I could segue into the next question, not quoting a journalist but quoting a research firm. Constellation Research specifically is estimating that by 2020, which is not far away now, at least 60 percent of the data that organizations consider to be “mission critical” will live outside the four walls of the enterprise.
What do you think this means for big data in the enterprise? And since you just mentioned the notion of having analytics, workloads and operational ones side-by-side, if you think that adds a dimension to this answer, by all means feel free to include that as well.
Jack: Yeah. I think it’s spot on, I think 60 percent might even be low, right? If you look at increasingly the importance of IoT data, machine-generated information, and sensors, and social media and all the sort of data that goes into better understanding and impacting a customer experience, driving a more efficient product and service delivery, and better understanding and mitigating risk and security exposures, that requires data from a host of different sources and being able to harness those effectively. We tend to talk about big data and marvel at it after it’s been collected and look at the volume and the variety that’s there. The reality is, it’s created one event at a time. You have to have a capability that addresses the data at that moment of creation and collection, and handle analytics and the processing of the data across a distributed basis.
It’s not a question if it’s on premise in your four walls, whether it’s done in the cloud, whether it’s done at the device level. The answer is it needs to be done on a coordinated basis across all of those. When we talk about convergence, it’s with this eye towards this distributed capability, this inner cloud processing that includes on premise. We’ve got customers today that are approaching that. I think by 2020, I’ve also seen data that says 90 percent of the data will be held in next-gen applications. If you look at just the short four, almost three years to go in terms of 2020, organizations should be well on their way to handling this distributed processing, and to being able to take advantage of it with next-gen technologies.
It’s also about a collapsing. We tend to look at separate technologies and you need to bring all of that together so that there’s a blurring of the lines between a bio operation and a database operation and a streaming operation, because when you’re looking at operating in that short window of time, it’s got to have the intelligence and the benefit of a broad range of processing as part of that. Whether you’re trying to determine what product to recommend during a webpage loading, or whether this is a fraudulent transaction as the credit card’s being swiped.
How Will Big Data Affect the Job Landscape?
Andrew: Yeah. You’re doing what I thought you might do which is you’d have an answer to one question that would tempt me to drill down on just that one for half an hour. We’ll have to invite you back to do that but what I will take away from what you’re saying is that data is not this thing unto itself, it is a series of serialized recordings of business where it happens, of phenomena where they happen. It literally is everywhere unless you’ve got a very old-fashioned boring business where everything happens in a very very controlled environment and doesn’t sprawl at all. That kind of variety and that need, to use your word, converge, although I changed your noun into a verb, but the necessity to coalesce all of that seems to be maybe the most important part of it. It sounds like that’s what we’re getting at there.
That too I think is a good segue into a subsequent question, and this about the pervasiveness of big data in a growing number of industries. Maybe even industries that we don’t think of as first and foremost is technological, like agriculture, or manufacturing, insurance probably we do but I’ll bring that one up as well. As more industries are adopting big data and analytics really, this is a little bit of an ambiguous question and I’ll admit that this is somewhat intentional. How do you think this might shift the job landscape? There’s a couple different ways you can interpret that and if you want some prescriptive advice I’ll help you there but I’d rather leave it open to your interpretation.
Jack: Well what this is getting to is that we are in the middle of I think one of the biggest shifts in the data center, a huge paradigm shift in terms of how we look at big data. I’m glad this whole industry is labeled big data and not a technology. I think early on, we tended to focus on a single technology as the proxy for big data but the reality is this is shifting how we look at data, how we treat data, even how we start with data. It’s not about the application dictating how the data should be stored and placed into a specific queue per schema. This is much broader. In terms of “How does this affect different industries?”, it’s game changing, it’s a huge, huge shift. I think for some it’s very daunting, right? As more and more intelligence gets built into the process, what is the role of the typical worker? What is the role of the middle manager?
I think with some there’s a lot of trepidation there. I think it’s similar to other major revolutions. If you look at the industrial revolution I think there was a lot of trepidation on the part of farm workers, “What does this mean?” Looking back it was a huge driver of wealth, of additional leisure time, just very transformational. I think this is similar. If you’re looking at existing jobs, you might say, “Wow, there’s a lot of trepidation there because it’s going to change everything.” Yes, but I think looking back, we’re going to say, “How did we do things without this leverage of data and this very fast, intelligent response that’s built-in in a myriad of ways?”
I think from a worker’s standpoint it’s going to change things. I think it requires a big shift in our focus on how we think about data, how we learn, even the education process. We’ve invested heavily in free, on-demand training to try to get people to understand this technology. I think it’s got far-reaching changes, Look in our own lifetime — there’s no longer a big premium on retention and knowledge buildup.
There’s much more of a premium on, can you search effectively and can you pull data from multiple sources and can you determine what’s a good source of data, and what’s a bad source of data, or what’s real news and what’s fake news? Ultimately that’s a data problem too. This question alone is probably a topic we could go on for hours. I don’t think we should underestimate the impact of data.
Andrew: Yeah, no. I certainly agree. You mentioned retention and I wasn’t sure if you meant of personnel or of data, and now I’m thinking, “Well maybe that’s not even that important of a question because it really kind of applies perhaps to both.” I remember when I started my career, going back to the stone ages of the late 80s even. Back then we were talking about paperless offices and office automation and there was a huge amount of trepidation then that there would be scores of job losses. That didn’t happen. We ended up keeping everyone employed; we just had them doing different stuff and ultimately getting more done. I would definitely say there’s more pessimism this time about what technology may do to certain job holders. Maybe I’m naive, but I tend to think that the same trepidation and then silver lining, that same kind of equation will play out this time as well.
Jack: I will say, I think the people listening to this and that are in our industry are in an excellent position and probably in a better position than many others in the economy because it will be a big shift.
Too Many New Technologies—What to Do?
Andrew: I think ultimately if we can factor stuff out of procedures and make them automated we will. But then that leads to more stuff that we could do that we couldn’t even approach before, and that will tend to require human action and human involvement. Again, maybe I’m naive but I tend to be optimistic that way.
All right, those were pretty grandiose questions. The next one honestly is a little bit smaller in scope but I think it’s important especially as we’re having this conversation with you, because obviously, MapR has been well-known for really fusing open-source technology and its own commercial innovations.
Clearly, you guys have strong respect for both sides and probably even more so for the way that greater power can be emancipated if you put them together. Sticking on the open-source side, is there an open-source technology or project that’s relatively recent that you’re especially excited about? Whether you guys are involved it or not, and hopefully the answer’s yes, if so, what is it and what makes you excited?
Jack: I guess my first reaction to this question is that, I know you probably want me to name drop a technology that people might not be aware of that they need to look at. The reality is there are just all these open source projects that continue to arrive. Ted Dunning, who’s Chief Application Architect here at MapR and is VP of software incubator at Apache Software Foundation, is constantly talking about new technology. There are two new technologies that have arrived this week. It’s hard to keep track of everything. I’m going to pull back a little bit and I think that we need to move beyond the focus on, “What’s the next technology,” and the separate silos that exist in the organization today.
I think that more attention needs to be focused on ultimately what is the platform approach for an organization and how do you drive agility? Because some of the technologies do a tremendous amount but actually make it possible for companies to be less responsive and have to prep data in a certain way. It’s more about that platform and that data agility. I think some of the container technologies like Docker and Kubernetes are really promising, just all of the flexibility that that provides in terms of where you process data and how you take advantage.
I think moving well beyond a support staple application that drive analytics as part of those containerized apps is going to be transformational. We’ve got customers that are there now, we’ve got many other customers that are looking at what is their path. For 2017, I think that’s the exciting area that will continue to pay dividends because every new technology that arrives then can be leveraged in a very flexible and agile manner.
Andrew: If it’s okay, let me pursue that a little bit because you hit on something that’s a little bit of a pet peeve of mine to be honest. You didn’t call it out explicitly but I think you were still thinking about it. That is sort of the notion of fragmentation versus integration. You’re right, at the beginning we were talking about individual projects and products that we were in kind of a siloed reality to some extent. We still are but it’s moderating and it’s mitigating. What about the value of providing a platform, providing a suite that takes a lot of these atoms of innovation and puts them together in whole molecules that people can use? Is that a MapR credo to some extent? Is it a market need? Do you get cranky about it the way I do? That there’s not enough of it, what’s your experience in all of that?
Jack: Yeah, I think … Look, it’s about driving these innovations and increasingly it’s not just about convenience because you can collapse things together. It’s really about removing latencies and delays. To take analytics away from the back room, from the historical perspective in terms of what happened to the business, and integrating it so it’s actually impacting the business as it happens. Latency is the foe. That requires a very flexible, robust platform that can treat data from many different sources as first-class citizens and provide a lot of processing flexibility together on that platform.
Then it comes down to, how do you best bring the different tools and utilities and processing to it, and there I think it’s … what are the open APIs, how do you make sure that that platform supports this broad, open ecosystem and that’s where our approach has been just adamant to make sure that there are industry standards. If there’s an industry standard that’s not present and prevalent, how do we drive that? The open chase on application interface, OHI is a great example of how an open API and industry standard helped drive the document database. I think it’s a little bit underappreciated, how important that is so you don’t have fixed data models that require a lot of delay and setup before you can take advantage of the data.
Again, a big group of topics that are bundled within that, but I think that starting with the data first and then bringing the processing afterwards, machine learning, whether it’s analysis or whether it’s legacy applications that exist today and making sure that they can run alongside the latest and greatest.
Andrew: I don’t know if this was intentional but you mentioned Ted Dunning and he’s involved now in this project called Arrow, which is all about making sure that there is a data-first approach, in this case, to representing columnar data and memory and instead of having 11 different open source projects they each create their own little standard for doing that. There’s cooperation and interaction there so that they’re working to the same standard, and it’s to get rid of precisely the latencies that I think you were talking about.
Jack: That’s a great example, yeah. That’s why I like doing these with you.
Jack Norris’s Big Data Predictions for 2017
Andrew: It’s easy when I get to learn from others. Well, so you’ve already spoken to this being a theme in 2017 and maybe we can point ourselves toward the finish line if you had ideas about other predictions for the short term, for this year, for 2017? Depending on what kind of swagger you feel like you’re working with today, whether you want to take a gander at what we might even see, let’s say, three, four, or five years from now?
Jack: We talked a little bit about data first, I think that’s really going to fuel a lot of the activities in 2017. I think it’s where it’s much more about the business value, it’s not, “Let’s collect it and throw it in a lake and then come analyze it later.” It’s more, “How are you actually driving business results with the data?” Much more of a focus on the flow of the data lake and the real-time arrival of that data. I think that’s a big area that involves streams, it involves more than machine learning, I think that’ll be a big part of 2017. We talked about Ted, Ted’s done a lot on machine learning and some of these straightforward algorithms coupled with large data sources can have some incredible commercial impacts and effects.
That’s a different trend than historically has been done in the AI, machine learning, data science world, where it’s who can have the most complex model and the ones that have the most academic cred, and here we’re talking about some of those that are more straightforward that are able to get incredible results on a repeatable, fast basis. I think we’ll start to hear more about that and those approaches with machine learning.
I think the turning point is going to be clear in 2017 and a lot of the historical approaches with respect to how we govern the data, how we looked at lineage, how we kept the metadata completely separate and curated. There was a lot of time between when data was collected until it was ready for analysis. That’s where the pressure point’s going to be in 2017 with a new, almost ruthless focus on how do you eliminate those delays and how do you eliminate that prep work to drive these new business processes and these new integrated analytic and operation applications?
Andrew: Absolutely. There too I think you’ve alluded to something really important which is when a technology is in its infancy, we tend to impress each other by how hard it is and but how we wrestled it, and hone it to produce a great result. Where it really starts getting a multiplier of power and impact is when we talk about how easy it is. Not trivial, not to the point where there’s no value in it obviously, but to the point where putting it to use is not the hard part. We can almost take that for granted and then build on top of that.
Jack: I think there’s a window opening up for those organizations that might not have started first but have an opening now to move in, take advantage of some of the latest innovations, take advantage of this convergence. Being able to leapfrog and then maintain a lead going forward. That window’s not going to be open for very long. The longer organizations delay, the more they’re going to be behind and trying to catch up and that’s the speed of this industry, the speed of this technology, catching up is hard to do.
Andrew: All right, so then 2017 is an opportunity in that respect and that’s probably a great place for us to conclude. Jack, thank you very much. I hope we can invite you back five more times to talk about each one of these in detail and isolation but since that probably won’t happen, I’ll thank you very much for covering such a broad scope of stuff in one conversation today. Hopefully we will have subsequent chances to talk more.
Jack: Oh, well thank you very much Andrew, it’s a real pleasure to be on the show.
Andrew: Thank you. Thank you all for listening.