Today we’re featuring Scott Gnau, Hortonworks’ Chief Technology Officer. He discusses new applications of big data, how we should preserve data, cloud investment and much more. Start listening to start learning, or read the transcript below.
Transcript, lightly edited for clarity:
Andrew: What’s current in big data? Today we’re delving into the buzz with Scott Gnau, CTO at Hortonworks, about everything from data as cultural heritage to big data in the cloud. I’m Andrew Brust of Datameer, and this is the Big Data Perspective. Scott, welcome, thank you for joining us.
Scott: Hey, thanks for having me.
Andrew: Thanks again for being here. To get started, I thought I’d ask you, people tend to get very starry-eyed about what big data technology can do — anything from curing cancer or otherwise saving lives. Then on the other end of it, people tend to get very pragmatic and talk about how big data can improve concepts or increase efficiency. Two ends of the spectrum, from the very aspirational to the very buttoned-down pragmatic and mundane side of things. Maybe in the middle, you have a favorite application of big data that you’d be comfortable telling us about.
Scott: Big data, the data itself, the technology stack, the new way of thinking, the new way of collaborating around data and analytics, is really enabling some very interesting use cases that frankly weren’t possible six months ago, a year ago, two years ago. It’s really this confluence that we’re finding a lot of really interesting things.
Frankly in many cases, we weren’t able to capture and store the data before and so we’re just now, because of some of the new technologies and platforms, and the continuation of Moore’s Law, and compute and data storage capacities increasing, we’re actually now able to store stuff that used to get thrown away.
We can store it in its native format, and then we can apply tools and analytic technologies to that data to go find things that we didn’t know about before. That’s what’s creating this buzz, anywhere from saving lives, and yes, curing cancer which is ultimately something that we all look forward to having happen. When you just think about how technology is enabling a much more refined look at a much larger set of data points and much more accurate treatments in cancer treatment which is just one example, it’s really exciting.
It goes across any industry and the thing that excites me the most is the stuff that we’re going to find next, and how we’re going to build on that by taking advantage of this new technology stack. It’s being able to combine lots and lots of data from sensors and edge devices and wearable devices and our cars, and collect that and apply those analytics. It’s not just about the technology, but it’s also how the technology is getting deployed as a collaborative kind of thing.
We’ve moved away from mainframes. Obviously a long time ago, it was kind of central processors, and certainly mainframes still exist, but we’ve moved away from mainframes as the center of the universe. The next thing was client server. I would argue that in this new open source world, user-centric kind of communities are emerging, and so it’s a combination of having the data, having some really cool breakthrough big data tools, but also the collaborative environment that exists that’s very different than where we’ve been as an industry.
Andrew: Interesting. Not only interesting, but you provided a nice segue into the next question. I’m going to have to think about what you said with regards to collaboration and see if I can bring that back in the conversation. But in the meantime, one of the things you mentioned is how we now have the capability to retain a ton of data that before, we just couldn’t.
The economics of storage were such that retention of that data was just not feasible, and now it is. Ironically, now there’s getting to be a concern, not around the capability to retain the data, but policy around what gets retained and what doesn’t. There’s buzz going around about data as cultural heritage, and maybe government data, open data, is a big part of that.
There has been some concern, for example, without getting terribly political, that certain data around climate change might or might not be archived and kept for perpetuity. My question then is, what do you feel is our responsibility as citizens or as an industry, towards preserving data, and what do you think should be done to live up to that responsibility?
Scott: I think that even one of the predictions that I’ve talked about for some time is that data is really becoming everyone’s product. Whether you’re an automobile manufacturer, whether you’re a chip manufacturer or whether you’re a consumer products company, your product is not just your product. Your product is also, and certainly value to your business is, your data, the data that your product collects when it’s being used. It can be used for understanding warranty claims, it can be used for understanding how features get used and improve the product, data about how customers interact with the products and so on. All of those things really become really important assets.
If you take it one step further into more public domain, yeah, the data that we collectively create is our cultural footprint. It’s important and I think will become increasingly important. Just think about the discovery of the Dead Sea Scrolls. That’s data from a really long time ago, and it’s precious. Unpacking it has led to lots of interesting insights. Those historical artifacts are data and we think about fast forwarding to today, the data that we create becomes our historical artifacts and footprints, and so I think increasingly become very important.
I think there’s been a lot of goodness in terms of a lot of public data that is available freely that’s been created. It would be great for us as a society to figure out ways to enable more of that. The flip side of course is understanding how to protect privacy, how to protect individual rights in that context. I think that there is still some technology and frankly, some policy to be developed around that as well, and that will be increasingly important over the next decade.
Andrew: You know it’s interesting too, you mentioned Dead Sea Scrolls, so I guess we have to think about some of the data being in other languages, even languages like Aramaic that aren’t in use anymore. Yeah, the whole notion of ethics is interesting. Once upon a time, I was a developer and when I moved into the enterprise sphere, I learned that when we built applications to do maintenance on data, we didn’t put a delete function. At first, that just kind of confused me. Then later, I gained an appreciation for how data just never ought to be deleted.
That was then, in a transactional setting where the data volumes were much smaller. And now, although we have the ability to store so much of it, that doesn’t mean we want to because the notion of being a pack rat gets really serious when you’re talking about the high volumes. It definitely leaves a lot of food for thought.
I guess a lot of that data gets stored in the cloud. Then it sort of becomes a mess that’s stored somewhere else, if it is a mess. Of course, that has cost as well, but that kind of segues into the whole question of doing big data in the cloud. In the hype-driven environment that our industry lives in, we hear a lot about that. But at least according to one or two sources, the portion of the worldwide analytics market that’s in the cloud is only about 17 percent by spending right now.
Meanwhile, we know Hortonworks is in the cloud in a number of respects, and Microsoft’s Hadoop offering is actually based on Hortonworks’ data platform, so clearly you guys are cloud believers. What do you think needs to happen for companies to really start investing in the public cloud specifically, and to get beyond just having one or two kind of skunkworks projects there, but really having it be a mainstream choice?
Scott: Well, I think that is happening now. The world went to the cloud for some time, and to your point, and the studies I think are representing it, data-centric, analytic-centric applications and workloads have been slow to adopt relative to the rest of the market. I think that happens for a whole bunch of different reasons. Certainly one is that some of the early cloud adoption really happened in places where there was an economic incentive to go do it. Application processing that are very light in data but heavy in processing, those kinds of things, moving to a cloud, there’s an economic advantage because while they’re processing-intensive, they might only use a small portion of a whole server. So why buy the whole server? Let me just buy the slice that I need.
Even with the premium for-profit for the cloud provider, there’s an economic incentive to go do it, and then you get all the ease of use, and all the other advantages as well. A lot of things that went quickly kind of fit that footprint. When you think about data, there are a couple of things. Data has mass, it has gravity and data movement can be expensive, so that’s one big thing.
The second thing is if you think about your company’s data and the analytics on your company’s data being highly valuable, being highly proprietary, being highly differentiating, I think there were, in early days, some concerns about security and privacy of leveraging public cloud technology. By the way, I don’t know that they were necessarily technologically valid, but there were those, and still are some of those lingering concerns that cause, “Hey, do I really want to put this out there? Because I really want to protect it.”
Like I say, I think technologically, that’s a little bit untrue, and we’re starting to see people realize yeah, the safety numbers, and we can depend on this as a platform. Then you do get into the notion of data gravity and the expense and the latency involved in moving data from place to place. I’ll take that and turn it back around. I think part of the reason that you’ll see rapid adoption of cloud for data and analytics is around that gravity, where there’s a lot of data being created at the edge in the IoT space that will actually be created and live first in the cloud. In that regard, that will be kind of the first choice, because that’s where it lives already.
Why move it if I can simply do what I need to do, and kind of play it where it lies? I think as we move into the IoT era, you’ll start to see more of that. In fact, I think you’ll see very much unlike what’s happened in the IT industry for the last 40 years, centralizing and converging. I think you’ll see a lot of this “play it where it lies”, where you’ll end up with data footprints in multiple places. That’s really why, certainly at Hortonworks, we talk about connected data platforms, being able to connect the data, being able to push applications around this grid of data where data lives, whether it be in the cloud or on-prem or in multiple cloud footprints, is going to be differentiating in the future because it’ll just take too much time and be very costly to move all of those hundreds of petabytes or exabytes of data all over the place.
“Play it where it lies” becomes an important thing and like I said, in the IoT space, a lot of that data is going to be created and live in the cloud. I think that’s where combining with what I said earlier, that people are starting to have more trust and more confidence in the security and privacy implementations of public cloud. Attitudes are changing, but also the data proximity is changing as well, and that’s why you’ll start to see a bigger uptick in cloud. I don’t think it’ll be exclusively cloud, I think it’ll be multiple clouds and multiple on-prem footprints for many very large customers for some time, getting back to the whole notion of the gravity of data.
Andrew: Yeah, that makes sense. I think what you’re saying is there’s gravity, which means there’s also a certain amount of inertia. If that inertia biased people towards on-prem before, it may actually bias people towards the cloud in the future because so much data is going to originate there anyway, or originate at edge devices and then coalesce in the cloud, etc. That makes sense, and it segues nice into some discussion of predictions.
I don’t know if you know this, but I actually write about big data for ZDNet, and I put together a piece at the end of 2016 that compiled predictions for 2017 from a number of figures in the industry and a number of companies. It’s quite the exercise in editing and compilation. I’ve seen everything from all kinds of almost platitudes now around artificial intelligence and even the notion of automation and its impact on employment, things of that nature. I could go into a lot more detail, but again I want to keep things kind of time efficient here.
I have a really kind of tricky, maybe counter-intuitive question for you, which is, I imagine you’ve reviewed a lot of predictions, maybe even formulated some yourself. Have you come across predictions that you actually feel strongly are incorrect, that won’t take place, or at least won’t take place this year, that you find far off the mark? Are there things that you’ve seen predicted that you really feel, for example, won’t be happening in the next five years or so? If so, which ones leap to mind?
Scott: I think this gets back to what I was mentioning in that last question, and that is I see many experts and vendors talking about converged systems and the rise of converged systems. I think things are actually going exactly the opposite direction. That’s why again, when we brand the Hortonworks data platform, we talk about connected data platforms. We think the notion of being connected is much more important than the notion of being converged. Because we do believe there’s going to be so much data created at the edge, where there’s very little economic value to bringing it together to a centralized system. If you can actually run the analytics at the edge, it’s much more efficient, and it’s going to be much more scalable over time. You think about edge devices and how smartphones, the smartphone that I carry has the equivalent of a Cray supercomputer from 20 years ago.
The edge is getting a lot of capacity to do processing as well, so why not take advantage of that? I really do believe this whole notion of convergence. Because for many years it’s been about, “Hey, let’s get everything into a central ERP system and a stack, we’re going to take cost down, we’ll be highly efficient,” all that kind of stuff. This new world is the opposite of that. Figuring out ways to have portable and collapsible applications that can run at the edge, that can run all over the grid of data that’s going to exist in this new footprint, I think that’s where it’s at. I tend to be completely the opposite of the converged predictions.
Andrew: Okay, fair enough. Yes, on this series, we talk to a number of folks, including those who might feel the convergence is taking place. It’s always good to hear from all sides, if you will.
Scott: Along those lines, by the way, there may be some micro conversions inside of this connected platform. Naturally, think about the evolution of big data. Some of the early Hadoop stuff was very single-application, very batch oriented. Now, with Hadoop 2.X, arguably you have multi-tendency, multi-applications, so there’s convergence of physical platforms. I call that micro conversions because that’s kind of inside of a broader data fabric that’s growing exponentially.
Andrew: Sure, no that makes sense. If I wanted to be kumbaya about it, I could say you guys are in some ways saying the same things, just that workflows are able to overlay into a converged platform. But at the same time, the diversity of data and the sources and the locations and the grid that you mentioned a couple times, especially when we were talking about the cloud, kind of underscores the fact that we’re certainly not going to get everything in one neat little repository, that’s just history.
Let me take you more onto the positive side of predictions though. You spent obviously a ton of your career not just in the big data world, but in the data warehousing world before, having been at Teradata. At this point, what are your predictions for the year 2017, based on trends that you’ve seen come and go? Maybe you’re even seeing some patterns in the big data world that you already saw play out in the BI world and in the data warehousing world. Even if you haven’t, just real eager to kind of hear a couple of pieces that are really prominent for you in the upcoming year.
Scott: I think one of the big things is this notion of data in motion that I and we have been talking about a little bit. We hear about Hadoop and big data and all that kind of stuff, and they’re all really great. Then there’s also a lot of buzz in the marketplace about, “Well, what about streaming and streaming analytics, and all that kind of stuff?” I think there’s this bigger play out there that’s going to mature soon, and it’s happening in real time. I lump that into this notion of data in motion, being driven by edge devices, being driven by IoT. I don’t think the world is just about streaming or streaming analytics or CEP, or any of those individual technologies. It’s really about how do you manage moving data, data that hasn’t landed anywhere yet, but it’s just moving through a pipeline, through a workflow across a sensor network?
There’s a combination of things that are important there, including data flow management, encryption and protection and provenance of data that is in motion, making sure that if you get a signal in from a sensor somewhere like an airbag deployed in an automobile, is it real? Because if it is, obviously there’s some very immediate and important action that needs to be taken. Is there provenance back to guarantee that those signals are correct and there’s not noise in the system? There’s a whole category of new stuff going on in that space as it relates to just managing the flow of data in motion.
Inside of managing the flow I think obviously, streaming, stream analytics, simple event processing, complex event processing, windowing and all of those kinds of functions become important. I do believe that we’ll see a lot of maturity in and around this whole space of data in motion, and kind of a consolidation of technologies and a consolidation of processes being driven by all of the data that’s just flying around us all the time, and trying to manage and harness it more effectively.
Andrew: Just as a follow up there, do you foresee it manifest in terms of just the industry having greater sophistication around it and developing tooling and features and capabilities? Or maybe, and I’m leading you here a little bit, but so be it, the application of stuff we already do, machine learning and automated application of that, to help make that manage that complexity through technology so that the poor data engineers, data scientists and especially business users aren’t so overwhelmed with it. Is it going to be about tooling, or is it going to be about kind of algorithmic handing of that complexity?
Scott: Yeah, I mean as with anything, I think it’s more the former near term, and more the latter longer term. I think it’s kind of a realization of, “Gee, this is a whole new problem set to go solve,” and so what’s the first thing that we’re able to do? We can go tool it, measure it, monitor it, understand it, build the flows, manage it and then we can become more sophisticated over time. By the way, eating our own dog food, right? We’ve got this really cool analytics and machine learning algorithms that we can run in the Hadoop space. Once we set those flows up and have the tooling, certainly we can leverage that to go to the next level.
I think this whole notion of streaming is hot, and then the vendors pushing specific technologies and so on. Streaming is hot, but I think it’s a part of a bigger context that I call “data in motion”, which is a more broad space. I think that the market, and certainly we’re seeing our customers demand a little bit more in that space, as they look out at how they can go build next-generation kinds of applications. I think the really cool thing comes as a derivative of that, not only can we now manage data in motion and understand it and apply tools to it, but then can we take the analytics that we build that were traditionally for data at rest and push them out into that network?
The analytics and the models that we build on petabytes and exabytes of historical data that are very accurate, can we start pushing them out in the network to the data as it’s moving? As such, we can be more proactive about the offers that we make to consumers. We can be more proactive and have lower latency on the actions that get taken from signals that are processed.
Andrew: I don’t know whether to feel uplifted or a little cynical. You’ve kind of, actually I think quite astutely reduced a lot of product cycles into first we get sensitive to it and we develop tooling around it an management, and then we try and get more automated about it. I think you’re right. Hopefully that’s an uplifting message. It shows that we’re, as an industry, we’re being self-aware about things, and also trying not to just sort of go from one problem to the next, but maybe take it more of a meta-level and sort of corral a bunch of problems together into one greater management challenge.
You’ve talked about data in motion, you’ve talked about the grid, you’ve talked about kind of the multiple locations and sources of data, the fact that there’s gravity, the fact that there’s still the need for portability. I think you’ve given us today a really nice overview of all the different ways this same bigger question manifests. Hopefully in 2017, that’s something as an industry we can take on, and I’m just willing to bet that’s something that Hortonworks is going to focus on. Thank you so much for being here. I think that was a really good note to end on, and I will wish you and Hortonworks a great 2017.
Scott: Thanks very much. Thanks for having us, and yeah, I think based on the comment, I would say it should be uplifting. I think we as an industry are very data-driven, so let’s tool it and let’s build advanced algorithms around it. Have a great 2017.
Andrew: Thanks, and thanks everyone for being here.