Today we’re featuring Scott Gnau, Hortonworks’ Chief Technology Officer. He discusses new applications of big data, how we should preserve data, cloud investment and much more. Start listening to start learning, or read the transcript below.
Transcript, lightly edited for clarity:
Andrew: What’s current in big data? Today we’re delving into the buzz with Scott Gnau, CTO at Hortonworks, about everything from data as cultural heritage to big data in the cloud. I’m Andrew Brust of Datameer, and this is the Big Data Perspective. Scott, welcome, thank you for joining us.
Scott: Hey, thanks for having me.
New Big Data Uses
Andrew: Thanks again for being here. To get started, I thought I’d ask you, people tend to get very starry-eyed about what big data technology can do — anything from curing cancer or otherwise saving lives. Then on the other end of it, people tend to get very pragmatic and talk about how big data can improve concepts or increase efficiency. Two ends of the spectrum, from the very aspirational to the very buttoned-down pragmatic and mundane side of things. Maybe in the middle, you have a favorite application of big data that you’d be comfortable telling us about.
Scott: Big data, the data itself, the technology stack, the new way of thinking, the new way of collaborating around data and analytics, is really enabling some very interesting use cases that frankly weren’t possible six months ago, a year ago, two years ago. It’s really this confluence that we’re finding a lot of really interesting things.
Frankly in many cases, we weren’t able to capture and store the data before and so we’re just now, because of some of the new technologies and platforms, and the continuation of Moore’s Law, and compute and data storage capacities increasing, we’re actually now able to store stuff that used to get thrown away.
We can store it in its native format, and then we can apply tools and analytic technologies to that data to go find things that we didn’t know about before. That’s what’s creating this buzz, anywhere from saving lives, and yes, curing cancer which is ultimately something that we all look forward to having happen. When you just think about how technology is enabling a much more refined look at a much larger set of data points and much more accurate treatments in cancer treatment which is just one example, it’s really exciting.
It goes across any industry and the thing that excites me the most is the stuff that we’re going to find next, and how we’re going to build on that by taking advantage of this new technology stack. It’s being able to combine lots and lots of data from sensors and edge devices and wearable devices and our cars, and collect that and apply those analytics. It’s not just about the technology, but it’s also how the technology is getting deployed as a collaborative kind of thing.
We’ve moved away from mainframes. Obviously a long time ago, it was kind of central processors, and certainly mainframes still exist, but we’ve moved away from mainframes as the center of the universe. The next thing was client server. I would argue that in this new open source world, user-centric kind of communities are emerging, and so it’s a combination of having the data, having some really cool breakthrough big data tools, but also the collaborative environment that exists that’s very different than where we’ve been as an industry.
How Should We Preserve Data?
Andrew: Interesting. Not only interesting, but you provided a nice segue into the next question. I’m going to have to think about what you said with regards to collaboration and see if I can bring that back in the conversation. But in the meantime, one of the things you mentioned is how we now have the capability to retain a ton of data that before, we just couldn’t.
The economics of storage were such that retention of that data was just not feasible, and now it is. Ironically, now there’s getting to be a concern, not around the capability to retain the data, but policy around what gets retained and what doesn’t. There’s buzz going around about data as cultural heritage, and maybe government data, open data, is a big part of that.
There has been some concern, for example, without getting terribly political, that certain data around climate change might or might not be archived and kept for perpetuity. My question then is, what do you feel is our responsibility as citizens or as an industry, towards preserving data, and what do you think should be done to live up to that responsibility?
Scott: I think that even one of the predictions that I’ve talked about for some time is that data is really becoming everyone’s product. Whether you’re an automobile manufacturer, whether you’re a chip manufacturer or whether you’re a consumer products company, your product is not just your product. Your product is also, and certainly value to your business is, your data, the data that your product collects when it’s being used. It can be used for understanding warranty claims, it can be used for understanding how features get used and improve the product, data about how customers interact with the products and so on. All of those things really become really important assets.
If you take it one step further into more public domain, yeah, the data that we collectively create is our cultural footprint. It’s important and I think will become increasingly important. Just think about the discovery of the Dead Sea Scrolls. That’s data from a really long time ago, and it’s precious. Unpacking it has led to lots of interesting insights. Those historical artifacts are data and we think about fast forwarding to today, the data that we create becomes our historical artifacts and footprints, and so I think increasingly become very important.
I think there’s been a lot of goodness in terms of a lot of public data that is available freely that’s been created. It would be great for us as a society to figure out ways to enable more of that. The flip side of course is understanding how to protect privacy, how to protect individual rights in that context. I think that there is still some technology and frankly, some policy to be developed around that as well, and that will be increasingly important over the next decade.
How Will Companies Start Investing in the Cloud?
Andrew: You know it’s interesting too, you mentioned Dead Sea Scrolls, so I guess we have to think about some of the data being in other languages, even languages like Aramaic that aren’t in use anymore. Yeah, the whole notion of ethics is interesting. Once upon a time, I was a developer and when I moved into the enterprise sphere, I learned that when we built applications to do maintenance on data, we didn’t put a delete function. At first, that just kind of confused me. Then later, I gained an appreciation for how data just never ought to be deleted.
That was then, in a transactional setting where the data volumes were much smaller. And now, although we have the ability to store so much of it, that doesn’t mean we want to because the notion of being a pack rat gets really serious when you’re talking about the high volumes. It definitely leaves a lot of food for thought.
I guess a lot of that data gets stored in the cloud. Then it sort of becomes a mess that’s stored somewhere else, if it is a mess. Of course, that has cost as well, but that kind of segues into the whole question of doing big data in the cloud. In the hype-driven environment that our industry lives in, we hear a lot about that. But at least according to one or two sources, the portion of the worldwide analytics market that’s in the cloud is only about 17 percent by spending right now.
Meanwhile, we know Hortonworks is in the cloud in a number of respects, and Microsoft’s Hadoop offering is actually based on Hortonworks’ data platform, so clearly you guys are cloud believers. What do you think needs to happen for companies to really start investing in the public cloud specifically, and to get beyond just having one or two kind of skunkworks projects there, but really having it be a mainstream choice?
Scott: Well, I think that is happening now. The world went to the cloud for some time, and to your point, and the studies I think are representing it, data-centric, analytic-centric applications and workloads have been slow to adopt relative to the rest of the market. I think that happens for a whole bunch of different reasons. Certainly one is that some of the early cloud adoption really happened in places where there was an economic incentive to go do it. Application processing that are very light in data but heavy in processing, those kinds of things, moving to a cloud, there’s an economic advantage because while they’re processing-intensive, they might only use a small portion of a whole server. So why buy the whole server? Let me just buy the slice that I need.
Even with the premium for-profit for the cloud provider, there’s an economic incentive to go do it, and then you get all the ease of use, and all the other advantages as well. A lot of things that went quickly kind of fit that footprint. When you think about data, there are a couple of things. Data has mass, it has gravity and data movement can be expensive, so that’s one big thing.
The second thing is if you think about your company’s data and the analytics on your company’s data being highly valuable, being highly proprietary, being highly differentiating, I think there were, in early days, some concerns about security and privacy of leveraging public cloud technology. By the way, I don’t know that they were necessarily technologically valid, but there were those, and still are some of those lingering concerns that cause, “Hey, do I really want to put this out there? Because I really want to protect it.”
Like I say, I think technologically, that’s a little bit untrue, and we’re starting to see people realize yeah, the safety numbers, and we can depend on this as a platform. Then you do get into the notion of data gravity and the expense and the latency involved in moving data from place to place. I’ll take that and turn it back around. I think part of the reason that you’ll see rapid adoption of cloud for data and analytics is around that gravity, where there’s a lot of data being created at the edge in the IoT space that will actually be created and live first in the cloud. In that regard, that will be kind of the first choice, because that’s where it lives already.
Why move it if I can simply do what I need to do, and kind of play it where it lies? I think as we move into the IoT era, you’ll start to see more of that. In fact, I think you’ll see very much unlike what’s happened in the IT industry for the last 40 years, centralizing and converging. I think you’ll see a lot of this “play it where it lies”, where you’ll end up with data footprints in multiple places. That’s really why, certainly at Hortonworks, we talk about connected data platforms, being able to connect the data, being able to push applications around this grid of data where data lives, whether it be in the cloud or on-prem or in multiple cloud footprints, is going to be differentiating in the future because it’ll just take too much time and be very costly to move all of those hundreds of petabytes or exabytes of data all over the place.
“Play it where it lies” becomes an important thing and like I said, in the IoT space, a lot of that data is going to be created and live in the cloud. I think that’s where combining with what I said earlier, that people are starting to have more trust and more confidence in the security and privacy implementations of public cloud. Attitudes are changing, but also the data proximity is changing as well, and that’s why you’ll start to see a bigger uptick in cloud. I don’t think it’ll be exclusively cloud, I think it’ll be multiple clouds and multiple on-prem footprints for many very large customers for some time, getting back to the whole notion of the gravity of data.