Big Data & Brews: What’s in Store for the Future of Big Data?
One of my favorite questions to ask my guests is what they think will be the next big technology in big data. These are some of the smartest minds in the industry and it’s always interesting to hear their thoughts on the topic.
I thought it would be cool to compile a few snapshots of our past guests, including Hortonworks founding CEO and CTO, Eric Baldeschwieler, data science consultant Antonio Piccolboni and Justin Borgman, CEO of Hadapt.
Tune in below and leave some comments to let me know your thoughts.
Eric Baldeschweiller, Founding CEO & CTO, Hortonworks
Stefan: What’s the future? You know? Where do we go? I mean, you talked a little bit about Storm and Spark.
Eric: Mm-hmm (affirmative).
Stefan: Obviously, but where do you see Hadoop maybe five years from now even?
Eric: Well, there’s kind of two … I’m spending a lot of time right now thinking about the future of Hadoop, and there’s two megatrends that I’m really noodling on. There’s a whole list of features that I could give you …
Eric: … but that’s probably another talk. One megatrend is how are Cloud and Hadoop going to converge? I think that’s … there’s a 20-minute segment right there.
Eric: I think that’s really interesting. If you look at it, Amazon and Google are two mature proprietary systems that show the two ways it could go. Amazon is a Cloud first, and people are having a lot of success running Hadoop on it. Google built an HPC infrastructure with a real focus on supporting things like MapReduce and that had a HDFS-like storage infrastructure first, and now they do Cloud-like things on top of it, right? They run all their services in, effectively, a Hadoop-like system. Or in at least an Hpc.Scheduler-like system.
So, how are these OpenStack, or how are these Open Source ecosystems going to converge OpenStack Hadoop, and all of the various projects in there? I think that’s really wide open.
Stefan: Mm-hmm (affirmative).
Eric: Right? I mean, right now neither project does what the other set of projects need, but IT managers don’t want both.
Eric: Right? They want one common place to store all the data, and one common way to compute all the data. One common way to allocate resources to projects.
Stefan: Right. They want a plug in the wall, where they just put in … this is my storage and computer and its utility.
Eric: Exactly. So they think that that thing is going to be called [00:02:00] OpenStack, but Hadoop is actually getting deployed in a lot more places and at a lot more scale.
Stefan: Than OpenStack.
Eric: Than OpenStack, so how’s that story going to end?
Eric: I have no idea.
Eric: There’s a lot of speculation you can do there. The other real megatrend is when we started Hortonworks, we talked about how important it was that the community not fragment. That there be one distribution of Hadoop.
Eric: That’s a noble goal, but someone was following me around at a conference the other day and saying, “Admit it! Hadoop, the Hadoop community is fragmented. The Hadoop community is fragmented.” We got into this long argument and ultimately I said, “Well, so what?”
Eric: Right? I think, yes, in some ways the Hadoop community, we can argue about how much it’s this way, and how long it’s going to last, but I think the Hadoop community is kind of going into a Unix decade.
Eric: If you look at the Unix ecosystem, the Unix APIs came out pretty early. There was the AT&T Unix version and then there was the Berkeley Unix version, and then there was every vendor’s Unix version, and one can argue that this was a terrible thing. That Unix evolved much more slowly than it might have if there had been one.
Stefan: Right. Well, it’s an evolution.
Eric: Yeah, you can argue that, too, and that everybody was slowed down because, as a vendor, if you wanted to write an application for Unix, you had to write it for everyone. You could look at it that way and you could look at the CQuEL ecosystem and say the same thing. Wouldn’t it be terrific if all the CQuELs where the same because then all the people that write CQuEL apps would have less work to do?
Or, you could turn around and say, “Well, wait a second, look at those huge ecosystems, right?” If you look at the Unix ecosystem, Unix went from an unknown thing to the default …
Stefan: Multi-billion market and, you know, a lot of technology and innovation are in different areas. Eric: … and the defaults are the ecosystems on which the systems’ infrastructures are built during that “Unix decade.” Stefan: Right.
Eric: So I think Hadoop’s going to see the same thing. I don’t know. I’m, of course, a big fan of Apache Hadoop and hope that everybody does continue to base all of their work on that, but whether or not they do, the APIs of Hadoop are being supported by more and more vendors, and more and more products, and more and more distras, be they pure or not pure, all the time and, as a result, I think what’s really interesting, over the next few years, is what are people going to do with Hadoop?
Eric: Right? What is that ecosystem that’s forming above Hadoop? If that does really well, that just drives more of all the Hadoops, and that creates more and more opportunity.
Eric: So yeah, that’s very exciting to watch and see.
Antonio Piccolboni, Data Science Consultant
Stefan: So outside of Rhadoop, what are other open source projects maybe as a final question, what are other open source projects that you see that you’re really excited about or that you’re maybe consulted in corporate that you think have a great future?
Stefan: Yeah? Antonio: I think when I… I got a..
Antonio: I looked into node-
Stefan: Okay, node.js, right? Yes.
Antonio: I think Reno’s something [I hate too 00:18:57] much to do this, but I don’t think I have time to… yeah… to…
Antonio: Yeah. If you look at the trajectory, Java started on appliances, I mean before it was a language for applets. Stefan: Mmhmm (affirmative).
Antonio: Then I don’t know why they moved it… they needed a language for applets at Sun. It became a language for applets. It completely, totally bombed it, failed it, drastic. I don’t think I have seen an applet in five years or something.
Stefan: Oh, oh.
Stefan: But you have R.
Antonio: I can try to mould it.
Stefan: Oh yeah.
Antonio: Two examples. We’ll see. It’s already moving on the server-side. It has interesting tools. It has a lot of people working on the interpreters that you can write crappy code and it’s gonna run fast anyway. So I think…
Stefan: Yeah, that makes sense.
Antonio: It is fast enough for us.
Justin Borgman, Hadapt
Stefan: Anyhow, Justin, what’s coming next in the market? What do you think? What’s the future in big data? Obviously it’s SQL, sounds like.
Stefan: But what do you think are some of the trends that you’re keeping a close eye on? Is it in memory stuff? Do we need to do more on the file system area? Is it more kind of the access? What do you think’s the next thing?
Justin: Yeah, great question. I think, certainly, all of those are interesting areas, and memory certainly has its place. I think some of the things that we’re most interested in keeping an eye on is this notion of you sort of have Hadoop. Again, I’ll draw. My bad handwriting.
Stefan: Oh, you didn’t see mine yet.
Justin: You’ve got HDFS in here, and very often people are using this as a landing environment, a data reservoir, a data pick-your-favorite-word, right?
Justin: Ultimately, they’re doing some ETL, and they use Hadoop as effectively fancy ETL, and then they push it into a database, a traditional database, maybe that’s Teradata or what-have-you. We’re constantly focused on increasingly what is driving people to skip this step and leave that data in here, and what’s missing to prevent that, because that’s the ultimate feature we believe in. Certainly, we founded the company on that, of doing all of your analytics in one place in this data reservoir.
Some interesting things there we think are making ETL a thing of the past, to a certain degree.
Stefan: Oh, we’re on the same page here.
Justin: Yeah, exactly. That’s one area we keep an eye on things that we invest in from an IP perspective to try to make that easier, make that more accessible. Certainly, our vision is effectively bringing database technology into Hadoop. That’s sort of what Hadapt is all about. Continuing to watch that, also watching what the rest of the ecosystem is doing from a maturity around resource management with YARN, but also security.
Increasingly, as Hadoop goes production in major enterprise customers, there are the kinds of things that aren’t always the sexy things, that engineers are like, “Oh, this is great,” but you have to build it anyways. It’s also something we’ve been paid to do.
Stefan: Yeah, go to?
Justin: We see a lot.
Stefan: Yeah. Are there, to come back to that observation, [00:08:00], because we see that too, so that Hadoop comes to melting pot for all the data, but then if it’s valuable inside, people are still pushing into databases. Why do you think this is there because you have a BI system sitting on top of the database? It’s working well, that’s maybe overly integrated, or are people just feeling more comfortable having data clean and structured?
Because you could argue, “Hey, I just land all the data in there, and instead of cleaning it and transforming the data, I just do a view of the data, kind of the view concept from databases where my view is the cleaning of the data.” The advantage will be if I change my mind what clean data means, I would just change the field, right? Aren’t we at the point that we have enough storage and compute to handle everything as a field?
Justin: Right. I certainly believe that’s coming, and hopefully not too far away, but I think the challenge, still, for a lot of people is that these existing legacy platforms have so much functionality already built into them that Hadoop hasn’t been able to duplicate, yet. To your point, BI tools is one common way that people want to interact with that data, and these tools don’t work perfectly seamlessly with Hadoop yet today. Things like doing deletes or updates, for that matter. Stefan: And you guys do that, deletes and updates?
Justin: Not yet, but very shortly. That’s something we’re working on, and again something that we think we’ll be able to implement much more quickly, given our DBMS component of our architecture than open source vendors that are trying to reinvent the wheel, to your point earlier.