Big Data & Brews: The Future of Big Data with Pivotal, Twitter, Think Big Analytics, MapR
Continuing on from last week’s segment, I wanted to add a few more snapshots from guests like Milind Bhandarkar, chief scientist at Pivotal, Oscar Boykin, data scientist at Twitter, Ron Bokdin, co-founder and CEO of Think Big Analytics and Tomer Shiran, product manager at MapR.
Ron brought up an interesting point that I wanted to call out. He said that we’ll see more big data capabilities moving to the cloud in the next few years as some of the current limitations and challenges are resolved and cultural disadvantages and skepticism are mitigated. He also said that what is clear is that it will take awhile, but over the next decade big data will have a bigger impact on economic growth than the wave of client server computing and workflow automation had in the 90’s and will create a tremendous amount of value for human kind. That’s something we are always aiming to drive here at Datameer — value for human kind and it’s always great to hear that other people in the big data space have the same goal.
Tune in below for the full segment!
Millind Bhandarkar, Chief Scientist, Pivotal
Stefan: What’s the most exciting thing that’s coming up? As a visionary that looks into the future, anything on the horizon that you’re really excited about? I mean Storm isn’t really the latest, greatest thing anymore. What’s the next thing beyond that?
Milind: In the short term, when we launch this whole Pivotal HD, one of our visions has always been that you get a buffet of computing frameworks to run on the same data. That’s the reason that we want to accumulate all this data in HDFS, whether it is HAWQ connective data or HAWQ trying to access data from H-Base or whatever, right?
Why just be limited to that? Because of YARN, now we can have something like GraphLab, which specializes in graph model and graph analytics, have access to the same compute resources and the same data, on top of the same Hadoop cluster.
GraphLab is the recent one that we added, but in order to run GraphLab, GraphLab runtime is actually OpenMPI. OpenMPI has been used in high performance computing for a long time, so what we managed to do is make YARN as the resource manager for OpenMPI.
We’ll be open sourcing it soon, and I’ve been saying “soon” for a long time now, so …
Stefan: Well you’re a small start-up, there’s only one lawyer that has to look at the paperwork, right?
Milind: Right, exactly. That’s the project that we call as Hamster.
Stefan: There’s a whole zoo of animals here.
Milind: I can’t take full credit for Hamster, but I can take credit for the name. It stands for Hadoop and MPI on the same cluster. Basically we’ll see a lot of these specialized runtime systems for specialized machine learning algorithm, or any time your communication pattern changes, that’s what you’re going to have. All of them sharing the same data.
Stefan: The big vision is to leave the data in one space and move the buffet of tools to the data?
Stefan: What we did traditionally over the last 30 years or so, we moved the data into the specialized environment. Now we moved the specialized environment to the data.
Milind: Although the data as a byte stream is there in HDFS files, we actually need to impose some structure to input format and record readers and things like that, right? Really, each catalog is actually a great step in going towards that, because it actually associates a particular record structure with any arbitrary data set, that both MapReduce as well as Hive as well as Big, all of them can use, simultaneously, right?
Stefan: Or HAWQ.
Milind: Or HAWQ. Exactly. That’s the thing, that’s where I was going. It’s that we basically make use of something that we call Pivotal Exchange and Framework, which essentially is a translation layer between these outside data sources. Outside as in, not in our native format, and make it accessible to all our different components that are running on top of Hadoop.
One more recent happening that I’m really proud of is basically we introduced the support for Parquet as a native table type in HAWQ. You can basically, when you do a CreateTable, you basically say, CreateTable type=Parquet.
Now it’s a completely open format, and Cloudera has done a lot of work on it, and the reason we went with Parquee is that it is accessible both from Java and C++.
Since Hawk is written in C, so basically that gets into the native access for Parquet, the same files are then available via Parquet input format, for the rest of the ecosystem. Right?
Oscar Boykin, Software Engineer, Twitter
Stefan: We’ll put it on my Kindle for a next airplane ride. If you look into the future, what’s next?
Oscar: I think that’s a good question. I don’t know.
Stefan: Drinking more beer.
Oscar: I’m really excited about people getting good at programming large numbers of computers, and I think that we’re still kind of figuring … we’re still in early days. I mean, three years ago, people weren’t very good, so I think that’s a big deal. I’m much more knowledgeable about probability theory than statistics, so if you don’t really know much about statistics, you’re always like, “Well, those two sound like the same thing.”
So I’ve only recently at Twitter really been getting more into machine learning, but one observation I have as someone who really cares about abstractions is that machine learning doesn’t seem like we’re to the point where we’re really programming with machine learning. I’d like to see that happen. I’d like to be somehow a part of that where … like, right now, if you want to make a new model, you just, “Okay, I guess I’ll reuse all this code, but I’m going to start from zero. I heard from somebody that I should use logistic regression and maybe I should use these features.” But you don’t have any libraries.
Imagine … like, you learned a lot after you were born, I imagine. You seem quite clever, so … but you have all these circuits that most humans came with, so we’re not doing any of that that I can really see on a large scale with machine learning, where I can go, “You have got some great featurization libraries. I’d like to use your featurization libraries and then add some more libraries, and then share like … how do we featurize?” And then you’ve already trained this model that can featurize a lot, can produce these very highly refined features, and now I want to take those and plug those into my next thing. I don’t see us doing that yet. When we start doing that, I think recommendations systems can get really, really, really good.
Stefan: Great. And I think recommendations system are the future.
Oscar: Yeah, definitely. Absolutely.
Ron Bodkin, Founder, Think Big Analytics
Stefan: When you have a really good perspective flying above all those different vendors and companies, and a lot of experience with diverse customer [set], where is this all going in five to ten years? Where will we be?
Ron: That’s a great question. From our perspective we’re in the early stages of the journey towards big data. The whole industry will come along. The economy’s going to really change in meaningful ways as more companies become data-driven, as you get start-ups that are using big data as a weapon to change value [chance].
I think over time what you’re going to see is that the data platform, the analytics platform will be integrated in to drive strategic outcomes in many companies; that they’re going to use data science as a fundamental way of both thinking about strategy, how to resolve.
How do you come up with experiments that test and learn about what’s going to work as well as process execution, having the right data, breaking down silos in front of people that are acting in a process and the right level of automation to drive response as events come in that you can use machine learned models that are continuously being approved?
From the technology standpoint, that’s going to mean you’re going to have rich platforms based on open source that are used for both real time response and for the analytic core. You’re going to continue to have evolution. One of the things that’s going to be an interesting X factor that will hit over the next five, ten years is changing storage dynamics.
As you start to see things like solid state memory that functions a lot like D RAM, but retains data when power is off; so, use of more solid state storage, along with spinning disks, is going to be really interesting.
Even smaller things like the fact that increasingly really large disks are not being designed to have access to data be as easy; that they’re being designed in complex ways that are not as efficient. You’re getting some bifurcation even in spinning disk.
The changes in the underlying architecture are going to be interesting. In a space where you’ve got just massive innovation, you’re going to see a lot of different ideas flourish around virtualization, and open stack, and cloud. You’ll see more big data capabilities moving to the cloud over the next few years.
There’ll be … some of the current challenges and limitations in cloud will be resolved. Some of the current cultural disadvantages of cloud, skepticism about cloud will be mitigated. Just like 10 years ago to say you’d put customer data in the cloud would be considered heresy. Now salesforce.com is ubiquitous.
Stefan: Multi-billion dollar … yeah, um-hum (affirmative).
Ron: That’s another trend you’re going to see is definitely a lot of factors converging. What won’t change, what’s clear is that it’s going to take a while. There’s a lot to learn about really driving innovation and changing culture to deliver value for big data. Over the next decade, it’s going to have probably a bigger impact on economic growth than the wave of client server computing and workflow automation had in the ’90s. It’s going to create a tremendous amount of value for humankind.
Thomer Shiran, Vice President Product Management, MapR
Stefan: Wow. What’s in the future for you, for MapR, for the Hadoop ecosystem? What do you think?
Tomer: What do I think? Well, I think if you look at the trends in the market, IT spending is growing at about 2.5% annually. Data is growing at about 40% annually, right? There’s a disruption there that has to happen and Hadoop is that disruption. MapR is the company that’s bringing Hadoop to the enterprise as well as the web companies. With a production ready distribution, I think we’re in a great position to feel that disruption. I think it’s the biggest disruption really since that relational database 30 years ago.
Stefan: Strata was a few weeks ago. What was that for you guys?
Tomer: I think one of the things I heard actually from someone who I was talking to was that this year at Strata, there were a lot more suits than previous years which is a good sign with the Hadoop market maturing, and the decision makers from our perspective when you look at our customers and the prospects. The decision makers being at the show and really being more tuned in to this big data revolution.
That’s one thing that was appearing at the conference. We had a few announcements ourselves. We announced YARN, the MapR distribution. We announced the MapR sandbox which is a really nice, easy-to-use virtual machine that you can download from our website, from MapR.com, and learn Hadoop, get up the speed. It’s like for somebody who’s new to Hadoop and wants to look and learn how to write code, how to run queries, things like that.
There was also a great panel with one of the sessions at Strata for Hadoop users. Actually, all of them are MapR customers. It was Climate Corporation, Cisco IT, Solutionary which is a management security company, and the Rubicon Project which is an ad exchange. They talked about how to achieve production successfully with Hadoop which is what we help our customers do. I think one of the greatest comments I heard there was from Piyush at Cisco IT. He’s their distinguished engineer and chief architecture for big data at Cisco. He was asked, “Well, how do you make Hadoop successful?”
His response was that, you have to get the architecture upfront because if you can get the right architecture in place, then the conversations will be, “how do we get value out of this? How do we increase revenue? How do we reduce costs?” things like that versus, “How do we solve this issue with the name noder? How do we solve this issue with this open source project?’” Really, an IT focused discussion as opposed to a business discussion. Cisco has been using our product for a while now and it started with a simple used cases like offloading the data warehouse. Then there was a used case that actually increased revenue by 40 million dollars by providing recommendations to their channel partners on which opportunities to engage. There’s now over 12 different used cases and different business groups running on the cluster.
Stefan: Your pricing model is percent on the ROI or?
Tomer: We would have IPOed a long time ago if that was the case. It’s a standard per node and no subscription type model.