The Big Data Perspective, With Tony Baer [Podcast]

2017 is here, and it’s your year to become an expert on the big data world. Throughout January, we’re bringing you thought leaders of all types in our brand-new podcast, The Big Data Perspective. Be sure to subscribe to our blog to get updates as soon as they’re published!

For our first episode, we’re thrilled to have Tony Baer as our guest. Principal analyst at Ovum, blogger at ZDNet and all-around big data expert, Tony’s here to share his own view on the big data world. Listen to the podcast below, or skim through the transcript if you’d prefer.

Transcript, lightly edited for clarity:

Andrew Brust: Data startup fundraising. The technology hype cycle. Google’s new machine learning offering. What conclusions can we draw from recent events? Today we’re talking with Tony Baer, Ovum analyst, and my fellow ZDNet blogger about the big data climate and the forecast for the future. I’m Andrew Brust of Datameer, and this is the Big Data Perspective.

Tony, how are you?

Tony Baer: Not too bad, thanks Andrew, for having me for this reflection on big data.

Thoughts on Hadoop Versus Spark, or Hadoop Plus Spark

Andrew Brust: We’ve got a little time to ask a few questions, and see if we can get you to be clairvoyant about the year ahead. Let’s start, if we could, kinda referencing something that you wrote about back in October. The title of your piece was, “Have We Reached Peak Hadoop?” Your argument was that the question isn’t really about Spark versus Hadoop, although that came up in there in terms of what you were writing about, but more the cloud versus Hadoop. Clearly though, the Hadoop versus Spark debate isn’t dead. Given how many open-source products are being added to the Hadoop ecosystem, where do you think that’s going to take Hadoop versus Spark, or Hadoop plus Spark, maybe, into this new year?

Tony Baer: I’ve always thought that, and I’ve been going on record on this for at least about the past 18 months, is that Hadoop versus Spark really is a false dichotomy, in that one does not necessarily replace the other. It’s kinda almost like a Venn diagram, which is that Spark can run on Hadoop, or it can run basically on essentially the equivalent of what I would call bare metal, which would basically be just a simple cluster or a very lightweight cluster manager like Mesos.

Andrew Brust: It’s hard to talk about Venn diagrams when we’re audio only, but how much of an intersection between the two circles appears in this Venn diagram that’s in your mind’s eye?

Tony Baer: There’s a lot of intersection, that’s the whole point. That’s why I’m talking about a Venn diagram were, as opposed to, let’s say two types of circles that you’re seeing at an optometrist when you’re getting fitted for contacts. The reason for that is that Spark is a compute engine, and Hadoop is a data platform. That’s huge difference there. Number one is that Spark is not meant for managing data, it’s meant for performing certain types of very complex compute on data, and it’s incredibly versatile, and it’s very clever and it’s very developer friendly.

It has just an awful lot of great things going for it. Lots of ecosystem support, a lot of excitement in the open source community, lots of contributors, just a rich library or rich portfolio of libraries of different function that you can run. Spark’s pretty amazing, but it doesn’t manage data, it doesn’t persist data, and the fact is that it lacks sometimes. For instance, you take a platform like Hadoop, it can perform Spark, it can also preform interactive sequel, it can also perform search.

Spark is not the only workload out there. It’s a workload that’s basically very well suited for data science and data engineering, and for solving very complex problems, but it’s just part of a larger pallet of options that you have for basically looking for answers in data. As I said Spark versus Hadoop is a false dichotomy.

On the other hand, what I was saying was that, what does this really boiling down to? It’s really becoming … I don’t want to call it Hadoop versus cloud, but it could be perceived as that, which is that Hadoop is a very complex platform. Yes, over the years it has become packaged to basically make it more of a whole, a single entity. But there’s still lots of dials and knobs that you have to turn and configurations. The idea is that if you go up on a cloud in a managed service, although it doesn’t put everything in a black box, it can put a lot more in the black box.

The other thing is that it provides a lot of other advantages which I would probably call agility. For example, someone might think, “Well, I don’t know if I’m gonna need a lot of compute notes today, but I might need them tomorrow.”

The nice thing about cloud is you can just buy what you need when you need it, which is very different from setting up a cluster on premises for kind of more what-if types of scenarios. Therefore, the cloud gives you the chance to run a lot more workloads that you might now otherwise be able to do on premises, because it’s just not economical. There, where I’m seeing, Spark as a dedicated service is in the Cloud. The reason for that is that all the management features that you would otherwise need, all the housekeeping features that you would need to run a Spark cluster and essentially reinvent the wheel, well, the Cloud provider already provides that for you.

But sometimes all you want to do is run a Spark service, and there’s a lot of validation in doing that because it can be a great sandbox for data scientists. Remember, the essence of sciences is testing out lots of hypotheses, and testing, and testing, and testing. You don’t want to basically occupy an entire cluster with just a whole bunch of recursive tests. That’s basically what dedicated Spark services are great for.

On the other hand, if you still want to run basically diverse workloads, and serve a broad constituency of end users, basically using Hadoop in the cloud, is also I see, basically, is becoming a growing trend. In fact, we forecast within the next 24 months that over half of new workloads for Hadoop will be in the cloud. I would say it’s even a greater number of that for Spark today, but at least for Hadoop, I’d say by 2019, that the majority of new workloads will be in the cloud.

Andrew Brust: All right. I think I’m even seeing some of the distro companies are trying to make it as easy as possible to set up temporary clusters, even on premises. There’s this whole notion of being able to script up a cluster, run a workload and take it down. It’s almost like bringing the cloud back on premises. It definitely does seem like the notion of an ephemeral Hadoop or Spark cluster is there, and as a service model seems to be pretty compelling.

How Big Data Startups Can Stay Relevant

Andrew Brust: Let’s change our focus, our gaze a little bit here, and take about new companies and consolidation. There’s sort of both going on now. There’s been a lot of recent fundraising. Interona, Paxata had a new round. There’s a newer company from France called Datacoup. We’re seeing consolidation, Platfora was acquired not too long ago by Workday, and Datascale by Datastax. With all of that churn and volatility, what do you think it will take for big data companies, independent ones that aren’t part of the big four mega vendors, what do you think it will take for them to stay relevant in the market?

Tony Baer: Well, number one, there will be a continuing trend towards consolidation. The market, in the long run, basically cannot support a thousand different companies doing very narrow functions. That being said, this is a continual process as new areas come to light, so we have new areas of innovation, we have startups. It’s not like that niche companies are going to disappear, per se. What we’re seeing with the VCs is that, there has been this eb and flow. It’s natural to investment. And in fact, in 2015 we did not see as many new startups.

I think we’re starting to see some slight pickup in activity there now. But I think the dominant trend has been that VCs are doubling down on their investments, that they want to make sure that those companies that they have funded in early rounds are either going to find successful exits, or find exits like Platfora’s. Or that they get bulked up to become more viable, in other words,  sharpening their product focus, sharpening their go-to market, defining what they are, differentiating themselves more sharply.

I think for a company to survive this round, they really have to basically double down on who they are and who they serve and also start to look at acquiring some additional functions that are related. For instance, like the Paxatas of the world. Paxata’s done some amazing things to essentially scale this whole data prep process, it’s a unique differentiator for them. I think they’re kind of the exception that proves the rule, because we’ve seen otherwise a lot of commoditization of data prep, for instance, the iTools. Even Tableau, now, is introducing some data prep capabilities.

What has to happen for the Paxatas of the world if they want to become the next generation in Informatica, and I mean Informatica as an example of as an independent company that has remained viable, is that they have to broaden their focus. To not just do data prep, but to also start doing things like dedup, need to start doing forms of like emergent master data, so we can start to deduce, what are the golden records? The advantage that the Paxatas of the world have at this point is that they have a chance to basically write the script anew, so that, unlike, let’s say to the traditional approach to data integration, which is very deterministic, top down, is that they use machine learning so that this process becomes a lot more fluid, a lot more flexible.

Basically, I think the moral of the story here is that, as a startup, in the long run you can’t survive as a company being just a function. You have to basically have an area that you own, and that area may mean that you need to either acquire or develop some functional adjacency.

Andrew Brust: Makes total sense, and actually there was another recent acquisition that completely corroborates what you said, which is Syncsort’s acquisition of Trillium, so that they’re bringing data quality into their ETL and data prep. It’s almost verbatim what you just said. You also mentioned machine learning, which wittingly or unwittingly, was a great segue into my next question, which is about machine learning.

You’ve written that it’s going to become increasingly a default capability under the hood, rather than perhaps a standalone gee-whiz kinda thing, but meanwhile Google Cloud has their new machine learning offering, you’ve paid some attention to that. If we can get out of the world of all the data geeks, and just get into the world of mainstream business, and companies that use databases, and data technologies to get their job done, where do you see machine learning finally settling at some equilibrium into that world? You think it’ll stay in buzzword territory, or do you think we can finally normalize and productionalize this stuff?

Tony Baer: I think it’s going to be normalized and productionalized. It’s just going to become an assumed part of what we expect a software, a business solution does. Essentially what it’s doing is it’s not replacing people, but it’s providing what I will call, let’s say, assist of assist. Just like how data prep and machine learning are helping business end users, in other words, not just DBAs or database architects to figure out, “Okay, what data should go together?” The system gives them a helping hand.

There is just a lot of clear parallels and just very mundane business processes. “How should we best segment our customers? Who are the ones that we should be concentrating certain forms of promotional activity on? How should we optimize our supply chain? Where to look for the best promising new sources of compounds for the next miracle pharmaceutical drug cures for cancer?” for instance.

I think if you look at the practical world around us, and you look at the business problems that we deal with every day … I would dare you or anybody to name any issues, or any business problems that we deal with that would not be able to benefit from some form of machine learning or artificial intelligence. What’s kind of interesting is that to some extent, this reality is already here. There was a firm called Narrative Science did a survey. They basically surveyed business, like, “Are you using machine learning, artificial intelligence?” Of course, most said no, and then they said, “Well, by the way, are you using certain types of routines that help to do, let’s say, next best offer?” The answers were invariably yes.

I think in many cases, the future’s already here, it’s just going to become more prevalent.

Andrew Brust: Maybe people are even using AI and data mining technology perhaps without even realizing it, because at least in some tech verticals like digital marketing, or eCommerce, it’s already built in to the point where it’s such an embedded feature people may not even discern it as machine learning, per se.

What Interests People About Big Data Right Now?

Andrew Brust: All right, so, again we’re segueing well, because we’re talking about what actual customers actually want, and I’d like to drill down on that a little bit, since you’re an analyst in this field, and you’re not just working with vendors, you’re working with customers. What’s the State of the Union, as it were, in terms of whether customers are still trying to figure out big data conceptually, or if they’ve got that, and they’ve now moved on to more specific criteria for what they need, and what’s going to give them ROI. Where along the maturity cycle are customers at this point, and what do you think in 2017 they’re going to demand?

Tony Baer: Well, the answer’s going to be a big it-depends question. I think it depends on which region of the world you’re talking about. I was just having a discussion, actually, with a good colleague of mine, you probably know Mark Madsen, a consultant who’s also very heavily involved with the Strata conferences. He just came back from Strata in Singapore, and I was comparing notes with him. He basically said that he really felt that they were still trying to figure it out over there.

Andrew Brust: Yeah, I’ve heard the same.

Tony Baer: I would say that here, where we’re at, is that I think big data is starting to get into that part of the continuum we’re just thinking about data now, and I think that’s a good thing.

Andrew Brust: Are we dropping the word big, is that what you’re saying, it’s not big data anymore, it’s just data?

Tony Baer: In terms of what organizations need, businesses need, it’s going to be data. I think businesses are still calling it big data today because there still is a new shiny thing associated with it. I think the realization is that we’re not just working with traditional structured data, we’re working with a number of variable structures.

Source has become a buzzword, and IoT is a Wild West out there, but name me, let’s say, a logistics company that’s not using data from its GPS devices, or sensors in the engines of its trucks to figure out, one, “Where should we route the trucks?” And by the way, “How’s our gas mileage?” Or, “When do we take this truck in for maintenance?” Basically, I think IoT has in that sense supplanted big data in terms of, “Here’s the new shiny thing that we need.” I think it reinforces that what we need to do is use data.

Andrew Brust: Absolutely. What’s interesting, also, you keep stumbling on things that are new and timely, but recently Pentaho has announced that they’ve got connectivity to the MQTT protocol, which is one of the two dominant protocols in the IoT world. And maybe, at this point, it’s emerging as the dominant protocol. Again, you’re hitting on all the buzzy things. IoT definitely seems to be driving a lot of big data demand.

Big Data and the European Market

Andrew Brust: All right, I think you kind of talked about Asia Pacific, and you talked about North America. Do you have a different sense of what’s going on in Europe, especially since Ovum has its headquarters there? Are they closer to us, closer to Asia? I don’t mean geographically, I mean in terms of their big data maturity, or are they right at some midpoint?

Tony Baer: That sort of reminds me of an answer to a question I asked the gentleman who was sitting next to me on my flight from SFO to Newark yesterday. He was on the first leg of a trip that was going, not only to Mumbai, but to some other city in India. I was asking him, “Well, coming from the West Coast, which is the faster or the closer way to get to India?” He said, “From the West Coast, it just doesn’t matter, it’s far both ways.” In a way, I think that answer kind of applies to where is Europe with regards to big data compared to Asia or North America. There are certain areas, obviously areas around London, Amsterdam, actually, surprisingly, Berlin, Paris, and I think even up in Stockholm, where you really do see a lot of hotbeds of innovation. If you just look at the meetup group activity there, it’s very strong there.

Are they at the same place where we’re at here? Obviously not, and I’m sure there are gonna be some folks who get very angry when I mention this, but you take a look at the Flink groups, and basically Flink started as project based in Berlin. But for fortune, but for time, we could be talking about Flink instead of Spark, but Spark was probably about two years ahead. I think that kind of sums up essentially where Europe is. It’s not to say that Europe is behind in innovation, but I think in general, North America, still tends to be where things happen first. However, don’t discount what’s coming out of Europe.

Andrew Brust: Of course, Tableau bought a German company, which by all signs is what’s enabling them to add the data prep capabilities that you mentioned earlier.

Tony Baer: Right.

Andrew Brust: Listen, we’re kind of heading towards the end of our time, which means you have an awesome opportunity to tell us what you think in general’s gonna happen this year, and into the future. My colleagues at Datameer were hoping you could forecast a decade into the future. I think if we could get half of that we’d be lucky.

Tony Baer: Oh man.

Andrew Brust: Tell us what you see in the near term, and then in the next three, five years, or even more.

Predictions on the Future of Big Data

Tony Baer: Well, okay. I don’t have a great track record in trying to predict the more distant future. It wasn’t that long ago, we were talking about SOA and web services. Given that, are you going to listen to something coming out of my lips there? Given all those qualifications …

Number one, short term is, I think, we’ve gone on record in our research, in our trends on WatchReport 2017, and it is available on, you can get a free copy, it’s outside the pay wall. Machine learning’s clearly the biggest disruptor to big data analytics in 2017, mainly because that’s where we see the big upshot of activity. It’s just an explosion, basically, of libraries and algorithms, and they’re becoming more accessible through these cloud services.

Another thing that I also see, is in the jobs world. Yesterday, going from Santa Clara to SFO airport, my Uber driver was a computer science college student down in Santa Clara, actually, and he was tapping me for advice. Actually, one of the pieces of research that I had picked up on in putting together this 2017 report, is that we keep talking about demand for data engineers. If you go on, and you look at demand for data engineers versus data scientists, over the last four years you actually see that the demand for data scientists seems to be relatively flat, which is kind of surprising. But there’s a huge demand for data engineers, who are the folks who basically provision those clusters, figure out how to lay out data, and figure out also what types of models and algorithms will work, because you can be a data scientist and come up with a great algorithm, but is it going to actually work the way the data is laid out? Data engineers are going to provide the Scotch Tape, to make this all happen.

I told this guy, I said, ‘Hey, go into data engineering. Number one, it’s not as sexy a title, but it’s really needed, and by the way, becoming a data engineer does not preclude you from becoming a data scientist.’ If you have that creative spark in you, or I should say pun intended, a data engineer could become one. The thing is that I see a continuing demand for data engineers. I see, also, demand for tooling that connects, and solutions that connect data engineers to data scientists, and in turn, for data scientists to be able to collaborate more with the business so that these great ideas don’t stay bottled up in their heads.

I also see that IoT is pushing real time streaming analytics to the front burner. I say that as someone who came out of that whole middleware space, the poor schlump who basically got stuck covering complex event processing, which is basically a solution looking for a problem, or a technology looking for a problem. IoT has really pushed that to the forefront in terms of that, even though it may be useful to analyze this data historically, the big value is basically dealing with it in real time.

As I mentioned before, I think in coming years, the cloud is going to be sharpening what I call Hadoop-Spark coopetition, which is talked about before.

When we look in the longer term, and this is where basically my track record is pretty pathetic, I would say that machine learning and artificial intelligence are going to become more embedded in business solutions. And yes, deep learning will eventually get its day. I have a colleague at Ovum who’s very much into deep learning, and I keep telling him, “Michael, you’re at least five to seven years away, the stuff you’re talking about is at least five to seven years away.” I say that basically at some point, deep learning is going to also, in turn, become more embedded in a number of types of solutions that we are using. I think right now we still need to think about this whole human-machine interface. I think there’re a lot of deeper social issues that need to be dealt with before we really mainstream deep learning.

Obviously, within the next decade, cloud is just gonna kill. I don’t mean kill in terms of slay, I mean kill in terms of, it’s just going to become the default model for deployment for big data, and probably for business solutions in general. And I think, that will happen maybe for one reason, and one reason alone, which is that threats are just mutating like crazy. The only ones that are going to be really capable of keeping up with that are the ones that do infrastructure for a living, and it’s not going to be enterprises. They’re not going to win at that cat-and-mouse game.

As I said, basically looking out, say five years, definitely more sort of embedding of machine learning, and definitely Cloud becoming default option. Those are the two I would probably pin myself down for. Yes, there are lots of prognostications about the Internet of Things, I’m not even going to touch that.

Andrew Brust: All right, that leaves us both with food for thought. I thank you for that, Tony, and we thank you for your time. This has been a great conversation.

Tony Baer: It’s been enjoyable, thanks for having me.

Connect with Datameer: