About Us Icon About Us Icon Business Analyst Icon Business Analyst Icon CEO Icon CEO Icon Datameer Icon Datameer Icon Envelope Icon Envelope Icon Facebook Icon Facebook Icon Google Plus Icon Google Plus Icon Instagram Icon Instagram Icon IT Professional Icon IT Professional Icon Learn Icon Learn Icon Linkedin Icon Linkedin Icon Product Icon Product Icon Partners Icon Partners Icon Search Icon Search Icon Social Networks Icon Social Networks Icon Share Icon Share Icon Support Icon Support Icon Testimonial Icon Testimonial Icon Twitter Icon Twitter Icon

Datameer Blog

Big Data & Brews: On Open-source Big Data Development Tools

By on May 13, 2014

In the last episode, Ron Bodkin talked about his history and role at Think Big Analytics. This week, we talked about his belief that “fundamentally open source is the foundation of the platforms that will be successful in big data,” and that “…a lot of what’s exciting about big data is the economics of open source.”

We talked about YARN, Spark, Kafka, Solr, Flume, and much more. Enjoy:


Stefan:           Welcome back to Big Data & Brews.

We have Ron Bodkin from Think Big Analytics. Cheers.

Ron:                Cheers.

Stefan:           We talked about your history and what Think Big is doing but let’s talk a little bit about technology.

Ron:                Sure.

Stefan:           That’s your area of expertise. As you build those big data applications, what kind of tools are you guys using?

Ron:                I think there’s a couple of things. One is that we’re big believers that fundamentally open source is the foundation of the platforms that will be successful in big data, that a lot of what’s exciting about big data is the economics of open source. Hadoop is an open source ecosystem, is a thriving successful ecosystem. If you didn’t believe that before the recent announcement of a billion dollars of funding and leading Hadoop companies is pretty strong statement that Hadoop is going to be the center of important open source ecosystem.

We kind of look at it as there’s two very different parts of the big data applications stack. There’s the analytic core which is centered around Hadoop but there’s interesting pieces in orbit around that, and then you’ve got more of the real-time architecture which is in edge serving, Web serving, responding to events, where you’ll typically have a pattern that at scale you’ve got large numbers of data centers distributed around the globe and so you need to have technology to put local in each of those.

Stefan:           Then you model? [2:24]

Ron:                You’ll typically have various forms of … This is where event NoSQL databases like Cassandra come to their strength of having eventual consistency and models for replicating the data. There’s interesting patterns and interplays between the analytic core and those systems of engagement at the edge. I think it’s a little more of a shooting match as to what’s going to win in the market on the edge with Cassandra and MongoDB’s is a couple of leading NoSQL databases.

Whereas, I think the Hadoop Ecosystem is absolutely the winner and anything that’s successful in the analytic core gets pulled into Hadoop. A great example of that would be the Berkeley Stack, that the folks at University of Berkeley created some innovative technologies like Spark, is a lower latency, better cashing approach for doing math-produced type computations. It’s great for machine learning. They raised $14 million from Andreessen Horowitz to start Databricks, but they’re now developing relationships with all the major Hadoop distributions. It’s being included. It’s being wrapped up with YARN.

It’s a great example of it’s being pulled into what’s considered Hadoop. It’s innovative, so the Hadoop of today and the Hadoop of next year is very different than the Nutch Hadoop that we were talking about earlier but it’s still as a community evolving and building its capabilities. [2:56]

Stefan:           You talked a little bit about Mongo and Cassandra, like you guys using both of them? [3:51]

Ron:                We do.

Stefan:           Where’s kind of like, this is where the used case is standing more on this direction or this kind of requirements more on that direction? [3:58

Ron:                We use a number of NoSQL databases. We use a lot of HBase, more typically in analytic used cases requires around that edge serving deployed out to many data centers. I’d say that we’ve done a lot more with Cassandra because we tend to focus on big data applications which have large scale and we found that it’s a more reliable, kind of better scaling system.

The architecture of Mongo makes it probably better suited for more department level, NoSQL applications so we see there’s probably more promising in scenario where someone’s doing something like a My SQL type application, where MongoDB has got a very compelling API for developers and it’s really easy to get up and running. But we don’t see nearly as many people scaling it up to big data proportions, in large part because it’s tricky to get the memory management right the way it allocates objects doesn’t make it easy to get some good caching locality. Whereas you have to typically be seen when people are using Mongo in big data applications they have to throw Memcached or the equivalent in front of it.

Stefan:           Is Memcached kind of the standard caching thing even in front of Cassandra or others? [5:09]

Ron:                A lot more … We’ll see people that are using Cassandra and relying on it and to do a good job of having caching without having to build a separate caching link. We don’t see it as nearly as common that people are actually building a separate caching tier in front of a Cassandra, a database as they do in front of Mongo, or traditional relational.

Stefan:           What I’m hearing is computation, enrichment, cleaning the data, maybe even build some prediction done in Hadoop or related technologies and then results pushed into kind of a Cassandra that serves their websites or an app?

Ron:                It’s a great question. We kind of think of it this way, that in the Realtime Edge system which can be using any number of response engines … could be using old-fashioned Apache or Node, Nginx. There’s a bunch of options for the serving layer here but behind that you’ll typically have a little bit of relational for acid consistency. Behind these systems there’s usually some high value data you don’t want to lose, so committing transactions, orders … Then you’ll have –

Stefan:           Just so I understand this correctly, this will be your online store, right? [6:30]

Ron:                You make a purchase. A transaction gets committed here before you reset for the page. NoSQL that’s being used to provide … to scale out things here like profiles and recommendations and the interplay here is that you’ll have machine learn models that are being held here that say something about the users interests or in response to this incoming device event, is this likely to be a failure and I need to escalate? We need to replace this thing right away.

Usually we’ll have a hybrid where you’ll have a simplified ability to update model scoring on the Edge. You’ll have some code that’s running and the simplest form just running inside your serving engine that weeds the state from those SQL, updates the model and then writes an update out here. The events stream also gets pushed backed out to your Hadoop cluster where you’re doing the long term computation.

These models get refreshed. I think Nathan Mars, buts as well as land architecture taught me about that computed views. You’re computing views here that ideally there’s NoSQL databases will have a short window of data that’s computed transiently in realtime based on … they just look at this page so I need to immediately update my model. But the long term propensity is being refreshed from a consistent view in the batch world. Nathan would have two versions of NoSQL, like a scratch we write NoSQL’s and a read only version that gets bulk exports. In practice we just don’t see a lot of people investing in a separate NoSQL database. They use one. They rely on it. They handle both purposes.

There are cases where people want to do things like … sometimes people have search applications, so Elasticsearch, or Solr, or in the commercial world Splunk, as engines for doing distributed search. That’s typically a different scheme that most of the NoSQL databases don’t handle very well. Sometimes people will try to do that in Mongo because it’s a little better at it but more often you have a separate and that’s your background, so I know you know that space really well.

Far be it for me to lecture you on the service search engines but you know that the other thing that you’ll see is sometimes people doing … streaming computations where they … What we see is that you can get pretty far with the ability to just have a simple logic distributed across a farm of web servers and hitting reading and writing state from NoSQL that the streaming engine’s like a storm. You’re typically looking at that when you’ve got computations that are sufficiently complex that you don’t … you can’t just do it quickly in a single process, you really have to scale it across the network. [8:23]

It has to be more advanced. You need to … tend to have things like more sophisticated models. We’ve seen it in two places. We’ve seen it at Web scale startups like Quantcast adopted streaming to make it Storm … to make it easy that to have live updates to profiles as you’re looking on a page. Some of the originators of the technology like LinkedIn, we’ve done this for AdTech customers that wanted to get real time updates to see counters of what’s going on in every page or every campaign. Then on Wall Street where people had streaming things for market tracking, so that’s on the Edge.

In the analytic core there’s a range of things, typically people will do things like … and of course, ingest the data. You might use a Kafka or a Flume or any of those kinds of technologies.

Stefan:           Are you a big Kafka fan or a Flume fan? [10:04]

Ron:                I’d say that we think Kafka has a very elegant architecture as well as design, so it’s much more reliable when you design a solution around Kafka. You don’t have nearly the same concerns around late arriving data. However, we still see that because commercial support is predominantly for Flume and all the distributions, many companies that are out there aren’t interested in relying on community support for a technology like Kafka and so they’re willing … For them the occasional delay and data arrival or data loss from Flume is acceptable and so they prefer having commercial support to having kind of the elegant architecture.

Stefan:           Let me think about … Let me make sure I hear what you just said. I’m okay to lose data if I can pay money to get commercial support over not losing data but not have commercial support. [10:55]

Ron:                Something like that.

Stefan:           Just checking.

Ron:                That’s right, but you know –

Stefan:           Great Salesforce. Congratulations.

Ron:                We don’t sell either.

Stefan:           It’s fine.

Ron:                But you know what I’d say is that most of the time when people are putting together big data architectures and they’re doing streaming of events back, they’re building systems where they are not needing to capture every last event where if it’s 99.9% accurate, you’re able to build models and respond in it, it meets their needs and losing data for high value things is not acceptable that –

Stefan:           Right, that’s why you have your acid databases, right? [11:39]

Ron:                Your acid databases and if you have a high value, low volume stream you can use more traditional messaging approaches that are guaranteed … committed it won’t lose data. There are approaches for that super high value data. It tends to be smaller in scale.

Stefan:           Usually.

Ron:                That’s the case. The Kafka is more elegant than … and we like it but most customers we find are favoring commercial support. I guess the argument for why they want commercial support is they don’t want to have data loss through their mistakes, and their problems and operating something like Kafka and there is nowhere to turn.

Stefan:           You pre-process recommendations maybe for every single person, I guess that’s what this talk about in the views.

Ron:                Yes, the models… you would compute.

Stefan:           Then you put them into like a Cassandra and then you show recommendations here and you mix these all up with the data stream set feedback, and maybe some search and a kind of as a database traditional RDB merge like Grandio E-commerce store. Do you have like … Sounds like you build it quite a few times, do you have some numbers what kind of sales uplift you can get with an architecture like this of recommendation engine? I guess it’s one of the classical used cases. [12:54]

Ron:                Yes, so recommendation engine or predicting failure for device data I’d say we’ve seen great metrics and different used cases but it’s not a case of saying, “Well, you know, we’ve done 20 e-commerce sites of these type and this is exactly what you get, right?” And we have done look-alike models for mobile advertising. We’ve done commerce recommendations. We’ve done recommendations for looking at when an event comes in for a device to understand if it might break or not.

The other thing it’s worth noting is that this represents a more sophisticated architecture. This loop of the offline and online data that many enterprises rather than this are starting with an analytics used cases while rather than closed loop driving models. They’re ingesting data from both are online systems but also from various high value relational systems that can be fed by an online system somewhere along the way but they don’t have a close loop and they’re taking that raw data. They’re able to do the science to explore it, build views to make it easier to access the data. Organize it, de-normalize it. Build summary aggregates and what we’re starting to see is traditionally the next step was dump that into a database, a relational database, a classic … a Teradata, an Axidata that uses macro-strategy or if ever the iTool. What we’re starting to see is instead that … There are alternatives. We’ve built an accelerator to make it easy to do this and push this into Hbase, so you can have pre-computed aggregates and an API to quickly get at them from a Java Script dashboard right to the people have deep drill down and not have to deal with a relational database but have low latency access.

The other thing people are doing is they’re starting to play with these emerging SQL engines. I know you love SQL, so you’ll appreciate this.

Stefan:           I do.

Ron:                But roughly the tools like an Impala or Presto are trying to take … Take a lot of what people have done in relational and say, “Well, look it’s going to graft an MBP database on top of Hadoop.” There is a tension that a lot of energy goes into trying to make the old relational world work on top of the new world.

Stefan:           Sure, then it comes back to what you said before. You sent someone to a training and like, “Oh, that looks like the whole stuff.” [15:31]

Ron:                Right.

Stefan:           Maybe.

Ron:                For sure, I mean, people are very interested in that but it’s a disruptive technology and like most disruptive technologies the best uses are solving problems you couldn’t solve the old way not taking the old approach and running it in the new world. I think that there’s more to an analytics in Hadoop than making it an IPP database on top of HDFS.

Stefan:           What is the stupidest you can ever do, right? It’s a streaming file system.

Anyhow, we had long … a lot of beers and a lot of conversations about this. Let’s double click on this area on the Hadoop. What kind of technologies you guys using there or you just go straight back to writing MapReduce jobs or you … What’s your favorite higher level abstraction Pig, Hive, Cascading? What’s the … maybe help us to understand what’s the difference, where is the one more favorable? [16:21]

Ron:                Sure. Definitely, I think one of the strengths of Hadoop is it’s a polyglot environment, you can use a variety of tools. We love HCatalog, as a way of having a common set of abstractions so that you can use different tools for different purposes. We do a lot with higher level languages Hive and Pig, often writing jobs at UDF’s. We find those to be very productive and easily maintainable and with HCatalog, good abstractions.

Stefan:           But you can’t get … write a UDF for Pig that you’ll reuse in Hive, can you?

Ron:                No.

Stefan:           So you’re going to use –

Ron:                Write a different UDF for different purposes. You write some common code and wrap it differently in the two. If you like those two we definitely do … some straight mapper do, some cascading … we’ve done little bits of other languages like Scalding. There’s a variety of options there. Part of it is we tend to select the technologies that are maintainable and useful for our customers.

On the data science side we do a lot with Python streaming and for prototyping R Hadoop when we’re doing distributed data science. We think Spark is going to be interesting. We’re starting to see a lot of interesting and it’s starting to be feasible. Traditionally before YARN, running Spark meant running yet another cluster but with YARN being able to actually run it on the same cluster is going to be interesting.

Stefan:           YARN. What do you think about YARN versus MapReduce [18:00]?

Ron:                I wrote an article when YARN was announced on InfoQ a few years ago, asked them why do we need YARN? But a matter of fact I think having a resource manager for Hadoop is critical and I think that the community has lined up around YARN in a fundamental way so that while there’s a nice elements to the way their source was designed, that it’s not the standard. I think YARN is going to provide tremendous value in terms of you really did need to go beyond straight map reduce, its the only approach for writing distributed applications for a rich analytic system like a Hadoop with a distributed file system.

We’re going to see YARN now in all the mainstream distributions. We really think that it will be dominant by early next year, that people will only using YARN in production for a range of applications.

Stefan:           From an outlook perspective, what is the technology you’re most excited or what’s the technology that’s still missing? [18:57]

Ron:                Good questions. I think there’s such a range of areas to expand. It’s still very nascent to do some of the more advanced analytics for distributed data science in Hadoop and it’s far too complicated, so the typical pattern is for most problems is people use a tool like Hive or Pig to pull out of a massive data set, a small enough chunk of data that they can work in a single machine in R or Python or what have you, or SAS maybe. Then if they have to scale it up to a large number of machines it’s a significant custom effort to translate that into a scalable algorithm.

The emergence of scalable machine learning for Hadoop is going to be really exciting but that also gets back to … We see that there are stages of adaption that customers usually start their big data journey by cost-cutting and scale. They shift to ETL or mainframe workloads or their old system isn’t scaling and rather than doing yet another Legacy system they move to Hadoop.

The next thing we see is agile analytics, where it’s really about them being able to start to explore data in a raw format, ask questions without investing all the traditional waterfall effort to Plume a field through, to a BI tool and start to understand it.

Stefan:           That’s powered by Datameer, right?

Ron:                Absolutely. That’s why they buy Datameer and that’s where they start to invest in data science skills. To have people to come work with data, know more complex formats, get inside out of it. Then we start to see more of an optimization where they’re starting to use predictive analytics to drive closely. Start to make better business decisions. Drive automated response like recommendation engines. Then finally, it’s really transformation, becoming a data driven organization, really relying on data and experimentation and analytics is a core part of how they operate their business.

Most enterprises are somewhere here and here, in the proof of concept stage. You don’t have to pull anything. That means it’s when you get here that you start to get into really significant data science, your distributed machine learning needs. The pioneers, the web scale companies, Quantcast, and LinkedIn, and Google are … Google doesn’t use Hadoop but they’re in that stage of being … having really transformed themselves through big data. The companies at this stage and later tend to have a real appetite for the data science tools but it’s not mainstream yet.

I think right around here you start to see a lot more appetite for blending some of the capabilities that are good about data governance and meta-data management in a traditional database world with what’s good about an agile, less structured world in Hadoop. We don’t think anybody is working on that yet. There was one startup that started working on that and discovered that they didn’t think there was enough of a market to build their business, so they pivoted and they’re now working on making it possible to take Legacy databases and have them run inside of Hadoop.

I don’t know if that’s a great business model either. There’s certainly more customers that would buy if you could sell it but that’s a little bit like, say we’re going to sell fast horses for carts rather than to compete against cars, right? [22:08]

Stefan:           Great. Thank you so much for coming by and thanks for the beer.

Ron:                My pleasure. Cheers.

Stefan:           Cheers.

Connect with Datameer

Follow us on Twitter
Connect with us on LinkedIn, Google+ and Facebook

Stefan Groschupf

Stefan Groschupf

Stefan Groschupf is a big data veteran and serial entrepreneur with strong roots in the open source community. He was one of the very few early contributors to Nutch, the open source project that spun out Hadoop, which 10 years later, is considered a 20 billion dollar business. Open source technologies designed and coded by Stefan can be found running in all 20 of the Fortune 20 companies in the world, and innovative open source technologies like Kafka, Storm, Katta and Spark, all rely on technology Stefan designed more than a half decade ago. In 2003, Groschupf was named one of the most innovative Germans under 30 by Stern Magazine. In 2013, Fast Company named Datameer, one of the most innovative companies in the world. Stefan is currently CEO and Chairman of Datameer, the company he co-founded in 2009 after several years of architecting and implementing distributed big data analytic systems for companies like Apple, EMI Music, Hoffmann La Roche, AT&T, the European Union, and others. After two years in the market, Datameer was commercially deployed in more than 30 percent of the Fortune 20. Stefan is a frequent conference speaker, contributor to industry publications and books, holds patents and is advising a set of startups on product, scale and operations. If not working, Stefan is backpacking, sea kayaking, kite boarding or mountain biking. He lives in San Francisco, California.