Big Data & Brews: Concurrent and Cascading
This was a really exciting Big Data & Brews for me because I’m such a fan of Cascading. This week I talked to Concurrent’s VP of Field Engineering, Supreet Oberoi about the back-end of the technology over an IPA (which was pretty good if I do say so myself) and some great use-cases.
Stefan: Welcome to Big Data and Brews with my guest today, Supreet Oberoi. Please introduce yourself and the brew you brought.
Supreet: Okay. My name is Supreet Oberoi. I am the vice president of field engineering at Concurrent. Prior to that I was the big data technical evangelist for American Express, where I helped develop the application infrastructure for the Hadoop environment. I developed use cases that we will be talking about.
The brew that I brought is one of my favorite beers, an India Pale Ale. Clearly I couldn’t wait for Stefan to come here soon enough. [0:50]
Stefan: Okay, well, let’s get going. If that’s okay I’ll just take one. Let’s see, IPA. I like the darker…Doggone good. Thank you very much. It’s nice and cold. Cheers. Oh, that’s good. It’s not too hoppy.
Supreet: It’s hoppy.
Stefan: A little bit. But it’s not super-crazy. It’s not like I’m chewing on hops or something.
Stefan: It has a little kick to it. Maybe not like lime, but a little high flavor.
Supreet: It does have a little bit of an aftertaste. I feel like I’m talking about wine. But that’s great.
Stefan: They actually put beer in your bottle when they raise you as a little child in Germany. It’s actually true. In Bavaria there’s this thing called “Bier Nuckel,” what it’s called in Germany. What they basically do with the little kids — not any more and don’t try it at home — is they take a glove and put a little bit of beer in it. And then they let the little kids suck on it, and they fall asleep, obviously. But they used to do that a long time ago just as a way to, when the kids scream, to get them to chill and let them sleep.
Stefan: I think it’s prohibited now. (Laughs)
Stefan: So don’t try this at home. It’s called a “Bier Nuckel” and it’s a Bavarian thing from a few hundred years ago.
Supreet: It’s tradition.
Stefan: What random information I have! Let’s talk about big data a little bit.
Stefan: What does Concurrent do?
Supreet: Concurrent is the company behind Cascading, which is a middleware meant for Java developers who are developing data driven applications. Their aim is instead of the Java developers learning how MapReduce works, and writing the application focused on those constructs, they focus it on things that matter most to them such as where is the data, what kind of filter joins functions that you can do on that, and provide an abstraction from the storage of the analytics below that. The advantages are that it not only improves the developer productivity, but it also makes the applications future-proof in the sense that if today it is MapReduce, and tomorrow it stays just like any middleware, we have the ability to replace the mechanisms to whatever fits the users’ cases the best.
In addition to that we also produce a commercial product called Driven. Driven is an application that provides visibility into your application. It improves develop per productivity. Down the road it’s going to help improve operational visibility into how applications are developing. And it provides some governance and compliance capabilities as well.
Stefan: Cascading has been around, I would say, forever in big data. It has been in the big data world forever. It was there before Pig, before Hive, before all of the other stuff. You guys seem to have a really big user community. What are some of the highlights of people using Cascading and what are the use cases? [4:32]
Supreet: Sure. That’s the fantastic thing I’m discovering about Cascading. The technology spans verticals. It is being used in pharma, in consumer such as AirBnB, Twitter, and Etsy. It is being used or adopted in financial services as well.
Typically the use cases come when you need to develop complex, data-driven applications, and those applications have to be taken into production scenarios. So the needs that are required during long-string data exploration, data visualization, and during ad hoc analysis, are very different than the needs when you’re taking big data applications to production with the SLAs, doing capacity planning or being able to do root cause analysis for job setup of the production. That’s where it really shines in its value.
The other place that it really shines is the domain specific languages that are being developed on top of that. There’s been a session on Cascalog…
Stefan: Let’s double-click on this. I’m not sure everybody really knows about it. First of all, the way I would describe Cascading at a techie cocktail party is it’s the Hibernate for Hadoop.
Supreet: It is exactly. It is the Hibernate for Hadoop.
Stefan: So you just have core objects and you put them together in the data processing pipeline, and you have sources and things, and it’s all really cool reusable objects. You don’t even care about all the stuff that’s going on underneath. What is this domain-specific language thing on top of that? Maybe you can why people are doing that, how it is integrated, and what are the benefits? [6:35]
Supreet: Sure. By the way, that was a great analogy and I’m going to use it.
Stefan: Chris would totally slap my fingers for undervaluing Cascading and calling it the Hibernate for big data. Actually, it’s Hibernate-plus for big data. What I really love about Cascading is this whole concept — let’s be honest. Hadoop is a big, gray elephant. If you think about the architecture of Hadoop, there’s nothing in the version of control, it’s all hard-wired. The overall idea in the beginning was to write a really close, controlled system that runs at Yahoo. As it became more popular, people came along and said we want to have encrypted messaging between the servers. And the folks that control the code said why would you do that?
Maybe because I work at financial services.
Well, no, you can’t do that.
What I like about Cascading so much is everything is pluggable. You just override at meta or you override a class or you inject something. I seem to be a big fan, eh?
Stefan: You just override what you want and plug it together. It’s like a Lego style system.
Supreet: Exactly. Especially coming from a financial enterprise IT background, I see two distinct value propositions coming out. One is what you just said, because it’s Lego-like, because it’s based on a Java-based programming API. There is a lot more flexibility to develop complex applications, a recommend engine and gene sequencing algorithms. That is definitely one.
And the second one is, being in enterprise IT and coming to Strata conference, you used to give us the biggest hype up. We’re here, the vendors are there, we have the next best thing, and they’re looking at it like, that piece of technology will work perfectly for our graph system, and this is great for simulation. And too bad, we just spent a significant amount of effort and time building it on MapReduce. It’s impossible to plug and replace it. So by providing that future-proofing on the data applications, that carries a significant value as well.
Stefan: I’m not sure if you guys already do this, but maybe in the future you can write Cascading applications that are much easier to write than MapReduce, and run them on MapReduce, on Tez, on Spark, in memory, on a local machine, on a cluster, you don’t care.
Supreet: Exactly. As of today you can run Cascading applications in local mode as well as in MapReduce mode. The next version, Cascading 3, the query plan itself which decides how the Hibernate or whatever gets deconstructed into how it’s going to be executed in the native environment, that’s being made pluggable as well. So the aim is we’re going to come out with support for Spark and Tez too.
Stefan: Being so involved with Cascading, we got totally distracted about the domain-specific languages. Why, how, and what are the benefits there? [9:59]
Supreet: Domain-specific languages, the main aim is, again as you said, that it’s like Lego blocks. If I’m building a house, can I provide a higher level of construct, like doors and windows? So a great example is that we have developed a SQL and CSQL compliant interface on top of Cascading. And for use cases where we’re doing a distributed federated query, we can use that SQL interface, it can connect to any JDBC system and run that query.
Stefan: So you have some application, you have GDBC, you have ANSI SQL, and you have Cascading and then you have Hadoop.
Stefan: Why would you ever use anything else then? (Laughter) Like, all those SQL somethings, they don’t have JDBC. Oh, I think they have.
Supreet: They could.
Stefan: I don’t want to call out vendors.
Supreet: The use case is a little different. This is something I saw in my previous background. Today if I need to do analytics in Hadoop, I need to move all the data from an enterprise data warehouse into Hadoop. And although it can be an extremely simple thing to do technically, in terms of the enterprise data warehouse already working at peak capacity and doing that extract, it’s going to compromise the SLAs with existing applications, so data is not accessible to access in bulk.
And now this provides a path to say I’m going to run my query in Hadoop and only the data that is going to be required for that system comes down. In my opinion, that not only solves a big SLA problem, it also solves some organizational issues about who is allowed and who owns the data too. [12:02]
Stefan: You’re not! (Laughs)
Supreet: Big data, no! (Laughs)
Stefan: You talked a little about SQL on top of Cascading as one domain-specific language, but there are others such as Scala.
Supreet: Yes, Scala.
Stefan: Scala and Lisp kind of… what is that?
Supreet: Clojure, Cascalog.
Stefan: Jesus, all those names, I’m giving up on all those acronyms. But Oscar was here talking about this. What’s your experience of that level of domain-specific languages? It seems to be very popular.
Supreet: It seems to be very popular. Each of these DSLs provides their own value proposition. So Cascalog, as you said, is very Lisp-like, and is incredibly compact and crisp. And that’s the advantage for people who become familiar with it. They can write very compact and crisp code.
We all know the benefits of Scala. Scala has been adopted by many powerhouses. So for example at Twitter, Twitter has been the one contributing to that DSL. They’ve been using Cascading for most of their productions apps. [13:20]
Stefan: Twitter is a big Cascading user. They use Cascalog for their more machine oriented…
Supreet: And Scala.
Stefan: Scala? The Scala DSL as well as Cascalog?
Stefan: Cascalog, okay. And that seems to lead more to machine-learning use cases?
Supreet: Yes. Especially since Cascading has a very interesting play with the lingual. It fulfills a very interesting story on scoring your model… Sorry, on training your models in R. When it’s time to score it, you take it to Hadoop and use the Cascading constructs to score your models.
So there is another contribution to the ecosystem called Pattern. It has implemented five machine-learning algorithms to score your models. The way it works is you use R or Datameer to generate the PMML model, export it, and then use Cascading to score it in the Hadoop infrastructure.
Stefan: Is that something really important in the financial services area?
Supreet: Again, it depends. It is going through an evolution. Right now the set of use cases would be the low hanging fruit. There is some low hanging fruit in the recommendation engines. But most of the immediate impact is coming not from something straight to machine learning or jumping straight to streaming or real-time analysis, but to doing some very mundane work.
Stefan: Yeah, I call it shake-and-bake analytics.
Supreet: Shake and bake analytics?
Stefan: Seriously. The problem is data complexity, bringing the data together. You put flour in there, a little water, a few eggs, a little yeast maybe, and then out comes the cake. So a lot of our customers really struggle to just get their data together.
As you described and what I think is so fascinating, everybody wants to do the in-memory, machine learning, real-time stream… It sounds really attractive, and I’m sure it makes a hell of a resume if you have that background, the check boxes. But I think the big value in a lot of companies is just bringing the data together and getting a 360-degree view of your process, your customer, your behavior. [16:08]
Supreet: Yes, absolutely. So if I could use this…
Supreet: Let’s say traditionally you had your enterprise data warehouse and I was doing some batch of bulk analytics. The simplest one is I’m doing a simulation for a fraud scanner. I’ve developed a new rule that says if this, this, and this happens, it’s a fraud. But just to make sure there are not a lot of false negatives, and I don’t make all of my customers angry, let me do a…
Stefan: Okay, let’s not talk about credit card companies. With my bank that happens quite frequently. I guess they should use Cascading.
Supreet: Change your bank to mine. (Laughter)
So the aim is the more you can run your simulation on historical data to check off false positives, the better you know what this data is. But at the same time, you’ve detected a new attack pattern and time matters. So the more time it takes for simulation, the more time fraud is happening as well.
So you have an enterprise data warehouse and there is this much data that fits on it. Let’s say it is 3 months of data. It takes maybe 14 hours to run the simulation on that.
Stefan: And meanwhile you’re bleeding money over here.
Supreet: Meanwhile you’re bleeding money because you want to maintain your customer relationship and the trust that you’re not taking a kayak trip to Papua New Guinea… (Laughter)
So the first set of use cases where the big data makes a very interesting play is… This is 3 months and let’s say this takes 12 hours to do. Move this to Hadoop.
The first thing is what if instead of 3 months you can run it on 12 months? That doesn’t fit in the enterprise data warehouse. But now it can if it moves over here. And on top of that, instead of 12 hours, it takes 3 to 4 minutes to do it.
It’s a very simple query; it’s not machine learning. It is porting your existing queries written in PL SQL or whatever over to big data.
Stefan: That’s a Hadoop environment.
Supreet: Yes, that’s Hadoop.
Stefan: And the color of that data warehouse was red? Or blue? (Laughs) The reason I’m asking is I would expect that a big financial services company with a credit card, that has unlimited credit and could just buy more… What was the technical challenge to not scale it out further? Was it just limit at half?
Stefan: So you just hit the maximum limit on hardware.
Supreet: Most of the enterprise data warehouses, despite the investments that are made, very quickly — and this is not specific to the organization that I was in — but in talking to enterprise companies, they end up hitting their peak capacity to support SLAs very quickly. They plan for 4 years but it’s happening within 2 years. So to say that I’m going to run that query on 12 months of historical data instead of 3 months… And this is a very simple query.
The second example is I’m trying to develop a new fraud algorithm and the data, let’s say there is a source system… Oh, that’s wet too.
Stefan: That’s like in my old school.
Supreet: Right. A wet sponge.
Stefan: It’s the only thing we had in East Germany. (Laughter) And a big ruler that we got a…
Supreet: Okay, the source system. You can call any one of these sensor systems. This is not specific to machine learning, and these days the Internet of Things is a big deal. It’s a lot of data. And a lot of data doesn’t just mean the speed, the pace, but it could mean there’s thousands of attributes available, depending on the context. [20:47]
Stefan: Depending on the individual user.
Supreet: Yes. Let’s say there’s an event like a swipe and native plus derived variables. Over here, there’s only limited space. And only the most key variables are kept here. But what if I could keep additional variables direct from my algorithm? Again, you’re not going into machine learning. And it’s not just one.
If I do a join with another data set over here, it becomes really expensive. So the breadth and the depth, just move all these over there. That is the most immediate impact. You can show a quick ROI for that and then take it from there.
Stefan: Well, we will come back soon. We’ll have a few more drinks and then we will talk a little about financial services and big data.