Wrapping up my conversation with Supreet, I wanted to dig deeper into his time as a Big Data evangelist at American Express before joining Concurrent. He has some great insights about how regulated industries are using Hadoop and big data analytics:
Stefan: Welcome back to Big Data & Brews with Supreet from Concurrent, Inc., the maker of Cascading, cheers.
Stefan: You need a new one?
Supreet: I guess I do.
Stefan: I have your magic stack here. Thank you for bringing that.
Supreet: My pleasure.
Stefan: Let me open this for you. There you go.
Supreet: Thank you.
Stefan: Recycle the other one? We’re good at data and at recycling. We recycle code; we recycle beer. Let’s talk a little bit about your experience in the financial service industry. I think it’s really fascinating because you’ve worked on some scale of big data deployments that some people just dream about.
How did you guys start? How did you make this all work and where did you end up? I think of you have hundreds of machines now, petabytes of data moving around, a lot of really cool deblanking all night long in the big data center. Where did it start? Did it start with a couple, three, machines? Did it start from the top, or did it start from the ground as a grassroots movement? [2:10]
Supreet: Definitely from the top, an initiative requiring such a massive amount of redirection of energy and expenses. There was definitely not just a sponsorship but a strong championship all the way from the top. The companies are realizing that, even though they are not in the technology business, even if they’re in the services business, financial transactions, whatever, hospitality, technology is the key differentiator that’s going to help them, and big data is one of the key players.
There was a strong championship. Talking about the dimensions of scale, definitely the machines, the data, but the other two dimensions that had to be captured in scale were the number of programmers or the analysts doing analytics and the number of use cases. What are the number of applications? How do you develop a methodology for not just engaging one or two use cases but dozens of use cases across fraud, marketing, operational risk and that to an unregulated environment? How do you do things at scale where there’s a lot of regulation? Those were some of the challenges that were there.
Stefan: Where are you guys now? Where’s your ex-employer now hundreds, thousands of machines, millions? [3:45]
Supreet: Can’t say, but …
Supreet: It’s big.
Stefan: Boy, it’s big.
Supreet: The things that have been publicly announced it is the biggest in a price class, the biggest cluster. It broke the TeraSort record as well so really, state of the art machines.
Stefan: How do I think about a credit card company that runs a TeraSort benchmark? Are you just like, “Hey, guys, let’s have a beer on the weekend. Let’s run a benchmark on all machines?” [4:21]
Supreet: The culture is definitely different than the hacker culture, where such things come up.
Stefan: You guys have amazing, smart people there.
Supreet: Yes, and I would say that has to be the case for any enterprise company that has to seek a differentiation from becoming a data-driven company. After all, why are we doing all this big data analytics? It is to develop smarter applications, to help serve your core competency better. There is no alternative in any company to not be a technically savvy company. In that company, there were president-level people taking three days off to attend Java programming classes.
Stefan: That’s awesome.
Supreet: I’d say that culture of coding being cool, it has arguably finally broken the Silicon Valley barrier, and it is becoming more main stream.
Stefan: Nice, I should submit my resume there, maybe. I think what I hear is that when you have senior-level exec sponsorship, and you really commit to the investment in hardware, in the data, but also into the people, then you will find success. It needs to be all in.
Supreet: It needs to be all that and then some. For example, what does investment in people mean? If you end up trading everybody to go from sequel some in R and SaaS to programming Java and MapReduce, that’s not going to happen. You can’t execute. Exposing them to a platform, like what we did, the Datameer platform, helps them quickly visualize, discover. One of the other things was again going back to if a model was developed in one data set and now in the Hadoop platform, you have the capability to consolidate data as it’s across all information centers.
There’s not one enterprise data warehouse. There are, depending on the size, multiple. Now, they have this and this, and they’re all together. What is the quickest way for me to start exploring the data? What if instead of doing a join with this, I do a join with this and a join with this and see if I get some more interesting insights? I can do better modeling on that behavior.
Giving them tools to quickly visualize, explore, prototype, do those kind of things — that was a very big, big factor in enabling us to scale and making this from a niche, early adopter platform. It went from an early adopter to an enterprise platform in less than twelve months. Those were some of the things we had to do, and Datameer played an extremely big part of that.
Stefan: Thank you, do you want to say for which company you work. I appreciate that, thank you. What are some of those challenges that you think are specific to financial services that you guys had to overcome? Was there, like, encryption or, I guess, a lot of personal, identifiable information? What if someone in the financial services thinks about ‘I want to go for this Hadoop thing.’ What other things that you think are important to keep in mind? [8:04]
Supreet: Yes. I would say not specific to financial services but specific to regulated industries. Each time a doctor gets involved or each time money gets exchanged — those are the two places, whether it’s real estate or something else. The big challenge is, let’s say, if I was developing a model in the SAS platform. In the SAS platform, I can look at a model, very easy-to-understand language, I can look at all the variables in the system, and I can say, “Well, this variable looks like zip code.” From zip code, I can derive customer demographics and tell me all the models that it’s been used in.
Those kind of questions, today, cannot be answered with absolute certainty in the big data ecosystem. That’s one of the value propositions that, again, to make a pitch towards Cascading, by using a common fabric that captures all the metadata and the signals associated with a query, one could address those questions, but it’s not a matter of encrypting. For example, I could be allowed to use a zip code to improve customer satisfaction. I could be allowed to use age to improve a use case that involves serving the customer better, but I definitely cannot use those two variables to determine whether I should lend to you or not. That makes it even a more complex challenge.
Stefan: If I’m twelve years old.
Supreet: I still want to make you happy, but maybe not a credit card.
Stefan: Is there a legal age in the U.S. when you get allowed to get a credit card? I wonder.
Supreet: Is there?
Stefan: In Germany, there is. In Germany, you’re not allowed to get a credit card before age sixteen, I think.
Supreet: Even if it’s a dependent of … ?
Stefan: Yeah, I think so. There’s something. But, hey, you can drink over 16.
Supreet: Then, those suckers or whatever you say, they start at six months.
Stefan: What are the biggest challenges that you had in your previous job with the Hadoop platform? Where did you say, “Boy, I wish, really, Hadoop would do a better job here?” [10:23]
Supreet: In terms of the Hadoop platform, one of the constant screaming matches I used to witness was that there’s a job, it’s been developed, works fine. Let’s say it has X numbers of hours of SLA. It needs to finish in twelve hours. It’s based on let’s say again I’m making up a use case like a marketing campaign. It’s deployed in production. The marketing campaign takes off. There’re a lot of people joining it, and there’s some part of the equation which becomes very time-consuming to process so that the SLAs start getting missed. The loud discussion that happens is, “Where’s the problem?” The application developers would say, “I need more infrastructure. The infrastructure people will say, “Well …
Stefan: “Your code sucks.”
Supreet: “… your code sucks.” There is no objective way to instrument. MapReduce is extremely, unless it’s not built in the platform, it’s bi-natively the only visibility I have is how much time did a job take to run. That is a challenge that needs to be managed better. We talked about the governance and the compliance. The third one is, ultimately, in a regulated industry, it goes through a regulator. Regulators like to be shown or proven that you didn’t use unfavorable or illegal signals to make a decision, but when you start using machine-learning algorithms … He looked like that.
Stefan: Hey, it’s algorithm.
Supreet: He paid his bill, so he must pay his bill, too. Why’d he look like that? I don’t know. The machine took it out, right?
Stefan: It was that machine, Dover 23.
Supreet: They said, “It will work out.” They’ll pay the bills. I’m not too sure if it’s a technical answer, if it’s a legal answer. There’s a lot of benefit that can come from more advanced — I’m not a data scientist — but whether it’s health care, whether it’s financial, all the regulated cases. Machine learning is not incredibly easy to prove a hypothesis, and those are the places where it’s being held back, even though it can improve the outcome in many cases.
Stefan: What other technologies that you see coming up that you’re really excited about in the ecosystem besides Cascading? [13:03]
Supreet: One’s from Concurrent and Datameer.
Stefan: Like, general themes, right? The streaming or in memory or kind of the graph optimization. I don’t know. What’s the things that you’re really excited about?
Supreet: The things that really excite me, I would almost move one level higher. We have been focused, if I go to Strata, it’s still focused very much at a system level. It has yet another way to do analytics. I think we are within six to eighteen months to start seeing either new vendors or existing vendors using big data to become smart, existing technologies for example, CRM. CRM has a big play. For example, I bet on prospects closing. 80% in my typical CRM application. There’s a lot of intelligence that can be developed in that, so going one level higher and looking at existing CRM applications, CRP applications, and making them smarter through big data analytics and seeing them in a packaged form. That is a place where I’m most excited about. Over the last two years, I got a chance to evaluate multiple, what I call as machine-learning-in-a-box vendors.
The thing was I had this algorithm. It’s the best algorithm ever, right, but algorithms are so dependent on the use case and the data center that they can’t stand by as themselves. I’m sure they’re the best algorithms, but apply it for a sales automation, for a marketing campaign, and that’s where the real lift will come, and that’s where I’d say, that’s the part what I’m really looking forward to.
Stefan: The data-driven applications?
Supreet: Data-driven applications and, from there, a data-driven business that comes out.
Stefan: Let’s switch topics here a little bit. What’s your role at Concurrent?
Supreet: My role in Concurrent is … Field engineering encompasses activities, of course, the technical evangelization, supporting deployments of customers, definitely presales, supporting and integrating with partners as well, and helping third-party applications use the Cascading and the Driven code, which is a commercial product, to be successful. I’m really excited about that role.
Stefan: What’s the, maybe from the ones you can talk about, the most fun use case, the most fun companies? It’s known that Twitter and Etsy are using Cascading heavily, and I think Prismatic is using the Cascalog for something, but where did you see the technology used and you were like, ‘Wow, this is awesome what you guys are doing.’ [16:01]
Supreet: Definitely the scale aspect of it that when I’m seeing multiple thousands of nodes running production jobs, not missing the SLAs because there have been multiple reasons, not just one. One of them is being that they’re built on a platform that makes the jobs very predictable, very deterministic. That really amazed me. Then looking a few months out in the future, not just for the Driven, but Cascading captures a lot of signals, and those signals are now being exposed through the Driven platform. I really believe that a lot of challenges that we discussed today can be addressed through those signals that are being captured and address some of the problems which my previous world’s had. I’m pretty excited about that.
Stefan: Maybe even the middle layer, where Cascading learns from Cascading to do a better Cascading job in the future? [17:05]
Supreet: Yes, exactly.
Stefan: From all the metrics you collect in Driven, and then make smart decisions moving forward?
Supreet: Exactly. Make smart decisions moving forward. Even starting from something simple like data lineage. If you’re an analyst …
Stefan: Yeah, you have to report that, right?
Supreet: You have to report that, and you’re not allowed to use a variable, but you know 70% of your bonus is based on getting a lift. You might just change your variable a little bit here and there, mask it, and then end up using … We’re ages away.
Stefan: I’ve never worked in financial services.
Supreet: Neither did I. That’s what regulators want to see.
Stefan: We have to give in, then.
Supreet: Regulators, in that world, you are assumed guilty until proven innocent. All kinds of accusations are thrown at you, and you have to prove it otherwise. By using this Cascading fabric, you can increase the lineage of any data element from the beginning to the end. That’s a very powerful value to be made as well.
Stefan: I think it’s fascinating being on the other side. Welcome to our side, to the side of technology that most …
Supreet: I have some fond memories of the friends and memories I have on that side, but it’s great to be back on the right side.
Stefan: Tell me, how was it? What was your observation? I’m sure it must be absolutely fascinating, in the position you were at that really big credit card company. I’m sure you saw a million technology vendors. What’s some of the most interesting observations you had? [18:37]
Supreet: Not only did I see a lot of vendors, but I had been in sales calls before being part of the entrepreneur play, and I thought I knew what a good sales call was. The lessons that I learned were that, number one, the decision to engage or not, it’s made by the time you hit the second or third slide. There’s certain checks that are going mentally if they have the laptops open. Again, sorry for giving out your tribal secrets, but it’s also to save their time. The first check is, are you credible enough? Credibility comes not only from the background, but have you solved the problem before? Do you even understand what our problem is? Getting that pulse. That has to be hit very quickly. Secondly, are you smart enough? Again, that’s a double-edged sword. Not only do you have to show that you’re competent, but you also have to show you’re hungry and humble to work and give a solution. That call that’s happening, it’s not for a product, it’s for a solution.
Stefan: Right, and you thought of that as an entrepreneur, as a technology person.
Supreet: That’s right. A lot of times, by the time people end up raising the money and doing great in school, they walk in with a certain amount of hubris. They might do a check on the smartness thing. They might do a check on great technology that was produced at some Google-like place, and now it’s coming out but it’s a veto. It’s like is that technology in the team?
Unlike traditional analytics companies, enterprise ITs now are very open to working with entrepreneurs, but working with those kind of things are very, very valid. A great example, for example, where a sales call goes extremely bad, a company comes with machine learning in a box. The question is, ‘Well, why do you believe that your algorithms are better than the algorithms we have?’ The answer came very matter of fact that we have better scientists, great.
Stefan: “We are smarter than you.”
Supreet: “We are smarter than you,” in a sales call.
Stefan: Was that the moment where 50% of all people dropped out of the WebEx?
Supreet: That’s right, like hang up, hang up, hang up! Yes, so that’s a big takeaway. Come in with a value proposition, solution, humility, and it’s a great opportunity to become great companies. Datameer, in my experience, did that for two years, which is why I was such a big champion of yours.
Stefan: Thank you very much for your support.
Supreet: It was a win-win.
Stefan: I’m very happy you went to Concurrent, my favorite — seriously — my favorite other big data company. Your favorite technology book in the last twelve months What was that? [21:57]
Supreet: I still hit the Don Knuth, it’s a classic, but I still hit it once in a while. Most recently, reading some booklets coming out of Ted Dunning . He does a fantastic job evangelizing the potential of machine learning, and I really like that, too. My favorite book for the next three months will be something on Scala, I’m sure.
Stefan: Okay, so Scala’s the new language for you?
Supreet: I think so. It is. Not only because of any marquee customers of Concurrent, but we started hearing about Scala being used outside technology powerhouses as well. Many of these technology companies which had gotten a little … not technology but IT companies, IT organizations that had gotten a little behind in keeping their edges sharp. They’re trying to leap for the whole Java world but going straight there, too.
Stefan: To Scala.
Supreet: I don’t know. It’s definitely very interesting, and I would like to.
Stefan: I have this belief. You only have so many types you can do in your life as a software developer when you’re typing so much, and Scala is really good for, the main three, because you have to write the code, but then on the other side the domain-specific languages they go a little bit too far. You don’t write much code anymore, but I think you can but it’s a little bit too much. [23:34]
Supreet: Too much, yeah.
Stefan: Scala seems to be really getting a lot of traction. A couple years ago, a lot of companies had scalability or availability problems with very random Scala bugs, but we seem to get over the bump there.
Supreet: It seems to have, again, crossed that threshold of being a niche technology as well. It has reached that tipping point where there’s a lot of belief in mainstream technology vendors as well, that that may be the way to go.
Stefan: Great. Thank you very much.
Supreet: Thank you.
Stefan: Thanks for coming by.
Supreet: Most pleasure.
Stefan: People, check out Cascading, check out Scala, and come back for our next Big Data & Brews. Cheers.