We’re getting deeper into the summer months in San Francisco and I think a great way to cool down is to grab your favorite brew, pull up a chair, and check out a few more real-world use cases that my friends at Hadapt, Zementis, Twitter and Concurrent have seen with their technology. Enjoy!
Justin Borgman, Hadapt
Stefan:What are the use cases that people implementing then with your … Where are you seeing your technology being very strong? SQL structure data. The classical use case for [inaudible] and Hadoop, maybe help me to contrast that. Why would I …
Justin:Yeah. If we have an eraser.
Stefan:No we basically just do that. Yeah. We should note to self get an eraser.
Justin:All I was going to say here is the way we look at it, there’s certainly the spectrum of structured and unstructured data.
Justin:SQL on Hadoop is largely I think about the structured data and how can you sort of do these structured SQL traditional workloads on structured data. Certainly that is where we began, where we’ve evolved is by allowing enterprise users to actually query this entire spectrum. In addition to SQL, we added full-text search, that kind of works more on this end of the spectrum.
Then we just introduced the feature in version two of our software back in September that we call Schemaless SQL, which is basically the ability to query structured data like JSON or XML, key value data.
Justin:You can query that directly.
Stefan:You said more ex path query thing which done or for JSON or is that still straight SQL?
Justin:It’s straight SQL. What we actually do is, don’t ask me too many questions on how this works. I’ll have to get one of the engineers involved.
Stefan:We’ll just call them in.
Justin:Basically, it’s all magic. Basically what we do is we take the key value pairs in the JSON data. We materialize them as in a tabular format. They keys become treated like columns. You can continue to query using SQL, JSON, XML et cetera, even as that scheme is changing. You don’t have to change your etl process. There’s no real etl process per se. It’s sort of happening automatically in a way.
Also you don’t have to change the scheme of your database. We’re sort of automatically materializing that. You may change your JSON, add a new attribute to your JSON file. Now that’s immediately queryable automatically. We did that because to answer your question about use cases. We found a lot of customers that want to do analytics on JSON XML data. Whether that’s Clickstream data or some kind of event log data or data coming from a key value store like a Mongo or or HBase or what have you.
This was kind of a easier way to allow them to query that directly. I can talk about one customer that examples. This is a customer actually in Boston that I think they have a pretty cool business. It’s called objective logistics. What they do is they take point of sale data from restaurants and bars. They take the receipt data, the check data. They basically use that data, and analyze that data to determine the best wait staff in the restaurant in the bar.
Who’s selling the most, what sort of margin items are they selling, some items are higher margin than others. What kind of tip were they getting. They stack ranked these folks. They do a couple things with that. First of all, to actually reveal it to everyone in the restaurant. There’s sort of some inherent competition.
You want to be [inaudible] exactly. That’s exactly right. Secondly the people that are at the top actually get rewarded by choosing their ships first. If you work in a ship case business, especially in restaurant industry, it’s important. Exactly. That’s the case where all that data storage JSON it’s coming from these point of sales systems. It’s changing everything, every restaurants a little bit different, every point of sale systems a little bit different.
Being able to have that flexibility is important.
Stefan:Cool, that’s interesting.
Stefan:What kind of the industries you guys seeing, we see a lot of financial services, telco retail, the new things kind of optimizing production systems, lead production based on big data. You guys have hot spots and certain verticals, where you see “oh, we have 90% of our customers in the retail business or in the”
Justin:Yeah. I would say the top three for us are sort of what I’ll call Internet which a lot of SaaS-based business is doing funnel analysis and that sort of thing. Retail is certainly one as well. Then financial services around security use cases, fraud protection, that sort of thing.
Stefan:Okay. Interesting. Okay. Cool. What’s the deployment size as you guys see before you feel technology?
Justin:It ranges quite a bit actually. Everything from five or six nodes or even in ec2, we see a lot of our customers deploy this in Amazon.
Stefan:We don’t see that at all.
Justin:Interesting. Yeah. I wouldn’t say … I’d say the majority of … The majority might still be actually on premise. We do see certainly within …
Stefan:As a trend.
Justin:Yeah. As a trend. More and more people using Amazon. Then deployment size was the question. All the way up to a couple hundred nodes so far. The interesting thing is because we price on a per node basis, the customers very smartly decide to build beefier nodes in some of those on-premise situations.
Stefan:Interesting. Well, we price on data.
Justin:Data ingestion or …
Justin:Kind of like a [inaudible]
Stefan:Yeah, exactly. Where it’s more about we don’t care how many users you have. We don’t care if you have 10 or 500 machines. We … The beauty as we see really high our eyes around even like a 10 [inaudible]. It doesn’t … The success of your Hadoop or big data strategy isn’t related to the size of your Hadoop cluster. Funnily enough, the engineers always measuring themselves of the size of the new cluster.
Michael Zeller, Zementis (Part 1)
Stefan:Let’s talk a little bit about the product. What is the product doing? Where are the models? Where do you need other tools? Where has it integrated with things? But let’s say I’m … I don’t want to say I’m a data scientist because I don’t believe in that. Anyhow, let’s say I’m a business analyst, and I want to do some, I don’t know, loan scoring or maybe, is that a good use case for it. What’s kind of the “Hello world” for …
Michael:”Hello world” for …
Michael: Click with Fraud detection.
Stefan:Yeah, okay. Why don’t we … I happen to have a hundred trillion credit card transactions. How do I deal with that? Help me. Where are your tools sitting? Where are others coming in? How do I make this happen?
Michael:Yeah. There’s always the two worlds that we face. One is the data scientist, or the machine learning expert, that really builds advanced models that predict risk, right? Fraud, predict upsell and cross-sell opportunities.
Stefan:How do they build this? Do they write code, or … ?
Michael:In various different tools typically. We have a great selection of commercial or open-source tools, commercial like SAS, SPSS, KXEN, open-source like R, KNIME, RapidMiner. A lot of tools for data mining scientists to build models. But once you have that model, that’s where we start really with our tools is, how do you operationalize them? How do you really run them through a scalable hundred million, billion transactions?
Stefan:A RapidMiner or an SPSS is more kind of a design environment, where I build something and then where do I run it. So you maybe are the application server for predictive models?
Michael:You could think about that. Yeah. Think of it, the scoring engine, the application server, the deployment shell, so to speak, where you plug in 1 or many models to apply them in whatever your most favorite data mining operational environment is, actually.
Stefan:How do they communicate? If I build something in SPSS or RapidMiner, do I give you some pitan code or what’s the …
Michael:Yeah. What we use is an open standard called PMML, the Predictive Model Markup Language. The idea really is..
Stefan:So it’s like XML?
Michael:It’s an XML representation of the model and the idea is not to move code. Not to move custom code from the data science desktop into your operational ITM model. Very complex, very hard to keep track of, and then to QA. The idea is to have a common standard, PMML, an XML representation of your model, that I can consume, that I can first of all export from all data mining tools, commercial or open-source, and then consume in our engine, and then deploy on various different platforms.
Stefan:Just so I get this right. So I build something in RapidMiner or SPSS or whatever might be the next product, to PMML. Now I have a hundred trillion records over here in a database on a Hadoop system. So what do I do now?
Michael:Now you apply that model on your data.
Stefan:Your engine is where this is executed?
Stefan:You integrate your engine into Hadoop? Or what are the different areas there [00:10:00]?
Michael:Hadoop is one option. The idea really is to have many different choices for your platforms. Hadoop is one. Through Datameer, through Hive, in database, your classical data warehouse. We run on top of Netezza, on Greenplum, on Teradata, and Sybase IQ. It really depends on the customer’s current IT infrastructure to make it easy to deploy. You pick the right environment. We have real-time environments that are kind of stand alone, that can run in the cloud. We also have a cloud-based solution on Amazon.
Stefan:Is people using this? We see people not so happily pushing data to the cloud. What’s your experience?
Michael:It depends. There’s quite a few use cases. But if you’re looking at…
Stefan:It’s the banks moving credit card transaction into the cloud?
Michael:Not really. That industry … Whenever you have a regulatory environment, the cloud most often is not the choice. But for proof of concepts, for smaller consulting firms, for marketing applications, there are many, many use cases that you can really use in the cloud. Where it’s not personally identifiable information and it’s a pass-through. You don’t necessarily have to store your data in the cloud. But we also have clients that have their data already in the cloud so there’s no questions are asked.
But regulatory environments or the government took particulars in-house of course. I think it’s really about simplicity there and the choice of the customer to deploy predictive models wherever they need to go. Not having to worry about it.
Stefan:You said predictive models, that basically mean you really limited to models you learn or you build and then you execute. Right? There’s no … Is there classification or …
Michael:There’s classification, there’s …
Stefan:But not clustering?
Michael:There’s clustering capabilities. You can apply cluster models, classification models. It’s really across the mainstream, what I would call predictive models, from linear, logistic regression to the machine learning algorithms like neural networks, support vector machines, all the way to ensemble models, what was used for example in the Netflix Prize. Again the most complex models, and the more complex they are, the easier it is to move them from the data mining side just to your operational environment, using a standard, not custom, code.
Michael Zeller, Zementis (Part 2)
Stefan:Let’s come a little bit to the use cases. What do you see as the main use case in your customer base? Maybe let’s even move one step back. What are the verticals you guys seeing and are there specific use cases in those verticals?
Michael:Verticals is … This is a platform application, same as your data warehouse, your Hadoop system would be, across all industry verticals. That’s what our customers span, from consumer electronics to telecom, financial industry to government.
Stefan:It’s very diverse.
Michael:It’s very diverse. It’s really a true platform application. Sometimes difficult to be able to serve all these different verticals, because you have to speak the language to a certain extent. But if you think about predictive models, they’re the same mathematical models, the same algorithms, that you use everywhere. The scoring of one model in marketing is the same as a similar algorithm in the financial industry. That makes it really a very universal application. That’s the fun part of it. You can play anywhere you want to go.
Stefan:Are there any main … where you say, here, this is just blueprinted, we do this all the time. Besides, you sounded like, credit card …
Michael:Yeah. Fraud detection in general has been the marquis use case of advanced predictive analytics. I think the financial industry is ahead of the curve in adopting predictive analytics. But we see a lot of marketing applications, cross-sell, upsell, churn prediction in the telecom industry. Those are, I think, the classical mainstream use cases.
You see a lot of machine predictive maintenance and quality control use cases now that sensors are becoming so cheap. I think we’re at this very interesting point in time where processing power becomes very affordable, cheap. Storage, so massive amounts of data, you can keep around. You can move around. You can work with it. The algorithms are well-known. It’s really us creating business value out of that.
It’s really a very interesting point in time where things that, what we did in research 10, 20 years ago, very hard, very complicated on dedicated clusters, now you can do really standard based and very easy on everybody’s desktop.
Stefan:I want to drill down more of those use cases, because they’re so much fun, right? You said predictive services. What would be the example? You predict when my car needs to get new brakes, or … ?
Michael:Yeah. You could…
Stefan:What’s the most fun one you can share?
Michael:For example, we could look at vibration signatures of rotating equipment. Let’s pick one …
Michael:Rotating equipment, like vacuum pumps. Something that rotates very, very quickly and …
Stefan:Wow, that’s interesting.
Michael:Vacuum pumps, for example, are used in many semiconductor processes. There’s lots and lots of vacuum pumps. Having those fail is very bad for the overall process. You would like to know in advance when those elements may be about to fail.
Similar exciting use case in the energy grid. You’d like to look at all your transformers in the field, look at voltages, currents, temperatures. They all have sensors in it. You can detect when there’s something wrong with the equipment, and not necessarily send a maintenance crew out there on a quarterly interval just to check on it. Really be more..
Michael:More dedicated. So we can do more with less resources.
Stefan:Is that, that you would learn the model then? How is that … So you’ll execute the model, but how are people coming up with those dimensions? I got this sense or this sense or that temperature, that. Is that where the data scientists come into play or …
Michael:Exactly. That’s the offline process. You always have a learning, data exploration, process, where you look at your data, you look at outliers, you look at different variables. What do you have. What you can do. Then you build your model. Kind of the scientific aspect of the process.
Our products almost sit at the tail end of that, where it becomes a repeatable process. Now you have that model, but you want to execute it every time you get the sensor reading. In real-time, potentially. Or you have a nightly batch them all run where you run through a hundred million transactions again. It’s where repeatability is very important, and being able to crunch and apply those models over and over again.
Oscar Boykin, Twitter
Stefan:What are the use cases … what are the things that … you touched a little bit on lock file processing, but are there any cool implementations, lock pop that you can share, Scalding that you use at Twitter? What was it that you using for?
Oscar:Well, it’s easier to program at a larger scale, so it means that in some cases, people just build bigger things. Chris Evers at eBay has a GitHub thing that will do matrix factorization.
Matrix factorization is that we have this matrix … often, we imagine the matrices that are tall and skinny, and maybe the width of them … so they’re skinny, so maybe they have 10K columns, but maybe they have a hundred million rows. So you can kind of think of users and movies, if you want to make a movie recommendation, and so we want to say that this matrix here is approximately this other matrix that’s really skinny and then has this other thing like that that’s a product, right?
So if we were to go and multiply this out, this matrix here has the same width as that one, and that has the same height as that one, but it has some hidden dimension here that’s R rows wide. Maybe that R has two hundred dimensions. And these are the kind of … like, you can kind of think of like, each movie has one of two hundred characteristics, and for each movie in our database, they each point a little bit in the romance direction or whatever, and then, each user is interested in various different amounts, and we observe this one over here.
So when we showed you … let’s go … I’m going to mix my metaphors … we’re going to go away from movies and go back to tweets. When we showed you a tweet from Britney Spears, did you click the “favorite” on it? Yes, you did. Or over here … and then a re-tweet or whatever. Now we want to reconstruct an estimation of this matrix from the product of these two. So that’s usually useful for machine learning. So going backwards—this regression backwards in generating these—allows you to make really good recommendations, because anywhere where I haven’t seen it but the prediction is that it would be really good, maybe we should show, give it a try.
Stefan:So that’s more kind of a feature reduction … ?
Oscar:Yeah, you can kind of think of it as a feature generation, and on this, you can use that to train a model. I mean, you could just force it, but it’s better to use it as feature generation, yeah. I would feed it to like a logistic regression or something.
Stefan:Well then, getting this running on Spark will be fun, right?
Oscar:Yeah, absolutely, yes. That would be a lot faster. So that was one example that Chris did on Scalding. We do a lot of stuff of … you know the standard kind of … rollups of … data cubing? We have the people come in; they see some tweets; we make some recommendations; and they engage with those recommendations or not. And now we want to look and make graphs: Are we doing well on iPhone? Are we doing well on Android? Are we doing well in the U.S.? Are we doing well with men? With women?
You can kind of cube it out in all these dimension, and so, as an item comes in, we’re going to put it in this large bucket of sets, and then crunch them down in Scalding. And keeping track of the types there is really nice because it’s really easy to get that kind of thing wrong. You can do anything with anything; everything is universal, so it’s kind of … I mean, at the end of the day, comfort is going to be the answer. When people … I mean, go to Twitter, do a search for Scalding—I’m sure I’m bias—but I see people like, “Oh, I just tried Scalding. It’s so great. I’m never going to use anything else.” So I think there’s a comfort feature. It’s a fun library to use.
Supreet Oberoi, Concurrent (Part 1)
Stefan:Cascading has been around, I would say, forever in big data. It has been in the big data world forever. It was there before Pig, before Hive, before all of the other stuff. You guys seem to have a really big user community. What are some of the highlights of people using Cascading and what are the use cases?
Supreet:Sure. That’s the fantastic thing I’m discovering about Cascading. The technology spans verticals. It is being used in Pharma, in consumer such as AirBnB, Twitter, and Etsy. It is being used or adopted in financial services as well.
Typically the use cases come when you need to develop complex, data-driven applications, and those applications have to be taken into production scenarios. So the needs that are required during long-string data exploration, data visualization, and during ad hoc analysis, are very different than the needs when you’re taking big data applications to production with the SLAs, doing capacity planning or being able to do root cause analysis for job setup of the production. That’s where it really shines in its value.
The other place that it really shines is the domain specific languages that are being developed on top of that. There’s been a session on Cascalog…
Supreet Oberoi, Concurrent (Part 2)
Stefan:What’s the, maybe from the ones you can talk about, the most fun use case, the most fun companies? It’s known that Twitter and Etsy are using Cascading heavily, and I think Prismatic is using the Kazaa log for something, but where did you see the technology used and you were like, ‘Wow, this is awesome what you guys are doing.’
Supreet:Definitely the scale aspect of it that when I’m seeing multiple thousands of nodes running production jobs, not missing the SLAs because there have been multiple reasons, not just one. One of them is being that they’re built on a platform that makes the jobs very predictable, very deterministic. That really amazed me. Then looking a few months out in the future, not just for the Driven, but Cascading captures a lot of signals, and those signals are now being exposed through the Driven platform. I really believe that a lot of challenges that we discussed today can be addressed through those signals that are being captured and address some of the problems which my previous world’s had. I’m pretty excited about that.