Big Data & Brews: Michael Zeller, CEO of Zementis, Part 1
As I said in this episode with Michael Zeller, the CEO of Zementis: I love my job.
0:08 Welcome to today’s episode of Big Data & brews, today with Mike. Mike can you introduce yourself and your brew?
I’m Mike Zeller, CEO of Zementis, we focus on operational predictive analytics and my brew here is Franziskaner Heffeweizen, very German, Bavarian,
Are you German?
I am German, so this is the German episode of Big Data Brews i think so we’ll try a summer like beer, a hefeweizen, which is a yeast like beer,
Which is a meal almost, right?
It’s a meal, and you have to be careful how you pour it into your glass, that’s the first test, so, we’ll try that one
Well let me open it, so you grew up in Bavaria?
I did, yes.
Well I didn’t so I have no whatsoever Hefeweizen skills, so I will learn today
OK we’ll try it.
1:05 In fact I don’t usually drink Hefeweizen, so I will look at you, and try to make the same thing, and of course we don’t have the right glasses, I apologize, we will fix that next time.
The important thing with Hefeweizen is that you tilt your glass, and that you pour very slowly, otherwise you get is a glass full of foam, and you don’t want that. And then at the end before you pour everything you kind of swish a little bit so you can get the yeast at the bottom, so this is unfiltered, this is the yeast and you need that for your beer to make it complete.
There’s a whole technique to that, you know usually I just open and drink.
No, there’s a, it takes its time, and then you pour it in, and we’ll finish it next time. With the right glass, you’d get evertyhing this is a little tricky but its pretty good for now.
I could drink off of it.
No, no no, that’s a no no. Hefeweizen you always have to drink from a glass.
Yeah, I love my job. I just drink beer and talk to people.
That’s a good start, yeah.
So give us a little bit on your background, and then on Zementis and more on your technology but let’s start with your background.
Yeah so my background, don’t hold it against me, but I’m a physicist, so I studied machine learning I used it in robotics, vision processing, and studying machine learning algorithms and how the brain works.
2:59 By the way, I think the most amazing architects and computer scientists never studied computer science, so if you go around they all did bioinformatics, or biology, our CTO is a biologists and a good friend of mine, Hans Dockter, who does Gradle, he’s a physicist, so I think that’s a good sign.
Yeah I think coming from different backgrounds gives you the ability to combine different technologies. Interdisciplinary research is important so here we have machine learning, you know how does biology do it, how does our brain do it, translated into math, algorithms, and what we do with Zementis is really deployment of predictive models. It’s kind of back to my roots, making it easier to operationalize predictive analytics. So, taking those advanced models, deploying them very easily, kind of mainstream operational systems like in a business process, in a database system, in Hadoop, really making it super easy for anyone to utilize those models. I think that’s the key aspect of big data, big data brews, right? So the key aspect of big data is using it to make better decisions, and you know predictive analytics really is a great example for that.
Before we go more into your technology, give us a little more about the history of your company, how you guys got together, when did you start it, did you start it in Germany?
No, we started it here in San Diego actually.
Oh well twist my arm, San Diego!
It’s not the worst place in the world and its actually a very good place for predictive analytics. Its kind of a hotbed of analytics companies so we started there with a core team that came from different backgrounds, software engineering, machine learning..
A few surfers…
I don’t think we have surfers…
Really, I would have thought surfers in San Diego
Yeah but we’re too serious of a crowd, haha. So we actually started in 2004, so quite some time ago with an exclusive footprint in the financial industry, which for a while wasn’t the best industry to work in, but we’ve since diversified quite a bit. Coming from that background we built a scoring engine, so to speak, that allows you to deploy predictive models.
And did you work in the financial service industry before?
I did, yes, yeah good question it was actually a consulting engagement I had in the financial industry, I was consulting for different companies, and it really presented an opportunity to introduce them to advanced machine learning algorithms, so that’s kind of the ignition of the company where it allowed us to start out with clients from day one which was a very fortunate opportunity.
6:01 I think a lot of great companies started out this way, like Qliktech, or others where they first really worked in the industry and they had all of this experience and they basically productionalized their learnings, and they didn’t have to hold their fingers in the wind and see what kind of product to build because they kind of did it a few times.
Yeah we kind of felt the pain so we were initially building predictive models and then we had the need to deploy them and it was always very painful, all this custom code, a lot of work involved, all these one-offs, and that really gave us the idea foer the product that we have today.
So let’s talk a little bit about the product, what is the product doing, what are the borders, where do you need other tools, where is it integrated with other things. Let’s say I’m a… I don’t want to say I’m a data scientist because I don’t believe in it, let’s say I’m a business analyst and I want to do some loan scoring, or maybe, is that a good use case? What is kind of the hello world for predictive?
The hello world for predictive? Fraud detection…
Yeah? Okay. Let’s score… I happen to have a 100 trillion credit card transactions. How do I deal with that? So help me, where is your tool sitting, where are others coming in,how do I make that happen?
Okay so there’s always the two worlds we face, one is data scientists, or the machine learning expert that really builds advanced models that predict risk, predict fraud, predict upsell and crossell opportunities.
And how do they build this, do they write code?
In various different tools typically, so we have a great selection of commercial or open source tools, commercial like SAS, SPSS, KXEN, open source like R, gnime, rapidminer, a lot of tools for data mining scientists to build models. But once you have that model, that’s where we start really with our tools, its how do you operationalize them, how do you really runt hem through a scalable 100 million, billion transactions.
8:13 So a rapidminer or a SPSS is more kind of a design environment, where I build something and then where do I run it? So you are maybe the application server for predictive models?
You can think of it like that, yeah, think of it, the scoring engine, the application server, the deployment shell so to speak where you plug in one or many models to apply them in whatever your most favorite operational environment is actually.
So how do they communicate? If I build something in SPSS or rapid miner, do I give you some python code, or what’s the…
Yeah so what we use is an open standard called PMML the predictive model markup language
Sounds like XML,
It’s an XML representation of your model and the idea is not to move code. So not to move custom code from the data scientist desktop to the operational environment. Its very complex, very hard to keep track of and then to QA so the idea is to have a common standard, PMML, an XML representation of your model that I can consume, that I can first of all export from all data mining tools, commercial or open source, and then consume in our engine and deploy on various different platforms.
So just so I get this right, so I build something in rapid miner or SPSS or whatever it may be, then export it to PMML and now I have 100 trillion records over here in a database, on a Hadoop system, and then now what do I do no?
Now you apply that model on your data,
And your engine is where this is executed?
And you integrate your engine into Hadoop or, what are the different areas there?
Yeah Hadoop is one option. The idea is really to have many different choices for your platform. So Hadoop is one, through Datameer, through Hive, in database, your classical data warehouse, we run on top of Netezza, on Greenplum, on Teradata, and SybaseIQ, and so it really depends on the customer’s current IT infrastructure to make it easy to deploy. And you pick the right environment. So we have real-time environments they’re kind of standalone, they can run in the cloud, we also have a cloud-based solution on Amazon…
Are people really using this? We see people not so happily pushing data to the cloud, what’s your experience?
It depends, there’s quite a few use cases, but if you’re looking at…
Are the banks moving the credit card transactions into the cloud?
Not really, that industry… whenever you have a regulatory environment, the cloud most often is not the choice. But for proof of concepts, for smaller consulting firms, for marketing applications there are many many use cases that you can use in the cloud, where it’s not personally identifiable information, and its a pass tool, you don’t necessarily have to store your data in the cloud but we also have clients that already have their data in the cloud so no questions are asked. But regulatory environments, or government, keeps that in house of course. But its really about simplicity and the choice of the customer to deploy predictive models wherever they need to go, so they don’t have to worry about it. not having to worry about it
11:47 So you said predictive models so that basically means you’re really limited to models you learn you build and then you execute. Right so there’s no… Is there classification?
But not clustering?
There’s clustering capabilities, yeah.
You can apply cluster models, classification models, its really across the mainstream what I would call predictive models, from linear and logistic regression to the machine learning algorithms like neural networks, support vector machines, all the way to ensemble models what was used for example in the netflix prize so again the most complex models, and the more complex they are, the more complex they are the easier it is to move them from the data mining scientist to your operational environment using the standard and not custom code.
And right that would be PMML and Zementis running that. So your product is written in C++ then to be super fast?
No, actually our product is written in java.
Oh wow I like Java, that’s good
There’s not much of a difference there and the benefit with Java you get is you don’t have to recompile on every platform and you can run across different target platforms and that’s really enabled us to spread across our partnerships very quickly.
And what would you say if I was the bearded tech guy who said well but Java is super slow.
It’s not as slow as it used to be, it is much faster, and I think the differences are marginal and the benefits you get from not being tied to very specific compiled code on a specific platform outweighs the small increase in speed you could get.
Yeah hardware isn’t that expensive anymore
So let’s come to the use cases, what do you see as the main use cases in your customer base. So maybe let’s even step back, what are the verticals you’re seeing? And then are there specific use cases in those verticals?
Yeah so verticals, this is a platform application same as your data warehouse, as your Hadoop system would be across all industry verticals and that’s what our customers span from consumer electronics to telecom, financial industry to government…
So its very diverse.
It’s very diverse its really a true platform application. Its sometimes difficult to be able serve all of these different verticals, because you have to speak the language to a certain extent but if you think about predictive models, they’re the same mathematical models the same algorithms you use everywhere so the scoring of one model in marketing is the same as a similar algorithm in the financial industry so that makes it really a very universal application and that’s the fun part of it, you can play anywhere you want to go.
14:49 And are there any main, where you say, here this is blue-printed, yeah we do this all the time? Beside you sounded like credit card?
Yeah fraud detection in general has been kind of the marquee use case of advanced predictive analytics I think the financial industry is ahead of the curve in adopting predictive,
but we see a lot of marketing analytics, cross-sell, up-sell, churn prediction in the telecom industry, so those are the classical mainstream usecases, you see a lot of machine predictive maintenance and quality control now that sensors are becomings so cheap. And I think we’re at this very interesting point in time where processing power becomes very affordable, cheap, storage, so massive amounts of data you can keep around, you can move around you can work with it. The algorithms are well known, so its really us creating business value out of that. I mean its really a very interesting point in time where things that we did in research 10-20 years ago very hard on dedicated clusters, is now standards based, very easy and you can do on everybody’s desktop.
Yeah. I want to spend more time on the use cases because they’re so much fun. You said predictive services, what would be the example, you would predict when my car would need new brakes? What’s the most fun one you can share?
Um for example we could look at vibration signatures of rotating equipment…
Yeah rotating equipment like vacuum pumps, for example are used in many semi-conductor processes, so having those fail is bad for the overall process so you’d like to know in advance when those elements might be about to fail. Similar use case in the energy grid. You’d like to look at all of your transformers in the field, look at voltages, currents, temperature,s they all have sensors in it so you can detect when something is wrong with the equipment, and not necessarily send a maintenance crew out there on a quarterly interval just to check on it, but its really about being more predictive and more dedicated so we can do more with less resources.
So is that about you would learn the model, or, how is that, so you execute the model, but how are people coming up with these dimensions, this sensor, this sensor, that temperature is that where the data scientists come into play?
Exactly that’s kind of the offline process. You always have a learning a data exploration process where you look at your data, your outliers, you look at different variables what you have what you can do and then you build your model. Kind of the scientific aspect of the process and then our products almost sit at the tail end of that where it becomes a repeatable process, where you have that model but you want to execute it every time you get a sensor reading, in real-time potentially. Or you have a nightly batch model where you run through 100 million transactions, again its where repeatability is very important and being able to crunch, to apply those models over and over again.
18:29 So how is Zementis then deployed, what’s the infrastructure, what are the moving pieces, where is it running on? Maybe you can make a little drawing, how is it all working together?
Yeah let me show you that part. If you start with a data scientist, we often have R, or SPSS or SAS models, so here we would have, I can do a little stick figure maybe for the data scientists..
He needs a crown because it’s the new king in the big data world, right?
And the big brain so this is really the art form, exploring data, building models, this is the most complex part of it. You know you have data, you’re trying to explore it, you’re going to build a certain model and the other extreme you have different database platforms so you have database platforms, maybe Hadoop next to it, or you have something like realtime or cloud, let’s do cloud maybe, just different platforms.
And what different databases do you guys support? Or run in, or run off…
So typically your data warehouse appliances that are massively parallel that are similar to Hadoop but more structured
Yeah like Netezza, Teradata, Teradata Aster, Greenplum, Sybase IQ, and
And you guys integrate with all of them?
Yes we integrate with all of them.
Oh that’s really cool.
So the fun part is that here you can leverage existing infrastructure like on massively parallel executions you have many nodes on a database, on a data warehouse, you have many nodes on your Hadoop platform where we integrate with Datameer of course, or Hive, and in the cloud you can have many different servers serving requests or this is more a real-time application so you have different target platforms that you can deploy your models to. So the glue in-between is really the standards that we use, the PMML, i mean that’s the key aspect of it. PMML allows you…
20:54 Is that a ISO standard? or what is that?
Its an industry standard so its really developed by vendors, its driven by vendors, R, is of course an open source tool, but SPSS, SAS, and those guys SPSS now of course being a part of IBM, but the idea is to move models you develop on the left side very seamlessly over to the right side and not be locked in to a single vendor, not being locked into a single platform, so you can take your choices here, find the best data scientist you can find,
Free of tool choice, right,
Right, free of tool choice, or, pick, build your decision tree in R, you build your neural network in SPSS, take your choices, and then combine all of those models onto your operational platforms.
Is it common then that people run multiple models after each other, sort of stream the data through different models, is that a use case?
Not necessarily, its an option, but you’d usually run it in parallel, you would say in what model performance would matter, you can compare, but most often you have dedicated models for different tasks, or ensemble models that work together…
So we just talked to a customer and their biggest problem is like how do you really manage those models. I guess since PMML is really XML do you check them into a versioning system or are there any cool good tools to kind of manage the lifecycle of predictive models, is that something you do?
We don’t manage the lifecycle per se, we allow you to deploy one or many models in our engines that sit on the different host platforms, but what you can do since this is XML you can basically check into your source code control and manage it just in the same way, and often that’s what clients choose to do because they already have code control, source control and you have many tools on the data mining scientist side, and the commonality is really the underlying IT process where you manage the deployment of models as part of your infrastructure.
And then when I run my data through the model that is executed in the Zementis engine then I basically generate data, right, I get a score or like a decision okay then so I can deal with it myself, put it in another table, update a record, or write a new sequence file, I guess.
Right so you would write it into your database into your Hadoop file system or you would integrate it into your business process through a web services call really so the idea is to go from data to a smarter decision as easy as possible as fast as possible.
23:39 So that was an interesting one, integrating it into your business process, so would I use Zementis, say I sign up for a home insurance or home loan or something you know maybe healthcare insurance, would my data be sent to Zementis in the cloud or the application server and then come back like “ah, approved” or “not approved”?
Yeah so a client is integrated that way but for sensitive information its typically deployed in house, as part of their custom solution, you know their most trusted data center, so it depends on the use case scenario.
So its not just batch operation its also interactive, where I sign up on the website and I instantly get kind of like..
Yeah and the cool part is the you don’t have to know this when you build the model I can score everybody I have in my 100 million all at once or interactively so you have either big data or small data, which in the case of small data, you know latency is often important so fast data so instead of doing 100 million massively parallel you just do one singular transaction in just 10 seconds.
Okay so basically I can use whatever platform I want to on both sides, on design and execution, and that is helpful,