My chat with Informatica’s Anil Chakravarthy touched on the subject of database schemas, ETL and dynamic mapping. With the growing number of data sources and complexity, Anil argues that a purely static schema has only limited use and that flexibility is critical. He also points out that technology doesn’t have to provide the perfect answer, but it should save time, which to me, is the most valuable asset.
Enjoy the next episode of Big Data & Brews!
Stefan: What’s your perspective? As you said, there’s a growing number of data sources and more insights shape up as you’re enriching more the data. Is it really hard to define the static schema that we used to do?
Anil: Yeah, absolutely. Let me actually, just because it’s good conversation, I’ll start with the other extreme because the schema discussion usually goes from either …
Stefan: Black to white.
Anil: Yeah. Either everything is fixed or nothing is fixed. As you mentioned earlier, when I was at Symantec, one of the businesses, product groups, that I ran was data loss prevention, the DLP business. There, there is no schema. It’s basically, how do you un-structure data, especially over email? Somebody might be sending social security numbers, etc. What do you do in those cases? DLP became a very successful category by just having essentially regular expressions. That you look for certain data.
Has that been enough? Clearly not because you look at what’s going on in the world of breeches etc. It’s necessary, but not sufficient. That’s what has shaped our world view. You don’t want to insist on schema everywhere. There will be many, many types of data where you can do perfectly good processing without schema. That is not sufficient by itself. Even in the world of security that we’ve been talking about now, like we just talked about, you need to understand metadata. You need to understand what is valuable data. You cannot combine it with other schemaless… You might, for example, have SharePoint documents where you’ll never get any schema, but they still contain valuable information in order to protect data and process it. You need to be able to do it in a manner that doesn’t depend on schema.
We don’t have a religious view, if you will. I believe that it’s actually a spectrum and depending on what you’re doing and depending on how accurate it needs to be. For many purposes, it may be that if you have puristic methods that can work with schema-less data, you can still derive valuable uses and insights, but there may be other ones. For example, if you’re doing a banking transaction, that is absolutely unacceptable. If you are doing a search for closing of data sets that you’re just looking for, some human assisted activity, that may be perfectly acceptable. That’s how we view it is, can we understand the use case and map it to that?
Stefan: The way we see the world is really the schema-on-read perspective where the challenges that we observe with our customers is really that their environment is changing so quick that the traditional, pick your favorite – Oracle, DB2, Teradata – locked-in-static schema approach doesn’t work anymore.
Stefan: I couldn’t agree more with you that especially in banking, or pick your favorite area, you need a schema, but the idea for us is that we virtualize the schema. I believe data is always at motion, so you basically say, “Okay here’s my data generating sources and the data is flowing and eventually here I have to create maybe a loan risk number. “How do I get from here with now all those different data sources — as you said, maybe structured, maybe unstructured, maybe time-series data — to here.” The idea we really implement in our products is we say, well Hadoop is Moore’s Law on steroids because you have as many machines in a cluster as you want. Why would we pre-optimize? Why do we create those kind of things?
Of course, we still do under the hood, but we cut out that slow, human-driven, IT -driven process where you spend months and months and months trying to model the perfect schema. I’d rather say, okay well this is the schema you need. We create this as a view on the data. The cool thing is that you can create as many views on the data as you want. Cleaning the data is a view, aggregating is a view or the cleaning view, and then whatever your predictions, are just another view on that view, on that view. If you decide, actually I missed a data role here, you just adjust it down here and it flows all the way through because everything is a lens on the lens on the lens.
Anil: That’s right, that’s right. It makes a lot of sense. By the way, actually it may … I don’t know if it’s surprising because we’ve been in the world of ETL for a long time. It’s very interesting that how even in such a, if you will, a traditional world, these concepts are absolutely making their way. I’ll give you an example. In our next version of a product, we’re introducing dynamic mapping.
A lot of customers are like, “Hey look we use a lot of that for things like application consolidation. I’m simplifying my application architecture and moving data from five different sources to there. I don’t want to recall that mapping five different times.” It may actually be, “I don’t want to recall it for every table in those data sets, so can you introduce the concept of … Probably the most popular concept?” I think your point is well taken that a fixed, a purely static schema, only limited use. I think the flexibility is critical.
Stefan: What we’re observing in the market is certainly that time is becoming the biggest obstacle for everything. Everything needs to be more agile, faster turnaround times because the perception of IT, it’s a cost center.
Stefan: The reason is costs us is because the people and their time. Productivity gains, even if you heavily have to pay in maybe, CPU, is now so incredibly cheap, it’s not a problem anymore.
Anil: Correct, correct.
Stefan: I think that’s what happened in the last couple decades is, it’s shifted around.
Anil: That’s right.
Stefan: Computers are so cheap now that we can just throw it at every problem, whereas human beings, rightly so, became very more expensive than the hardware, which I’m pretty happy about. We’re really focusing on how simple can we make the creation of the schema-on-read approach because yes, you have diverse data sets and you somehow have to bring it together just to make a line chart.
Anil: That’s right.
Stefan: You have to have an X and a Y.
Anil: Exactly, exactly. That’s exactly right. I think our view is we have at Informatica obviously, we pride ourselves on the developer community we have, meaning we have over a hundred thousand developers worldwide who use Informatica tools. That’s their primary tool that they use. From that perspective, what we have recognized is these technologies are very useful to enhance their productivity. Ultimately, the technology doesn’t have to provide the perfect answer. If it gets close enough, it can save the developer a lot of time. I completely agree with that point.