Why You’re Thinking About Data Prep All Wrong


In a previous post on Data Informed, “Data Prep Tools Struggle to Keep Pace With Changing Face of Big Data”, I asserted that with the current functionality of standalone self-service data preparation tools, the market was being misled to believe that data repair, or data cleansing, was more important than data transformation when it came down to the two basic purposes of data preparation.

Of course, a look back in history as to how we got here, technologically speaking, explains why we are where we are today. (I won’t rehash this here – read the full post on Data Informed for the background if you’re interested). But just because we can trace back to the root of the problem doesn’t mean we, or you, should settle for an evolution in functionality, especially in the era of big data.

That being said, beyond being misled to think cleansing was more important than transformation, there’s likely an even bigger reason why you might be thinking about data prep all wrong: all the hype around the still emerging stand-alone data prep tools in the first place. VCs keep funding them, tech pubs keep writing about them, all the major analyst firms and lots of the independent firms are writing reports about them, grading them, giving them merit. But to each firm’s credit, most are saying standalone isn’t sustainable. It’s not a market in and of itself – at the end of the day, it’s a feature. Let’s dig in there.

Data Prep is a Feature, Not a Market

If you’re looking to take full advantage of big data, to use it to discover the insights that will translate into dollars saved or dollars earned, it’s important to also think outside the functionality scope and understand the role of data preparation in the larger analytic workflow.

Download Your Free Data Prep Whitepaper

Keep in mind, of course, that the word “preparation” really is a misnomer. It implies that the data preparation process takes place as an early step and from which one actually moves on. The fact that these self-service data preparation tools are standalone tools — not baked into a larger end-to-end workflow — perpetuates the problem. It encourages the belief that data cleansing is one stop in a linear workflow. Once you’re done preparing, then you should shuttle it off to another tool for analysis. And maybe even yet another tool for visualization. Wrong. Wrong. Wrong.

In big data discovery, data preparation can, and does, take place at any point in an iterative data discovery process. After analysis, flaws may have been revealed in the data or you might find that you need to add new data. This then dictates new requirements for how the data should be shaped, interpreted, enhanced or cleansed.

The diversity and amount of data available today is unprecedentedly large, creating data preparation requirements for big data analytics that expand beyond the scope of standard self-service data prep. The range of transformations you can perform on these vast datasets is also large and differs markedly from the relatively simple, formal process of extract, transform and load (ETL).

Here’s a video I created about combining and shaping data:


While the market is accepting self-service data preparation as a standalone tool, for now, beware as you plan your big data initiatives and investments. Even if data prep is your biggest pain point at the moment, understand that even if you make that part of the process a little easier, you ultimately need it to perform in a larger analytics workflow.


Connect with Datameer: