Data Preparation

data preparation

What is Data Preparation and Feature Engineering?

Data preparation is the process of cleaning, structuring, and enriching raw data, including unstructured or big data. The results are consumable data assets used for business analysis projects.

In the data science community, data preparation is often called feature engineering. Although data prep and feature engineering are used interchangeably, feature engineering relies on domain-specific knowledge compared to the standard data prep process. Feature engineering creates “features” for specific machine learning algorithms, while data prep is used to disseminate data for mass consumption.

Both data preparation and feature engineering are the most time-consuming and important processes in data mining. Having data prepared correctly improves the accuracy of the outcomes. However, data preparation activities tend to be routine, tedious, and time-consuming.

data preparation self service

Data Preparation Tools

There is a host of tools on the market today that provide data preparation capabilities. They are typically applications meant to streamline and operationalize the data preparation process. These tools are found in centralized IT departments, are used by Data Engineers, and are designed to batch and schedule data pipelines rather than explore and discover new analytics assets.

Stand-alone data prep vendors, such as Datameer X, Datameer Spectrum, and Alteryx, shape this software market’s foundation. The applications are designed to transform complex data into consumable datasets for analytics and then create data pipelines to produce it consistently.

The Relationship Between Datameer Spotlight and Data Prep Tools

Data preparation tools are great for IT teams to make centralized, complex data consumable on a scheduled basis via data pipelines. Spotlight is great for exploring and discovering new analytics assets at the business lines, not only from those centralized data pipelines but also from the data that resides everywhere else.

Datameer Spotlight allows analytics teams to find, create, collaborate, and then publish trusted analytics assets in complex hybrid landscapes. Spotlight provides unified access across analytics silos, increases the use of analytics assets, and furthers data knowledge.

Spotlight is built for ad-hoc analytics and includes key data prep capabilities, so analytic professionals can quickly enrich most assets rather than relying on centralized data pipelines and procedures used in data preparation tools. With Spotlight, professionals can directly:

  • Profile Data: Users select individual datasets and view the profiled data with column names, sample rows of data, and column metrics
  • Personalize Data: Users personalize your data by applying any number of analytical operations. Spotlight simplifies all the common data prep procedures, including blending, extracting, filtering, replacing, and splitting. Spotlight also includes more advanced capabilities for power users, including SQL programming and JSON transformations.

How Spotlight Works with Data Prep Tools

Spotlight builds trust in the analytics assets through a community of experts. Spotlight works interactively with data prep outputs through virtual queries, allowing analysts to discover, access easily, and use those datasets with ease. Data preparation tools can continue to be used for data engineering purposes, producing data pipelines, and robust datasets for the enterprise. End-users build on the hard work of the data engineering team – by tagging, publishing, sharing these datasets in real-time, and promote greater use of these assets – all in a SaaS solution.

Spotlight allows analytics teams to find, create, collaborate, and publish trusted ad-hoc analytics in complex hybrid landscapes. Spotlight provides unified access across analytics silos, increases the use of analytics assets, and furthers data knowledge.

Benefits From Integration

A cooperative environment between Datameer Spotlight and data preparation tools provides customers with many benefits:

  • Traditional data prep tools can still be used by data engineering for batching and scheduling data pipelines – Spotlight allows analysts to utilize all those pipelines easily. 
  • Spotlight provides the way for analysts to discover, consume, and build knowledge around any data and provide real-time feedback within Spotlight on the outputs from data preparation applications.
  • Minimize data downtime and costs; Data Engineering teams can harness Datameer Spotlight to understand who, how often, and when data prep outputs are consumed – and then optimize and prioritize those workstreams and data pipelines.