Data Preparation & Pipelines for Data Science using Datameer book cover

Data Preparation & Pipelines for Data Science using Datameer

Learn how you can use advanced data preparation and exploration to explore your data to determine fit, rapidly shape your data for AI and ML engines, and deploy data pipelines to create a cooperative workflow with AI and ML tools.

Ebook Background

About The Data Preparation & Pipelines for Data Science Book

Some significant factors are mitigating the success of AI in the enterprise, though, because it can be hard to integrate into operational processes. Part of the reason is that AI tools are, to varying degrees, primitive and segregated from mainstream analytics tools. In reality, AI should be connected to all the other work you do with data; it can’t be in an isolated zone unto itself. Data Science is not an island.

If you’re going to do machine learning work right, you’re going to need well-honed data sets on which to build your models. It is not just about cleaning the data. It is about finding more data to increase accuracy and discover data that may be more relevant to the problem at hand.

The Gaps in Today’s Workflow

Using disjointed tools for various data science tasks can slow down the analytics cycle and create extra tasks.


Creating a Cooperative Workflow

See how data preparation and exploration and data science modeling and execution can be combined into a more agile data science workflow.

DataOps Process: Data Platform Capabilities

Data Exploration Role

See how large-scale data exploration enables your data science teams to find the right data for the task at hand and create more accurate models.

DataOps Process: Drivers and Objectives of DataOps

A Real-World Example

Walkthrough a real-world example of a cooperative data science workflow using Datameer for data preparation, exploration, and pipelines, and SparkML for data science.


Our architecture enables a far more efficient – and repeatable – machine learning workflow. Data starts in Datameer, where it is explored and shaped, allowing for upfront code-less preparation. Datameer simplifies exploration of the data for relevancy and makes for easier feature engineering to create the most appropriately shaped data set the first time around.

Datameer’s output is exported to Spark, used to build, train and test an ML model. Finally, ML model test results are round-tripped back into the exploration platform for detailed model validation. And if the model’s accuracy is not satisfactory, it can further refine the data, and the entire process repeated, creating a virtuous cycle.

Our workflow, then, has three phases: The preparatory workflow, model design, training and testing, and model performance validation.


Notebooks are a great place to build and test models. Focusing the Notebooks on this task greatly reduces coding efforts. You’ll get the fastest workflow with the greatest reusability by combining the preparation & exploration platform with the machine learning platform. In this section, we would like to show you how that can be done. First, we will go through a real-world example with a well-known public data set, anonymizing U.S. census data on personal income. Then, we will describe how we might examine the data in Datameer, build a machine learning model on Apache Spark with Python code in a notebook.

We’ll also run some data through the model to test it, and we’ll bring a data set back into Datameer that includes that data, the model’s predictions, and the actual values for the column it tried to predict.

Get the Data Preparation & Pipelines for Data Science Ebook

Sign Up for Our Newsletter

If you liked this ebook, sign up and stay informed on the most popular trends in data management.