Datameer Blog post
Datameer and IBM Cloud Private for Data
by John Morrell on Apr 06, 2018
Many CEOs see Artificial Intelligence (AI) and Machine Learning (ML) as a key component to gaining competitive advance in their respective marketplaces. A 2017 survey of Fortune 500 CEO’s found that 81% of the respondents listed AI/Machine Learning as a “critical area of investment”, ranking just behind Cloud and Mobile Computing (91% and 86% respectively).
Yet, with all the hype surrounding AI and machine learning, projects remain in the science labs with few companies using it effectively. Another recent report by MIT Sloan – Reshaping Business with Artificial Intelligence – showed that while 85% of 3,000 business leaders surveyed believed AI would enable competitive advantage, it remains a missed opportunity with only 20% of the respondents actually taking advantage of it.
This report goes on to discuss this challenging role of data in the AI process, stating:
No amount of algorithmic sophistication will overcome a lack of data.
Data collection and preparation are typically the most time-consuming activities in developing an AI-based application, much more so than selecting and tuning a model.
Companies sometimes erroneously believe that they already have access to the data they need to exploit AI.
Even if the organization owns the data it needs, fragmentation across multiple systems can hinder the process of training AI algorithms.
IBM Cloud Private for Data
In an effort to help companies eliminate these barriers and drive broader business-level adoption, last Friday IBM announced their Cloud Private for Data offering to help foster and flourish data science and analytics initiatives. It is a comprehensive offering described as:
An integrated data science, data engineering and app building platform designed to help companies uncover previously unobtainable insights from their data and make richer data-driven decisions.
Cloud Private for Data runs on Kubernetes open-source container software, enabling it to be deployed in minutes for dramatically faster time to value. A microservices architecture supports all aspects of managing data, creating insightful analytics, and deploying these analytics to the business.
The Cloud Private Data solution includes key capabilities from IBM’s Data Science Experience, Information Analyzer, Information Governance Catalogue, Data Stage, Db2 and Db2 Warehouse, integrated into a cohesive platform to streamline analytic cycles. The solution lets Cloud Private for Data clients quickly discover insights from their core business data, while keeping that data in a protected, controlled environment.
The Role of Datameer
IBM has partnered with Datameer to be the data pipelining platform for Cloud Private Data Solution, providing comprehensive data preparation, exploration and pipeline operationalization capabilities for the platform. Datameer will be an integrated part of the solution, offering a seamless customer and user experience to streamline the use of data preparation within the platform.
In support of the announcement, Datameer CEO Christian Rodatus is quoted, saying:
two of the biggest challenges for data scientists is cleansing and shaping data, and operationalizing their insights to deliver value to business. The direction IBM is headed with IBM Cloud Private for Data is aligned with Datameer’s strategy and will enable companies to more quickly prepare data for machine learning and AI projects and operationalize these across their organizations.
Data Pipeline Challenges for Data Science
As with analytics in general, AI and data science projects are often bogged down in the process of preparing the data for intelligence models. As often cited, 80% of an analyst or data scientist’s time is spent on preparing and shaping the data for analysis and analytic modeling.
In addition, as cited on one of my recent blogs on the Five Rules of Data Exploration, data science teams need to add diversity to their data by blending more sources, and go through a painstaking process of data exploration and feature engineering to ensure the right data is being used and to eliminate potential bias in the results. Without the right tools, this can even further elongate already lengthy cycles.
There are four key challenges to creating data pipelines for data science and AI:
- Agile creating of data pipelines to drive faster cycles and time to insight
- Minimizing the copying and movement of data for efficient resource utilization
- Maintaining strong security, governance for data privacy and regulatory compliance
- Operationalizing and automating data pipelines for a constant flow of new insights
Let’s look at each of these in detail, and how Datameer works with IBM Cloud Private for Data to lower these barriers.
Agile Data Pipeline Creation
The goal of data preparation for data science driven analytics is to dramatically reduce that 80% of a data scientist’s time spent on preparing and shaping the data. This can dramatically shorten the analytic cycles and gain tremendous time to value benefits.
It all starts with an easy to use, yet powerful interface for cleansing, transforming, blending and shaping data, which Datameer delivers within the IBM Cloud Private Data Solution. With Datameer, data engineers, data scientists, and analysts can:
- Interactively apply functions to the data and visually profile how the shape and form of the data changes,
- Apply over 270 powerful functions, allowing even the most complex data to be quickly and easily tamed,
- Blend a wide variety of data sources, regardless of complexity, to add greater diversity to analytic datasets and reduce potential bias,
- Use Smart Analytics and other statistical functions to quickly perform feature engineering to find the most important attributes of the data,
- Visually explore datasets across any attribute, metric and value to rapidly find interesting patterns within the data to use in the modeling process
Scalable Execution Without Data Duplication
The tremendous volumes of data used in data science projects can be extremely taxing on system, storage, and network resources. Using a cobbled together architecture with multiple dis-integrated tools further taxes valuable resources by copying and moving data between components.
The integration of Datameer with IBM Cloud Private for Data enables your data pipelines to follow data gravity and process data without copying or duplication. Datameer provides data links, schema-on-read, and comprehensive data retention policies that process data efficiently to produce the final, analytics-ready results. The integration with the data management and object store services within IBM Cloud Private for Data further reduces resource-wasting data copying and movement.
Governance and Security
More and more, AI-enriched analytic pipelines are used to feed models at the core of understanding operations, risk and customer behavior. These processes can often require regulatory compliance to prove how risk is understood and calculated (BCBS 239, etc.), show effective detection of financial crimes (AML, etc.) or protect consumer data and rights (EU GDPR, etc.). The consequences of poor governance are high, with extensive fines being levied.
Datameer provides a rich suite of security and governance features to provide the control and auditing needed to maintain regulatory compliance, including:
- Data encryption in-flight and at-rest
- Secure views with field obfuscation
- Role-based security
- End-to-end lineage tracking
- Detailed job and user auditing
Datameer integrates with the security controls and mechanisms of IBM Cloud Private Data Solution for comprehensive, end-to-end security. It also integrates with the IBM Infosphere Governance Catalog for complete artifact cataloging, usage auditing, and lineage tracking.
The combination of Datameer and IBM enable fully secure, AI-enriched pipelines, full governance of the use of data and analytics, and the ability to easily track how the data and models are used for compliance auditing and reporting.
I highlighted earlier the need to follow data gravity for AI data pipelines. The same rationale holds true for operationalizing your data pipelines. To deliver the scale needed for AI data pipelines, the platform requires a powerful execution framework and automation tools.
Datameer’s efficient management of data combines with its scalable Smart Execution technology to enable AI data pipelines that execute at tremendous scale. Smart Execution delivers parallel optimization of the entire job workload – ingest, preparation and advanced algorithms – to leverage the available compute and storage power offered in the IBM platform
Datameer also provides flexible job automation and scheduling tools to run jobs to generate fresh new results, and auditing services that track how the jobs are run and how the data is used. This enables data pipelines that continuously feed to downstream processes and people that consume the AI-enriched insights while maintaining full governance and control.
Are You Ready?
With AI being the true game-changer business teams believe it can be, removing the barriers to adoption are the key to experiencing faster business value from AI. A comprehensive integrated platform that brings together best of breed components for the entire data and analytic lifecycle, eliminates many of the critical barriers to success. IBM Cloud Private for Data, with Datameer an integral piece, delivers the platform you need to streamline your data and analytic processes for AI.