Datameer Blog post
DataOps: New Term, Similar Concepts
by John Morrell on Jun 21, 2018
Componentization, containers and the cloud have all converged to usher in a new era focused on “Ops.” It started with DevOps, which according to Wikipedia is defined as:
“a software engineering culture and practice that aims at unifying software development (Dev) and software operation (Ops).”
And while DevOps initially focused on software engineering processes to construct, test and release software, more of the recent attention has been on the “Ops” part – automating the operational deployment aspects to make rollouts faster, smoother and more reliable.
The cloud has created an even greater need for DevOps tools. In the move to cloud architectures, software stacks have become even more disintegrated, with developers using finer-grained tools – often called “primitives” – designed and optimized for very specific purposes in the stack.
Software teams are challenged to orchestrate processes that link together the execution and data exchange between each primitive in the stack in order to get an end-to-end flow. In the past, this orchestration was performed with scripting. The more modern approach is to use a segment of DevOps tools specifically designed for this orchestration.
DataOps: the New Sheriff in Town
Now, apply the same DevOps concept to the world of data and analytics, and suddenly you get DataOps, a subject that is getting increasing attention in big data analytics and analytics in the cloud. Going back to Wikipedia, DataOps is defined as:
“an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics.”
DataOps aims to bring the same benefits delivered by DevOps – speed, continuous agility, reliability, scalability – to data analytics processes. DataOps is typically broken into four unique aspects:
- Data engineering
- Data processing
- Data management
- Data security and privacy
Let’s explore each of these.
The first part of DataOps is engineering the data flows that turn raw data into consumable information by the analyst and business teams. These tools need to focus on increasing the productivity of the data engineer, who is the primary curator of data for the organization.
This requires a tool set that provides multiple manners for the data engineers to model the integration, transformation, curation and organization of datasets for downstream consumption. The tools should cater to different skills and approaches of different personas that will be creating the data workflows:
- Data engineers like to use SQL and are proficient in transforming data via SQL syntax
- Data or Business Analysts like less technical, spreadsheet-style transformations and interactive experimentation
- Data Scientists need additional facilities for feature engineering and advanced organization
- All personas need visual data exploration at scale to dig further into the data
Once data workflows have been orchestrated, the data itself needs to be processed. The actual processing will take the original raw data and transform it into the desired result sets that can be consumed by the downstream analytics.
In the world of big data, this means deciphering complex data formats and processing data at scale to produce downstream results. There are two critical components of this:
- Optimizing the processing – since the workflows can be extremely complex and work on extremely large datasets, the data processing component requires a “smart” optimizer that knows how to distribute the workloads and process closest to the data to use the most efficient and expedient path to execute the jobs at hand.
- Securely processing – we will discuss data security in a bit, but there is also the notion of processing security whereby the jobs that are executed must be run in a manner aligned with the operational security of the underlying systems. In the world of Hadoop, this requires “secure impersonation.”
As the DataOps processes are running, they are consuming, transforming and producing a variety of different datasets along the way. The end result should be a repository of datasets that analytics teams can consume. Some of this data is just in transitory form, while other pieces need to be persisted.
Managing the data manually inside DataOps processes can be difficult and complex, especially with compliance and governance processes that can often be involved. Therefore, key requirements for a DataOps platform is to abstract the data management complexity through:
- Automated data partitioning that will use the storage resources effectively and enable more efficient and faster processing of the data
- Advanced, automated data retention capabilities that manage versions of the resulting datasets based on different policies
- Complete auditing services that can track how data under management is used and consumed to facilitate governance and compliance
Data Security and Privacy
More and more analytics are using Personally Identifiable Information (PII), which falls under regulatory compliance and control, and may also use sensitive information about the company operations. All of this data must be properly secured and guaranteed to remain private.
We mentioned earlier that auditing services are used to understand how data under management is used, which also is an important part of data security. But Data Ops platforms also provide additional security features such as:
- Granular role-based security down to individual artifacts and datasets to control who has access to the data and how it is used
- Built-in encryption and obfuscation of data to ensure all PII cannot be seen by various constituents that use or consume the data
Single Platform to Empower Multiple Aspects & Personas
As one can see, if data engineering teams used lower level primitives to create data pipelines and manage them via different execution frameworks, there would be the extra task and programming and integrating the various DataOps items mentioned above: Data Orchestration, Processing, Management and Security. This creates an extremely complex custom framework for automating DataOps that hinders the speed, agility, reliability and scalability objectives.
A complete, end-to-end analytic data management platform that covers all aspects of DataOps delivers the key benefits, and simplifies the job of each key persona: data engineer, business analyst and IT teams. With such a platform, DataOps processes create “managed self-service” – self-service consumption for the analysts combined with well managed processes for the IT and data engineering teams. This becomes a win-win for both sides, and enables an agile enterprise-wide data strategy to speed overall time to insight.
To learn more about how Datameer enables an agile, scalable DataOps environment, visit our website at www.datameer.com/product/why-datameer, and explore our key data preparation, governance and operationalization capabilities.