When it comes to the health of your data, the problems go beyond questions such as “did a data pipeline run and deliver its payload.” Data observability incorporates additional questions such as:
- Did the data arrive on time?
- Did all the data arrive?
- Where was the data delivered to?
- Was the data in the right format?
- How did the data come into the final format?
- Is the data at risk in any way?
- What is the degree of data quality?
- How useful and complete is the data?
Answering these questions provides a complete view of the health of your data and data pipelines. It also allows your organization to measure the effectiveness and operative use of your data. Let’s explore each of these in more detail.
Delivering data on a timely basis ensures that analysts and business teams are working from fresh data to make their decisions and see trends as near to real-time as possible. To ensure timeliness, DataOps teams need to automate and run data pipelines as often as the infrastructure allows and monitor for their clean execution.
Erratic data volume production in data pipelines can be an indicator that the pipelines are broken and can create unforeseen holes in the resulting analytics. Not only do DataOps teams need to monitor overall data volume, but they also need checkpoints at different points within the pipeline in order to drill down and identify where data pipelines are broken.
Data pipelines can have multiple delivery points for both the finished and intermediate datasets, and data pipelines can also be extended by analysts to produce derivative datasets. DataOps teams need to monitor if datasets are being properly delivered to their destinations and what those destinations are to ensure proper use of the data.
A data pipeline with multiple sources and destinations will work with and deliver data in different formats. DataOps teams need to monitor for format and schema changes, keep them from breaking pipelines, and adjust the pipeline logic as needed.
The end-to-end lineage of a data pipeline is important for many reasons, including data governance, regulatory compliance, and building trust in the data. DataOps teams need to have and publish a complete, detailed data lineage that tracks every source, transformation, and destination.
Data risk takes into account the risk of exposing data from a security, privacy, and regulatory control. While data privacy teams may manage this overall process, DataOps teams should continuously monitor, assess, and govern the risk within their data pipelines.
Data Quality & Consistency
Incomplete and inconsistent data creates potential holes in the end analytics leading to less than optimal decisions and low trust in the data by the business. DataOps teams need to constantly measure and monitor data quality and completeness, and be able to drill down, identify, and fix problems.
In the same way poor data quality can hinder data use and trust, data completeness can improve accuracy and context of decisions. DataOps teams need to monitor the completeness of the data and collaborate with analytics and business to maximize usefulness and completeness.