Data Pipeline Optimization : Why Self-Sufficiency Accelerates Insight
- John Morrell
- March 5, 2018
In their original forms, data pipelines were used to enable information flow between structured systems – operational systems, data warehouses, and data marts. The IT departments completely controlled these pipelines. Business analysts and teams would submit requests and, eventually (meaning many months later), they would see the trickle-down of new data. The resulting delays slowed decision-making processes, making this process suitable only for long-running initiatives.
Then along came the data lake with the promise of using more data and getting results faster. The underlying Hadoop technology offered many key underpinnings to speed the process. But the complexity of the data and the programming models required IT to remain the gatekeeper. IT was able to use new techniques such as schema-on-read and tools such as data preparation to speed the process. But, alas, they were still the gatekeepers and, therefore, the bottleneck. The main question facing the data lake’s evolution is: How do we remove this bottleneck and free the data to be consumed?
Where’s the Data Pipeline Bottleneck?
The typical analytic cycle involves 5 key steps:
- Preparation, Curation & Enrichment
- Exploration & Refinement
But, as the figure depicts, when the middle stage – exploration and refinement – is separated between different tools, an iterative back-and-forth hand-off can arise, slowing the process:
- First, the data engineer prepares and refines a dataset to the needs of the analyst.
- The analyst then explores and analyzes the data to find it does not reveal the answer and asks the data engineer to try again.
- After a few iterations, the analyst gets a refined dataset that does reveal the answer, finally helping to close the cycle.
This repetitive process of passing the data back and forth to get the proper refinement continues to hinder the analytic cycles, even in the world of data lakes.
There is a Better Data Pipeline Approach
There is an often-overused term in the software industry (especially the BI market): self-service. We like to believe we produce software that allows business teams to do their work completely independently without the need for IT. But the reality is that self-service rarely happens.
A faster, more modern approach to big data pipelines is a process where IT and the business analysts share responsibility for the end product. We call this a cooperative curation process where:
- The IT teams do what they do best: ingest, integrate and blend data into useable datasets.
- The business teams then explore the data blended by the IT teams to identify refined datasets that meet their analysis’s specific needs and consume this data.
In this case, the business teams become what I like to call “self-sufficient,” meaning the IT teams did just enough to let the business teams finish the job independently. Each group used their unique skill set to do what they do best and work together to speed the creation of a pipeline.
This cooperative process eliminates the last-mile bottlenecks in getting just the right data for the analyst. There are no longer repetitive back and forth exchanges that slow things down, but rather a smooth process, clear responsibilities, and well-defined hand-off points.
What’s the Secret?
The secret to enabling this faster, jointly owned data pipeline process is:
- Integrate data refinement into the underlying data preparation and exploration platform, eliminating the need to move the data to other tools or locations to refine it
- Offer familiar, easy to use metaphors that meet the two audiences’ needs: A data-centric, functional metaphor for the data engineer, and a visual metaphor for the business analyst, with data shared between the two.
- A powerful free-form exploration infrastructure that allows an analyst to explore billions of records across any attribute, metric, and value, with sub-second response times
This data pipelining process enables business analysts to consume data faster, using it when they need it and on their own terms. By enabling self-sufficiency in the data pipeline, organizations can greatly alleviate process bottlenecks to get to the data they need.
This truly enables an agile process to produce actionable datasets. The business analyst is free to “fail-fast” – explore in any direction, and if they don’t find the answer, back-out and try another direction. They fail-fast until they get just the right data they need to solve the problem.
Conclusion: Create Information Production Lines
With competitive markets becoming more cutthroat than ever before, data is no longer a luxury but rather a competitive necessity. To succeed at digital transformation, companies produce critical information assets that drive their business. The faster and more efficient the information asset production line is, the more streamlined the business is, and the more effectively it operates.
This modern, cooperative approach to creating data pipelines creates a more efficient and speedier process that produces the valuable information assets that fuel your digitally tuned business. To read about how Datameer can help, please visit our Visual Explorer landing page and learn more about cooperative curation processes and data pipelines. And for more information on optimizing your data pipeline, be sure to download the Eckerson Data Pipeline whitepaper for an in-depth look at this modern data architecture.