Datameer Blog post
Data Pipeline Optimization : Why Self-Sufficiency Accelerates Insight
by John Morrell on Mar 05, 2018
The modern data pipeline has become an invaluable asset for many companies, allowing them to make well-reasoned business decisions fueled by the most up-to-date data possible. However, in today’s competitive business environment, collecting and compiling data is simply not enough to remain relevant. Proactive companies now strive to be “data driven” across many functions of the organization, using vast quantities of data, gathered from a plethora of sources that help to better define their business goals and company operations.
In their original forms, data pipelines were used to enable the flow of information between structured systems – operational systems, to data warehouses, to data marts. These pipelines were completely controlled by the IT departments. Business analysts and teams would submit requests and, eventually (meaning many months later), they would see the trickle down of new data. The resulting delays slowed decision-making processes, making this process suitable only for long running initiatives.
Then along came the data lake with the promise of being able to use more data and to get results faster. The underlying Hadoop technology offered a number of key underpinning to speed the process. But the complexity of the data and the programming models required IT to remain the gatekeeper. IT was able to use new techniques such as schema-on-read, and tools such as data preparation to speed the process. But, alas, they were still the gatekeepers and, therefore, the bottleneck. The main question facing the evolution of the data lake is: How do we remove this bottleneck and free the data to be consumed?
Where’s the Data Pipeline Bottleneck?
The typical typical analytic cycle involves 5 key steps:
- Preparation, Curation & Enrichment
- Exploration & Refinement
But, as the figure depicts, when the middle stage – exploration and refinement – is separated between different tools, an iterative back-and-forth hand-off can arise, slowing the process:
- First the data engineer prepares and refines a dataset to the needs of the analyst.
- The analyst then explores and analyzes the data, only to find it does not reveal the answer, and asks the data engineer to try again.
- After a few iterations, the analyst gets a refined dataset that does reveal the answer, finally helping to close the cycle.
This repetitive process of passing the data back and forth to get the proper refinement is what continues to hinder the analytic cycles, even in the world of data lakes.
There is a Better Data Pipeline Approach
There is an often-overused term in the software industry (especially the BI market): self-service. We like to believe we produce software that allows business teams to do their work completely on their own without the need for IT. But the reality is that self-service rarely happens.
A faster, more modern approach to big data pipelines is a process where IT and the business analysts share responsibility for the end product. We call this a cooperative curation process where:
- The IT teams do what they do best: ingest, integrate and blend data into useable datasets
- The business teams then explore the data blended by the IT teams to identify refined datasets that meet the specific needs of their analysis, and consume this data
In this case, the business teams become what I like to call “self-sufficient,” meaning the IT teams did just enough to let the business teams finish the job on their own. Each group used their unique skillset to do what they do best and work together to speed the creation of a pipeline.
This cooperative process eliminates the last-mile bottlenecks in getting just the right data for the analyst. There are no longer repetitive back and forth exchanges that slow things down, but rather a smooth process, clear responsibilities and well-defined hand-off points.
What’s the Secret?
The secret to enabling this faster, jointly owned data pipeline process is:
- Integrate data refinement into the underlying data preparation and exploration platform, eliminating the need to move the data to other tools or locations to refine it
- Offer familiar, easy to use metaphors that meet the needs of the two audiences: A data-centric, functional metaphor for the data engineer, and a visual metaphor for the business analyst, with data shared between the two
- A powerful free-form exploration infrastructure that allows an analyst to explore billions of records across any attribute, metric and value, with sub-second response times
This data pipelining process enables business analysts to consume data faster, using it when they need it and on their own terms. By enabling self-sufficiency in the data pipeline, organizations can greatly alleviate process bottlenecks to get to the data they need.
This truly enables an agile process to produce actionable datasets. The business analyst is free to “fail-fast” – explore in any direction and if they don’t find the answer, back-out and try another direction. They fail-fast until they get just the right data they need to solve the problem.
Conclusion: Create Information Production Lines
With competitive markets becoming more cutthroat than ever before, data is no longer a luxury, but rather, a competitive necessity. To succeed at digital transformation, companies produce critical information assets that drive their business. The faster and more efficient the information asset production line is, the more streamlined the business is and the more effectively it operates.
This modern, cooperative approach to creating data pipelines creates a more efficient and speedier process that produces the valuable information assets that fuel your digitally tuned business. To read about how Datameer can help, please visit our Visual Explorer landing page and learn more about cooperative curation processes and data pipelines. And for more information on optimizing your data pipeline, be sure to download the Eckerson Data Pipeline whitepaper for an in-depth look at this modern data architecture.