Data Pipeline Feat Img

Data Pipeline Optimization : Why Self-Sufficiency Accelerates Insight

  • John Morrell
  • March 5, 2018

In their original forms, data pipelines were used to enable information flow between structured systems – operational systems, data warehouses, and data marts. The IT departments completely controlled these pipelines. Business analysts and teams would submit requests and, eventually (meaning many months later), they would see the trickle-down of new data. The resulting delays slowed decision-making processes, making this process suitable only for long-running initiatives.

Then along came the data lake with the promise of using more data and getting results faster. The underlying Hadoop technology offered many key underpinnings to speed the process. But the complexity of the data and the programming models required IT to remain the gatekeeper. IT was able to use new techniques such as schema-on-read and tools such as data preparation to speed the process. But, alas, they were still the gatekeepers and, therefore, the bottleneck. The main question facing the data lake’s evolution is: How do we remove this bottleneck and free the data to be consumed?

Where’s the Data Pipeline Bottleneck?

The typical analytic cycle involves 5 key steps:

  • Integration
  • Preparation, Curation & Enrichment
  • Exploration & Refinement
  • Analysis
  • Visualization

Older Style Data Pipeline Approach

But, as the figure depicts, when the middle stage – exploration and refinement – is separated between different tools, an iterative back-and-forth hand-off can arise, slowing the process:

  • First, the data engineer prepares and refines a dataset to the needs of the analyst.
  • The analyst then explores and analyzes the data to find it does not reveal the answer and asks the data engineer to try again.
  • After a few iterations, the analyst gets a refined dataset that does reveal the answer, finally helping to close the cycle.

This repetitive process of passing the data back and forth to get the proper refinement continues to hinder the analytic cycles, even in the world of data lakes.

There is a Better Data Pipeline Approach

There is an often-overused term in the software industry (especially the BI market): self-service. We like to believe we produce software that allows business teams to do their work completely independently without the need for IT. But the reality is that self-service rarely happens.

A faster, more modern approach to big data pipelines is a process where IT and the business analysts share responsibility for the end product. We call this a cooperative curation process where:

  • The IT teams do what they do best: ingest, integrate and blend data into useable datasets.
  • The business teams then explore the data blended by the IT teams to identify refined datasets that meet their analysis’s specific needs and consume this data.

In this case, the business teams become what I like to call “self-sufficient,” meaning the IT teams did just enough to let the business teams finish the job independently. Each group used their unique skill set to do what they do best and work together to speed the creation of a pipeline.

Newer Style Data Pipeline Approach

This cooperative process eliminates the last-mile bottlenecks in getting just the right data for the analyst. There are no longer repetitive back and forth exchanges that slow things down, but rather a smooth process, clear responsibilities, and well-defined hand-off points.

What’s the Secret?

The secret to enabling this faster, jointly owned data pipeline process is:

  • Integrate data refinement into the underlying data preparation and exploration platform, eliminating the need to move the data to other tools or locations to refine it
  • Offer familiar, easy to use metaphors that meet the two audiences’ needs: A data-centric, functional metaphor for the data engineer, and a visual metaphor for the business analyst, with data shared between the two.
  • A powerful free-form exploration infrastructure that allows an analyst to explore billions of records across any attribute, metric, and value, with sub-second response times

This data pipelining process enables business analysts to consume data faster, using it when they need it and on their own terms. By enabling self-sufficiency in the data pipeline, organizations can greatly alleviate process bottlenecks to get to the data they need.

This truly enables an agile process to produce actionable datasets. The business analyst is free to “fail-fast” – explore in any direction, and if they don’t find the answer, back-out and try another direction. They fail-fast until they get just the right data they need to solve the problem.

Conclusion: Create Information Production Lines

With competitive markets becoming more cutthroat than ever before, data is no longer a luxury but rather a competitive necessity. To succeed at digital transformation, companies produce critical information assets that drive their business. The faster and more efficient the information asset production line is, the more streamlined the business is, and the more effectively it operates.

This modern, cooperative approach to creating data pipelines creates a more efficient and speedier process that produces the valuable information assets that fuel your digitally tuned business. To read about how Datameer can help, please visit our Visual Explorer landing page and learn more about cooperative curation processes and data pipelines. And for more information on optimizing your data pipeline, be sure to download the Eckerson Data Pipeline whitepaper for an in-depth look at this modern data architecture.

Instant Access To Our Free Library Of Resources

Discover the Top ETL and Data Integration Platforms


Featured Blog Posts

Five Critical Success Factors To Migrate Data to Snowflake
Five Critical Success Factors To Migrate Data t...

You’ve decided to modernize your data and analytics stack and migrate analytics workloads to the ...

  • John Morrell
  • May 10, 2021
ETL++: Reinvigorating the Data Integration Market

(This article first appeared on Medium on April 6, 2021.) The definition of “++” means incrementa...

  • John Morrell
  • April 12, 2021
Spectrum ETL
Disrupting the no-code cloud ELT market: Datame...

More than just loading Data: Datameer launches Datameer Spectrum ETL++ to disrupt the no-code clo...

  • Press Release
  • February 9, 2021
Google Partners with Datameer
Datameer Partners with Google Cloud to Deliver ...

Datameer is now a Google Cloud migration partner The partnership will help customers build secure...

  • Press Release
  • December 2, 2020

More Resources We Think You Might Like

Top 5 Fivetran competitors

Top 5 Fivetran Competitors and Alternatives

What is Fivetran?  Fivetran is a cloud-based ELT integration tool that teams can use to synchroni...

  • Justin Reynolds
  • June 15, 2021
The Simplest Road to a Modern Data Stack with Snowflake

The Simplest Road to a Modern Data Stack with S...

The first building block of a cloud data stack starts with Snowflake.  Your analytics engine and/...

  • John Morrell
  • June 14, 2021
Top 5 Matillion Competitors

Top 5 Matillion Competitors and Alternatives

Matillion ETL Review Matillion is a cloud-based ETL tool that enables teams to create and orchest...

  • Justin Reynolds
  • June 10, 2021

Updating your ETL? Your guide to the 10 things to consider when modernizing your ETL.