ETL++: Reinvigorating the Data Integration Market
- John Morrell
- April 12, 2021
(This article first appeared on Medium on April 6, 2021.)
The definition of “++” means incremental. It is used in various computer languages to increment variables. It was first used to describe new software solutions with C++. C++ added a number of innovations on C, including:
C++ blended the control of C with ease, re-use, and flexibility. If you wanted to stay higher-level you could. If you needed to dip down and get more control, you could as well.
ETL tools are traditionally the domain of data engineers. It’s their job to extract, transform, and load the data into a pre-defined schema that will service operational BI and reporting – typically residing in a data warehouse. Analysts will get the data in its’ final format at the end of the ETL data pipeline.
Traditional ETL tools are good at (a) extracting from operational data sources, (b) transforming the data – which is typically cleansing, joining, normalizing, and mapping the data, and then (c) loading the data warehouses. The newer cloud ETL tools work very much the same way, except they use an ELT model – extract then load and transform in the cloud data warehouse.
ETL++ takes classic ETL to an entirely new level. ETL++ provides all the classic capabilities one would expect in ETL – easy extraction, mapping, and loading, with enterprise scalability, performance, security, and governance.
Like C++ added re-use, flexibility, and abstraction to C, ETL++ does the same for ETL. Data engineers can take complex data and transform it into a flexible and reusable form that analysts and data scientists can use.
The ++ in ETL++ stands for the ability to incrementally do much more to the data, with even greater ease of use. ETL++ offers the ability for data analysts and data scientists to get involved and perform their own incremental transformation of the data to suit their specific needs. It relieves the burden on the data teams to create new pipelines for every analytical need and provides self-service to analytics teams.
So what is the incremental innovation, the ++, that ETL++ provides? Let’s explore what defines ETL++.
Easy to Use Interface
While many ETL tools have modernized their user experience to be more graphical and drag-and-drop, they still have an old-world data flow style interface with complex interfaces for each component. In addition, they require users to define and work with rigid schemas.
These data flow style UIs can make defining data pipelines complicated, make it extremely difficult to understand the logic and resulting data, and makes reuse problematic. Although graphical, these UIs are not suitable for non- or less-technical users.
The ETL++ user experience is around data discovery and exploration and a non-workflow style UI that facilitates non-technical users while also giving data-savvy engineers the power they need. The ETL++ UI is:
- Spreadsheet-style, allowing users to see and profile the data at each step,
- Interactive, showing the impact of each function and change
- Drag-and-drop and wizard-driven, requiring no technical knowledge
- Schema-on-read, automatically deliver new schemas as the data is shaped
- Able to easily show logic and data lineage
Excel is one of the most popular tools with analysts and data scientists. With this existing knowledge-base of how to use Excel, a spreadsheet-style UI eliminates any learning curve for users.
Flexibility and Reuse
As previously mentioned, one of the areas C++ improved upon C was providing an object-oriented abstraction layer that facilitated component reuse. ETL++ provides similar reuse of data pipelines and logic, along with resulting datasets.
Data engineers, analysts, or data scientists can define data pipelines that can create one or more datasets along the way. A data pipeline consists of multiple components which contain logic and data. In Datameer Spectrum, these reusable components are workbooks and worksheets.
Entire data pipelines or individual components (worksheets) can be shared with the team either to customize that pipeline or to extend it with their own additional offshoot pipelines, depending upon their access rights. The extension pipelines can be attached to any of the components within the originating pipelines.
Great use of this is for data engineers to create baseline data pipelines with canonical datasets, and then have analysts or data scientists build personalized offshoots of those. Analysts can combine multiple datasets, enrich the data, slice it in different ways, or more to shape the data to the specific needs of their analysis.
Let’s use an example. We will start with the data engineer defining a data pipeline that produces a canonical Customer Activity dataset. The data engineer might blend together data from the CRM system, website, sales system, and data warehouse, cleanse it, and enrich it with calculated columns to produce the Customer Activity dataset. Once tested and approved, the data steward can then give the necessary parties – analysts and data scientists – access to the output dataset and viewing access to the pipeline. This can facilitate re-use and extensibility downstream by the rest of the community.
The reuse and extensibility of ETL++ create a collaborative process by which data engineers can create base data pipelines and datasets, cleansing and normalizing data into canonical, usable datasets. Analysts can then create customized extensions to produce datasets suitable for the needs of different forms of analysis. The base logic can remain intact, while the analyst extends the pipeline with their own custom logic.
Extensions added by the analyst will be to combine, reshape and slice data in different ways. This requires a rich library of easily applicable functions that can:
- Join and union data in many different ways
- Sort and filter data to better organize it
- Pivot, group, and aggregate data to slice it in advanced ways
- Work with time and date fields to create Windows
- Extract meaning from text fields for text mining or to add context
- Form advanced lists
- Apply math, trigonometry, or statistical formulas to enrich data
Continuing with our example, a marketing analyst wants to check on the impact of their marketing activities on customer behavior and activity. The analyst has access to the canonical dataset, can see the logic behind it, and now has the ability to create their own extensions.
The analyst might extend the original data pipeline by blending marketing campaign and geo-location data with the original dataset, extract keyword data from campaign messages, then aggregate and bin data along various dimensions. The analyst has then created their own offshoot dataset that can analyze the impact of various campaigns and messaging on activity and sales by geo-location.
Data Scientist Functions
The same collaborative process discussed above also applies to data science. Data scientists can re-shape and organize data to their needs with data pipeline extensions while using cleansed, dependable, and trusted datasets supplied by data engineering.
Data scientists will use many of the same functions as analysts to shape and organize their data. But data science requires the function library to include specialized functions that:
- Encode data to feed AI and machine learning models
- Offer advanced ways to slice data and aggregate data
- Apply math, trigonometry, or statistical formulas to enrich data
- Can apply advanced algorithms to enrich and further shape data
And let’s not leave the data scientist out of our example, who perhaps wants to build a model that understands where customers are in the buying cycle. The marketing analyst blended together marketing campaign and geo-location in their pipeline, so the data scientist can use this intermediate dataset in the analyst’s pipeline and create their own extension to feed the model.
Re-using the market analyst’s dataset, the data scientist can:
- Do feature exploration by interactively applying algorithms and creating pivot tables, then filter down the data to the influential fields on results,
- Aggregate and bin the data along the features to count various activities (i.e. number of website visits) to account for the level of activity in the model,
- Further enrich the data with calculated columns,
- Create encoded fields from various features that will feed the machine learning model,
- Slice the data into testing, training, and feed datasets.
The data scientist now has the datasets they need and an extension to the data pipeline to consistently generate the data.
The data integration and ETL market is a long and rich history. Over time the market’s innovations have focused on the needs of the enterprise around performance, scalability, governance, and more. But besides the move to the cloud, the ETL market has seen little innovation and been stagnant in recent years.
ETL++ is the next evolution in the data integration market and offers a number of innovations. It brings in a new range of users with a new user experience, facilitates re-use and extensibility, and allows analysts and data scientists to shape and organize data to their specific needs.
Ready to see ETL++ in action? Book a personalized demo of Datameer Spectrum.