ETL, ELT and Data Integration

What is Data Integration?

Data Integration is the process of combining data from multiple sources into a unified view for downstream applications or analytics.  It allows data teams to centralize data for other uses to maximize query performance (from the consolidated data store) and minimizing the impact on back-end data sources. 

Traditionally, data integration has involved three key processing components: extracting data from the sources, transforming into the unified view, and loading the unified data into the destination.  In some cases, all three operations are contained in the same physical process, while in others, the steps are separated and run individually.  And the actions can come in different orders, with the most common ones ETL and ELT.

Data integration is often used for analytics purposes, sending the resulting data to analytical databases, data warehouses, or data marts, and organizing the data in specific analytic formats (star-schemas, multi-dimensional, etc.).  But new purposes for data integration include enriching data within an application such as CRM and bringing together all the data need for a new application such as customer or supply chain 360.

What is ETL?

ETL stands for extract, transform, and load, one type of process used for data integration.  Each of the three steps is distinct and can be complicated.  But they are interrelated and come together into a complete process to deliver a unified set of data.

Here is how the steps work:

  • Extraction pulls the data from the source.  For databases, extraction would involve using SQL.  For more complex applications, one might need to code to an API.  Or higher level ETL tools might abstract the extraction step into point-and-click operations.
  • Transformation is the process of shaping the data into the final form needed.  This often includes cleansing (which may also be in the extraction process), blending, enrichment, aggregating, and organizing the data.
  • Loading puts the data into the destination data store: a data warehouse, data mart, analytical database, or OLAP engine.  Some of the final organization of the data may be coded in this step.

ETL processes can be pieced together in ETL tools, hand-coded, or a combination thereof.  An ETL process could be a single synchronous process, or steps can be separated and run individually.  In the latter, there will often be an intermediate data store used to manage the in-flight data.

ETL processes can also range from simple – extract from a single source and transform it into an analytical form – to complex – bring together many sources to transform and feed into one or more sophisticated analytical schemas (e.g., star schema).  Organizations may have ETL processes that feed each other.  A complex process may feed a wide range of data into a data warehouse, and then a simpler process may move data from a data warehouse to a departmental OLAP cube.

What is ELT?

ELT, or extract, load, and transform, is a new data integration model that has come about with data lakes and cloud analytics.  Data is extracted from the source and loaded into a destination still in its original or raw form.  The raw data is transformed within the destination to a second form that is then ready for analytics.

In practice, we see two primary use cases:

  • Data lakes – Raw data is first extracted AND loaded into the data lake, either from traditional sources or newer complex sources (e.g., JSON, log files).  Once in the data lake, a data preparation process will transform the raw data into the final form, leave it on the data lake for use (e.g., in Hive) or send it to downstream BI or analytical destinations.
  • Cloud analytics – Since most SaaS applications have complex APIs getting data out for analysis is difficult.  Some ELT tools provide connectors to SaaS applications making it easy to extract and load SaaS data into a cloud data warehouse.  Once in the cloud data warehouse, the data is then transformed into an analytics-ready format in a separate schema.

ELT processes on data lakes can be sophisticated as they are often dealing with complex data formats, multiple data sources, and a great deal of preparation.  Cloud analytics ELT from SaaS applications tend to be simpler, taking data from 1 or 2 SaaS sources, loading it, then transforming it into a denormalized form.

data-virtualization-green@3x

What are the Advantages of Each?

ETL is the more traditional method of doing data integration, has well-known best practices and approaches, and skills are easy to find.  There are a good number of tools on the market that support traditional ETL approaches.

However, developing ETL processes can be slow, cumbersome, and costly.  Since it requires specialized tools and skills, dedicated teams are needed.  These teams will typically have project backlogs, forcing analytics teams to wait.  Some of the legacy ETL products also carry a very high price tag.

ELT is more modern and embraces the agile methodologies of today.  Teams can iterate quickly on projects, and often analysts can create their ELT pipelines using an ever-growing number of tools that eliminate coding.

However, ELT has its challenges.  When Hadoop and Spark emerged, there was a shortage of skills, creating higher and higher data engineering costs.  There was a similarly complex problem in working with SaaS APIs.  Tools such as Datameer Spotlight emerged to abstract this complexity of Hadoop, Spark, and SaaS APIs and allowed easy code-free creation of ELT pipelines.

There is a second gotcha with ELT.  Tools that simply extract, load into a cloud data warehouse (CDW), and do transformations there, can unknowingly raise the costs of running the CDW.  Therefore, it is essential to look for tools that support hybrid ETL/ELT processes and eliminate wasted CDW compute costs.

What are Some of the ETL and ELT Tools?

There are two different approaches to both ETL and ELT – hand-coding versus tools.  Many organizations still have ETL and ELT code and legacy teams that use traditional coding methods.  But there are a large number of ETL and ELT data integration tools on the market.  They fall into these buckets:

  • Traditional data integration suites – these are platforms that started as ETL tools and evolved into a fully-featured data integration suite with various supporting tools.  This group includes specialty data integration vendors (Informatica, Talend, etc.) and larger data platform vendors (Microsoft, Oracle, IBM, etc.).
  • Data integration tools in BI Suites – some BI vendors include ETL tools as part of their suites.  This group comprises SAS, SAP, Alteryx, Information Builders, and Hitachi Vantara.
  • Cloud platform data integration tools – each of the major cloud vendors include one or more data integration tools as part of their suites.  This group contains AWS Glue, Microsoft Azure Data Factory, and Google Cloud Data Fusion.
  • Modern ETL/ELT platforms – several modern specialty ETL/ELT tools with a strong focus on easy creation of data pipelines, data preparation, and work on-premises or in the cloud.  This group includes Datameer Spectrum as well as Trifacta.
  • Cloud data integration tools – these are newer cloud ETL and ELT tools that focus on getting data from SaaS applications to CDWs.  This group contains Snaplogic, Matillion, and Fivetran.

What is Datameer Spectrum?

Datameer Spectrum combines the full power and reliability of fully-featured ETL with wizard-driven simplicity to turn your data from raw form into analysis-ready in minutes without one line of code.  Once ready, Spectrum’s complete operationalization and governance features enable reliable, automated, and secure data pipelines to ensure a consistent data flow.

Spectrum provides a hybrid ETL and ELT platform for flexibility to support both forms of data integration your organization needs.  All data pipelines run on the same enterprise-grade operationalization platform to deliver the security, governance, and scale you need.

Spectrum is cloud-native on all three major cloud platforms (AWS, Azure, GCP) and carries with it the elasticity and cost economics you would expect from the cloud.  Spectrum also bridges any data sources you have regardless of type, format, and location (cloud or on-premises).

Why Datameer Spectrum for ETL and ELT?

Datameer Spectrum provides you with the best ETL, ELT, and cloud data integration functionality in a single package, at modern cloud-based pricing levels.  It gives you the same level of reliability and enterprise-grade functionality as high-end traditional ETL data integration tools, with the ease, re-use, and flexibility of modern cloud ETL tools, at affordable consumption-based pricing.

For simpler ELT needs, Spectrum lets analysts create and manage their own data pipelines without coding and operate them efficiently and reliably.  There is easy wizard-based data extraction from over 100 sources, including SaaS applications and cloud services, point- and click transformations, and managed data loading into destinations.

For more sophisticated ETL or data integration needs, Spectrum offers the same easy code-free UI with over 300 powerful functions supporting even the most complex data transformations in a single click and offers catalog features and reuse for data engineering productivity.  Underneath is an enterprise-grade operationalization engine with end-to-end security, governance, reliability, and scale that our many G500 customers have come to expect.

What to learn more?  Visit our Data Spectrum pages to dig deeper.  Or see for yourself with a personalized demo.