The data modeling world is constantly evolving as new technologies are introduced. First, it was data warehousing, then MPP data warehouses, followed by Hadoop and data lakes, and now we are in the cloud data warehouse era.
We have seen different data modeling options starting with the Inmon and Kimball models and evolving into the “data flattening” models in data lakes. We have also seen the data transformation process evolve from ETL to ELT.
Data modeling is the process of defining data content and structure for a specific purpose. The data modeling process is general purpose and is used for both applications and analytics.
Data modeling will often happen in three parts. First, a conceptual model is created which defines the high-level data requirements for the data model. This is then translated into a logical model, which adds more details and defines destination independent logical structure. Finally, the logical model is translated into a physical model within a data source such as a cloud data warehouse like Snowflake.
The two most prominent data modeling approaches for analytics came from data warehouse innovators Bill Inmon and Ralph Kimball. Inmon advocated a subject-oriented approach to data modeling organized into subject areas, while the Kimball approach focused on organizing data into fact tables and dimensional tables in what became known as a star schema. Each approach added needed structure to data warehouses and models, but that structure also required lengthy change management processes.
Data lakes brought about another data modeling technique that was more agile. Data transformation could take advantage of schema-on-read and the highly parallel processing of Hadoop. Large volumes of raw data were loaded into the data lake, and then data engineers would transform the data into a subject-specific structure for each use case. The resulting structure was typically a wide, flat table with all the data needed for specific analytics use cases.
In both cases, data was processed with an ETL model – extract, transform, and load. In the case of data warehouses, data was transformed going into the warehouse, and for data lakes, it was transformed directly on the data lake.
The advent and popularity of cloud data warehouses have altered the way we think about data transformation and modeling. The new ELT model, or extract, load, and transform, process extracts, and loads raw source data into the CDW, which is then transformed into final form for your analytics.
This allowed organizations to take advantage of the inexpensive and scalable compute and storage services of the CDW and created agility by separating data loading and data transformation processes and workload with data engineers performing the former and data analysts the latter. Organizations could create any number of subject-specific analytical data models that are optimized for their own needs and can use modern organization techniques such as Snowflake virtual data warehouses.
The ELT model also allowed organizations to share the data modeling and transformation workload. Initially, data modeling and transformation was a specialty of the ETL developer. On data lakes, a new role emerged – the data engineer – and they took over the task. In cloud data warehouses, the EL and the T are separated with the data engineer responsible for the EL and shared responsibilities between the data engineer and analytics community for the T, with much of the data modeling happening in the T.
It is arguable whether the EL part of the ELT process is technically considered data modeling – the process of getting the original, raw data into your cloud data warehouse. At a minimum, the EL process has a big impact on the overall data model and structure within your cloud data warehouse.
In the modern data landscape, data comes from a wide variety of sources. The faster-growing sources of data for analytics are from SaaS applications and cloud services. These sources have extremely complex data structures and APIs. The modern EL or “data loader” tools, such as Fivetran, Stitch, and others, focus on eliminating this complexity and replicating “objects” from these sources into your cloud data warehouse.
Thus, the initial data model your team will work from is a set of tables in the cloud data warehouse that look like objects from your data sources, are grouped similarly, and contain all the same fields. But because the data was designed for use within the SaaS application or cloud service, it could be very cryptic and not understandable by a general-purpose data analyst.
There is one very important data transformation step that needs to be applied to the data on its way into the cloud data warehouse. If any of the data is private or sensitive, it needs to be anonymized or masked. This is critical to maintain the privacy of the data and ensure regulatory compliance.
After the raw data is loaded, the data engineering team may apply for a first pass at data cleansing. While data cleansing can often cover a number of different techniques, they can be loosely grouped into two categories: general, standardized cleansing, and use case-specific cleansing. In this first step, data engineers could apply general, standardized cleansing to (a) find and correct missing or invalid values, (b) transforming incorrectly formatted fields, and (c) extracting individual fields from complex, multi-faceted columns.
Once the data is in the CDW and has gone through the first pass of data transformation, the data engineering team can transform the raw data into canonical data models that represent specific subjects. Examples of these would be data models representing customers, contacts, leads, opportunities, activities, and more.
The primary rationale for canonical data models is to create shared, reusable components for multiple use cases. Along with this comes added benefits:
The data engineering team will gather requirements from the various business and analytics teams to build the canonical data models. These data models will typically be supersets of the requirements to maximize reuse and consumption. The data models will also continuously evolve as new requirements or data sources come into play.
Since the raw data that came from the data sources is often normalized (in some cases mildly normalized and others highly normalized), the canonical data models will typically blend (JOIN, UNION, etc.) data from multiple objects to create a rich and complete set of fields to represent the subject. In addition, the canonical data models may also have some data enrichment to calculate new fields for standardized use in different use cases.
The final step in data modeling is to create a dataset or datasets that are specific to the analytics use case. For modern data modeling in a cloud data warehouse, this task is typically done by the data analyst. Why? It boils down to roles and skills:
Data analysts may have varying technical skills but would prefer to spend more time on what they are good at – analysis – and less on coding data transformation. This is where a low-code or no-code data transformation UI becomes important, eliminating the need for analysts to write complex SQL code and Python-like scripts.
Use case data modeling and transformation will typically involve:
The optimal, final form of the data model will be a single flattened data structure – a very large, wide table. This, along with materialization, eliminates the need for expensive JOINs to be performed each time a query is performed for the analysis.
Datameer is a powerful SaaS data transformation platform that runs in Snowflake – your modern, scalable cloud data warehouse – that combines to provide a highly scalable and flexible environment to transform your data into meaningful analytics. With Datameer, you can:
Datameer provides a number of key benefits for your modern data stack and cloud analytics, including:
Modern data stacks using cloud data warehouses and ELT processes have created the need for modernized data modeling within the data stack. A highly modular approach to data modeling and transformation is required, as is a highly collaborative process between data engineering and analytics teams where each can best use their skills and knowledge.
Are data modeling and transformation keeping you from modernizing your analytics? Are your data and analytics engineering costs skyrocketing? Datameer has the data modeling and transformation solution to speed your analytics processes and keep your data engineering costs low. It offers versatility and completeness to facilitate easy data modeling and transformation for the steps detailed above.