What is Data Modeling and How Do You Do It?

Datameer Spectrum Versus Tableau Prep

The data modeling world is constantly evolving as new technologies are introduced.  First, it was data warehousing, then MPP data warehouses, followed by Hadoop and data lakes, and now we are in the cloud data warehouse era.

We have seen different data modeling options starting with the Inmon and Kimball models and evolving into the “data flattening” models in data lakes.  We have also seen the data transformation process evolve from ETL to ELT.

What is Data Modeling?

Data Governance - who is responsible

Data modeling is the process of defining data content and structure for a specific purpose.  The data modeling process is general purpose and is used for both applications and analytics.

Data modeling will often happen in three parts.  First, a conceptual model is created which defines the high-level data requirements for the data model.  This is then translated into a logical model, which adds more details and defines destination independent logical structure.  Finally, the logical model is translated into a physical model within a data source such as a cloud data warehouse like Snowflake.

Data Modeling for Analytics

Data Governance frameworks

The two most prominent data modeling approaches for analytics came from data warehouse innovators Bill Inmon and Ralph Kimball.  Inmon advocated a subject-oriented approach to data modeling organized into subject areas, while the Kimball approach focused on organizing data into fact tables and dimensional tables in what became known as a star schema.  Each approach added needed structure to data warehouses and models, but that structure also required lengthy change management processes.

Data lakes brought about another data modeling technique that was more agile.  Data transformation could take advantage of schema-on-read and the highly parallel processing of Hadoop.  Large volumes of raw data were loaded into the data lake, and then data engineers would transform the data into a subject-specific structure for each use case.  The resulting structure was typically a wide, flat table with all the data needed for specific analytics use cases.

In both cases, data was processed with an ETL model – extract, transform, and load. In the case of data warehouses, data was transformed going into the warehouse, and for data lakes, it was transformed directly on the data lake.

The Cloud Data Warehouse Era

security-and-data-governance

The advent and popularity of cloud data warehouses have altered the way we think about data transformation and modeling.  The new ELT model, or extract, load, and transform, process extracts, and loads raw source data into the CDW, which is then transformed into final form for your analytics.

This allowed organizations to take advantage of the inexpensive and scalable compute and storage services of the CDW and created agility by separating data loading and data transformation processes and workload with data engineers performing the former and data analysts the latter.  Organizations could create any number of subject-specific analytical data models that are optimized for their own needs and can use modern organization techniques such as Snowflake virtual data warehouses.

The ELT model also allowed organizations to share the data modeling and transformation workload.  Initially, data modeling and transformation was a specialty of the ETL developer.  On data lakes, a new role emerged – the data engineer – and they took over the task.  In cloud data warehouses, the EL and the T are separated with the data engineer responsible for the EL and shared responsibilities between the data engineer and analytics community for the T, with much of the data modeling happening in the T.

collaboration tools green icon

Step 1: Getting the Data Into the CDW

It is arguable whether the EL part of the ELT process is technically considered data modeling – the process of getting the original, raw data into your cloud data warehouse.  At a minimum, the EL process has a big impact on the overall data model and structure within your cloud data warehouse.

In the modern data landscape, data comes from a wide variety of sources.  The faster-growing sources of data for analytics are from SaaS applications and cloud services.  These sources have extremely complex data structures and APIs.  The modern EL or “data loader” tools, such as Fivetran, Stitch, and others, focus on eliminating this complexity and replicating “objects” from these sources into your cloud data warehouse.

Thus, the initial data model your team will work from is a set of tables in the cloud data warehouse that look like objects from your data sources, are grouped similarly, and contain all the same fields.  But because the data was designed for use within the SaaS application or cloud service, it could be very cryptic and not understandable by a general-purpose data analyst.

There is one very important data transformation step that needs to be applied to the data on its way into the cloud data warehouse.  If any of the data is private or sensitive, it needs to be anonymized or masked.  This is critical to maintain the privacy of the data and ensure regulatory compliance.

After the raw data is loaded, the data engineering team may apply for a first pass at data cleansing.  While data cleansing can often cover a number of different techniques, they can be loosely grouped into two categories: general, standardized cleansing, and use case-specific cleansing.  In this first step, data engineers could apply general, standardized cleansing to (a) find and correct missing or invalid values, (b) transforming incorrectly formatted fields, and (c) extracting individual fields from complex, multi-faceted columns.

Step 2: Canonical Data Modeling

Once the data is in the CDW and has gone through the first pass of data transformation, the data engineering team can transform the raw data into canonical data models that represent specific subjects.  Examples of these would be data models representing customers, contacts, leads, opportunities, activities, and more.

The primary rationale for canonical data models is to create shared, reusable components for multiple use cases.  Along with this comes added benefits:

  • The creation of a single version of the truth for each subject and field within that subject,
  • Providing shared and standardized definitions and documentation about the data for each subject,
  • Transparency into the data models and how they are built to build trust in the analytics community.

The data engineering team will gather requirements from the various business and analytics teams to build the canonical data models.  These data models will typically be supersets of the requirements to maximize reuse and consumption.  The data models will also continuously evolve as new requirements or data sources come into play.

Since the raw data that came from the data sources is often normalized (in some cases mildly normalized and others highly normalized), the canonical data models will typically blend (JOIN, UNION, etc.) data from multiple objects to create a rich and complete set of fields to represent the subject.  In addition, the canonical data models may also have some data enrichment to calculate new fields for standardized use in different use cases.

people icon

Step 3: Use Case Data Modeling

The final step in data modeling is to create a dataset or datasets that are specific to the analytics use case.  For modern data modeling in a cloud data warehouse, this task is typically done by the data analyst.  Why?  It boils down to roles and skills:

  • Data engineers tend to know more about the data itself – where it resides, how it is structured and formatted, and how to get it – and less about how the business uses the data.  This makes their ideal role in getting the data into the CDW and first-pass data modeling.
  • Data analysts know less about the raw data but have a complete understanding of how the business would use the data and how it would be incorporated into analytics.  This makes their ideal role to be use case data modeling and transformation.

Data analysts may have varying technical skills but would prefer to spend more time on what they are good at – analysis – and less on coding data transformation.  This is where a low-code or no-code data transformation UI becomes important, eliminating the need for analysts to write complex SQL code and Python-like scripts.

Use case data modeling and transformation will typically involve:

  • Data cleansing that is specific for the use case, such as identifying and fixing outliers or deduping records,
  • Data shaping and reduction such as sorting and organizing the data, eliminating unneeded fields, or narrowing the scope of the data to time periods or specific dimensions, and
  • Data enrichment to add new calculated fields specific to the analysis or uploading local files specific to the use case, such as external or department-specific data.

The optimal, final form of the data model will be a single flattened data structure – a very large, wide table.  This, along with materialization, eliminates the need for expensive JOINs to be performed each time a query is performed for the analysis.

Datameer SaaS Data Transformation

Datameer is a powerful SaaS data transformation platform that runs in Snowflake – your modern, scalable cloud data warehouse – that combines to provide a highly scalable and flexible environment to transform your data into meaningful analytics.  With Datameer, you can:

  • Allow your non-technical analytics team members to work with your complex data without the need to write code using Datameer’s no-code and low-code data transformation interfaces,
  • Collaborate amongst technical and non-technical team members to build data models, and the data transformation flows to fulfill these models, each using their skills and knowledge
  • Fully enrich analytics datasets to add even more flavor to your analysis using the diverse array of graphical formulas and functions,
  • Generate rich documentation and add user-supplied attributes, comments, tags, and more to share searchable knowledge about your data across the entire analytics community,
  • Use the catalog-like documentation features to crowd-source your data governance processes for greater data democratization and data literacy,
  • Maintain full audit trails of how data is transformed and used by the community to further enable your governance and compliance processes,
  • Deploy and execute data transformation models directly in Snowflake to gain the scalability you need over your large volumes of data while keeping compute and storage costs low.

Datameer provides a number of key benefits for your modern data stack and cloud analytics, including:

  • Creating a highly efficient data stack that reduces your data and analytics engineering costs,
  • Allowing you to share the data transformation workload across your broader data and analytics team,
  • Fostering collaboration among the data and analytics team to produce faster, error-free projects,
  • Efficiently using your Snowflake analytics engine for cost-effective data transformation processing,
  • Enabling you to crowd-source your data governance for more effective and efficient governance processes, and
  • Improving data literacy to expand knowledge and effective use of your data.
checklist-icon

Wrap Up

Modern data stacks using cloud data warehouses and ELT processes have created the need for modernized data modeling within the data stack.  A highly modular approach to data modeling and transformation is required, as is a highly collaborative process between data engineering and analytics teams where each can best use their skills and knowledge.

Are data modeling and transformation keeping you from modernizing your analytics?  Are your data and analytics engineering costs skyrocketing?  Datameer has the data modeling and transformation solution to speed your analytics processes and keep your data engineering costs low.  It offers versatility and completeness to facilitate easy data modeling and transformation for the steps detailed above.

Learn more by visiting our website or by scheduling a personalized demo.

See How Quickly Datameer Can Transform Your Data in Snowflake.

Learn More