How to Transform Your Data in Snowflake: Part 1
- John Morrell
- October 11, 2021
The data modeling world is constantly evolving as new technologies are introduced. First, it was data warehousing, then MPP data warehouses, followed by Hadoop and data lakes. Now we are in the cloud data warehouse era.
The advent and popularity of cloud data warehouses, such as Snowflake, have altered the way we think about data transformation and modeling. The new ELT, or extract, load, and transform, process extracts, and loads raw source data into Snowflake, which is then transformed into final form for your analytics.
This allowed organizations to take advantage of the inexpensive and scalable compute and storage services of Snowflake and created agility by separating data loading and data transformation processes and workload with data engineers performing the former and data analysts the latter. Organizations could create any number of subject-specific analytical data models that are optimized for their own needs and can use modern organization techniques such as Snowflake virtual data warehouses.
The ELT model also allows organizations to share the data modeling and transformation workload. A new role has emerged over the past few years – the data engineer – whose primary responsibility is creating and implementing data pipelines and data models, part of which is data transformation. In Snowflake, the EL and the T are separated with the data engineer responsible for the EL and shared responsibilities between the data engineer and analytics teams for the T.
Step 1: Getting the Data Into the CDW
In the modern data landscape, data comes from a wide variety of sources. The faster-growing sources of data for analytics are from SaaS applications and cloud services. These sources have extremely complex data structures and APIs. The modern EL or “data loader” tools, such as Fivetran, Stitch, and others, focus on eliminating this complexity and replicating “objects” from these sources into Snowflake.
Thus, the initial data model your team will work from is a set of tables in the cloud data warehouse that look like objects from your data sources, are grouped similarly, and contain all the same fields. But because the data is still in a form similar to the SaaS application or cloud service objects, it could be very cryptic and not understandable by a data analyst.
There is one very important data transformation step that needs to be applied to the data on its way into Snowflake. If any of the data is private or sensitive, it needs to be anonymized or masked. This is critical to maintain the privacy of the data and ensure regulatory compliance.
After the raw data is loaded, the data engineering team may apply a first pass at data cleansing. In this first step, data engineers could apply general, standardized cleansing to (a) find and correct missing or invalid values, (b) transforming incorrectly formatted fields, and (c) extracting individual fields from complex, multi-faceted columns.
Step 2: Canonical Data Modeling
Once the data is in the CDW and has gone through the first pass of data transformation, the data engineering team can transform the raw data into canonical data models that represent specific subjects. Examples of these would be data models representing customers, contacts, leads, opportunities, activities, and more.
The primary rationale for canonical data models is to create shared, reusable components for multiple use cases. Along with this comes added benefits:
- The creation of a single version of the truth for each subject and field within that subject,
- Providing shared and standardized definitions and documentation about the data for each subject,
- Transparency into the data models and how they are built to build trust in the analytics community.
The data engineering team will gather requirements from the various business and analytics teams to build the canonical data models. These data models will typically be supersets of the requirements to maximize reuse and consumption. The data models will also continuously evolve as new requirements or data sources come into play.
Since the raw data that came from the data sources is often normalized (in some cases mildly normalized and others highly normalized), the canonical data models will typically blend (JOIN, UNION, etc.) data from multiple objects to create a rich and complete set of fields to represent the subject. In addition, the canonical data models may also have some data enrichment to calculate new fields for standardized use in different use cases.
Step 3: Use Case Data Modeling
The final step in data modeling is to create datasets that are specific to the analytics use case. For modern data modeling in Snowflake, this task is typically done by the data analyst. Why? It boils down to roles and skills:
- Data engineers tend to know more about the data itself – where it resides, how it is structured and formatted, and how to get it – and less about how the business uses the data. This makes their ideal role in getting the data into Snowflake and first-pass data modeling.
- Data analysts know less about the raw data but have a complete understanding of how the business would use the data and how it would be incorporated into analytics. This makes their ideal role to be use case data modeling and transformation.
Data analysts may have varying technical skills but would prefer to spend more time on what they are good at – analysis – and less on coding data transformation. This is where a low-code or no-code data transformation UI becomes important, eliminating the need for analysts to write complex SQL code and Python-like scripts.
Use case data modeling and transformation will typically involve:
- Data cleansing that is specific for the use case, such as identifying and fixing outliers or deduping records,
- Data shaping and reduction such as sorting and organizing the data, eliminating unneeded fields, or narrowing the scope of the data to time periods or specific dimensions, and
- Data enrichment to add new calculated fields specific to the analysis or uploading local files specific to the use case, such as external or department-specific data.
The optimal, final form of the data model will be a single flattened data structure – a very large, wide table. This, along with materialization, eliminates the need for expensive JOINs to be performed each time a query is performed for the analysis.
In part 2 of this series, we will explain how to optimize your data transformation in Snowflake.
Modern data stacks using cloud data warehouses and ELT processes have created the need for modernized data modeling within the data stack. A highly modular approach to data modeling and transformation is required, as is a highly collaborative process between data engineering and analytics teams where each can best use their skills and knowledge.
Datameer provides data transformation solutions that enable data engineers and analysts to transform and model data directly in Snowflake to solve complex analytical projects. Are you interested in learning more about transforming data in Snowflake? Read our complete how-to guides, how to transform your data for analytics and how to optimize your data transformation for Snowflake.