How to Transform Your Data in Snowflake: Part 2
- John Morrell
- October 11, 2021
In How to Transform Data in Snowflake: Part 1, we discussed the three primary steps you would take to transform your data in Snowflake. This is the T in your ELT process.
In part 2, let’s examine how you can optimize your data models and transformations in your Snowflake data warehouse. The keys to optimizing data transformation for Snowflake come down to three areas:
- Making effective use of virtual data warehouses
- How you execute your data transformation queries
- Employing specific data transformation techniques
Let’s explore each of these.
Virtual Data Warehouses
Virtual data warehouses are a primary means by which you scale out your data and analytics in Snowflake. Virtual data warehouse instances also help optimize the workload for each analytics project. Snowflake recommends you have a different virtual data warehouse for each use case.
Snowflake also recommends that users separate “loading” and “execution” into separate virtual data warehouses, where loading is an area external data is loaded and execution are areas for user query and reporting. This separation is recommended because of the different workloads for data loading and user queries.
Data transformation queries also carry a different workload. And, how you deploy your data transformation queries also has a large impact on workload. As a general rule, you should separate your data transformation views/queries into a separate database away from your raw data. This way, you have the flexibility to use distinct virtual data warehouses – one for your EL and one for your T – and lets you optimize those virtual data warehouses to the compute needs of each and minimize your costs.
Data Transformation Execution
How your data transformation queries are executed also plays a large role in optimizing data transformation in Snowflake. In general, this boils down to a basic decision: do I materialize the data transformation views or run them “at query-time.”
Materialized views are another way to improve query performance for workloads composed of common, repeated query patterns. And data transformation models are typically designed specifically for this purpose – to create a dataset that supports a common, repeated query pattern.
Snowflake commonly shares a best practice that one should use materialized views to improve performance for queries that are frequent and repeated. Using storage is always cheaper than using compute. Materialized views will add a small amount to your storage bills but storage costs are so small and you’ll save much more in your compute bills and query response time. Hence, if your models are to support repeatable query patterns, then materialize them.
One option is to place the data transformation views in the use case database. This option works well when (a) the views are unique to the use case and (b) the views have unique workload or compute requirements. The other option is to have a single database for all your data transformation views. This option works well when you have data transformation views that are shared across different use cases and have common workload characteristics.
In practice, you’ll find you will mix these two approaches. Your data transformation tool, such as Datameer, can maintain a complete repository of data transformation models regardless of where they physically reside.
Data Transformation Techniques
There are different techniques for optimizing your data transformation views/queries. These will impact the performance. They boil down to four recommendations:
- Reduce complexity where you can. If you can, combine multiple queries into one. Simpler is always better.
- Materialize JOINs. JOINs are very costly operations in cloud data warehouses. If a data transformation view JOINs multiple tables, materialize the view.
- Discard data you don’t need. Perform projection in your data transformation models to eliminate fields you don’t need which reduces query time costs.
- Use materialized pre-aggregation. If there are common aggregation values and dimensions that are often used, then create materialized aggregated views.
Data transformation queries typically work with a large volume of your raw, loaded data from your various data sources. How you optimize your data transformation queries is critical to keeping your Snowflake compute and storage optimized.
Datameer provides data transformation solutions that enable data engineers and analysts to transform and model data directly in Snowflake to solve complex analytical projects. Are you interested in learning more about transforming data in Snowflake? Read our complete how-to guides, how to transform your data for analytics and how to optimize your data transformation for Snowflake.