How to use subqueries and table expressions in Snowflake
- How-Tos FAQs
- November 22, 2021
In this tutorial, you will learn how to use subqueries and common table expressions in Snowflake. We will also discuss where and when to use each. Lastly, we will touch on some advanced topics.
A good analogy for a subquery or a common table expression is the parentheses in a simple math equation. According to the order of operations in mathematics, anything in parentheses is calculated first and then the resulting output is fed into the rest of the equation.
Subqueries and common table expressions behave the exact same way. They are queries (also noted by parentheses) that are run before the rest of the query, whose result is stored as a temporary table or single column or value and feeds into the rest of the query as part of a FROM or WHERE clause.
Let’s take a look at subqueries and common table expressions in Snowflake.
Subqueries in Snowflake can be defined by wrapping parentheses around a query and putting that query, with the parentheses around it, in a FROM or WHERE statement of the overall query.
An example of using a WHERE subquery as a filter, using Snowflake Sample Data:
select c_custkey, c_acctbal from tpch_sf1.customer where c_acctbal > (select avg(c_acctbal) from tpch_sf1.customer) order by c_acctbal desc;
The subquery in the overall query above calculates the average account balance of all customers as a single value from the customers table and then that value is used in a WHERE clause. The subquery is in bold below:
where c_acctbal > (select avg(c_acctbal) from tpch_sf1.customer)
In the example above the subquery returns a single value. We can also return several values.
Another example of using a subquery in a WHERE clause:
select sum(c_acctbal) as total_balance from tpch_sf1.customer where c_custkey IN (select o_custkey from tpch_sf1.orders where o_orderpriority = '1-URGENT');
In the example above the subquery returns all customers that have ever had an urgent order and then we calculate the total balance for all of those customers.
One thing to note: we did not specify distinct customers. This is optional and the result of the overall query will be the same, even if customers had multiple urgent orders.
In the last example we queried 1 column of values in the subquery and fed that into the overall query. Subqueries can also be used just like tables in a FROM clause to pull entire table results.
An example of using a subquery in a FROM clause:
select c_custkey, c_acctbal, urgent_orders.first_order_date as first_urgent_order, urgent_orders.order_count as urgent_order_count from tpch_sf1.customer join (select o_custkey, min(o_orderdate) as first_order_date, count(*) as order_count from tpch_sf1.orders where o_orderpriority = '1-URGENT' group by 1 ) as urgent_orders on urgent_orders.o_custkey = customer.c_custkey;
The subquery example above queries a customer id, calculates the first urgent order date for every customer and counts the total number of urgent orders. The result of that query is used in the FROM clause as a table and joined to the customer table where we can get the customer’s account balance.
In this last example, the SQL code works perfectly fine, but it does become a little hard to read. And if we had a more complicated subquery, it would be hard to see what the overall query is doing as the subquery is nested within the middle of the overall query.
That’s where common table expressions come in handy.
Common Table Expression (CTE)
A subquery and a CTE have different syntax in SQL code, but for most use cases behave the exact same way. A CTE uses a WITH clause at the beginning of your SQL script to define and name the cte, which can then be used as just like a temporary view or table later on in the script.
Let’s rewrite our last subquery example as a CTE:
with urgent_orders as ( select o_custkey, min(o_orderdate) as first_order_date, count(*) as order_count from tpch_sf1.orders where o_orderpriority = '1-URGENT' group by 1 ) select c_custkey, c_acctbal, urgent_orders.first_order_date as first_urgent_order, urgent_orders.order_count as urgent_order_count from tpch_sf1.customer join urgent_orders on urgent_orders.o_custkey = customer.c_custkey;
This query returns the same result as the previous example, but now the urgent orders subquery is split out on it’s own up top and referenced in the from clause in the overall query below. This makes the code much more readable and easier to walk through the code in the order that it will be executed.
Subqueries and common table expressions can also be chained together. To understand this, the analogy is parentheses inside parentheses in a math equation. For Excel users, a good analogy would be a nested IF() statement in Excel. Your SQL code can get pretty messy when you have a subquery with several nested levels. And that’s where common table expressions really shine.
with urgent_orders as ( select o_custkey, min(o_orderdate) as first_order_date, count(*) as order_count from tpch_sf1.orders where o_orderpriority = '1-URGENT' group by 1 ), /*** note the comma here ***/ first_urgent_days as ( select orders.o_custkey, sum(o_totalprice) as total_value from orders join urgent_orders on orders.o_orderdate = urgent_orders.first_order_date and orders.o_custkey = urgent_orders.o_custkey group by 1 ) select c_custkey, c_acctbal, urgent_orders.first_order_date as first_urgent_order, first_urgent_days.total_value as first_urgent_day_value, urgent_orders.order_count as total_urgent_order_count from tpch_sf1.customer join urgent_orders on urgent_orders.o_custkey = customer.c_custkey join first_urgent_days on first_urgent_days.o_custkey = customer.c_custkey;
The above example adds the first_urgent_days CTE that references the urgent_orders CTE which is defined (and executed) before first_urgent_days. This can be extremely helpful when you have multiple steps in a data transformation process that you want to split up into smaller queries for easier maintainability and future development.
Important note: CTEs do not have to chain on each other as in the example above. You can also define multiple, completely unrelated CTEs to be used in the overall query.
- Correlated subqueries are subqueries which reference the outer query. This type of subquery is helpful and may have better performance in row-oriented databases, but in column-oriented databases like Snowflake, correlated subqueries have limited support, due to the underlying architecture. Row-by-row calculations or lookups, often the use case for correlated subqueries, perform worse in column-oriented databases like Snowflake than calculating all rows at once and returning the entire result set for the overall query to reference.
- Recursive CTEs are CTE that reference themselves and can recursively create hierarchical data. This is out of the scope of this article, but you can read about recursive CTEs in Snowflakes docs. They are very helpful for querying or creating hierarchical data.
Datameer can complete all the data transformation we’ve done above, and all without SQL. Each transformation we’ve done can be created as a no-code recipe, or a series of operations.
One nice thing about Datameer is the transformation steps are laid out graphically so that it’s not necessary to know SQL to walk through the transformation recipe with any of a project’s stakeholders.
Datameer also makes maintenance easier. You don’t have to sift through a lot of code if you want to add a transformation to the recipe or modify an existing transformation.
Check out: Add identity or autoincrement to table in Snowflake