The Golden Snowflake Warehouse Metrics Part 1 – Warehouse States
- Datameer, Inc.
- December 1, 2022
TL;DR: Our Snowflake costs exploded, and we wanted to find out why. Based on several Snowflake warehouse metrics, we identified warehouses that were more idle than busy or were rarely suspended. By bundling the load and killing orphaned tasks, we reduced idling time as much as possible (in favor of increasing suspended time), reducing unnecessary credit burn.
In Snowsight, we observed a constantly growing credit burn in our account over the last 12 months. The usage statistics already gave us the warehouses ordered by credit consumption, but this was not enough to answer our questions. This blog post focuses on the question: Why does the LOAD warehouse consume more credits than expected?
We looked into the WAREHOUSE_LOAD_HISTORY and observed a higher load in core times and a generally small load overall time. Using QUERY_HISTORY, we fetched queries from small but constant load periods.
1) We found an orphan task named test1234 with the scheduling * * * * *. Once, this task was used to “trial and error” some Snowflake functions. The task was executed every minute and prevented the LOAD warehouse from suspending – each time with only some milliseconds of load and almost 60 seconds of idle time. What a waste of credits.
Additionally, we analyzed which users are using the LOAD warehouse within a typical workday using the following query.
SELECT TIME_SLICE(end_time::timestamp_ntz, 1, 'hour') as slice, user_name, COUNT(*) AS count FROM IMMORTAL_QUERY_HISTORY WHERE end_time >= TO_TIMESTAMP_LTZ('2022-11-16T16:00:00+01:00') AND end_time <= TO_TIMESTAMP_LTZ('2022-11-17T16:00:00+01:00') GROUP BY slice, user_name ORDER BY count DESC;
2) We identified various users, like our automated testing users, that should not be using the LOAD warehouse.
3) Further, the users performing our loading processes ran more often than the consuming data pipeline consumed the data. We will come back to this in one of the following blog posts.
We decreased the credit burn for the LOAD warehouse significantly with the following actions:
1) Deleted task test1234
2) Associated our automated testing users with other warehouses
3) Bundled our loading processes to run once a day (which is sufficient for the consumers)
We came up with the question, what are good indicators for unwanted credit burn, and how can we identify “badly” utilized warehouses in the future?
Results and Conclusion
We needed a dashboard that tells us whether warehouses are behaving unexpectedly. This dashboard would display several warehouse metrics per warehouse, which we use as key performance indicators (KPIs) to assess our warehouse utilization.
These warehouse metrics are given by the percentages of the three warehouse stats: suspended time, idle time, and busy time. As an additional warehouse metric, we determine the average number of credits per hour that we pay for the respective warehouse. All four warehouse metrics give us a feel for the cost of busy and idle time and how much money is saved by suspended time.
|Warehouse||Credits/hour||Busy (%)||Idle (%)||Suspended (%)|
As depicted in the table above, we measured our four warehouse metrics for several warehouses. As you can see, the LOAD warehouse produces nearly the maximum cost (warehouse size X-Small), and that ~94% of these costs are produced during idle time.
After performing the described actions, we measured our warehouse metrics again, as depicted in the pie charts above. The warehouse is now suspended most of the time, which reduced the cost of this warehouse to 1/6th. See the development of the metrics over some days below.
Interestingly, when looking at the busy-idle-ratio after the changes, 75% of these costs are still produced by idle time. Unfortunately, we see a similar situation for other warehouses. A discovery that deserves a sequel.
We distinguish the following three warehouse states: suspended, idle, and busy. Suspended warehouses are inactive and do not cause costs; idle warehouses are active, but there are zero workloads, and busy warehouses are processing at least one query. Suspended warehouses do not cause costs, but idle and busy warehouses equally cause costs, and we want to avoid costs due to warehouses in an idle state.
A warehouse can change the state from suspended to busy for the processing of queries. After completing the last query, the warehouse stays idle for the defined auto-suspend time and switches to suspended when no new queries arrive.
We measure the proportion of the three warehouse states in percent. This is how we can interpret these:
Suspended: There has been no load for a while, allowing your warehouse to suspend and thus save cost.
Idle: Your warehouse is idling most of the time? This produces the same costs as if executing queries. You may consider lowering the auto-suspend time or to redistribute the load to lower the proportion of the idle time.
Busy: Your warehouse is not very busy? You may consider migrating the load from this warehouse to another warehouse or migrating a load from another warehouse to this warehouse.
With these four warehouse metrics, we now have KPIs to judge undesirable warehouse load, but the application of these metrics is not limited to detecting rogue tasks. We will use these and other warehouse metrics to assess the availability of our warehouses, measure the warehouse performance, and plan the redistribution of load from warehouse to warehouse.
Stay tuned for upcoming blog posts!
Contact us today to give feedback, or get in touch with us!