The Golden Warehouse Metrics Part II – Governance

  • Martin Nettling
  • December 27, 2022

TL;DR: We observed decreased performance of our Snowflake warehouses and wanted to implement data governance. Based on Snowflake’s built-in QUERY_HISTORY, we compute the warehouse metrics, feasibility, availability, concurrency, and response time. We assessed potential reasons for such drops in performance, and we were able to evaluate options to react.

Snowflake Governance:

The Problem

Recently, we organized a bug bash in our company to put Datameer through its paces. Occasionally, some Snowflake calls showed unexpectedly long waiting times and timeouts, which led to undesirable behavior of our software. 

snowflake data governance

AVG_QUEUED_LOAD and AVG_RUNNING from the WAREHOUSE_LOAD_HISTORY from the LOAD_WH over 90 minutes.

The WAREHOUSE_LOAD_HISTORY showed queued queries indicating an overloaded warehouse. Consequently, we revised how our app communicates with Snowflake to avoid undesirable behavior in the future. Nevertheless, warehouses can overload, and we wanted to quantify such events enabling us to choose an appropriate warehouse size or even consider a multi-cluster warehouse. A well-directed decision can save money and keep our Snowflake users happy simultaneously. 

When exploring the WAREHOUSE_LOAD_HISTORY , we also observed blocked queries in our account, another indicator of degraded performance. We looked into the QUERY_HISTORY to understand who was affected. Our investigations revealed that the blocked queries came from testing our QA team. We wanted to know if we needed to change some testing approaches to ensure that blocked queries do not defer or hamper our QA processes.

snowflake data governance

AVG_BLOCKED from the WAREHOUSE_LOAD_HISTORY from the QA_ONLY_WH over a month.

In this blog post, we measure the impact of degraded performance on query level to enable data-driven decisions on questions like: 

  • How often and due to which activities did we face degraded performance in the past?
  • Were query response times still fast enough for users of our Datameer app? 
  • What warehouse size is appropriate for a certain scenario?

Results and Conclusion

We defined the warehouse metrics availability, feasibility, concurrency, and response time. All four warehouse metrics tell us to which degree a warehouse is overloaded or inhibited by blocked queries. We use these metrics as key performance indicators (KPIs) for degraded performance due to warehouse overload and blocked queries.  

We surveyed events of degraded performance in the past, which gave us valuable information about the impact of various activities in the company. Based on that knowledge, we could estimate the impact of future endeavors to avoid degraded performance for a minimum of credit burn.

snowflake data governance

Availability, concurrency, and response time (s) within 90 minutes of the LOAD_WH warehouse. Before and after the bug-bash we see business as usual. Still, during bug-bash, the availability drops from 100% to 25%, the concurrency increases from 50% to more than 900% (i.e.,> 70 concurrent queries), and the response times increase from 45 ms to 270 s.

During the bug-bash, we used the LOAD_WH warehouse (size was X-Small ), and the availability dropped to 25%, indicating that only every fourth query could be processed at a time. Hence, the warehouse accumulated more than 50 queued queries, and the mean response time increased to nearly five minutes, which is way too high for an interactive usage of our product. We conclude that we need at least a Medium sized warehouse for such a scenario the next time.

snowflake data governance

Feasibility and response time (ms) within one month for the QA_ONLY_WH warehouse.

The blocked queries observed for the QA_ONLY_WH warehouse result in a minimum feasibility of 97% and response times between 55 ms and 400 ms. We conclude that blocked queries occurred rarely and that concurrent updates had a short overlap resulting in a small impact. We conclude that blocked queries are not problematic and that we do not have to take action here.

Solution

We distinguish degraded performance due to warehouse overload and degraded performance due to blocked queries. A warehouse becomes overloaded when it receives more jobs than it can process, which results in queued queries. Query blocking can also be considered some kind of queuing, but the root cause is different. Query execution is blocked when another query locks an object for an update. A query that attempts to update the same object simultaneously is blocked until the other query releases the lock (see simple example below).

Exemplary warehouse workload with blocked and queued queries.

We want to separately assess degraded performance due to an overloaded warehouse from degraded performance due to blocked queries. The issue of blocked queries is not necessarily specific to a single warehouse, and we wanted to measure to which degree queries in different warehouses are affected by that. Also, the way of solving this is different. In case of an overloaded warehouse, we can increase the warehouse size or re-distribute the workload to other warehouses; for blocked queries, we have to adjust the way our updating processes are working instead.

We utilize the warehouse metrics availability, feasibility, concurrency, and response time to gauge the degraded performance of our warehouse and Snowflake data governance as follows.

The availability metric captures to which degree a warehouse is affected by queued queries (i.e., warehouse repair, provisioning, and overload), and values range from zero (everything is queued) to one hundred (no queuing at all). The availability should be close to 100%. Otherwise, your applications might crash, e.g., because certain routines might time out after 30s.

The feasibility metric captures to which degree a warehouse is affected by blocked queries, and values range from zero (everything is blocked) to one hundred (no blocking at all). If this metric is below 100%, you may want to check which processes are updating a common Snowflake resource and whether a better strategy can lead to block-free update cycles.

The concurrency metric reflects the number of concurrent queries of a warehouse relative to the MAX_CONCURRENCY_LEVEL warehouse parameter, and values range from zero (no queries) through one hundred (optimal warehouse utilization) to open end (warehouse overload with poor response times).

The response time metric specifies the mean overall time for query completion and depends not only on queuing or blocking of queries but also on the execution time. The response time should be quick enough so that you can work with Snowflake interactively and your consumers do not face timeouts.

Outlook

We wanted to manage our warehouse, and by setting up our Snowflake data governance, using the warehouse metrics availability, feasibility, concurrency, and response time, we have KPIs to assess the impact of degraded performance due to warehouse overload and blocked queries. We can anticipate the impact of future activities based on the impact of events from the past. We will introduce additional warehouse metrics for further use cases, such as decreased warehouse cost.

Stay tuned for upcoming blog posts!

Contact us today to give feedback, or get in touch with us!

Authors: Martin Nettling, André Rouél, Hendrik Treutler 

Related Posts

Top 5 Snowflake tools for Analysts- talend

Top 5 Snowflake Tools for Analysts

  • Ndz Anthony
  • February 26, 2024