Optimizing Analytics for the Hybrid Cloud
- John Morrell
- June 26, 2020
There is a multitude of techniques for optimizing analytics for the hybrid cloud world we live in. Optimizing your data and analytics in the cloud is critical as costs for computing, storage, networking, and software services can easily spiral out of control. To keep your CFO from pounding down your door when the monthly cloud bill comes due, let’s explore some of these techniques.
Let’s set one fundamental upfront: we are assuming a hybrid cloud landscape. The vast majority of enterprises that have moved or are moving to the cloud are operating in a hybrid landscape, bridging their on-premises data and analytics with cloud services. A recent paper by TDWI showed the propensity of enterprises’ desire to manage data in a hybrid cloud environment.
The second fundamental we will work from is that we are trying to optimize two assets – the amount of data managed (stored) and the amount of compute resource utilized. While network consumption can also be a factor, most cloud environments don’t charge for transmitting data into their cloud, only going out. We will assume a one-way flow of data into the cloud.
Lastly – and this may be a given – we understand that there will need to be some form of pre-processing of the data. This will be to organize, prepare, and curate the data for cloud analytics.
Understand What Data is Necessary
The first technique to optimizing your cloud analytics is to figure out what data – or more specifically, which datasets and part of those datasets – is essential to produce the analytic results the business teams require. This is done via discovery, collaboration, and knowledge-sharing amongst the data and analytics team to determine which data best fits the problem at hand.
The largest cost component of cloud analytics is the compute cost for queries based on how much data needs to be scanned and crunched. Optimizing the data will reduce ad-hoc and operational query compute costs by eliminating querying across unneeded data. It will also reduce data storage fees, although that is a much lower cost component.
Optimize Data Movement
Our first technique helped us figure out what data is actually needed, which optimizes data movement. But just because the data is needed doesn’t necessarily mean it needs to be moved to the cloud.
If datasets are used to answer ad-hoc, in-the-moment analytic questions, you may be better served to virtualize the data rather than move it into the cloud. Or if the current question is ad-hoc but eventually requires operational reporting, virtualize the data for the time being, then set up optimized data pipelines into the cloud when the operational reporting is needed.
This will again optimize the compute and storage costs of your cloud data warehouse. It also reduces the security and governance risks of wholesale data movement into the cloud. Also, use a data preparation and pipeline tool that feeds directly into your cloud data warehouse (CDW) and don’t waste compute and storage costs of landing data in cloud storage then pumping into your CDW.
Refine Data Before Moving It into the Cloud
Make sure you curate – prepare, transform, refine, etc. – your data first before sending it into your cloud data warehouse. This allows you to reduce the dataset, eliminating unneeded or redundant columns, dirty unusable rows, or rows representing data out of bounds for the analysis.
Some CDWs tout the fact that you can send raw data and use the CDW to prepare it. However, these data preparation capabilities are in no way self-service, requiring SQL-coding skills, creating coded preparation jobs, and having minimal transformation capabilities.
Curating your data with a self-service data preparation tool first will speed your analytics cycles and eliminate the need for costly custom coded data pipelines. It will also eliminate the extra CDW compute and storage costs otherwise needed for data preparation and for doing inline transformations at query time.
De-normalize and Pre-aggregate
Some of the most expensive operations in a CDW from a compute cost standpoints are JOINs and aggregations. Complex queries with JOINS and aggregations can dramatically raise cloud analytics costs. Creating materialized views can reduce the run-time query costs but requires query compute time each time the view is materialized and adds to data storage.
Several customers have used techniques to flatten out or de-normalize the data, enrich it, and pre-aggregate it along known dimensions in the data preparation pipeline. This will make the data more valuable to the downstream analysts and reduce the compute costs for the CDW.
Use Elastic Tools/Platforms
We encounter a greater variance of analytics workloads in the cloud, with an equal mix between ad-hoc analytics and operational BI and reporting. The exploratory nature of ad-hoc analytics often brings more complex queries requiring more cloud resources and can mean workloads will vary.
Solve this problem by using tools and platforms that are elastic and priced by the workload. When workloads are higher, the platform will scale-out resources only to what is needed, and when lower, scale them down. This optimizes compute costs and only see increases when you have query bursts.
Keep your CFO from banging down your door and keep your budget in line by optimizing your cloud analytics using the techniques we’ve mentioned here:
- Understanding what data is necessary
- Optimizing data movement
- Refining data before moving it into the cloud
- De-normalizing, enriching, and pre-aggregating data in the pipelines
- Using elastic tools and platforms