Optimizing Analytics for the Hybrid Cloud

  • John Morrell
  • June 26, 2020

There is a multitude of techniques for optimizing analytics for the hybrid cloud world we live in. Optimizing your data and analytics in the cloud is critical as costs for computing, storage, networking, and software services can easily spiral out of control.  To keep your CFO from pounding down your door when the monthly cloud bill comes due, let’s explore some of these techniques.

Fundamentals

Let’s set one fundamental upfront: we are assuming a hybrid cloud landscape.  The vast majority of enterprises that have moved or are moving to the cloud are operating in a hybrid landscape, bridging their on-premises data and analytics with cloud services.  A recent paper by TDWI showed the propensity of enterprises’ desire to manage data in a hybrid cloud environment.

The second fundamental we will work from is that we are trying to optimize two assets – the amount of data managed (stored) and the amount of compute resource utilized. While network consumption can also be a factor, most cloud environments don’t charge for transmitting data into their cloud, only going out.  We will assume a one-way flow of data into the cloud.

Lastly – and this may be a given – we understand that there will need to be some form of pre-processing of the data.  This will be to organize, prepare, and curate the data for cloud analytics.

Understand What Data is Necessary

The first technique to optimizing your cloud analytics is to figure out what data – or more specifically, which datasets and part of those datasets – is essential to produce the analytic results the business teams require.  This is done via discovery, collaboration, and knowledge-sharing amongst the data and analytics team to determine which data best fits the problem at hand.

The largest cost component of cloud analytics is the compute cost for queries based on how much data needs to be scanned and crunched.  Optimizing the data will reduce ad-hoc and operational query compute costs by eliminating querying across unneeded data.  It will also reduce data storage fees, although that is a much lower cost component.

Optimize Data Movement

Our first technique helped us figure out what data is actually needed, which optimizes data movement.  But just because the data is needed doesn’t necessarily mean it needs to be moved to the cloud.

If datasets are used to answer ad-hoc, in-the-moment analytic questions, you may be better served to virtualize the data rather than move it into the cloud.  Or if the current question is ad-hoc but eventually requires operational reporting, virtualize the data for the time being, then set up optimized data pipelines into the cloud when the operational reporting is needed.

This will again optimize the compute and storage costs of your cloud data warehouse.  It also reduces the security and governance risks of wholesale data movement into the cloud.  Also, use a data preparation and pipeline tool that feeds directly into your cloud data warehouse (CDW) and don’t waste compute and storage costs of landing data in cloud storage then pumping into your CDW.

Refine Data Before Moving It into the Cloud

Make sure you curate – prepare, transform, refine, etc. – your data first before sending it into your cloud data warehouse.  This allows you to reduce the dataset, eliminating unneeded or redundant columns, dirty unusable rows, or rows representing data out of bounds for the analysis.

Some CDWs tout the fact that you can send raw data and use the CDW to prepare it.  However, these data preparation capabilities are in no way self-service, requiring SQL-coding skills, creating coded preparation jobs, and having minimal transformation capabilities.

Curating your data with a self-service data preparation tool first will speed your analytics cycles and eliminate the need for costly custom coded data pipelines.  It will also eliminate the extra CDW compute and storage costs otherwise needed for data preparation and for doing inline transformations at query time.

De-normalize and Pre-aggregate

Some of the most expensive operations in a CDW from a compute cost standpoints are JOINs and aggregations.  Complex queries with JOINS and aggregations can dramatically raise cloud analytics costs. Creating materialized views can reduce the run-time query costs but requires query compute time each time the view is materialized and adds to data storage.

Several customers have used techniques to flatten out or de-normalize the data, enrich it, and pre-aggregate it along known dimensions in the data preparation pipeline.  This will make the data more valuable to the downstream analysts and reduce the compute costs for the CDW.

Use Elastic Tools/Platforms

We encounter a greater variance of analytics workloads in the cloud, with an equal mix between ad-hoc analytics and operational BI and reporting.  The exploratory nature of ad-hoc analytics often brings more complex queries requiring more cloud resources and can mean workloads will vary.

Solve this problem by using tools and platforms that are elastic and priced by the workload.  When workloads are higher, the platform will scale-out resources only to what is needed, and when lower, scale them down. This optimizes compute costs and only see increases when you have query bursts.

Wrap Up

Keep your CFO from banging down your door and keep your budget in line by optimizing your cloud analytics using the techniques we’ve mentioned here:

  • Understanding what data is necessary
  • Optimizing data movement
  • Refining data before moving it into the cloud
  • De-normalizing, enriching, and pre-aggregating data in the pipelines
  • Using elastic tools and platforms

How do we help?  The combination of Datameer X and Datameer Spotlight, along with a cloud data warehouse, is the perfect combination for deploying and migrating analytics workloads to the cloud.  Learn more from this earlier blog article to see our recipe for a successful migration to cloud analytics using Datameer X and Datameer Spotlight.  Better yet, get started right away using our free trial.

Subscribe for the Latest Posts

Search

Discover the Top ETL and Data Integration Platforms

Comparison_of_Leading_ETL_And_Data_Integration_Platforms

Featured Blog Posts

The Role of Chief Data Officers (CDOs) in 2020
The Role of Chief Data Officers (CDOs) in 2021

The Chief Data Officers role (CDOs) in 2021 is evolving as CDOs are having quite possibly their m...

  • John Morrell
  • April 3, 2021
Spectrum ETL
Disrupting the no-code cloud ELT market: Datame...

More than just loading Data: Datameer launches Datameer Spectrum ETL++ to disrupt the no-code clo...

  • Press Release
  • February 9, 2021
Google Partners with Datameer
Datameer Partners with Google Cloud to Deliver ...

Datameer is now a Google Cloud migration partner The partnership will help customers build secure...

  • Press Release
  • December 2, 2020
Datameer Spotlight - Disrupting the traditional central data warehouse model
Disrupting the traditional central data warehou...

The new flagship product from Datameer upends a three-decade-old approach to data analytics ̵...

  • Press Release
  • December 1, 2020
READ ALL

More from Our Blog

Top 5 Fivetran competitors

Top 5 Fivetran Competitors and Alternatives

What is Fivetran?  Fivetran is a cloud-based ELT integration tool that teams can use to synchroni...

  • Justin Reynolds
  • June 15, 2021
The Simplest Road to a Modern Data Stack with Snowflake

The Simplest Road to a Modern Data Stack with S...

The first building block of a cloud data stack starts with Snowflake.  Your analytics engine and/...

  • John Morrell
  • June 14, 2021
Top 5 Matillion Competitors

Top 5 Matillion Competitors and Alternatives

Matillion ETL Review Matillion is a cloud-based ETL tool that enables teams to create and orchest...

  • Justin Reynolds
  • June 10, 2021

Updating your ETL? Your guide to the 10 things to consider when modernizing your ETL.