Cloud Platforms for Analytics: The House Brand Ain’t Always Enough

Andrew Brust
May 3, 2018

In this blog, we’ll explore the various data processing and management components available to you on the Amazon Web Services (AWS) platform. We’ll discuss what’s possible with each of them and review various options for tying them together. Then we will contrast this to using an end-to-end platform that sits atop and leverages the best of these components, integrating them seamlessly in the background.

A Bit of Background

An important thing to keep in mind is that the public cloud gives you building blocks – Amazon calls them “primitives” – that deliver the functionality and the customizability necessary to build very sophisticated solutions. But what the cloud doesn’t give you is the fit and finish, the white glove treatment, or the turnkey experience.

The cloud provides a suite of products that customers can integrate, but it doesn’t provide a ready-to-run solution. Each of these products/components is often the best breed in its microcosm, designed and optimized for a specific purpose.

In doing this, the cloud providers create a challenge for data teams – not an unsolvable one, but a challenge nonetheless. How you address this challenge can make the difference between project success and failure. We will now acquaint you with the problem and its underpinnings and set you on the right path for a sensible solution that mitigates risk and frustration and prevents failure.

Stars in Alignment, for the Cloud

For context, it’s important to understand the current cloud adoption imperative, as it puts in focus the motivation for building successful cloud analytics solutions.

Many industry trends have combined to create the market demand we see today for public cloud solutions. Among them:

Paying for what you need: the cloud works on a combination of elastic resource deployment and utility-based pricing. Rather than having to layout significant capital funds to acquire technology infrastructure for your heaviest intermittent workloads, the cloud lets you use operating expense funds to pay for just the resources you need. This applies to both computing and storage resources.
Externally-borne data: an increasing amount of enterprise customers’ data originates off-premises. This means the data needs to be collected and consolidated into a single location and needn’t necessarily be on-premises. By extension, the analytics infrastructure and software that will process this data needn’t necessarily run on-premises, either. Cloud storage can be an ideal place to land the data, and cloud platforms may be the best place to run the processing and analytics on that data.
Rapid obsolescence cycles: innovation in hardware infrastructure, whether around storage, memory, or processing (CPU or GPU), proceeds rapidly. Hardware purchased now will obsolesce in less than a year. This makes ownership and physical installation of such infrastructure by the customer unattractive. As rapid upgrades are desirable –and sometimes necessary for competitive reasons – renting (in the cloud) is better than owning (on-premises).

Taken together, these factors seem like a perfect storm – in a good way. The cloud has never been readier to run analytics workloads, and customers have never been readier to run analytics workloads in the cloud.

But as ideal as this may seem, it raises pressure and expectations around implementing complex technology in cloud environments that are still relatively immature. That’s a perfect storm of its own – in a bad way. The risk of project failure is acute.

Analytics in the cloud isn’t a flawed strategy, but relying exclusively on first-party analytics components is often a recipe for disappointment, if not disaster. In the next section, we’ll explore what those components are and what AWS provides integrations between them. In subsequent sections, we’ll point why they’re usually not sufficient on their own.

What’s in the Cloud Analytics Stack?

Let’s now level-set and define the major components of a public cloud analytics stack. Since all major public cloud providers have a dizzying array of products, we will point to the most salient ones on offer in the not overwhelming spirit. The list below shows these major components, described in general terms, with the relevant AWS product in parentheses.

Object storage (Simple Storage Service – S3)
SQL over object storage (Athena)
NoSQL database management (DynamoDB)
Relational database management (Relational Data Service – RDS)
Data warehouse (Redshift)
Hadoop and Spark cluster services (Elastic MapReduce – EMR)
Data transformation/ETL (Data Pipeline, Glue)
Streaming data processing (Kinesis)
Business Intelligence and Data Visualization (QuickSight)

The above list has a total of nine components – and a total of 10 AWS products. That may seem like a lot – but this is just a minimalist list. For example, products like Amazon ElasticSearch Service (search-based analytics), Amazon Neptune (NoSQL graph database), and Amazon SageMaker (machine learning) have been omitted from the list.

Integrating These on Amazon

Amazon services are integrated through a collection of what might be called “bilateral interfaces.” In other words, rather than each service integrating one, Amazon has implemented specific integration pairs, with some services being more commonly integrated than others.

For example, most services can work natively with Amazon S3. This is central to Amazon’s strategy of encouraging customers to use S3 as their “data lake.” For example:

Elastic MapReduce can reference s3://-based URLs in almost any context where it would do so with hdfs://-based URLs to resources in the Hadoop distributed file system
Elastic MapReduce components Hive, Spark SQL, and Impala can each create external tables from files stored in S3
Other systems, like DynamoDB, Redshift, and Aurora, have built-in data import/load facilities, which can load data directly from files in S3 buckets right into their respective stores
Amazon Kinesis Firehose can load streaming data directly into S3
Other integration pairs exist as well. For example:
Amazon Kinesis Firehose can load streaming data directly into Redshift
Redshift’s COPY command can load data directly from DynamoDB
QuickSight can ingest data from S3, Redshift, RDS, Aurora, Athena, and EMR

The above lists are not comprehensive, but they illustrate a pattern: most AWS analytics services integrate with S3, and many others integrate with Redshift and DynamoDB. These are the three most “blessed” services in the analytics stack and the three most integrated with. But what about other permutations?

For example, what if we wanted to load data from Kinesis Streams into a table in DynamoDB or Aurora, or, more simply, what if you wanted to replicate data from DynamoDB into Aurora in real-time? Such pairings are possible, but they require a lot of work. In the latter case, customers must compose a connection using Kinesis Firehose and AWS’s serverless compute service, Lambda. This is not for the faint of heart.

Next week, we’ll look at options and approaches for leveraging these components with 3rd party solutions for an agile, end-to-end experience.

Cloud Platforms for Analytics: The House Brand Ain’t Always Enough

A Bit of Background

Stars in Alignment, for the Cloud

What’s in the Cloud Analytics Stack?

Integrating These on Amazon

Related Posts

Navigating Data Privacy in the Age of AI: Strategies for ...

Top 5 Snowflake Tools for Analysts

Should You Learn to Code for Data Analytics? – Code...

Product

Company

Resources

Sign up for our newsletter

Follow us on