ETL and Data Pipelines on AWS

Cloud ETL Challenges

Deploying or migrating analytics in the cloud has many challenges. Technical issues include feature differences between on-premises and cloud data platforms, security, and governance. Non-technical issues include the learning curve for new cloud platform skills and effectively managing and controlling costs.

The majority of cloud analytics projects to date involve data already in the cloud from SaaS applications (Salesforce, Marketo, etc.), cloud services (Google Analytics and Adwords, other marketing services, etc.), or raw data already landing in a cloud data lake. The risk of moving on-premises data into the cloud has forced enterprises to limit cloud analytics projects, especially in regulated industries where data privacy is critical.

Cloud-only ETL tools recognized this data source pattern, and the move to a new processing model, ELT, and focused exclusively on streamlining simple data integration in cloud data warehouses.

As organizations look to deploy or migrate enterprise-grade analytics workloads in the cloud, several additional challenges come into play:

Securely working with on-premises data sources and data that needs to be integrated into the cloud
Handling the scale and performance needs of integrating larger on-premises sources into the cloud
Implementing more sophisticated transformations to deal with the greater complexity and diversity of data
Effectively governing the greater volume and diversity of datasets used for different purposes in the cloud analytics
Supporting a greater variety of use cases requiring multiple, different forms of data transformation
Effectively managing the cost of both cloud infrastructure consumption and personnel

ETL Options on AWS

For ETL and data integration on AWS, there are four primary options:

Subscribe to EMR/Spark or Databricks and hand-code your ETL on top of Spark
Subscribe and use AWS’ data integration tool, AWS Glue
License a cloud-only data integration platform such as Fivetran or Matillion
Use a modern third-party data integration platform that is native on AWS, such as Datameer

GigaOm recently released a report, Analytics in the Cloud: Minimize Pain & Maximize Success, exploring many of the challenges and solutions encountered in a cloud data journey. In this report, GigaOm discussed many of the shortcomings of the ETL tools offered by the cloud vendors and the reasons why a third-party product makes more sense.

While hand-coding data pipelines may seem like the simplest way to start, this approach does not scale, and it takes a great deal of time to code and deploy each data pipeline. Many companies using Spark and Databricks for their data pipelines encounter data engineering costs that are spiraling out of control.

While improving ease of use and connectivity to cloud-born data, cloud-only data integration tools have very limited data transformation and platform capabilities. To learn more about cloud-only data integration platforms, please read our write-ups on Fivetran and Matillion .

AWS Glue

AWS Glue is the ETL tool offered by Amazon Web Services. Glue is a serverless platform and toolset that can extract data from various sources, transform it in different ways (enrich, cleanse, combine, and normalize), and load and organize data in destination databases, data warehouses, and data lakes.

Glue allows ETL developers to define data pipelines via a visual interface or coding. Glue also contains a catalog of data flows and resulting datasets. Glue Studio enables administrators to run and monitor ETL data flows.

Glue Studio is a more traditional ETL tool with a visual job editor and data flow style user interface. While Glue Studio provides a high-level graphical manner to define a flow, it has an extremely limited set of transformations. Anything beyond simple transformations such as filters, joins, and mappings require users to write code or SQL. Glue Studio also has a minimal set of connectors, working only with data sources and destinations running on AWS.

Glue DataBrew is a separate but related product offered by AWS for data preparation. With the DataBrew interface, you can interactively examine, profile, clean, and transform raw data. Databrew makes up for the transformation limitations of Glue with a larger library. As with Glue, the connectors on DataBrew are limited but do expand beyond AWS sources to traditional databases such as Oracle or MySQL running on AWS.

It is important to recognize that Glue and Glue DataBrew are entirely separate products. Glue is for ETL data pipelines, while DataBrew is for data preparation. The only way to combine the two is for Glue to perform extract and load (perhaps into Redshift), then have separate DataBrews preparation jobs to transform the data inside Redshift.

AWS Glue and Glue DataBrews have several limitations:

A tiny set of data connectors focused on AWS-owned sources, databases running on AWS, and files from S3 buckets,
The inability to securely connect to and work with on-premises data sources,
Disjointed data integration jobs that require more sophisticated transformations with logic and job execution split between two tools,
The potential for inconsistent security policies and security vulnerabilities between Glue and Glue DataBrews,
Very limited data governance features, mostly around security (encryption) and cataloging (via Glue Catalog)

As we compare our list of data challenges for enterprise-grade analytics workloads in the cloud with AWS Glue’s limitations, it is clear Glue will not meet the enterprise-level requirements for ETL and data integration in the cloud on AWS.

Also, from our examination of cloud-only data integration solutions such as Fivetran and Matillion , these products do not meet many of our enterprise-level requirements as well.

Datameer

Datameer is a fully-featured ETL++ data integration platform with a broad range of capabilities for extracting, exploring, integrating, preparing, delivering, and governing data for scalable, secure data pipelines. Datameer supports analyst and data scientist self-service data preparation and data engineering use cases, enabling a single hub for all data preparation across an enterprise. Data pipelines can span across various approaches and needs, including ETL, ELT, data preparation, and data science.

Datameer offers a comprehensive suite for data integration, supporting analyst self-service data preparation, data science, and data engineering use cases, thereby enabling a single hub for all data pipelines across an enterprise. Its point-and-click simplicity makes it easy for analysts and data scientists, and even non-programmers, to create data integration pipelines of any level of sophistication, allowing you to make your data analytics-ready 10 to 20 times faster at a fraction of the cost.

Once integration dataflows are ready, Datameer’s enterprise-grade operationalization, security, and governance features enable reliable, automated, and secure data pipelines to ensure a consistent data flow. Datameer has extensive features to support your hybrid-cloud data landscape. It is cloud-native on all three major cloud platforms (AWS, Azure, GCP) and carries the elasticity and cost economics you would expect in the cloud. Datameer can bring together any data sources you have regardless of type, format, and location (cloud or on-premises).

Why Datameer on AWS?

Code-free, Cloud Simplicity

An entirely graphical user experience combines the wizard-led data extraction, an award-winning spreadsheet-like UI, and an extensive array of drag-and-drop functions for fast, easy, code-free creation of data pipelines.

Powerful ETL++

Over 300 point-and-click functions allow you to transform your data in more ways imaginable – cleanse, blend, shape, enrich, organize, and group – to tame even the most complex data and make it useful and complete.

Scalable & Automated

Advanced elastic compute frameworks optimize resource allocation, leverage parallel processing, and scale to your data processing needs. Built-in automation and job execution tools fulfill all your DataOps needs.

Secure and Governed

The most in-depth set of governance and security features that meet the demands of highly regulated industries and ensure data privacy and compliance. Authentication, authorization, LDAP/Active Directory and SAML support, obfuscation and encryption, end-to-end data lineage, complete audit trails, and more to meet enterprise needs.

Connect to Anything

A broad suite of over 200 data connectors with wizard-driven connection simplicity makes it easy to extract data from any source you have. And a large number of destinations to data warehouses, data-marts, analytical databases, and BI tools simplifies delivery.

Cloud-native on AWS

Datameer was built for AWS, with deep integration that includes native elastic Spark compute clusters, S3 for storage, native high-speed connectors to AWS data sources and targets, IAM security integration, encryption integration with KMS, and integration with Glue catalog.

Datameer: Answering the Challenges

Works with On-premises Sources

Datameer provides high-performance connectors to all your on-premises data sources (databases, data lakes, data warehouses, apps) and uses secure protocols and end-to-end encryption to ensure data security.

Cloud-Scale and Performance

Datameer combines high-performance connectors, an elastic Spark-based compute engine, and an advanced Smart Execution TM optimizer to ensure your data pipeline operations get the scale and performance necessary.

Sophisticated Transformations

Datameer’s extensive array of over 300 easy to apply functions allows users to create data pipelines of any level of sophistication in minutes. Do anything to your data – cleanse, blend, transform, shape, enrich, organize, and group – to make it analytics-ready.

Effective Governance

Datameer contains an extensive array of security and governance features, trusted by some of the largest enterprises in the world, to enable effective data governance over many datasets and ensure data privacy.

Multi-Use Case

Datameer’s large array of functions and data source connectors allow you to apply it to many different problems – operational BI, ad-hoc analytics, data science, regulatory reporting, and more – giving you a single hub to manage and deploy all your data pipelines.

Manage Costs

Unlike cloud-only data integration tools that only support ELT integration, Datameer supports multiple data pipeline models eliminating extra cloud data warehouse compute and storage costs and keeping your cloud bills under control.

Conclusion

Creating new or migrating existing analytics workloads on AWS has numerous challenges, with delivering, managing, and governing the underlying data a critical one. This makes choosing your ETL and data pipeline platform a very important decision.

For delivering data for analytics on AWS you have multiple options and tools to choose from, including coding ones, cloud-only ELT tools, AWS’s tool Glue, and modern hybrid cloud platforms.

Datameer is a leader among the modern hybrid-cloud ETL and data pipeline platforms. Datameer’s modern, code-free tools and advanced platform capabilities blend the best of traditional data integration platforms (transformations, scale, security, governance) with cloud simplicity and ease of use. It answers all the critical challenges for cloud ETL and data integration.

See for yourself by scheduling a personalized demo of Datameer.

The Best AWS ETL Tools

Cloud ETL Challenges

ETL Options on AWS

AWS Glue

Datameer

Why Datameer on AWS?

Datameer: Answering the Challenges

Conclusion

What you’ll learn:

Product

Company

Resources

Sign up for our newsletter

Follow us on

The Best AWS ETL Tools

Cloud ETL Challenges

ETL Options on AWS

AWS Glue

Datameer

Why Datameer on AWS?

Datameer: Answering the Challenges

Conclusion

What you’ll learn:

No-Code Analytics Built for Snowflake