The Simplest Road to a Modern Data Stack with Snowflake
- John Morrell
- June 14, 2021
The first building block of a cloud data stack starts with Snowflake. Your analytics engine and/or cloud data warehouse is always the core component by which your data stack revolves.
The shift to cloud analytics and cloud data warehouses was supposed to simplify and modernize the data stack for analytics. Yet, many cloud journeys have done quite the opposite – the data stack has gotten more complex and expensive. In the end, this drives up data engineering costs.
When building a data stack for Snowflake, customers have a myriad of options. Many new tools in new categories can fulfill a specialized role in your data stack. But this can also create a disintegrated data stack that can increase the overall complexity to create, operationalize, and manage your data pipelines.
Beyond your Snowflake analytics engine at the end of your data stack, there is a myriad of specialty tools that can be inserted into your data stack, including:
- Data ingest and loading
- Data transformation
- Data orchestration
- Data catalogs and metadata management
- Data security and governance
- Data observability
However, the plethora of choices for each of these tools can be overwhelming as the data engineering ecosystem diagram (courtesy of lakeFS) below suggests. However, too many tools often create a disintegrated architecture increasing the overall complexity to create, operationalize, and manage your data pipelines.
In some cases, these specialty tools may work with one or two tools in an adjacent category but do not work effectively with the rest of the ecosystem. With Snowflake’s increasing presence and growth in the market, most products will work well with it. But beyond Snowflake, piecing together the rest of your data stack in a piecemeal manner requires at least five additional products.
Many of the companies on the chart above are small startups who have very innovative ideas for their category, but their tools often lack maturity and true enterprise capabilities. There are also open-source options that require some levels of customization and/or coding to build into your data stack. In both cases, this makes building an “enterprise” data stack an even more daunting task.
What Overall Options Do I have?
How to build your analytics stack for Snowflake comes down to is a classic decision we’ve seen repeated in many categories over the past 20 years – build versus buy. In this case, the choice is between a combination of building and integrating a combination of four or five products into your data stack or buy one platform that encompasses most, if not all, of the capabilities you need and potentially adding another product for certain specialized needs.
The tradeoffs in this decision are:
- Time – the amount of time it takes to deploy the overall data stack architecture and potentially added time around the building and deployment of data pipelines
- Skills – will I need to find talent with specialized skills that could be very expensive and hard to find
- Cost – the labor cost to build, maintain, and manage the data stack plus software licensing costs
Building a many-product data stack for Snowflake can become a complex endeavor. While simply being in the cloud does offer some advantages, building a data stack will require:
- A substantial, many months engineering project
- The need to acquire expensive talent with skills in these specialized tools
- The added cost of programmers to build and maintain the data stack
- Licensing multiple software products
The main advantage of this approach is flexibility – the potential ability to plug in different products in the data stack and providing a level of future-proofing. In addition, if you use open source tools, there could be more widespread availability of skills in these tools.
The simplest initial multi-product data stack for Snowflake contains:
- A data ingest or loading tool such as Fivetran or Stitch Data,
- A data transformation tool such as dbt, and
- A data orchestration tool such as Airflow.
This simple multi-product architecture may suffice for basic cloud analytics, such as loading data from SaaS applications (CRM, marketing automation, finance, etc.) and analyzing it. This is a common use case, especially for small- to mid-sized companies. But, it still requires heavy data engineering efforts. Operationalizing and managing data pipelines across multiple products is a complex ordeal, requiring custom coding and/or scripting, adding additional time to the data pipeline lifecycle.
At a minimum, what this simple architecture lacks is:
- Enterprise capabilities, including security, governance, and the ability to integrate with secure, on-premises data sources, and
- DataOps to operationalize and automate the data pipelines at scale.
Enterprise-class organizations, with greater scale, stronger security, governance, and regulatory controls needs, either cannot use such an architecture or spend great effort and cost adding integrating additional products and/or adding custom extensions. And, as mid-size organizations grow, they will need to continually throw more resources at their data stack to integrate additional capabilities for governance and see their data stack become even more complex.
These pieced-together data stacks are a major reason why many companies see a continual rise in their data engineering costs. As demand increases, organizations are faced with two options – throw more resources at the problem, or make the architecture more complex. With either option, data engineering costs will grow rapidly. In addition, the time to complete new data pipelines and the project wait queue will remain long.
A second, more viable option, especially for enterprises or rapidly growing mid-size companies trying to stay ahead of the curve, is to license a single ETL++ platform that contains most, if not all, of the capabilities you need from your data stack. This approach will require little, if any, integration effort and cost and can speed data engineering efforts and reduce costs.
An integrated data pipeline platform and toolset should include the following capabilities:
- Easy data ingest and loading
- Graphical, code-free data transformation
- End-to-end data orchestration
- Discovery and metadata management
- Enterprise data security and governance
- Automated DataOps, including data observability
The right platform can also offer integrations with other point products, such as enterprise data catalog and governance tools, to give you the flexibility of integrating an overall enterprise architecture and/or adding new capabilities over time. Your platform should also offer multiple data integration models/architectures – ETL and ELT – so you can choose the best option.
The buy option facilitates a robust data stack from the outset, and:
- Eliminates upfront time and effort to integrate multiple components in your data stack,
- Facilitates faster data pipeline definition and deployment,
- Increases data engineering productivity and lowers costs,
- Provides enterprise security and governance to reduce data privacy and regulatory compliance risks,
- Supports scalable DataOps to ensure a reliable and complete flow of data, and
- Offers lower overall licensing costs (versus licensing multiple products)
Datameer Spectrum as a Buy Option
Datameer Spectrum is a fully-featured ETL++ data integration platform with a broad range of capabilities for extracting, exploring, integrating, preparing, delivering, and governing data for scalable, secure data pipelines. Spectrum supports analyst and data scientist self-service data preparation and data engineering use cases, enabling a single hub for all data preparation across an enterprise. Data pipelines can span across various approaches and needs, including ETL, ELT, data preparation, and data science.
Spectrum’s no-code data orchestration and transformation make it easy for analysts and data scientists, and even non-programmers, to create data integration pipelines of any level of sophistication. The large array of over 300 off-the-shelf functions enable you to transform, cleanse, shape, organize, and enrich data in any way imaginable, and 200+ connectors let you work with any data source you may have. Once integration dataflows are ready, Spectrum’s enterprise-grade operationalization, security, and governance features enable reliable, automated, and secure data pipelines to ensure a consistent data flow.
Datameer Spectrum provides you with a robust, all-in-one “buy option” for Snowflake. Spectrum gives you the ability to have a one-product data pipeline platform that feeds your Snowflake cloud data warehouse and analytics without all the integration time, headaches, and costs.
Spectrum contains all six of the integrated data stack capabilities mentioned above – easy data ingest and loading, code-free data transformation, end-to-end data orchestration, data discovery and metadata management, robust data security and governance, and automated DataOps. It provides all the data stack capabilities an enterprise or growing mid-size organization will need when deploying scalable cloud analytics on Snowflake.
Although building a data stack of multiple point products may be appealing from a flexibility standpoint, the time, effort, and cost of building, maintaining, and operating such a data stack can quickly escalate. For many organizations, a far better option is to buy – acquire a single robust data pipeline platform that meets all your needs of today while also have capabilities you will require in the future.