It’s a very common theme among enterprise I.T. customers:
— “We have too many tools already”
— “We have initiatives to reduce vendor footprint”
— “How do your tools compare to / contrast with / complement my existing tools?”
— “My staff is busy; we don’t have time to become experts in yet another tool”
Enterprise technology users have been purchasing and deploying support tools for decades. Yet, as Hadoop and big data grow in popularity, and Hadoop tools increasingly take center stage, these statements and questions come up more and more often.
First, it’s important to level-set, and establish the big data tools landscape. Most major corporations have a rich set of tools in their current enterprise data warehouse environment. The “big data” repositories in this environment are traditionally SQL-based data warehouse platforms such as Teradata, Netezza, Oracle, and IBM. This is where the terabyte-class, query-intensive data has been living for many years now.
There are three major technology stacks that support these large EDW platforms:
— ETL tools like Informatica and Ab Initio are used to extract the source data, clean it up, and structure it for landing into the data warehouse platforms
— Business intelligence tools like Tableau, QlikTech, Microstrategy, Cognos and Business Objects provide data visualization and reporting
— Analytics tools like SAS, SPSS, and homegrown or third-party “R” functions perform more intensive analytics, such as scoring and customer lifetime value. These tools also enable our customers’ data scientists to develop unique, high-value analytics models
These technologies have made up most enterprise data warehouse environments for some time. They are high-value assets to their corporate owners, and will likely continue in their existing roles. However, these technologies are all challenged by the growth of big data in recent years. Today’s big data landscape includes multiple disparate, unstructured data types, rapid rates of data change, and very large amounts of this data – what the industry refers to as variety, velocity, and volume. The challenges include:
— Many of the new, interesting data – Web logs, machine data, free-form text in social media and emails, etc. – are only semi-structured, or unstructured. Data warehouses, and the ETL and BI tools that support them were all designed on the assumption of a pre-defined, SQL-based schema. The new big data is ill-suited to landing in a pre-defined database schema without a lot of work
— The new data can arrive in very large volumes, often into the hundreds of terabytes or petabytes. These volumes can rack up very large costs against the traditional data warehouse technologies
— Applying complex analytics against these very large datasets can strain even the most powerful SQL-based data warehouse systems
Hadoop evolved in response to these challenges. It is a software package consisting of two major components – the HDFS file system and the Map/Reduce language. The core concepts are simple and compelling:
— HDFS can run on a large cluster of relatively inexpensive compute nodes, providing theoretically unlimited compute power and storage capacity.
— HDFS is a file system, not a database. It can ingest files of any format. Data sizes are limited only by cluster size.
— The Map/Reduce language provides a programmatic way of processing and generating very large data sets in a parallel, distributed way on the HFDS cluster.
Built atop this HDFS / Hadoop structure is a wide variety of components and languages that make up the Hadoop infrastructure. These components enable data ingestion, RDBMS connectivity, job scheduling, and many others. Hadoop distribution vendors, and the Apache community, continue to make advances to Hadoop.
OK, so Hadoop addresses the problems of volume, variety, and velocity. The challenge then becomes one of Hadoop complexity. The Map/Reduce language, and the large family of Hadoop components – Sqoop, Flume Hive, Pig, etc. – are themselves fairly complex. Manual programming of this barnyard of tools can create very high labor costs for both development and support.
Datameer was developed to address these issues. Datameer’s founders are Hadoop experts, having built many large, complex Hadoop systems for major corporate customers. The Datameer product was developed to simplify the big data analytics environment into a single application, on top of the powerful Hadoop platform. Datameer was also developed to run natively on all distributions of Hadoop and leverage the scale and compute power of the Hadoop cluster. It is the ONLY end-to-end big data analytics application for Hadoop, designed to make big data simple for everyone. We deliberately designed our product as a single integrated package, as easy to use as a spreadsheet.
Datameer combines self-service data integration, analytics and visualization functionality in a way that provides the fastest time to insights. In the traditional SQL-based EDW world, these tasks would be handled by the multiple, disparate technology stacks named above.
So, there are five key take-aways to know about Datameer in the big data environment:
1. Datameer was purpose-built for Hadoop, from the ground up. Datameer responds to user’s clicks and drags, and generates MapReduce code
2. Datameer runs natively on the Hadoop cluster and leverages all of the power of the cluster
3. Datameer is as simple to use as a spreadsheet
4. Datameer continues to support the evolution of Hadoop, integrating new capabilities like YARN and Tez.
5. By putting Datameer in charge of your Hadoop “Data Lake” you can continue to leverage your existing SQL-based EDW infrastructure in some new and exciting ways