SQL interfaces for Hadoop have gotten a lot of buzz over the past year because companies are anxious to utilize current skill sets to leverage Hadoop rather than hiring new talent or re-training. However, there are multiple considerations a decision maker must keep in mind when evaluating an interface to Hadoop. In their recent report “Choosing Your SQL Access Strategy for Hadoop”, Gartner analysts Nick Heudecker and Merv Adrian do a good job outlining the differences among the SQL offerings, Hive-based SQL, and direct SQL execution, and offers guidance for those evaluating the space. Adding to the points made in the report, at Datameer, we have seen customers also need to consider the following issues when choosing an interface for Hadoop: (1) definition of schema (2) memory and resource management (3) types of users.
Definition of Schema SQL interfaces require a schema-on-write approach, which comes with a few drawbacks. First, the organizations need to create a data model that IT designs today based on the knowledge of yesterday. These data models limit analysis to already known questions, which is adequate for reporting, but is limiting for data discovery. When the business has a new question or a new data source, the model has to be modified, and this process can take months. Alternatively, the schema-on-read approach that Hadoop is highly optimized for enables users to discover insights without having to define the data model ahead of time. Users can discover insights without restricting their analysis of data through pre-defined models. One of the biggest values of Hadoop is to uniquely enable schema-on-read through a linear, scalable storage and compute engine. Implementing a pre-defined schema is therefore counterproductive. For example, the issues with Hive and external databases are well documented. See here for limitations with Hive. See here for the tuning required for performance. Recently, these issues were such a problem for a large mass media company and a large online travel company that they stopped using Hive and the traditional Business Intelligence approach all together and are now evaluating Datameer to utilize the schema-on-read approach.
Memory and Resource Management In certain cases, direct SQL interfaces work well, like when query runtimes are short enough that node failures during the query are unlikely. However, there are scenarios when SQL interfaces have limitations. For example, in some SQL interfaces, joins can fail to complete because there is insufficient memory. In these interfaces, during a join, data from the second, third (and so on) sets to be joined is loaded into memory. If the SQL interface chooses an inefficient join order or join mechanism, the query could exceed the total memory available. Other issues include lack of resource management across multiple runtimes (e.g. not running in YARN). Some Datameer customers have told us they’ve used Datameer instead because of the join failures and the lack of failover management.
Type of Users Most organizations have a mix of business and technical users who interact with the data. We have seen a number of Datameer customers use both SQL interfaces and Datameer to perform their analyses. Business users will use Datameer because they are used to using spreadsheets. IT users may use SQL interfaces for specific tasks that they’re more comfortable performing in SQL, for example segmenting data into specific groups. Then IT users will prefer to use Datameer for grouptopn and groupcustombin, groupby functions, joining and unioning data, partitioning and identifying outliers.
Conclusion Datameer believes in an open platform approach and that the open source community will improve the infrastructure. Above all we consider schema-on-read the biggest innovation in data analytics in 20 years, and that was enabled by Hadoop. Implementing schema-on-write on Hadoop would limit that innovation.
Agree? Disagree? Let me know your thoughts in the comments below or on Twitter at @KarenHsuMar and @Datameer.