As someone who interacts with business and IT professionals every day around big data, analytics and Hadoop, I have a lot of interesting conversations about various companies’ challenges within this space. In those conversations, I’ve come to realize that while they might be in different verticals or have an assortment of use cases, three questions come up over and over, so I thought I’d take the time to address them here:
What does big data analytics do that my existing BI software doesn’t?
Ask any BI analyst or user – there is a tremendous amount of market noise about big data analytics being the answer to everything.
But of course, it’s not. Big data analytics isn’t about throwing away your existing BI, it’s about the new use cases that big data analytics brings to the table. With traditional BI, essentially all of the data is highly structured, usually transactional, and resides within a database, data warehouse or OLAP cube somewhere. With some small variations, all traditional BI uses the same process: business users determine what questions they need answered and then the IT folks go find the data sources and build a data schema that answers those questions. Examples of traditional BI use cases include monthly sales or product reports, product profitability analysis and customer survey results.
Big data analytics are all about new use cases. Big data analytics and Hadoop are more flexible than traditional BI. Hadoop can store any data, no matter the source or whether its structured or unstructured. This enables creative discovery of your data and correlations and analysis across all of the available data without the limits of a data schema. The business, armed with these rich new data sources, can explore and ask questions around brand sentiment, product strategy and asset utilization that require both structured data together with weblogs, social media and other unstructured data. These analyses provide insights into how to offer prospects a more personalized user experience, a better understanding of customer behavior, or to make more relevant recommendations to customers based on what they’ve purchased in the past.
In short, the gist of big data analytics is to let users loose on all their data, no matter where it came from, in order to see a more complete picture.
Will Hadoop replace my Data Warehouse?
I get this question a lot. The answer is not unless you want it to. At least not for traditional BI queries. The dream of a central repository for all of one’s data has been around for years. Data warehousing technology was thought to be the ideal central repository of all data but companies find that the requirements of a rigid data schema and the need to rework the schema whenever a new data source is added simply is too time consuming for today’s fast paced business climate. The end result is a static data store that soon becomes out of date. And, of course, existing data warehouses are being challenged with growing varieties and volumes of data. One prospective customer I spoke with recently summed it up best: “Our data warehouse is where data goes to die.”
Hadoop addresses the challenges around data variety, storing semi and unstructured data like web, application logs, and social media in their raw form as well as structured data. In some cases where companies’ existing data warehousing infrastructure is overwhelmed by huge volumes of structured data, Hadoop can serve as a lower-cost alternative to a proprietary database.
Hadoop is best viewed as a complimentary technology to a data warehouse. A data warehouse is usually the better option for reporting on moderate amounts of transactional data. And they can also serve as data sources for Hadoop. Once the data is in Hadoop, then Datameer can analyze any subset or all of it, structured or unstructured. Where Hadoop is most complimentary is when analyzing semi and unstructured data along with structured transaction or catalog data. Datameer’s spreadsheet-like UI makes those correlations and analyses very easy.
What about Hive and isn’t Hive free?
Companies exploring the big data and Hadoop ecosystem often look at Hive. Hive is a data warehouse infrastructure built on top of Hadoop that allows for querying and analysis of structured data. For companies that have specialized engineering talent that understand SQL (structured query language) and who don’t have semi and unstructured data components, Hive can appear as a viable fit.
However, many of our customers looked into Hive and found its limitations. Time to insight is slow due to the complex programming language required in joining multiple data sets as well as maintaining data schemas for analytics. Hive doesn’t provide tools for ETL, analysis or visualization so additional technologies and expertise are needed to tie those components together. With Datameer and its spreadsheet UI, no technical programming in SQL is needed.
Hive, like structured data stores used in traditional BI, requires tables and schemas that are then queried via a SQL-type of language. This approach carries the same limitations of many existing systems in that the questions that can be explored are limited to those that have data in the Hive schema rather than the full raw data that can be analyzed with Datameer. As I mentioned earlier, one of the strengths of Hadoop is the ability to do ad-hoc analysis directly on all the raw data and then correlate back to transactional data within a database. Forcing data into a schema with Hive negates the flexibility that Hadoop provides.
Every day, we see companies using Datameer and Hadoop to get fast insights into customer behavior, company processes and business performance across all of their data. BI and traditional data warehouses will continue to play a role going forward, but it’s the new use cases that are driving all the excitement around big data.