**This post originally published in the October 2015 special Big Data issue of CIO Review**
Businesses are no longer pondering what big data is; instead they are asking what big data is doing for them, and then acting on it. We’re seeing enterprises and Internet-born companies alike generate breakthrough insights by bringing together all of their structured and unstructured data. With smart data insights they are increasing operational efficiency, saving costs, better serving customers and discovering new revenue streams with data-driven products and services.
There’s no doubt big data analytics translates into big competitive advantages. However, while data unleashes unlimited possibilities, it also introduces new risks and challenges that the enterprise must tackle. As data-driven approaches make the transition to the mainstream, governance around big data analytics has quickly become a top concern for businesses.
Data Democratization and the Move to Self-Service Tools
It’s clear that the adoption of self-service data analytics tools is underway, yet with it comes challenges enterprises must overcome — specifically around data governance. Research firm Gartner predicted that by 2017 the majority of business users and analysts will have access to self-service tools to prepare data for analysis. With self-service tools, companies no longer have to rely on teams of data analysts to extract data insights and make smarter business decisions. Instead, they can bypass that role and put data directly into the hands of everyone across an organization.
However, with data in the hands of every business user, governance becomes paramount. Organizations are asking: “Can I find that needle in the haystack? Is my data auditable? Where’s the data coming from?” These concerns are only enhanced by the move to self-service, which decentralizes big data. To mitigate these concerns, businesses must have complete transparency into their data pipelines. Once data governance concerns are adequately addressed, businesses won’t be left trying to choose between self-service big data analytics, or robust, governable big data architectures. Instead they will be able to have access to easy-to-use self-service tools with enterprise-grade data governance controls.
As more organizations make the shift to becoming data driven, how can they squelch their data governance qualms and choose the analytics solutions that best suit their needs with confidence?
Bringing Order to Hadoop’s Wild West
The world of big data, which includes Hadoop, needs to take data governance more seriously in order to become ready for enterprise-grade deployment. Unfortunately, Hadoop, the dominant big data technology, does not natively offer the appropriate tool sets to diligently audit for compliance with internal and external regulations, ensure data quality and address other governance concerns.
The notion of tracking changes in data and controlling access to data in a granular fashion, where certain users have access to certain subsets of it, is an assumed functionality in the world of the enterprise data warehouse. The same does not hold true in the Hadoop world.
Hadoop has been seen as the Wild West in which vendors have been developing different products for the ecosystem without really thinking about data governance and sophisticated security protocols. Business users continue to spread their work between different tools in the Hadoop ecosystem, such as Apache Hive, Apache Pig, MapReduce and higher-level platforms built on top of Hadoop. So even as some governance features are added to individual components, the need for an overarching governance system is clear.
How to Crack the Governance Code
While big data analytics is enabling significant new use cases, it’s also becoming increasingly complex. Business users need an easier way to navigate data pipelines that have been developed by multiple departments and participants and involve multiple data sources. As these types of use cases occur more frequently, it’s imperative that quality and consistency, data policies and standards, data security and privacy, regulatory compliance and retention and archiving are recognized as must-have capabilities for enterprises across the board.
Relevant issues include tracing data sources and lineage, auditing changes made to data and data management policies and ensuring users see only the data for which they are authorized. Most organizations already struggle with implementing these concepts, making data governance a significant challenge. When taking a look at data governance concerns, enterprises of all sizes should address the following areas:
With comprehensive big data governance in place, business can responsibly head toward a data-driven future. Addressing these data governance concerns removes a serious barrier to implementing big data initiatives in the enterprise and expanding existing initiatives, even when highly sensitive data is involved. While the Hadoop ecosystem continues to evolve and data governance standards and frameworks emerge, businesses must keep in the mind what data governance capabilities they need right now. With these concepts in mind, businesses can kick off and expand their big data initiatives with the confidence that comes from having robust governance strategies in place.