About Us Icon About Us Icon Business Analyst Icon Business Analyst Icon CEO Icon CEO Icon Datameer Icon Datameer Icon Envelope Icon Envelope Icon Facebook Icon Facebook Icon Google Plus Icon Google Plus Icon Instagram Icon Instagram Icon IT Professional Icon IT Professional Icon Learn Icon Learn Icon Linkedin Icon Linkedin Icon Product Icon Product Icon Partners Icon Partners Icon Search Icon Search Icon Social Networks Icon Social Networks Icon Share Icon Share Icon Support Icon Support Icon Testimonial Icon Testimonial Icon Twitter Icon Twitter Icon

Datameer Blog

Big Data Governance: How to Crack the Code

By on December 3, 2015

**This post originally published in the October 2015 special Big Data issue of CIO Review**

Businesses are no longer pondering what big data is; instead they are asking what big data is doing for them, and then acting on it. We’re seeing enterprises and Internet-born companies alike generate breakthrough insights by bringing together all of their structured and unstructured data. With smart data insights they are increasing operational efficiency, saving costs, better serving customers and discovering new revenue streams with data-driven products and services.

There’s no doubt big data analytics translates into big competitive advantages. However, while data unleashes unlimited possibilities, it also introduces new risks and challenges that the enterprise must tackle. As data-driven approaches make the transition to the mainstream, governance around big data analytics has quickly become a top concern for businesses.

Data Democratization and the Move to Self-Service Tools

It’s clear that the adoption of self-service data analytics tools is underway, yet with it comes challenges enterprises must overcome — specifically around data governance. Research firm Gartner predicted that by 2017 the majority of business users and analysts will have access to self-service tools to prepare data for analysis. With self-service tools, companies no longer have to rely on teams of data analysts to extract data insights and make smarter business decisions. Instead, they can bypass that role and put data directly into the hands of everyone across an organization.

However, with data in the hands of every business user, governance becomes paramount. Organizations are asking: “Can I find that needle in the haystack? Is my data auditable? Where’s the data coming from?” These concerns are only enhanced by the move to self-service, which decentralizes big data. To mitigate these concerns, businesses must have complete transparency into their data pipelines. Once data governance concerns are adequately addressed, businesses won’t be left trying to choose between self-service big data analytics, or robust, governable big data architectures. Instead they will be able to have access to easy-to-use self-service tools with enterprise-grade data governance controls.

As more organizations make the shift to becoming data driven, how can they squelch their data governance qualms and choose the analytics solutions that best suit their needs with confidence?

Bringing Order to Hadoop’s Wild West

The world of big data, which includes Hadoop, needs to take data governance more seriously in order to become ready for enterprise-grade deployment. Unfortunately, Hadoop, the dominant big data technology, does not natively offer the appropriate tool sets to diligently audit for compliance with internal and external regulations, ensure data quality and address other governance concerns.

The notion of tracking changes in data and controlling access to data in a granular fashion, where certain users have access to certain subsets of it, is an assumed functionality in the world of the enterprise data warehouse. The same does not hold true in the Hadoop world.

Hadoop has been seen as the Wild West in which vendors have been developing different products for the ecosystem without really thinking about data governance and sophisticated security protocols. Business users continue to spread their work between different tools in the Hadoop ecosystem, such as Apache Hive, Apache Pig, MapReduce and higher-level platforms built on top of Hadoop. So even as some governance features are added to individual components, the need for an overarching governance system is clear.

 How to Crack the Governance Code

While big data analytics is enabling significant new use cases, it’s also becoming increasingly complex. Business users need an easier way to navigate data pipelines that have been developed by multiple departments and participants and involve multiple data sources. As these types of use cases occur more frequently, it’s imperative that quality and consistency, data policies and standards, data security and privacy, regulatory compliance and retention and archiving are recognized as must-have capabilities for enterprises across the board.

Relevant issues include tracing data sources and lineage, auditing changes made to data and data management policies and ensuring users see only the data for which they are authorized. Most organizations already struggle with implementing these concepts, making data governance a significant challenge. When taking a look at data governance concerns, enterprises of all sizes should address the following areas:

  • Being regulation-compliant without locking users out of their data: Enterprise-grade security and fine-grained data access policies are the first line of defense against risk for businesses. For IT, the goal is to implement policies that allow them to manage risk appropriately, while still meeting business needs. Role-based access control is crucial for allowing IT to control which users can perform which tasks. For example, you can give bulk ingest abilities to IT staff only, while still allowing analysts to upload their own files on an ad hoc basis.
  • Upholding data quality and consistency so data is accurate: Data quality and consistency are imperative when it comes to ultimately extracting value from big data. If at any point in the data pipeline there is a question about data validity, the overall value of the resulting insights is in question. Users need to be able to check and remediate issues like dirty, inconsistent or invalid data at any stage in a complex analytics pipeline. Additionally, they must provide transparency into every change, from the original dataset all the way through to the final visualization so businesses can trace insights back to the source if needed. So as quality issues are remediated, the efforts are logged allowing them to be audited later. Meanwhile, downstream analyses are further safeguarded from dirty data and erroneous results are readily avoided.
  • Enabling retention and archiving: Industries that are fast to adopt big data analytics, such as healthcare and financial services are governed by both internal and external regulations that dictate the rules around retention and archiving of records. To comply properly, businesses need the ability to apply flexible retention rules so each imported dataset’s retention policy can be configured by an individual set of rules. By doing so, it is easy to automatically keep data permanently or purge records that are older than a specific time window. In the event of an audit, they easily have access to all of their data.

 With comprehensive big data governance in place, business can responsibly head toward a data-driven future. Addressing these data governance concerns removes a serious barrier to implementing big data initiatives in the enterprise and expanding existing initiatives, even when highly sensitive data is involved. While the Hadoop ecosystem continues to evolve and data governance standards and frameworks emerge, businesses must keep in the mind what data governance capabilities they need right now. With these concepts in mind, businesses can kick off and expand their big data initiatives with the confidence that comes from having robust governance strategies in place.

 


Connect with Datameer

Follow us on Twitter
Connect with us on LinkedIn, Google+ and Facebook


Stefan Groschupf

Stefan Groschupf

Stefan Groschupf is a big data veteran and serial entrepreneur with strong roots in the open source community. He was one of the very few early contributors to Nutch, the open source project that spun out Hadoop, which 10 years later, is considered a 20 billion dollar business. Open source technologies designed and coded by Stefan can be found running in all 20 of the Fortune 20 companies in the world, and innovative open source technologies like Kafka, Storm, Katta and Spark, all rely on technology Stefan designed more than a half decade ago. In 2003, Groschupf was named one of the most innovative Germans under 30 by Stern Magazine. In 2013, Fast Company named Datameer, one of the most innovative companies in the world. Stefan is currently CEO and Chairman of Datameer, the company he co-founded in 2009 after several years of architecting and implementing distributed big data analytic systems for companies like Apple, EMI Music, Hoffmann La Roche, AT&T, the European Union, and others. After two years in the market, Datameer was commercially deployed in more than 30 percent of the Fortune 20. Stefan is a frequent conference speaker, contributor to industry publications and books, holds patents and is advising a set of startups on product, scale and operations. If not working, Stefan is backpacking, sea kayaking, kite boarding or mountain biking. He lives in San Francisco, California.

Subscribe