We examined several different approaches and architectures for governance. Choose the right approach for the unique needs of your organization, data, analytics, and business teams. With Datameer, you can mix and match these models to fit individual needs with departments or business units.
In the world of big data and analytics, “governance” has become a buzzword. The notion of governing data is laudable. Data is strategic, important, and, if mishandled, potentially compromising. It certainly needs to be protected.
Your data needs a custodial layer to make that happen, and governance would seem to provide that. Its name makes that almost self-explanatory.
But although governance does encompass that layer, it extends well beyond it. In its best implementations, data governance does more than establishing a defensive regime around data. Instead, it creates an environment that makes data highly available, trustworthy, and easily discoverable. In general, good data governance entices people in the organization to explore, query, and contribute data, and it supports efforts around digitalization and promoting data-driven practices.
The notion of governing data is laudable. Data is strategic, important and if mishandled, potentially compromising. It certainly needs to be protected. Governance, concerning any data, is of paramount importance.
Lineage and Impact Analysis; Audit; Security; Data Quality; Compliance; Certification; Master Data Management; Data Cataloging.
Authentication; Secure Impersonation With Kerberos; Roles, Access Control and Permissions; Obfuscation; Lineage; Data Management; Auditing (Event Bus).
Drivers Behind Governance; What You Need to Govern; Introduction; Data-centric; User-centric; Reusability-centric; Department-centric; Lifecycle-centric
Datameer-centric; Data Lake-centric; Enterprise-centric.
Scope, People, Process, and Fit.
As previously noted, governance covers many key aspects of how you want to operate your big data analytics. This includes:
Data security — This is important, but it’s not the only aspect of governance. For example, as I look at my big data analytics, I need to define how I lock down my data, provide secure views of the data and ensure the proper access controls are in place, both to the system and the data.
Optimization — Governance also should help the team optimize their infrastructure to run effectively. Optimizing big data analytics involves creating the right structure and letting team members effectively operate and optimize what they know best.
Self-service — A well-aligned governance strategy will enable the degree of self-service you want to provide. Controls that are too tight will stifle self-service. If they are too loose, the risk is introduced.
Sharing and Reusability — With all the data involved in big data analytics, sharing and reusability bring greater economies of scale. Governance needs to implement the right structure for findability and the right blend of controls for reuse and sharing.
Operationalization — Governance plays an important role in how your analytics are put to work. This involves a clean structure and process to promote analytics to regularly running jobs. You want to be confident it runs cleanly, produces the right results, and is given to the proper business teams.
The first reference architecture focuses on Datameer, with much of the governance work performed using Datameer administration features for folder organization structure, user management, and role-based security. There are two critical integration points with external services:
A combination of LDAP or Active Directory, and possibly SAML for authentication
Secure Impersonation, optionally with Kerberos integration
When LDAP or Active Directory are used solely for authentication with roles still defined inside Datameer and applied to artifacts, secure impersonation is used to ensure jobs run with the same privileges as the Datameer user for close security Hadoop cluster. If the cluster is secured using Kerberos, then that integration should be configured.