Datameer Blog post
Follow These Three Steps to Optimize Your Data Lake
by Erin Hitchcock on Mar 05, 2018
Data today is quickly growing in volume, variability and complexity. This has left organizations with the challenge of harnessing all this data – however broad in variety or large in volume – to derive more value and insights from it.
Traditional enterprise data warehouses have problems dealing with the complexity of the data and the flexibility required for today’s range of analytic questions. To meet these demands, data lakes were conceived.
Your data lake is quickly becoming the answer to effectively managing a large volume and variety of data. The question now is: How do you provide the flexibility, speed and accessibility to fuel analytics that truly drive business results?
1. Curate and Govern Data Sets to Democratize Data
One of the things that sets the data lake apart from the traditional data warehouse is its ability to support all data types, not just the structured data. In essence, the data lake becomes the catch-all repository for your data.
However, data lakes can quickly become swamps if data isn’t curated from its raw form into something useful for the analysts. A data lake implementation requires a sound plan for optimizing data with curation and applying governance so that it can be made consumable by any analyst with the need to use it.
When curating and governing data, remember:
Adding context adds value
In order create relevance for information sourced from non-traditional data sources – devices, mobile application log files, web server logs, sensors, social media activity, and more – context must be added to the data. This entails blending the different sets of reference data and applying enrichment through functions and algorithms.
Once this occurs, you get a complete perspective on what the data means. Without these processes, value is lost, because it is difficult to tell how these figures relate to the customer, the product, revenues and other data that brings them together.
Data governance isn’t single-threaded
The advent of the data lake created a large repository that enterprises span across multiple areas and use cases, blended with a variety of information sources. This requires a model of data stewardship that can make important decisions about the data, including what it means, where it should be used, how accurate it should be and what rules should be followed in its usage.
As opposed to trying to single-thread the responsibility through a small group, data governance on the data lake needs to be a shared responsibility performed by a committee. This helps guarantee that data gets used in the right way and that users get quality data when they need it.
Data curation is more than preparation
Data curation plays a vital role in helping data lakes deliver on their promise. If applied properly, data curation utilizes a bottom-up approach to turn any raw data into information that produces useful analytics and can easily be consumed.
Data preparation traditionally involves cleansing and transformation along with integration and blending. Data curation takes the data to another level, helping to shape and organize it effectively for the analysis task at hand, enrich it with analytic and algorithmic functions and publish the data so it is easily retrievable for those in the organization that need it.
2. Deliver Strong Security and Governance Without Strangling
As the data lake consumes more data, security and governance become a bigger concern. More sensitive data surrounding the customers, as well as corporate information involving transactions and risk, need to be more closely held and understood.
While it may seem simpler to just lock things down to ensure that security breaches are kept at bay, this may also do a disservice to your analytic efforts. The key is to find the right balance that allows security policies to work alongside your data lake adoption so that your organization doesn’t lose out on the value that big data analytics provides.
This approach should revolve around two main areas: Strong Security and Deep Governance. In ensuring strong security complemented by effective governance, there are several systems and practices that should be carefully evaluated:
- Role-based Security: This refers to the restriction of user access to data depending on the specific role associated with that user.
- Enterprise Security Integration: Data lakes also need to be integrated with the rest of the enterprise’s IT investments, including existing security mechanisms.
- Encryption and Obfuscation: A fundamental part of protecting sensitive information is encrypting data, both while at rest and over-the-wire. Certain fields may also need to be obfuscated from those who handle the data to keep them from misusing it.
- Usage and Behavior Auditing: An effective and detailed auditing process could help you better understand not just what the users are doing, but more so, how they behave within the system.
- Data Retention Policies: Guidelines on what and whose data is retained, and for how long this information should be kept around should be in compliance with regulations like the GDPR, which disallows the retention of personal data for longer than necessary.
- Full Lineage: Data lineage helps you keep track of how and where the data is being used from the source to the final data sets, and should serve as a valid reference to counter questions on data veracity.
- Enterprise Governance Integration: Data lake governance policies should tie in together with the company’s existing investments in governance platforms for an enterprise-wide governance framework that guarantees secure access to quality data.
3. Facilitate Consumption From the Data Lake
Data lakes are evolving to the point where they need to support multiple usage scenarios. Being able to take a dip right into the lake gives data users the opportunity to try a lot of things with the data. But, in order to make the data lake accessible, it’s important to keep in mind the different users that the data lake serves.
These users can range from IT teams that operate the data lake, to data engineers who manage data through the lifecycle, to business analysts that do the last mile of data curation and consume datasets, and finally, data scientists that want to use the data lake to create datasets that serve their advanced analytic needs.
The core data wranglers are the power users – the data engineers and/or scientists, who know the most about how to find the data, curate it and then bring it all together for the downstream analytics. The data scientist is also a consumer, incorporating curated data into his or her predictive or deep learning initiatives.
As a consumer and user, the business analyst pores over the organized information to study for patterns and trends, identify potential problems, create solutions to current challenges and recognize opportunities. Behind all these processes are the IT operators who run the system and make sure that all users are able to effectively perform the different jobs and tasks associated with the data lake. So how do these various personas consume data from the data lake?
Depending on what type of analysis the enterprise is after and what type of presentation is most ideal, there are three different methods of consuming data from the lake:
- Ad-Hoc: Create data sets as-needed and push the resulting curated data into a “database” style mechanism so the analyst can perform ad-hoc querying.
- Delivered: Create data pipelines that produce curated datasets and push the information to the analyst’s favorite discovery and visualization tools like Tableau or Power BI.
- Direct: Allow users to perform their exploration directly on the data lake – to study the data in a fashion that allows them to look at the larger data set.
Each approach has its own set of pros and cons:
The preferred approach for consuming data would be determined by how the analyst wants to work with the data or what questions you’re trying to answer. In any case, what’s important is that organizations use each of these methods for their strengths.
It’s all about finding the right technology that helps you achieve faster time-to-analytics and derive the most value out of your data lake – and a well-managed data lake should support ALL THREE methods to facilitate adoption by any type of analyst with any combination of skills.
Erin Hitchcock is the Public Relations and Analyst Relations Manager at Datameer. In this role, she works diligently alongside thought leaders to spread the word about big data and data engineering technologies.