What’s With Data Lakes? Five Questions, Answered

  • John Morrell
  • February 26, 2018
Data Lake Img

What is a data lake? A data lake is essentially a huge collection of all the data your company collects about its customers, operations, transactions and more. Think of all of your company’s data sources as individual pools of data. You need to hop around to each pool to find the fish you seek. A data lake will have information flow into it from all of these pools so you can find your fish from one place, and find patterns and trends among all the fish.

1. How Is Data in Data Lakes Organized and Managed?

Data Lake Organization and Management

Data lakes use a flat, schema-less organization structure. Data is left in its natural  form, leaving you with a collection of data in many different formats. A common best practice is to add unique identifiers and meta tags so you can hunt down the data you need and use it. The open-ended nature of a data lake allows analysts to actively explore data in the lake without any predefined requirements, giving them the ability to discover answers in new ways.

Data lakes are often associated with Hadoop. You can create a data lake outside of Hadoop, but it would face numerous challenges. Hadoop is the predominant architecture for data lakes because:

  1. The schema-less structure and schema-on-read capabilities add the required data flexibility desired for a data lake
  2. The scalable, extensible CPU and storage power allow the data lake to grow in a linear fashion
  3. The ability to use commodity hardware gives a Hadoop-based data lake a tremendous economic advantage over other approaches

Putting a data lake on Hadoop provides a central location from which all the data and associated meta-data can be managed, lowering the cost of administration.

2. What’s the Difference Between a Data Lake and Data Warehouse?

While a data warehouse can also be a large collection of data, it is highly organized and structured. In a data warehouse, data doesn’t arrive in its original form, but is instead transformed and loaded into the organization pre-defined in the warehouse.

This highly structured approach means that a data warehouse is often highly tuned to solve a specific set of problems, but is unusable for others. The structure and organization make it easy to query for specific problems, but practically impossible for others.

A data lake, on the other hand, can be applied to a large number and wide variety of problems. Believe it or not, this is because of the lack of structure and organization in a data lake. The lack of a pre-defined schema gives a data lake more versatility and flexibility.

3. Where Can You Use a Data Lake?

Data warehouses evolved because they answered the highly structured, everyday questions that analysts asked. These typically revolved around the transactional aspects of the business and allowed an analyst to drill down on the specific dimensions defined. Think about it as being able to walk into different sections of a warehouse, down certain aisles and look on specific shelves.

Data lakes are meant to solve problems that are not as structured and require “discovering” the answer from the data. Analysts may know what question they need answered, but not what combination of data and analysis will reveal the answer. This requires iterative exploration and application of different, often more complex analytic functions to reveal the true answer.

4. Is Your Analytic Platform Designed for a Data Lake?

Just as data warehouses were organized to solve highly structured problems, the BI tools that accompanied the warehouse were designed to work with that structure. They were often designed to allow the analyst to “slice and dice” the data along the structure provided in the warehouse (dimensions and measures).

In the same manner, analytic platforms to solve problems on data lakes need to equally embrace the versatility and loose structure. While the underlying Hadoop technology provides the versatility a data lake needs, many existing analytic platforms are not designed to take advantage of this versatility, leaving many companies struggling to get real value out of their data lakes.

Analytic platforms built natively for Hadoop are designed to use the varying data types, structures and formats found in a data lake. Native analytic platforms for Hadoop embrace the schema-less structure and schema-on-read capabilities built into Hadoop. This provides an analyst workbench that can answer a much greater array of questions, discover new hidden patterns in the data and offer highly granular yet actionable insights.

5. How Can You Gain More Value From Your Data Lake?

Native analytic platforms for Hadoop will help you get the most value out of your data lake. But how do you learn the best practices for creating a Data Lake?