About Us Icon About Us Icon Business Analyst Icon Business Analyst Icon CEO Icon CEO Icon Datameer Icon Datameer Icon Envelope Icon Envelope Icon Facebook Icon Facebook Icon Google Plus Icon Google Plus Icon Instagram Icon Instagram Icon IT Professional Icon IT Professional Icon Learn Icon Learn Icon Linkedin Icon Linkedin Icon Product Icon Product Icon Partners Icon Partners Icon Search Icon Search Icon Social Networks Icon Social Networks Icon Share Icon Share Icon Support Icon Support Icon Testimonial Icon Testimonial Icon Twitter Icon Twitter Icon

Datameer Blog

What’s With Data Lakes? Five Questions, Answered.

By on February 15, 2017
What is a Data Lake?

Big data can be an intimidating field. And with most industries, there’s some jargon. Today we’re going to dig into the definition of the term “data lake” – we’ll look at what it means both as a general industry term and how it’s used in everyday practice.

What is a data lake? A data lake is essentially a huge collection of all the data your company collects about its customers, operations, transactions and more. Think of all of your company’s data sources as individual pools of data. You need to hop around to each pool to find the fish you seek. A data lake will have information flow into it from all of these pools so you can find your fish from one place, and find patterns and trends among all the fish.

1. How Is Data in Data Lakes Organized and Managed?

Data lakes use a flat, schema-less organization structure. Data is left in its natural form, leaving you with a collection of data in many different formats. A common best practice is to add unique identifiers and meta tags so you can hunt down the data you need and use it. The open-ended nature of a data lake allows analysts to actively explore data in the lake without any predefined requirements, giving them the ability to discover answers in new ways.

Data lakes are often associated with Hadoop. You can create a data lake outside of Hadoop, but it would face numerous challenges. Hadoop is the predominant architecture for data lakes because:

  1. The schema-less structure and schema-on-read capabilities add the required data flexibility desired for a data lake
  2. The scalable, extensible CPU and storage power allow the data lake to grow in a linear fashion
  3. The ability to use commodity hardware gives a Hadoop-based data lake a tremendous economic advantage over other approaches

Putting a data lake on Hadoop provides a central location from which all the data and associated meta-data can be managed, lowering the cost of administration.

2. What’s the Difference Between a Data Lake and Data Warehouse?

While a data warehouse can also be a large collection of data, it is highly organized and structured. In a data warehouse, data doesn’t arrive in its original form, but is instead transformed and loaded into the organization pre-defined in the warehouse.

This highly structured approach means that a data warehouse is often highly tuned to solve a specific set of problems, but is unusable for others. The structure and organization make it easy to query for specific problems, but practically impossible for others.

A data lake, on the other hand, can be applied to a large number and wide variety of problems. Believe it or not, this is because of the lack of structure and organization in a data lake. The lack of a pre-defined schema gives a data lake more versatility and flexibility.

3. Where Can You Use a Data Lake?

Data warehouses evolved because they answered the highly structured, everyday questions that analysts asked. These typically revolved around the transactional aspects of the business and allowed an analyst to drill down on the specific dimensions defined. Think about it as being able to walk into different sections of a warehouse, down certain aisles and look on specific shelves.

Data lakes are meant to solve problems that are not as structured and require “discovering” the answer from the data. Analysts may know what question they need answered, but not what combination of data and analysis will reveal the answer. This requires iterative exploration and application of different, often more complex analytic functions to reveal the true answer.

4. Is Your Analytic Platform Designed for a Data Lake?

Just as data warehouses were organized to solve highly structured problems, the BI tools that accompanied the warehouse were designed to work with that structure. They were often designed to allow the analyst to “slice and dice” the data along the structure provided in the warehouse (dimensions and measures).

In the same manner, analytic platforms to solve problems on data lakes need to equally embrace the versatility and loose structure. While the underlying Hadoop technology provides the versatility a data lake needs, many existing analytic platforms are not designed to take advantage of this versatility, leaving many companies struggling to get real value out of their data lakes.

Analytic platforms built natively for Hadoop are designed to use the varying data types, structures and formats found in a data lake. Native analytic platforms for Hadoop embrace the schema-less structure and schema-on-read capabilities built into Hadoop. This provides an analyst workbench that can answer a much greater array of questions, discover new hidden patterns in the data and offer highly granular yet actionable insights.

5. How Can You Gain More Value From Your Data Lake?

Native analytic platforms for Hadoop will help you get the most value out of your data lake. But how do you choose an analytic platform that will provide the versatility and flexibility required?  Download your free ebook, the Big Data Analytics Buyer’s Guide, if you’d like to learn more about why big data analytics is important, what’s important in a big data analytics platform and how your users can be truly effective.


Connect with Datameer

Follow us on Twitter
Connect with us on LinkedIn, Google+ and Facebook


John Morrell

John Morrell

John Morrell is Sr. Director of Product Marketing at Datameer.

Subscribe