Best Practices for Data Lakes

Datameer, Inc.
February 27, 2018

Why Data Lakes Require Discipline to Be Effective

Here’s the thing about data lakes: They’re actually a reaction to something else. They’re a reaction to an older construct called a data warehouse or a data mart. Data warehouses and data marts are very formal because they require the data to be based on agreed-upon schemas and meet a very high barrier before that data is included. Because of that, they sometimes tend to be impediments to getting analysis done.

Download your free ebook about getting more from your data lake

Data lakes are kind of the reaction to that. Now we can have a place where data can go to, where the barrier to entry is intentionally lower. The universe of data in a data lake can be more inclusive and more comprehensive and that allows a lot more analysis to get done.

However, just because the barrier to entry has become lower doesn’t mean it should be all the way down to the ground. We still need to have a sense of discipline here. We still need to have a sense of structure and what data lakes really need to be is a place for less formality but still a place that’s structured and navigable. That can be a difficult balance. I think it’s one the industry is still figuring out, to be honest.

What Are Some Best Practices for Structuring Data Lakes?

Your number one goal in terms of how a data lake is architected and structured is that someone from your organization who shares your corporate culture (but may not necessarily have expertise in data per se) should still find the structure of your data lake, the names and the contents of the data set stored within it to be fairly self-explanatory.

That way, a self-service kind of approach can work. What you want to avoid is something that’s such a cluster of undocumented stuff that nobody really feels competent to go near it.

You want it to be a place where people can get a certain amount of instant gratification if they’re looking for something. For that, the structure’s going to be fairly obvious so they can find it, the name’s going to be fairly obvious and hopefully the schema of that data set is pretty well-documented so they can get what they’re looking for.

The more that happens, the more they’ll come back and look for the next thing. The more adoption is driven, the greater return on investment that comes from that data lake.

How to Prevent your Data Lake Architecture From Becoming a Data Swamp

Let’s think about this balance between making things structured, but also making them open and inclusive. What it really comes down to is how well are the data sets organized, taxonomized and how do we really know what’s inside of them?

I hate to oversimplify, but just think about the principles involved in managing files on your own computer. If you have an organized system of folders, a good hierarchy of folders, you’re using good readable names for your folder structure and for your files, and you’re being consistent about it, that’s going to enable you to get through your files. You’ll be able to find the file you want much more easily than if you’re just throwing everything willy-nilly into one big folder and naming things in an inconsistent manner and not always so descriptively.

Again, at the risk of oversimplifying, that’s the kind of thing we have to be wary of in a data lake. You want to catalog your data. You want to curate your data. You want to have a good sense of what’s where. Another cliché that might be instructive here is a place for everything and everything in its place.

The limitations with data lakes are that it’s up to the customer and the user to impose that discipline on themselves. We’re seeing the emergence of tools that can help make shorter work of that discipline. The motivation for it really needs to come from the customer. That’s a tricky thing.