Datameer Blog post
BI vs. Big Data and Data Warehouses vs. Data Lakes
by Erin Hitchcock on Mar 05, 2018
Hear from Andrew Brust, blogger at ZDNet and Datameer’s Advisor for Marketing and Innovation, or skim through the transcripts to learn more about the distinctions between BI and Big data, and the real differences between a data warehouse and a data lake.
What are the legitimate distinctions between BI and Big Data?
Transcript: With BI and big data, often the destination, the goal, is the same – which is to have summarized results of much more detailed data so that we can get to real insights and find the real trends, basically extract the information from the data itself. The big difference is in the case of BI, the source data that we’re using is at the granular level of a transaction. So in an e-commerce situation, you could take the word “transaction” very literally. We’re talking about, let’s say, a purchase. In the case of some other type of business, maybe it’s a medical scenario and we’re looking at specific events in that domain, which could be the contraction of a disease – something along those lines. But in the case of big data, we’re looking at all the twists and turns that led up to that transaction, that led up to that event. In the case of e-commerce, we’d be looking at all the clickstream data. We’d be looking at all the decisions and all the products that the buyer looked at before making a purchasing decision. And that is a very different kind of source data, so we get a lot more nuance.
What also happens is that the structure of the data can be a lot looser in the big data case. We don’t have to be taking data that’s neatly in rows and columns. It’s almost the difference between having source data written on index cards where each line is clearly labeled the same way from card to card in the BI case, versus taking data from notes, and conversation and lots of things that are much less structured in the case of big data.
What’s the real difference between a data warehouse and data lake? And what’s a data swamp?
Transcript: Similar to the way BI and big data differ in their source data, both the granularity and the structure of it, a data lake and a data warehouse share some of those same differences. In the case of a data warehouse, you’re using a relational database. Things are very structured. You have tables. You have rows. You have columns. You have schema. That schema is legislated well in advance, and it’s very easily tracked because of that. That leads to less flexibility when it comes time to doing the analysis. In the case of a data lake, we have the ability to include more data. We have the ability to bring it in without having to legislate all the structure of it in advance. That gives us a lot more latitude to do different kinds of analyses.
On the other hand, because that scenario is more permissive, it can lead to an abuse whereby the data lake really ends up being kind of a parking lot where we throw different data sets without much regard at all for how they’re structured. That leads to what people call a data swamp, where really it’s not that we have a more nuanced data repository – We just have a collection point for all kinds of miscellaneous data that’s not especially well governed and is not especially usable because of that. What’s really important is to have that data governance in place so that we can have the extra flexibility that a data lake gives us over a data warehouse without it becoming something that is really under the umbrella of anarchy and, therefore, isn’t very usable in an enterprise scenario.
In another article we have some great background information about 5 Questions about Data Lakes Answered.
Are Big Data and Business Intelligence vastly different or largely the same?
Transcript: Business intelligence and big data need to be coordinated, need to be used together. They’re not the same thing, but they have a lot of the same common goals. A lot of the distinctions between the two tend to be arbitrary. That creates an unfortunate situation where you have two different communities of practitioners, two different communities of vendors, and there really ought to be a lot more unity. We have to respect the differences between the two, in order to integrate them. But integrating them is absolutely important.
Erin Hitchcock is the Public Relations and Analyst Relations Manager at Datameer. In this role, she works diligently alongside thought leaders to spread the word about big data and data engineering technologies.