Data Exploration vs. Data Discovery
- John Morrell
- March 5, 2018
While it served to help answer business questions faster, data discovery at the desktop and departmental level had one limiting requirement – you needed to know what data to use to find the answer you were looking for. This limitation often left analysts waiting for the availability of datasets, and could possibly lead them down the wrong path if the data they used did not reveal the optimal answer or even any answer at all.
Answering Questions on Big Data
As the big data market emerged, the term “data discovery” was appropriately applied to the process of trying to discover answers to questions buried in big data. And, since big data is really big (hence the name!), in order to discover answers, you need to know where to look.
Big data is also used to better understand answers to questions in areas you’ve previously never tapped. If you use the chart to the left as a guide, you can see that “discovery” is represented in the lower left (Traditional EDW). This illustrates a scenario where you know what specific questions you’re asking and the area to look for the answers. Big data is located in the upper right, where you are trying to “explore” new areas (things you don’t know) and questions you haven’t even asked yet.
Exploration vs. Discovery: An Old World Example
In the 15th through 18th centuries, there were many voyages of exploration and discovery. Some you could characterize as exploration and some as discovery. Let’s look at two examples that show the difference.
Christopher Columbus was on a voyage of discovery. He knew exactly what question he wanted to answer – I want to get to the East Indies – and knew what direction or area to look – sailing directly west. Now, he did find a different answer, discovering the Americas, but his mission was one of discovery.
Captain James Cook set out on a different mission – to explore the Pacific. He was trying to explore new areas to find answers to a broad suite of questions. As he explored, he would identify specific areas that showed promise. Then, he would transition into discovery mode to answer specific questions relevant to that area.
Big Data Complications
In analysis, your goal is always to eventually come up with answers. But, in the big data world, the analysis process is complicated by five key factors:
- Familiarity – with big data, analysts may be stretching their boundaries both in terms of the data they are using and the domain of analysis
- Data volume – there may be significant amounts of data available to the analyst, making it difficult to know where to start
- Where to look – the analyst is not necessarily sure what data or combination of datasets will reveal the best answer
- More than past performance – big data use cases require uncovering patterns that can be applied to the future, not simply seeing trends of the past
- Results-oriented – the analysis is trying to find significant events or interactions which lead to certain types of outcomes
Breaking the Analysis Process in Two
With these complications in mind, a best practices approach is to break the analysis process into two distinct steps:
- Exploration – after data has been prepared, you “explore” the data to see what parts of it will help reveal the answers you seek. You can also explore various hypotheses. One could also think of it as a data refinement or narrowing process.
- Discovery – once you know what data helps you find the answer, you dig deep into the data to identify the specific items that reveal the answers and find ways to show those answers to the business teams.
In the big data world, exploration is incredibly important for two reasons:
- While you might have a broader goal in mind, you don’t necessarily have a highly specific question you are asking – at least not yet. For example, you might be looking for reasons why customers churn (broad goal) but don’t know what specific areas will tell you why they churn (specific questions).
- The scope of the dataset(s) is large (many rows), wide (many attributes) and deep (many distinct values). Simply trying to discover answers is like looking for a needle in a haystack. Exploration lets you find the parts of the data that are relevant so you are looking for that needle in a handful of hay.
Once you have explored and refined the data, you can begin the data discovery phase. This is where you seek the patterns that answer highly specific questions. You look at specific trends, sequences of events, time-series analysis, clusters and more. Once you’ve “answered” the question, then you can visualize it and show it to the business.
Key Requirements for Big Data Exploration
Now that you understand the importance and distinction of big data exploration, what critical capabilities do you need to look for?
- Look at all the data – Exploration is about looking for something new and unknown. This requires the analysts to have no limitations on the size of their datasets.2. Look across, not just down – Most discovery tools focus on letting you drill into data. Exploration requires the ability to move sideways through your data as well.
- Look across, not just down – Most discovery tools focus on letting you drill into data. Exploration requires the ability to move sideways through your data as well.
- Explore anything – Most tools give you pre-fixed exploration paths on certain dimensions and metrics. Successful exploration requires the ability to explore any set of rows, attributes, metrics and values – and the relationship between these.
- Fail fast – When exploring you may have to examine many different paths until you find the right set of data. You need the ability to explore paths quickly, fail fast, and move on until you find exactly what you need.
- Interactive – If you are going to fail fast, you need to move at the speed of thought. To enable this, your exploration platform needs to rip through billions of records with sub-second response time.
- Tight link with preparation and blending – As you explore the data you may find dirty sections or the need to enrich the data with mathematical functions. Exploration needs to be tightly integrated with data prep capabilities.
- Finish the curation process – Once you’ve explored the data and found what you need, what’s the next step? Producing a usable dataset with the single click of a button that seamlessly transitions you to the discovery step.
Data exploration is a critical part of the analysis cycle for big data due to the tremendous length, width and depth of the datasets, and the need to understand unknown data, domains and questions. Without direct exploration of big data inside of the analytic process, analysts could potentially use the wrong data and lead themselves to bad or non-optimal conclusions.
Make data exploration a central part of a cooperative data curation process that brings together your two domain experts – data analysts and business analysts. Let them work together to tame and shape your big data to find not just answers to new questions, but also the optimal answers.
To learn more, please visit this page.