Datameer Blog post
Visual Exploration For Data Preparation
by Adam Wealand on Jun 01, 2019
Data visualization is the principle of taking data values and converting them into a graphic. Most people see, or prefer to see, data represented as a graphic. It is the common language that captures how data values are constructed into a story. Typically, the simplest and aesthetically pleasing data visualizations require the most time and effort to create.
Most of the data visualization we see is the end result of a long analysis process, of which exploration is a key part of. We see these results as an infographic in the media or as a chart in a business presentation. However, there is substantial value performing data visualization as part of the data preparation process. Visual data exploration is a fantastic, yet underutilized, way of finding patterns in exploratory data.
Datameer makes it easy to visually explore extremely large datasets. Loaded with both Infographics and Visual Explorer functionality, Datameer is the world’s first solution for interactive visual data exploration that bridges the last mile between analysts and the data lake. As an analyst, you can create a multitude of charts and tables to explore your datasets. I will now explain some of the easiest ways to get started using these charts.
Types and Aesthetics Of “Big Data”
You may think of data as numbers, but numerical values are only two out of several types of data we encounter, particularly when working with “big data.” Even those numbers can be both continuous and discrete numerical values, data can come in the form of discrete categories, in the form of dates or times, and as text. When data is numerical, we also call it quantitative and when it is categorical, we call it qualitative. Some data visualization aesthetics can represent both continuous and discrete data (size, line width, color), while others can usually only represent discrete data (shape, line type).
Common Visualizations For Data Exploration
You may be overwhelmed by the different ways data can be visualized. Don’t be! The key is to start simple. There a few charts that are commonly used to visualize different types of data at the exploratory stage. The four most common data types to visually explore are Amounts, Distributions, Proportions & Trends.
Most often, your dataset exploration will include numerical values shown for categories and you will want to know the magnitude of these. Think of how you might want to visualize the total sales volume of different products in your product line. We can consider this data as “amounts” of something. The most common approach to visualizing amounts, particularly when exploring data, is using simple bar charts. These can be vertical or horizontal bars – vertical is most common but horizontal is typically used when it is more aesthetically pleasing. For example, you would likely want to use a horizontal bar chart for very long category names.
Arrangement of the bars is more important rather than positioning of the bars. Bar charts should always be arranged meaningfully according to context. This makes bar charts less confusing and more intuitive compared to charts where bars are arbitrarily arranged. At the very least, order bar charts by either ascending or descending values. Datameer’s Infographics will do this automatically after dragging and dropping to set up the chart and users may also define their own parameters.
Another common, perhaps the most common, situation while exploring data is a desire to understand how a particular variable is distributed in a dataset. This is done by grouping variables into meaningful “bins” which results in a table that looks quite similar to a bar chart. It differs from bar chart as the height corresponds to counts and widths correspond to the width of our bins; the width of these bins can also vary. The most common visualization of distributions is a histogram. Datameer automatically displays a histogram for every column of data in your dataset to provide an understanding of the distribution of variables in that column. Bins in this example are retail products; you can quickly see that gift cards are a popular product in this sample data:
Another situation you may encounter is a desire to understand the composition of your dataset. We do this by visually exploring the proportions of categorical data. This is commonly expressed as a pie chart. This graphic, which is divided into slices to illustrate numerical proportion, displays data for around 6 categories or less. Stacked bar charts are similar to pie charts but proportions are stacked on a bar rather than on a circle. Stacked bars are useful for more categories, as well as multiple sets or time series of proportions. Stacked bar charts allows you to better see the proportions of data are more useful for visual exploration compared to pie charts.
The following is an example of a stacked bar chart in Datameer’s Visual Explorer. We can easily see the proportion of records by US State:
In our final scenario you want to understand the order of your dataset. Your data must have the structure to do this, so it is less common than the previous visualizations discussed. That said, the ability to visualize the order and impact of time of variables in data is incredibly powerful. These visualizations are conveyed in line graphs or scatterplots. Line graphs are essentially scatterplots with lines connecting the dots, which are the observations in the time series. Academics favor scatterplots over line graphs because the lines only serve to smooth out the trend as opposed to representing an actual datapoint. You can use either to explore your data, depending on your preferences.
There are many more visualizations than just those discussed here. The key takeaway is to start simple with visual data exploration. Build complexity on top of a foundational understanding of the data you have.
The Next Step
Rather than bring the data lake to your analysts, bring your analysts to the data lake. For more details, check out this on-demand TDWI Webinar to learn how data exploration can speed time to insight. In it, we discuss:
- The role of visual data exploration in analytics cycles and where it can help streamline processes
- Key attributes of visual data exploration tools and their integration into data preparation
- How business analysts and users can apply visual data exploration to find interesting new patterns and create refined data sets unique to their analytics problems