I recently gave this talk at BigData TechCon, and thought I’d share it here. Imagine a situation where you want to import data into your Hadoop environment, a process in which you see each record once, involving a Map/Reduce run. Since you see each record anyways, wouldn’t it be nice to take advantage of that situation, and collect some statistics that help you better understand the properties of the data you import?

Obvious candidates are statistics like count, sum, average, and cardinality for each attribute (column). There are streaming solutions available and adaptation for the Map/Reduce paradigm is mostly straight forward. But what about the relationship *between* two attributes?

In this blog post I will describe how the strength of the relationship between two columns of the data can be computed using Map/Reduce on a data stream.

**Measure**

A measure that naturally comes to mind when thinking about the relationship between two attributes is *Pearson Correlation*. Though this is a popular and commonly used measure, it can only be applied on two attributes when both are of numerical type.

A measure that allows to look at pairs of attributes of any type is *Mutual Information*. With that measure, the attributes can be numeric and categoric. I will briefly explain this measure in the following paragraphs. You can find a more in-depth explanation in this Wikipedia article.

Mutual Information is based on another measure – *Entropy*. The idea of Entropy can be described as measuring to which degree a bag of items (the values of a column) is distributed uniformly. It is computed by the formula shown below and the illustration shows the Entropy for three different bags of items.

Formula 1: Entropy

Image 1: Entropy Examples

A column of our input data can be interpreted as a random variable, represented by X in the above formula. As can be seen the Entropy grows when the items are more mixed up. In an extreme case when all items are the same, the Entropy is equal to zero.

Extending the concept of Entropy from one random variable to two leads to the concept of *Conditional Entropy*. A way to describe that concept is: given the value of one attribute, how much uncertainty remains about the value of the other attribute? The Conditional Entropy can be computed by the formula shown below.

Formula 2: Conditional Entropy

It is the weighted average of the Entropies of the conditional distribution. Formula 2 shows how to compute the Entropy of random variable Y given the values of random variable X.

This measure makes a statement about the relationship between two attributes (X and Y) but it is actually not symmetric, meaning H(X|Y) is not necessarily equal to H(Y|X). To derive a single (symmetric) measure for the relationship between two attributes, we alter the concept to: how much is the uncertainty of X reduced due to the knowledge of Y? This can be computed by the formula given below, which is the Mutual Information.

Formula 3: Mutual Information

As can be seen the only ingredients needed are the joint and the marginal distributions. This means, all we need to compute the Mutual Information is the two dimensional histogram representing the data of the two attributes.

Note that it is also practical to normalize the Mutual Information, e.g. by dividing it by the square rooted product of both attributes’ Entropy. This scales the Mutual Information between zero and one and also penalizes attributes with large cardinality.

**Two-Dimensional Histograms**

There are three possible cases we need to consider when creating a two-dimensional histogram from a data stream:

1. Both attributes are numeric

2. Both attributes are categoric

3. One attribute is numeric and the other one is categoric

In case (2) an obvious data structure is a nested map where the keys are the attribute values and the values are the counts. In case (3) we create a data structure where each bin of the numeric dimension contains a map with the values and the counts from the categoric dimension. Creating and updating the numeric dimension is a simplified version of case (1), which I will describe in the following paragraphs.

In order to create the two dimensional histogram in case (1), seeing each record only once, one needs to answer two questions:

- How to choose the bin break values?
- How to merge histograms in a Map/Reduce environment?

It turns out that choosing quadtrees as a data structure (initialized with a unit-square) is a great way to address these challenges. A quadtree can store the bin break values in the nodes and the count values in the leaf nodes. See the image below for an illustration of the quadtree data structure.

Image 2: Quadtree

As a new two dimensional record arrives in the stream, the quadtree updates as follows:

- The tree is traversed to determine the bin where to increase the count for the new record.
- If the record is outside the bounds of the outer bin breaks, new bins (nodes) are introduced by doubling the outer break widths until the record lies within the tree.
- If the count in a bin exceeds a threshold, that bin is split into four, by halving the bins for each dimension. The threshold is exceeded when the count of a bin is greater than a certain percentage of the total count that has arrived with the stream.

Merging two quadtrees is then straight forward, as the unit square allows to easily align two quadtrees. The merge is then simply done by adding up the proportional counts in the bins. The image below shows an example on how two quadtrees can be merged.

Image 3: Merge of two Quadtrees

The two input quadtrees (left) are aligned on their unit-square (orange) and then merged into the resulting quadtree (right).

There are two options for deriving the final histograms from the merged quadtrees: creating an equal width or an equal frequency histogram. In the first case the smalles bin width is used to create an equal width histogram from the quadtree. See the image below (left) for illustration.

Image 4: Converting the quadtree to an equal width (left) and equal frequency (right) histogram

Conceptually the tree gets overlayed with a grid of equal width bins that can then easily be filled with the proportional count values from the original tree.

In case of an equal frequency histogram, in a first step the marginal distributions based on equal bin width, on both dimensions, are determined, using the smallest bin width from the original quadtree. This results in two equal width, one dimensional histograms, one for each dimension. The bins of these histograms are then moved such that the equal width histograms become equal frequency histograms, having equal count values in each bin. This results in a new grid of equal frequency bin breaks that can be overlayed on the original tree (see Image 4 (right)). The new bins are then filled with the proportional counts from the original tree. Note that this is an equal frequency histogram with regards to the *marginal* distributions, this does not mean that the single values in the bins are of equal frequency.

Now that we can create two versions of a two dimensional histogram, the question is which is the better one?

**Equal Frequency versus Equal Width**

The actual goal is to compute the Mutual Information for two input columns. When using equal width histograms, in an extreme case it can happen that a few outliers in the data cause the bins to span a wide range, what in turn can lead to a case where one outlier lies within one bin and all the other data lies within a single other bin. See the image below (top left) for illustration.

Image 5: Normalized Mutual Information values on independent (top) vs. corelated (bottom) distribution, using equal width (left) vs. equal frequency (right) histograms

In this example the data of the two attributes was generated independently, so the true value of the Mutual Information should be very close to zero. Since we use an equal width histogram, all data (except the outlier) ends up in the same bin. This leads to a (normalized) Mutual Information value that is much greater than it should be. Using an equal frequency histogram (top right) overcomes this challenge by better representing the true distribution of the data. The Mutual Information value makes much more sense. The two bottom charts in the above image illustrate the case where the data was generated with strong correlation, thus the true Mutual Information is a value much greater than zero. Again, it can be seen that the equal frequency histogram (bottom right) is better suited to reflect the true distribution, rather than the equal width histogram (bottom left).

Feel free to download the R script that I used to create the histograms and to compute the Mutual Information as shown in the above illustrations. You can use it for experimentation to get a better feeling and understanding how different distributions affect the Mutual Information value. I’ve also embedded the slides from my presentation at Big Data TechCon below.