“Smart data discovery has the potential to expand access to sophisticated interactive analysis and insights to business consumers and nontraditional BI users — the approximately 70 percent of users in organizations that currently do not use BI tools or have statistical backgrounds,” said Rita Sallam, research vice president at Gartner.
That’s exactly what we aim to do with Datameer: close the gap between those who should be able to work with their data, and those who currently can. Following is an early look at just one of many examples of how we’re working to push the envelope with Smart Data Discovery, making what was previously a complex analysis accessible to all. Today, let’s geek out on Multiple Correlation Coefficients.
When we analyze data in an attempt to make predictions, we are often looking for links between the different variables (columns/features/attributes/dimensions) to better understand relationships and how the data is shaped. In Datameer, you can quickly and easily use the Column Dependencies feature in our Smart Analytics module to automatically compute how strong each possible pair of these variables is related, simply by telling it which columns of your data you’d like it to analyze.
You can set Column Dependencies to leverage the Pearson Correlation measure to determine the strength of those relationships. However, again, this only gives you the correlation of each pair of variables. But what if you could compute how any possible combination of multiple independent variables is related to a dependent variable? This technique is called Multiple Correlation and can be useful especially in the context of feature selection or multiple regression.
In the following illustration, the picture on the left shows each pair-wise combination of an independent variable (blue) with a dependent one (red), which is what you can easily do with the Datameer Column Dependencies component. The image on the right shows combinations of multiple independent variables, that are compared with the dependent one, which is what we will focus on for now.
This raises two questions of interest:
To answer these questions, we must first understand how multiple correlation works mathematically.
Now wait a minute, you’re probably saying. Isn’t the point of this post that you shouldn’t *have* to know how something like multiple correlation works, mathematically speaking? You’re absolutely right. But, we want to “show our work” anyway for those of you who are interested in taking a look under the hood. For those of you who don’t necessarily need to know how it works and just want to be able to use it in Datameer — skip ahead to the last section — “Get The Multiple Correlations Application”.
The first step in the calculation of multiplication correlation, is to calculate all pairwise correlations of the variables. To do this we use Pearson’s correlation coefficient rxy which is calculated using the following formula:
We then calculate the coefficient of determination, denoted by R2. First we construct the correlation matrix, which is a symmetrical matrix of all (pair-wise) correlation coefficients between the independent variables, as shown below:
And we construct a vector, c, which contains the correlation coefficients between each independent variable and the dependent variable, as shown below:
And the coefficient of determination is calculated using the below formula, where cT is the transpose of the vector c and R-1xx is the inverse of the correlation matrix.
One problem with this value of R, is that it will increase every time we introduce a new independent variable, regardless of whether it is an effective predictor. To prevent this from happening, we require one more formula. We now calculate the adjusted-R2 value
where R2 is our original coefficient of determination, p is the sample size and N is the number of independent variables. One quirk of this formula is that it can result in negative values, however these are interpreted as zero, and all positive values are square rooted, as was done with the original R2 value.
How does this work in Datameer? Currently, it requires bringing together a few advanced techniques, but spoiler alert, we’ve made an app for you to make it even easier. Once again, read on to see what we’ve created for you in the app or, skip ahead to the “Get The Multiple Correlations Application” section to go ahead and get started.
As a result, we will have identified the subset of independent variables that, as a group, has the strongest correlation with the dependent variable.
Note, we use a randomly generated dataset as our practice input. It is designed for three of the variables (out of five) to have a high correlation with the dependent variable, and the other two are truly random, thus having little to no correlation (except by chance!). Note that the random values are re-generated with each run, so the results will slightly vary. The screenshot below shows a few records of that input data. In this example we have five independent variables (X1,X2,X3,X4,X5) and the dependent variable Y.
In this first step, we generate the Pearson’s correlation coefficients for every possible pair of variables. This is a vital step, as in order to understand how a group of independent variables affect a dependent variable, we must also understand the possible relationships they have with each other.
Generating all the Pearson correlation values can easily be done using the Smart Analytics Column Dependencies component in Datameer. The result sheet shows for each input column pair the correlation value. It also outputs the p-value which is not relevant in the following.
In order to find the best predictor set, we want to exhaustively test every possible combination of independent variables. To achieve this, we apply the GROUPCOMBIN function to Column 1 from the column dependencies sheet, thus generating an individual list for every single combination. A GROUPROWNUMBER function is then used to assign every combination its own unique ID.
We then expand these lists with their combination IDs, so that every instance of an independent variable has its own row, alongside its ID. This is so every combination can be assigned the relevant independent variable-pairs, along with the correlation coefficients.
The resulting sheet is a join, where all combinations are associated with the necessary independent variable-pairs, and each pair is displayed with its correlation coefficient. Later on in the workbook, we also introduce the coefficients between the independent and the dependent variables. Once all this information is brought together (essentially a re-combination of Column Dependencies output), we are able to calculate the multiple correlation coefficient using our custom function.
At this point we have a sheet that contains for each possible combination a number of records (along with their combination IDs) – one record per pair of independent variables of that particular combination. In the below screenshot you can for example see that combination 11 (X1,X2,X3,X4) contains all six possible pairs of independent variables: (X1,X2), (X1,X3), (X1,X4), (X2,X3), (X2,X4), (X3,X4). The records also contain the respective pair-wise correlation coefficients between the independent variables and the independent with the dependent variables.
When we now treat one particular combination as a group (obviously with a GROUPBY on the CombinationID), then we can feed all that information into a custom group function (GROUP_R_SQUARED) to compute the R2 for each group (i.e. all combinations). The function itself applies the linear algebra described earlier in the theory section. Note that the function does that in memory for one single group, which is reasonable since the size of a single group is small. The function then simply outputs the R2 value per combination:
As noted above, we want to compute the adjusted R2 value where we take the number of records and the number of variables (of a combination) into account. We join this information into our “intermediate” R2 result and compute the adjusted value with the formula stated in the theory section. In a final step to compute the MCC value we output 0 if the adjusted R2 is negative, otherwise we take the squareroot.
With the described approach the maximum number of independent variables has an upper bound of 16. This is due to the fact that the GROUPCOMBIN function only generates all combinations up to 16 different values. When N is the number of independent variables then we can compute the number of resulting combinations via:
For N=16 this is 65,519 combinations. Note that Datameer will automatically take care to parallelize the computation of these thousands of R2 coefficients through our custom function.
You may wonder why all pairs of a single combination of independent variables (one single group) will always fit into memory. Imagine the number of independent variables is 16. How many records would all pairs of all 16 variables be? We can compute this via:
For all 16 independent variables this results in 120 possible pairs. This means our custom group function will at most have to deal with 120 records at a time. The largest object this function will build (from these records) in memory is the matrix Rxx (see theory section) that at most contains 16*16=256 elements.
As promised, we did the heavy lifting and turned all the above into a single app to help you achieve the steps outlined above, without actually having to build it out yourself. Ready to give it a shot? To install, simply follow these steps:
We hope you enjoy it!