Numerical data often can be modeled as a number of independent (predictive) variables (aka columns/features/attributes) along with one dependent (response) variable. In a recent post about Multiple Correlation, you learned how you can identify independent variables that are most relevant for the response variable.
For example, in a data set with eight numeric variables describing properties of a vehicle, through Multiple Correlation you figured that the four variables acceleration, distance, horsepower and weight contain best information to be able to predict the values of mpg (miles per gallon).
Multiple Regression is a technique where you now use these variables to learn a model that enables you to predict the value of the response variable, given a new record where you only know the values of the dependent variables (but the value of mpg is unknown).
I will briefly explain the mathematical background of how to learn such a multiple regression model, walk through the details how this can be implemented in Datameer on big data with a set of custom linear algebra functions and show how the derived model can be used in Datameer to make predictions on new data. By the end of this post, you’ll know how to scale on potentially big data (billions of records), both on learning the model as well as making predictions on new records.
Now wait a minute, you’re probably saying. Isn’t the point of this post that you shouldn’t *have* to know how something like multiple regression works, mathematically speaking? You’re absolutely right. But, we want to “show our work” anyway for those of you who are interested in taking a look under the hood. For those of you who don’t necessarily need to know how it works and just want to be able to use it in Datameer — skip ahead to the last section — “Get The Multiple Regression Application”.
Multiple Linear Regression attempts to fit a series of independent variables (each denoted as X) and a dependent variable (Y) in to a linear model.
This means we want to find the best way to describe the Y variable as a linear combination of the X variables. Using matrix algebra, we can describe this problem as a general linear system:
Written in shorter form, the equation becomes:
The large letters are the matrices and the smaller letters describe the dimensions of each term. We are solving for the beta vector. After some transformations, this can be expressed as:
β is a vector and from this vector we can take our required values to construct the desired equation.
This equation can then be used to make predictions on data where the values of Y are unknown.
As input data in this example, we use an illustrative data set of 384 records describing the properties distance, horsepower, weight, acceleration and mpg (miles per gallon) of a vehicle. Mpg represents the dependent variable (Y). We also introduce a constant intercept column (equal to 1) to initialize β0.
To be able to compute the values of the β vector in Datameer on big data, we utilize the formula described above:
We essentially break it up into three steps.
The first step is to compute the inverse of the transpose product of the independent variables input data. We do this in Datameer by applying the custom function GROUP_MATRIX_TRANSPOSE_PRODUCT on all independent variables and directly apply the custom function MATRIX_INVERSE on that:
This results in a list of lists represent a 5 by 5 matrix, with each inner list being a row of that matrix. This how the custom function formats its output as a matrix representation. Note that this scales on big data – the custom functions can deal with arbitrarily many rows of data in X. We call this result of the left part of the above formula the “betaCoefficcientInverse”.
The next step is to compute the product of the transposed input data with the Y vector (XTY).
Similar to the step above, we create a group-by sheet and apply a custom function GROUP_MATRIX_TRANSPOSE_VECTOR_PRODUCT, which returns a one-column matrix with six entries:
We call this result matrix “untransformedBeta”. Note that each inner list in this result matrix is single-element (single column) row vector.
In a final step we join these two intermediate result matrices into one sheet to then apply the custom function MATRIX_PRODUCT to compute the final result:
This final result contains the regression model consisting of five entries – the intercept and the factors that can now directly be used to make predictions on new data.
To use that model, we join it with each row of our new data set we want to apply it on, in order to predict the value of Y (mpg in our running example) for each record. The prediction itself is done by simply applying the formula described in the math section above:
Since the model is represented as a list of lists we apply the LISTELEMENT function in order to retrieve the intercept and the factor values.
The exact formula in Datameer is:
Note that in this example, we use the input data itself where we applied the model on. Obviously, this is done not using the known values of mpg. However, we still copied the known value of mpg into the result sheet, enabling you to compare the predicted values with the actual values. Not very surprisingly, the performance of the model on the training data looks quite convincing. In a real-world application, you would choose a setup with a hold-out set or cross-validation to determine the actual model performance, which is something that can be conveniently done with Datameer as well.
As promised, we did the heavy lifting and turned all the above into a single app to help you achieve the steps outlined above, without actually having to build it out yourself. Ready to give it a shot? To install, simply follow these steps: