Multiple Linear Regression Example
This post outlines the process of generating a multiple linear regression model using LINEST
, characterizing its performance, and then using it to make a prediction.
The Data
The data utilized are the same as the data in binary classification worked example. Also available is profit and loss information for each customer, instead of the binary classification (default/no default) data used previously. As before, there are 200 total data points, with the same first five listed below.
Age | Years at Employer | Years at Address | Income | Credit Card Debt | Automobile Debt | Profit/Loss |
---|---|---|---|---|---|---|
26 | 0.35 | 0.24 | $12,181 | ($2,057) | ($3,696) | $3206.43 |
28 | 4.47 | 0.37 | $38,536 | ($6,970) | ($3,018) | $2940.32 |
37 | 9.30 | 0.02 | $30,602 | ($2,892) | ($1,674) | ($1023.98) |
28 | 8.43 | 1.06 | $15,588 | ($38) | ($1,758) | $2944.66 |
39 | 5.40 | 0.74 | $27,599 | ($776) | ($4,374) | $738.41 |
Building the Model
As with the binary classification model, the first step of this analysis is to standardize both the dependent and independent variables. For the dependent parameters, standardization occurs using the mean and standard deviation of the training data.
Age | Years at Employer | Years at Address | Income | Credit Card Debt | Automobile Debt | Profit/Loss | |
---|---|---|---|---|---|---|---|
mean | 34.67 | 8.61 | 0.78 | 48439.96 | -3202.11 | -6378.07 | 1906 |
standard deviation | 8.20 | 6.77 | 0.62 | 47862.35 | 3892.03 | 7472.44 | 5755.91 |
Standardization completed, the results appear as follows.
Age | Years at Employer | Years at Address | Income | Credit Card Debt | Automobile Debt | Profit/Loss |
---|---|---|---|---|---|---|
-0.26 | 0.11 | -0.78 | -0.22 | -0.01 | 0.21 | 0.23 |
-0.01 | 0.50 | 1.14 | 0.36 | -3.18 | -1.51 | 0.18 |
0.37 | 0.57 | -1.12 | 0.26 | -2.11 | -0.21 | -0.51 |
-0.73 | -1.07 | 1.71 | -0.60 | 0.51 | 0.53 | 0.18 |
-0.25 | -0.17 | -0.88 | -0.49 | 0.53 | 0.80 | -0.20 |
The next step is to run the LINEST
function. This produces the following results.
Auto Debt | CC Debt | Income | Years at Add | Years at Emp | Age | |
---|---|---|---|---|---|---|
-0.005 | -0.055 | 0.639 | -0.066 | 0.187 | 0.008 | 0.000 |
0.058 | 0.062 | 0.073 | 0.043 | 0.058 | 0.051 | 0.042 |
0.655 | 0.598 | #N/A | #N/A | #N/A | #N/A | #N/A |
61.2 | 193 | #N/A | #N/A | #N/A | #N/A | #N/A |
131.1 | 68.9 | #N/A | #N/A | #N/A | #N/A | #N/A |
Recall that the interpretation of the output of LINEST
is as follows.
Auto Debt | CC Debt | Income | Years at Add | Years at Emp | Age | |
---|---|---|---|---|---|---|
beta(6)*x(6) | beta(5)*x(5) | beta(4)*x(4) | beta(3)*x(3) | beta(2)*x(2) | beta(1)*x(1) | alpha |
SDev of error | SDev of error | SDev of error | SDev of error | SDev of error | SDev of error | SDev of error |
R^2 | SDev(y) | |||||
F-statistic | Degrees of Freedom | |||||
“Regression Sum of Squares” | “Residual Sum of Squares” |
The model, therefore, is the following, where all parameters are standardized.
$$\text{Profit/Loss} = -0.005(\text{Auto Debt}) - 0.055(\text{CC Debt}) + $$
$$0.639(\text{Income}) - 0.066(\text{Years at Add}) + $$
$$0.187(\text{Years at Emp}) + 0.008(\text{Age})$$
Benchmarking Performance
The standard deviation of the model error is calculated as the root-sum-square of the residuals. The residuals, as expected, are calculated as
$$\text{profits}_{model}-\text{profits}_{actual}$$
$$\sigma_e=0.675$$
The standard deviation of model error, expressed in dollars, is given by the standard deviation of model error on the test set multiplied by the standard deviation of the Profit & Loss from the training data.
$$residual_{test}=(\sigma_{e,test})(\sigma_{profits,training})$$
$$=(0.675)(5755.91)=3885.24$$
$$residual_{training}=(\sigma_{e,training})(\sigma_{profits,training})$$
$$=(0.587)(5755.91)=3378.72$$
The increase in the residual from the training data to the test data will give some indication as to the degree of overfitting the model has.
$$\frac{residual_{test}}{residual_{training}}=\frac{3885.24}{3378.72}=1.15$$
A less than 20% increase in the residual indicates minimal model over-fitting.
With the test set residual, it is possible to calculate things like the 90% confidence interval for the test set.
$$\text{NORMSINV(.95)}(residual)$$
$$(1.645)(3885.24)=\pm$6391$$
Applying the Model
Suppose we are given the following information.
Age | Years at Employer | Years at Address | Income | Credit Card Debt | Automobile Debt | |
---|---|---|---|---|---|---|
Raw | 42 | 12.44 | 0.9 | 121400 | -34228 | -23411 |
First the data are standardized using the training mean and standard deviation.
Age | Years at Employer | Years at Address | Income | Credit Card Debt | Automobile Debt | |
---|---|---|---|---|---|---|
Standardized | 0.89 | 0.56 | 0.19 | 1.52 | -7.97 | -2.28 |
Then, each individual z-score is multiplied by the appropriate $\beta$ coefficient from the original LINEST
calculation and the result is summed. The sum is the point forecast of profitability, expressed as a standardized output. The specific value is 1.525.
To determine the actual point estimate, this value is multiplied by the standard deviation of profits (5755.91) and the mean profit (1905.51) is added.
The result is $10,683.61.
From here, it is possible calculate profitability ranges using the standard deviation of errors on standardized outputs of .675. As an example, the range of profitability with 50% confidence is given by
$$(residual)(\text{NORMSINV(.25)})(\sigma_{profits})=\pm$2620.55$$
$$$10683\pm2620=[$8063,$13304]$$
Excel-Linest-Function.xlsx
Data_for_Final_Project.xlsx
AUC_Calculator-and-Review-of-AUC-Curve.xlsx
Excel-Linest-Function.xlsx
Linear-Regression-Forecasting-Excel-Document.xlsx
Some other content is taken from my notes on other aspects of the Coursera course.