# A Worked Example of Multiple Linear Regression

This post outlines the process of generating a multiple linear regression model using `LINEST`

, characterizing its performance, and then using it to make a prediction.

## The Data

The data utilized are the same as the data in binary classification worked example. Also available is profit and loss information for each customer, instead of the binary classification (default/no default) data used previously. As before, there are 200 total data points, with the same first five listed below.

Age | Years at Employer | Years at Address | Income | Credit Card Debt | Automobile Debt | Profit/Loss |
---|---|---|---|---|---|---|

26 | 0.35 | 0.24 | $12,181 | ($2,057) | ($3,696) | $3206.43 |

28 | 4.47 | 0.37 | $38,536 | ($6,970) | ($3,018) | $2940.32 |

37 | 9.30 | 0.02 | $30,602 | ($2,892) | ($1,674) | ($1023.98) |

28 | 8.43 | 1.06 | $15,588 | ($38) | ($1,758) | $2944.66 |

39 | 5.40 | 0.74 | $27,599 | ($776) | ($4,374) | $738.41 |

## Building the Model

As with the binary classification model, the first step of this analysis is to standardize both the dependent and independent variables. For the dependent parameters, standardization occurs using the mean and standard deviation of the training data.

Age | Years at Employer | Years at Address | Income | Credit Card Debt | Automobile Debt | Profit/Loss | |
---|---|---|---|---|---|---|---|

mean | 34.67 | 8.61 | 0.78 | 48439.96 | -3202.11 | -6378.07 | 1906 |

standard deviation | 8.20 | 6.77 | 0.62 | 47862.35 | 3892.03 | 7472.44 | 5755.91 |

Standardization completed, the results appear as follows.

Age | Years at Employer | Years at Address | Income | Credit Card Debt | Automobile Debt | Profit/Loss |
---|---|---|---|---|---|---|

-0.26 | 0.11 | -0.78 | -0.22 | -0.01 | 0.21 | 0.23 |

-0.01 | 0.50 | 1.14 | 0.36 | -3.18 | -1.51 | 0.18 |

0.37 | 0.57 | -1.12 | 0.26 | -2.11 | -0.21 | -0.51 |

-0.73 | -1.07 | 1.71 | -0.60 | 0.51 | 0.53 | 0.18 |

-0.25 | -0.17 | -0.88 | -0.49 | 0.53 | 0.80 | -0.20 |

The next step is to run the `LINEST`

function. This produces the following results.

Auto Debt | CC Debt | Income | Years at Add | Years at Emp | Age | |
---|---|---|---|---|---|---|

-0.005 | -0.055 | 0.639 | -0.066 | 0.187 | 0.008 | 0.000 |

0.058 | 0.062 | 0.073 | 0.043 | 0.058 | 0.051 | 0.042 |

0.655 | 0.598 | #N/A | #N/A | #N/A | #N/A | #N/A |

61.2 | 193 | #N/A | #N/A | #N/A | #N/A | #N/A |

131.1 | 68.9 | #N/A | #N/A | #N/A | #N/A | #N/A |

Recall that the interpretation of the output of `LINEST`

is as follows.

Auto Debt | CC Debt | Income | Years at Add | Years at Emp | Age | |
---|---|---|---|---|---|---|

beta(6)*x(6) | beta(5)*x(5) | beta(4)*x(4) | beta(3)*x(3) | beta(2)*x(2) | beta(1)*x(1) | alpha |

SDev of error | SDev of error | SDev of error | SDev of error | SDev of error | SDev of error | SDev of error |

R^2 | SDev(y) | |||||

F-statistic | Degrees of Freedom | |||||

“Regression Sum of Squares” | “Residual Sum of Squares” |

The model, therefore, is the following, where all parameters are standardized.

## Benchmarking Performance

The standard deviation of the model error is calculated as the root-sum-square of the residuals. The residuals, as expected, are calculated as

The standard deviation of model error, expressed in dollars, is given by the standard deviation of model error on the test set multiplied by the standard deviation of the Profit & Loss from the training data.

The increase in the residual from the training data to the test data will give some indication as to the degree of overfitting the model has.

A less than 20% increase in the residual indicates minimal model over-fitting.

With the test set residual, it is possible to calculate things like the 90% confidence interval for the test set.

## Applying the Model

Suppose we are given the following information.

Age | Years at Employer | Years at Address | Income | Credit Card Debt | Automobile Debt | |
---|---|---|---|---|---|---|

Raw | 42 | 12.44 | 0.9 | 121400 | -34228 | -23411 |

First the data are standardized using the training mean and standard deviation.

Age | Years at Employer | Years at Address | Income | Credit Card Debt | Automobile Debt | |
---|---|---|---|---|---|---|

Standardized | 0.89 | 0.56 | 0.19 | 1.52 | -7.97 | -2.28 |

Then, each individual z-score is multiplied by the appropriate $\beta$ coefficient from the original `LINEST`

calculation and the result is summed. The sum is the point forecast of profitability, expressed as a standardized output. The specific value is 1.525.

To determine the actual point estimate, this value is multiplied by the standard deviation of profits (5755.91) and the mean profit (1905.51) is added.

The result is $10,683.61.

From here, it is possible calculate profitability ranges using the standard deviation of errors on standardized outputs of .675. As an example, the range of profitability with 50% confidence is given by

Some content from this note was taken from the spreadsheets listed below. They are distributed as part of the Mastering Data Analysis in Excel course on coursera.org, and licensed by Daniel Egger under CC BY-NC 4.0.

*Excel-Linest-Function.xlsx**Data_for_Final_Project.xlsx**AUC_Calculator-and-Review-of-AUC-Curve.xlsx**Excel-Linest-Function.xlsx**Linear-Regression-Forecasting-Excel-Document.xlsx*

Some other content is taken from my notes on the Coursera course “Mastering Data Analysis in Excel.” It is sponsored by Duke University and the course content is presented by Professor Daniel Egger.