Multiple Linear Regression Example

This post outlines the process of generating a multiple linear regression model using LINEST, characterizing its performance, and then using it to make a prediction.

The Data

The data utilized are the same as the data in binary classification worked example. Also available is profit and loss information for each customer, instead of the binary classification (default/no default) data used previously. As before, there are 200 total data points, with the same first five listed below.

Age  Years at Employer Years at Address Income Credit Card Debt  Automobile Debt Profit/Loss
26 0.35 0.24 $12,181 ($2,057) ($3,696) $3206.43
28 4.47 0.37 $38,536 ($6,970) ($3,018) $2940.32
37 9.30 0.02 $30,602 ($2,892) ($1,674) ($1023.98)
28 8.43 1.06 $15,588 ($38) ($1,758) $2944.66
39 5.40 0.74 $27,599 ($776) ($4,374) $738.41

Building the Model

As with the binary classification model, the first step of this analysis is to standardize both the dependent and independent variables. For the dependent parameters, standardization occurs using the mean and standard deviation of the training data.

Age Years at Employer Years at Address Income Credit Card Debt Automobile Debt Profit/Loss
mean 34.67 8.61 0.78 48439.96 -3202.11 -6378.07 1906
standard deviation 8.20 6.77 0.62 47862.35 3892.03 7472.44 5755.91

Standardization completed, the results appear as follows.

Age Years at Employer Years at Address Income Credit Card Debt Automobile Debt Profit/Loss
-0.26 0.11 -0.78 -0.22 -0.01 0.21 0.23
-0.01 0.50 1.14 0.36 -3.18 -1.51 0.18
0.37 0.57 -1.12 0.26 -2.11 -0.21 -0.51
-0.73 -1.07 1.71 -0.60 0.51 0.53 0.18
-0.25 -0.17 -0.88 -0.49 0.53 0.80 -0.20

The next step is to run the LINEST function. This produces the following results.

Auto Debt CC Debt Income Years at Add Years at Emp Age
-0.005 -0.055 0.639 -0.066 0.187 0.008 0.000
0.058 0.062 0.073 0.043 0.058 0.051 0.042
0.655 0.598 #N/A #N/A #N/A #N/A #N/A
61.2 193 #N/A #N/A #N/A #N/A #N/A
131.1 68.9 #N/A #N/A #N/A #N/A #N/A

Recall that the interpretation of the output of LINEST is as follows.

Auto Debt CC Debt Income Years at Add Years at Emp Age
beta(6)*x(6) beta(5)*x(5) beta(4)*x(4) beta(3)*x(3) beta(2)*x(2) beta(1)*x(1) alpha
SDev of error SDev of error SDev of error SDev of error SDev of error SDev of error SDev of error
R^2 SDev(y)
F-statistic Degrees of Freedom
“Regression Sum of Squares” “Residual Sum of Squares”

The model, therefore, is the following, where all parameters are standardized.

$$\text{Profit/Loss} = -0.005(\text{Auto Debt}) - 0.055(\text{CC Debt}) + $$

$$0.639(\text{Income}) - 0.066(\text{Years at Add}) + $$

$$0.187(\text{Years at Emp}) + 0.008(\text{Age})$$


Benchmarking Performance

The standard deviation of the model error is calculated as the root-sum-square of the residuals. The residuals, as expected, are calculated as

$$\text{profits}_{model}-\text{profits}_{actual}$$

$$\sigma_e=0.675$$

The standard deviation of model error, expressed in dollars, is given by the standard deviation of model error on the test set multiplied by the standard deviation of the Profit & Loss from the training data.

$$residual_{test}=(\sigma_{e,test})(\sigma_{profits,training})$$

$$=(0.675)(5755.91)=3885.24$$

$$residual_{training}=(\sigma_{e,training})(\sigma_{profits,training})$$

$$=(0.587)(5755.91)=3378.72$$

The increase in the residual from the training data to the test data will give some indication as to the degree of overfitting the model has.

$$\frac{residual_{test}}{residual_{training}}=\frac{3885.24}{3378.72}=1.15$$

A less than 20% increase in the residual indicates minimal model over-fitting.

With the test set residual, it is possible to calculate things like the 90% confidence interval for the test set.

$$\text{NORMSINV(.95)}(residual)$$

$$(1.645)(3885.24)=\pm$6391$$

Applying the Model

Suppose we are given the following information.

Age Years at Employer Years at Address Income Credit Card Debt Automobile Debt
Raw 42 12.44 0.9 121400 -34228 -23411

First the data are standardized using the training mean and standard deviation.

Age Years at Employer Years at Address Income Credit Card Debt Automobile Debt
Standardized 0.89 0.56 0.19 1.52 -7.97 -2.28

Then, each individual z-score is multiplied by the appropriate $\beta$ coefficient from the original LINEST calculation and the result is summed. The sum is the point forecast of profitability, expressed as a standardized output. The specific value is 1.525.

To determine the actual point estimate, this value is multiplied by the standard deviation of profits (5755.91) and the mean profit (1905.51) is added.

The result is $10,683.61.

From here, it is possible calculate profitability ranges using the standard deviation of errors on standardized outputs of .675. As an example, the range of profitability with 50% confidence is given by

$$(residual)(\text{NORMSINV(.25)})(\sigma_{profits})=\pm$2620.55$$

$$$10683\pm2620=[$8063,$13304]$$