Multiple Linear Regression Example

22 Jul 2018

This post outlines the process of generating a multiple linear regression model using LINEST, characterizing its performance, and then using it to make a prediction.

The Data

The data utilized are the same as the data in binary classification worked example. Also available is profit and loss information for each customer, instead of the binary classification (default/no default) data used previously. As before, there are 200 total data points, with the same first five listed below.

Age	Years at Employer	Years at Address	Income	Credit Card Debt	Automobile Debt	Profit/Loss
26	0.35	0.24	$12,181	($2,057)	($3,696)	$3206.43
28	4.47	0.37	$38,536	($6,970)	($3,018)	$2940.32
37	9.30	0.02	$30,602	($2,892)	($1,674)	($1023.98)
28	8.43	1.06	$15,588	($38)	($1,758)	$2944.66
39	5.40	0.74	$27,599	($776)	($4,374)	$738.41

Building the Model

As with the binary classification model, the first step of this analysis is to standardize both the dependent and independent variables. For the dependent parameters, standardization occurs using the mean and standard deviation of the training data.

	Age	Years at Employer	Years at Address	Income	Credit Card Debt	Automobile Debt	Profit/Loss
mean	34.67	8.61	0.78	48439.96	-3202.11	-6378.07	1906
standard deviation	8.20	6.77	0.62	47862.35	3892.03	7472.44	5755.91

Standardization completed, the results appear as follows.

Age	Years at Employer	Years at Address	Income	Credit Card Debt	Automobile Debt	Profit/Loss
-0.26	0.11	-0.78	-0.22	-0.01	0.21	0.23
-0.01	0.50	1.14	0.36	-3.18	-1.51	0.18
0.37	0.57	-1.12	0.26	-2.11	-0.21	-0.51
-0.73	-1.07	1.71	-0.60	0.51	0.53	0.18
-0.25	-0.17	-0.88	-0.49	0.53	0.80	-0.20

The next step is to run the LINEST function. This produces the following results.

Auto Debt	CC Debt	Income	Years at Add	Years at Emp	Age
-0.005	-0.055	0.639	-0.066	0.187	0.008	0.000
0.058	0.062	0.073	0.043	0.058	0.051	0.042
0.655	0.598	#N/A	#N/A	#N/A	#N/A	#N/A
61.2	193	#N/A	#N/A	#N/A	#N/A	#N/A
131.1	68.9	#N/A	#N/A	#N/A	#N/A	#N/A

Recall that the interpretation of the output of LINEST is as follows.

Auto Debt	CC Debt	Income	Years at Add	Years at Emp	Age
beta(6)*x(6)	beta(5)*x(5)	beta(4)*x(4)	beta(3)*x(3)	beta(2)*x(2)	beta(1)*x(1)	alpha
SDev of error	SDev of error	SDev of error	SDev of error	SDev of error	SDev of error	SDev of error
R^2	SDev(y)
F-statistic	Degrees of Freedom
“Regression Sum of Squares”	“Residual Sum of Squares”

The model, therefore, is the following, where all parameters are standardized.

$$\text{Profit/Loss} = -0.005(\text{Auto Debt}) - 0.055(\text{CC Debt}) + $$

$$0.639(\text{Income}) - 0.066(\text{Years at Add}) + $$

$$0.187(\text{Years at Emp}) + 0.008(\text{Age})$$

Benchmarking Performance

The standard deviation of the model error is calculated as the root-sum-square of the residuals. The residuals, as expected, are calculated as

$$\text{profits}_{model}-\text{profits}_{actual}$$

$$\sigma_e=0.675$$

The standard deviation of model error, expressed in dollars, is given by the standard deviation of model error on the test set multiplied by the standard deviation of the Profit & Loss from the training data.

$$residual_{test}=(\sigma_{e,test})(\sigma_{profits,training})$$

$$=(0.675)(5755.91)=3885.24$$

$$residual_{training}=(\sigma_{e,training})(\sigma_{profits,training})$$

$$=(0.587)(5755.91)=3378.72$$

The increase in the residual from the training data to the test data will give some indication as to the degree of overfitting the model has.

$$\frac{residual_{test}}{residual_{training}}=\frac{3885.24}{3378.72}=1.15$$

A less than 20% increase in the residual indicates minimal model over-fitting.

With the test set residual, it is possible to calculate things like the 90% confidence interval for the test set.

$$\text{NORMSINV(.95)}(residual)$$

$$(1.645)(3885.24)=\pm$6391$$

Applying the Model

Suppose we are given the following information.

	Age	Years at Employer	Years at Address	Income	Credit Card Debt	Automobile Debt
Raw	42	12.44	0.9	121400	-34228	-23411

First the data are standardized using the training mean and standard deviation.

	Age	Years at Employer	Years at Address	Income	Credit Card Debt	Automobile Debt
Standardized	0.89	0.56	0.19	1.52	-7.97	-2.28

Then, each individual z-score is multiplied by the appropriate $\beta$ coefficient from the original LINEST calculation and the result is summed. The sum is the point forecast of profitability, expressed as a standardized output. The specific value is 1.525.

To determine the actual point estimate, this value is multiplied by the standard deviation of profits (5755.91) and the mean profit (1905.51) is added.

The result is $10,683.61.

From here, it is possible calculate profitability ranges using the standard deviation of errors on standardized outputs of .675. As an example, the range of profitability with 50% confidence is given by

$$(residual)(\text{NORMSINV(.25)})(\sigma_{profits})=\pm$2620.55$$

$$$10683\pm2620=[$8063,$13304]$$

Some content in these notes is taken from the spreadsheets listed below. They are accessible as part of the Mastering Data Analysis in Excel course on coursera.org, and licensed by Daniel Egger under CC BY-NC 4.0.

Excel-Linest-Function.xlsx
Data_for_Final_Project.xlsx
AUC_Calculator-and-Review-of-AUC-Curve.xlsx
Excel-Linest-Function.xlsx
Linear-Regression-Forecasting-Excel-Document.xlsx

Some other content is taken from my notes on other aspects of the Coursera course.