Engineering a Metric for a Binary Classifier

13 Jul 2018

Consider the following practical example of binary classification. For a single day, a hypothetical bank issues credit cards to everyone who applies, and monitors the performance of the resulting loans. The dataset the class is provided includes two sets of 200 fictitious customers, one of which is “training” data and the other of which is “testing” data. The first few rows of this data are presented below.

As part of the normal process of issuing loans, for each customer, the hypothetical bank captures the information shown below. A 1 in the Outcome column indicates that the customer defaulted.

Age	Years at Employer	Years at Address	Income	Credit Card Debt	Automobile Debt	Outcome
32.53	9.39	0.30	$37,844	-$3,247	-$4,795	0
34.58	11.97	1.49	$65,765	-$15,598	-$17,632	1
37.70	12.46	0.09	$61,002	-$11,402	-$7,910	1

Engineering a Composite Metric

The AUC for each of these individual parameters, using the training data, is tabulated below.

Age	Years at Employer	Years at Address	Income	Credit Card Debt	Automobile Debt
0.64	0.72	0.58	0.63	0.60	0.62

It is reasonable to assume that the ratio of total debt to income would be predictive of a customer’s ability to repay loans. This is calculated as follows.

$\frac{D e b t}{I n c o m e} = \frac{| C r e d i t C a r d D e b t + A u t o D e b t |}{I n c o m e}$

This debt-to-income metric is a much stronger predictor of whether a customer will default than any of the individual metrics. Using the training data, it results in an AUC of 0.74.

Note, still, that Years at Employer is still nearly as strong a predictor on its own. It seems appropriate to combine these two metrics to see if, combined, they might produce something stronger overall. The values must be standardized before being combined. Years at Employer is negatively correlated with likelihood of default, whereas debt-to-income is positively correlated. Thus, they are combined as follows, where std indicates the data are standardized.

$M e t r i c = s t d (\frac{D e b t}{I n c o m e}) - s t d (Y e a r s a t E m p l o y e r)$

Or:

$M e t r i c = s t d (\frac{| C r e d i t C a r d D e b t + A u t o D e b t |}{I n c o m e}) - s t d (Y e a r s a t E m p l o y e r)$

Using the training data, this composite metric results in an AUC of 0.82.

The data are developed piecemeal, below. First, the raw data are used to calculate the $\frac{d e b t}{i n c}$ ratio.

Income	CC Debt	Auto Debt	Debt/Inc Ratio	Years at Employer
$37,844	-$3,247	-$4,795	21%	9.39
$65,765	-$15,598	-$17,632	51%	11.97
$61,002	-$11,402	-$7,910	32%	12.46

Then, the $\frac{d e b t}{i n c}$ and $Y e a r s a t E m p l o y e r$ are standardized and the results subtracted. Finally, that composite metric is itself standardized.

Std Debt/Inc	Std Years at Employer	Composite Metric	Std Composite Metric
0.04	0.11	-0.08	-0.05
2.00	0.50	1.51	1.03
0.74	0.57	0.17	0.11

Performance of the Metric

When tested using a set of testing data, this classification scheme produces the results below.

A: p(+), incidence of + condition
B: p(-), incidence of - condition
C: p(test POS), incidence of POS classification
D: p(test NEG), incidence of NEG classification
E: p(test POS, +), true positives
F: p(test NEG, +), false negatives
G: p(test POS, -), false positives
H: p(test NEG, -), true negatives

With parameters expressed as probabilities, the matrix is as shown below.

$true positive count = p (t e s t P O S, +) = E = .150$

$true positive rate = p (t e s t P O S | +) = \frac{E}{A} = \frac{.150}{.250} = .600$

$positive predictive value = p (+ | t e s t P O S) = \frac{E}{C} = \frac{.150}{.315} = .480$

Some content in these notes is taken from the spreadsheets listed below. They are accessible as part of the Mastering Data Analysis in Excel course on coursera.org, and licensed by Daniel Egger under CC BY-NC 4.0.

Binary-Performance-Metrics.xlsx
Review-of-AUC-for-ROC-Curve.xlsx
Forecasting-Soldier-Performance.xlsx
AUC_Calculator-and-Review-of-AUC-Curve.xlsx
Data_Final-Project.xlsx
Information-Gain-Calculator.xlsx

Some other content is taken from my notes on other aspects of the aforementioned Coursera course.