# Supervised Learning on Census Dataset

In this project, I employ several supervised learning algorithms to model individuals’ income using data collected from the 1994 US Census. The goal of the project is to construct a model that accurately predicts whether an individual makes more than $50,000.

The result of the project is a model that is **87.1%** accurate (75.4% F-score) at predicting whether an individual makes more than $50,000 based on 13 datapoints. Training the identical model on the most important 5 datapoints results in a model that is **85.9%** accurate (72.8% F-score).

This is page **1** of **3** for the project, where I

# Explore and Preprocess the Data

The specific tasks break down as follows:

**Explore the Data**

a.**Plot Histograms**for Continuous (Numerical) Feature Distributions

b.**Plot Bar Charts**for Discrete (Categories) Feature Distributions**Preprocess the Data**

a. Split into**Features and Labels**

b.**Transform Skewed**Continuous Features

c.**Normalize Continuous**Features

d.**One-Hot Encode Categorical**Features

e.**Suffle and Split**the Data**Write**the Processed Data to File

# Exploring the Data

The dataset for this project was donated by Ron Kohavi and Barry Becker, after originally being published in the article “Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid”. The article by Ron Kohavi is linked here. The data I investigate in this project consists of small changes to the original dataset, which is now available from the UCI Machine Learning Repository. Specifically, these changes include removing the ‘fnlwgt’ feature and records with missing or ill-formatted entries.

```
import numpy as np
import pandas as pd
from time import time
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
c = pd.read_csv('data/census.csv')
```

The outcome (label) is contained in the final column, `income`

.

```
n_records = c.shape[0]
n_greater_50k = c[c['income']=='>50K'].shape[0]
n_at_most_50k = c[c['income']=='<=50K'].shape[0]
greater_50k_percent = n_greater_50k/n_records
print(' Total records: {}'.format(n_records))
print(' Individuals with income >$50K: {}'.format(n_greater_50k))
print('Individuals with income <=$50K: {}'.format(n_at_most_50k))
print(' %age with income >$50K: {:.1%}'.format(greater_50k_percent))
```

```
Total records: 45222
Individuals with income >$50K: 11208
Individuals with income <=$50K: 34014
%age with income >$50K: 24.8%
```

Among the features, there are:

- 5
**continuous**, numerical features and - 8
**discrete**, categorical features

`c.info()`

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45222 entries, 0 to 45221
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 45222 non-null int64
1 workclass 45222 non-null object
2 education_level 45222 non-null object
3 education-num 45222 non-null float64
4 marital-status 45222 non-null object
5 occupation 45222 non-null object
6 relationship 45222 non-null object
7 race 45222 non-null object
8 sex 45222 non-null object
9 capital-gain 45222 non-null float64
10 capital-loss 45222 non-null float64
11 hours-per-week 45222 non-null float64
12 native-country 45222 non-null object
13 income 45222 non-null object
dtypes: float64(4), int64(1), object(9)
memory usage: 4.8+ MB
```

## Plot Histograms for Continuous Feature Distributions

```
def plot_continuous_distributions(field, bins=None, xlim_list=None, xtick_list=None):
fig, (b, h) = plt.subplots(ncols=1, nrows=2,
sharex=True,
gridspec_kw={'height_ratios':[1,3]},
figsize=(12,3))
b.boxplot(c[field],
widths=0.6,
vert=False)
b.set_title(field,
fontsize=14)
b.set_yticks([])
b.set_xticks(xtick_list)
h.hist(c[field],
bins=bins,
align='left')
h.set_xlim(xlim_list)
h.set_xticks(xtick_list)
for ax in [b, h]:
for spine in ax.spines.values():
spine.set_visible(False)
ax.tick_params(
axis='x',
bottom=False)
ax.tick_params(
axis='y',
left=False,
right=False)
```

```
params = [('age',np.arange(15,95,5),[10,95],np.arange(10,100,10)),
('education-num',np.arange(0,17,1),[0,17],np.arange(2,18,2)),
('capital-gain',np.arange(0,100000,5000),[-5000,105000],
np.arange(0,100000,10000)),
('capital-loss',np.arange(0,4500,250),[-250,5000],np.arange(0,4500,500)),
('hours-per-week',np.arange(0,100,5),[-5,105],np.arange(0,110,10))]
for field, bins, xlim_list, x_tick_list in params:
plot_continuous_distributions(field, bins, xlim_list, x_tick_list)
```

## Plot Bar Charts for Discrete Feature Distributions

```
def plot_discrete_distributions(field):
summary = (pd.DataFrame(c[field].value_counts())
.reset_index())
if summary.shape[0] > 14:
other_total = summary[14:][field].sum()
summary = summary[:14]
summary = summary.append({'index':'Other',
field:other_total},
ignore_index=True)
summary['percentage'] = round(summary[field]/summary[field].sum()*100.,
0).astype(int)
labels = summary['index']
position = summary.index
counts = summary[field]
plt.figure(figsize=(12,3))
plt.bar(position, counts, align='center')
if summary.shape[0] > 6:
rotation = 20
else:
rotation = 0
plt.xticks(position, labels, rotation=rotation)
rects = plt.gca().patches
for n, r in enumerate(rects):
height = r.get_height()
plt.gca().text(r.get_x() + r.get_width() / 2,
height + plt.gca().get_ylim()[1]*0.08,
str(summary['percentage'][n]) + '%\n' + str(height),
ha='center',
va='center')
for spine in plt.gca().spines.values():
spine.set_visible(False)
plt.gca().tick_params(
axis='x',
bottom=False)
plt.gca().tick_params(
axis='y',
left=False,
labelleft=False,
right=False)
plt.title(field,
fontsize=14);
```

```
for field in ['workclass', 'education_level', 'marital-status', 'occupation',
'relationship', 'race', 'sex', 'native-country', 'income']:
plot_discrete_distributions(field)
```

# Preprocessing the Data

The data must be preprocessed before it can be used as input for machine learning algorithms. This “**preprocessing**” consists of handling invalid or missing entries, a broad term for this is **cleaning**, as well as other **formatting** and **restructuring**.

## Split into Features and Labels

```
labels_raw = c['income']
features_raw = c.drop(columns=['income'])
```

## Transforming Skewed Continuous Features

`capital-gain`

and `capital-loss`

are both highly-skewed features. The common practice is to apply a **logarithmic transformation** so that the very large and very small values do not negatively affect the performance of a learning algorithm.

Note that since the logarithm of `0`

is undefined, I translate the values by a small amount above `0`

in order to apply the logarithm successfully.

Note that the only features that are log transformed, below, are the features in the list of skewed fields.

```
skewed = ['capital-gain', 'capital-loss']
features_log_transformed = pd.DataFrame(data = features_raw)
features_log_transformed[skewed] = features_raw[skewed].apply(lambda x: np.log(x + 1))
```

Visualize the change.

```
c['log-capital-gain'] = features_log_transformed['capital-gain']
c['log-capital-loss'] = features_log_transformed['capital-loss']
params = [('capital-gain',np.arange(0,100000,5000),[-5000,105000],
np.arange(0,100000,10000)),
('log-capital-gain',np.arange(0,12,1),[0,12],np.arange(0,14,2)),
('capital-loss',np.arange(0,4500,250),[-250,5000],np.arange(0,4500,500)),
('log-capital-loss',np.arange(0,8,1),[0,9],np.arange(0,10,2))]
for field, bins, xlim_list, x_tick_list in params:
plot_continuous_distributions(field, bins, xlim_list, x_tick_list)
c = c.drop(columns=['log-capital-gain','log-capital-loss'])
```

### Normalizing Numerical Features

In addition to transforming highly skewed features, it is also good practice to scale numerical features. This does not change the shape of the distribution like a log transform. It does, however, ensure that each feature is treated equally when applying supervised learning.

Note that the `capital-gain`

and `capital-loss`

fields were already log-transformed and are now being min-max-scaled.

```
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler() # default=(0, 1)
numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
features_log_minmax_transform = \
pd.DataFrame(data = features_log_transformed)
features_log_minmax_transform[numerical] = \
scaler.fit_transform(features_log_transformed[numerical])
```

Visualize the change.

```
for col in numerical:
c['norm-' + col] = features_log_minmax_transform[col]
params = [('age',np.arange(15,95,5),[10,95],np.arange(10,100,10)),
('norm-age',np.arange(0,1.05,0.05),[-0.05,1.05],np.arange(0,1.1,0.1)),
('education-num',np.arange(0,17,1),[0,17],np.arange(2,18,2)),
('norm-education-num',np.arange(0,1.05,0.05),[-0.05,1.05],np.arange(0,1.1,0.1)),
('capital-gain',np.arange(0,100000,5000),[-5000,105000],
np.arange(0,100000,10000)),
('norm-capital-gain',np.arange(0,1.05,0.05),[-0.05,1.05],np.arange(0,1.1,0.1)),
('capital-loss',np.arange(0,4500,250),[-250,5000],np.arange(0,4500,500)),
('norm-capital-loss',np.arange(0,1.05,0.05),[-0.05,1.05],np.arange(0,1.1,0.1)),
('hours-per-week',np.arange(0,100,5),[-5,105],np.arange(0,110,10)),
('norm-hours-per-week',np.arange(0,1.05,0.05),[-0.05,1.05],np.arange(0,1.1,0.1))]
for field, bins, xlim_list, x_tick_list in params:
plot_continuous_distributions(field, bins, xlim_list, x_tick_list)
c = c.drop(columns=c.columns[c.columns.str.contains('norm')])
```

### One-Hot Encode Categoricals

Learning algorithms generally expect numeric input. This dataset has 8 features that are categorical, which will need to be converted. This is generally accomplished with **one-hot encoding** the data. The `pandas.get_dummies()`

method will be used for one-hot encoding.

someFeature | |
---|---|

0 | B |

1 | C |

2 | A |

**After One-Hot Encode**

someFeature_A | someFeature_B | someFeature_C |
---|---|---|

0 | 1 | 0 |

0 | 0 | 1 |

1 | 0 | 0 |

The target label, `income`

, must also be converted. Since there are only two possible categories, we can avoid using one-hot encoding and simply encode those categories as `0`

or `1`

.

```
features_final = pd.get_dummies(features_log_minmax_transform)
before_encoding_cols = len(features_log_minmax_transform.columns)
after_encoding_cols = len(features_final.columns)
print("Before one-hot encode: {} features".format(before_encoding_cols))
print(" After one-hot encode: {} features".format(after_encoding_cols))
```

```
Before one-hot encode: 13 features
After one-hot encode: 103 features
```

```
labels_final = (labels_raw
.replace('<=50K',0)
.replace('>50K',1))
```

### Shuffle and Split the Data

Use an ^{80}⁄_{20} Training/Testing split.

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features_final,
labels_final,
test_size = 0.2,
random_state = 0)
print("Training set: {} samples".format(X_train.shape[0]))
print(" Testing set: {} samples".format(X_test.shape[0]))
```

```
Training set: 36177 samples
Testing set: 9045 samples
```

## Write to File

```
for data_subset, name in [(X_train, 'X_train'),
(X_test, 'X_test'),
(y_train, 'y_train'),
(y_test, 'y_test')]:
data_subset.to_csv('processed/' + name + '.csv')
```