Decision Tree Titanic Example
Training a sklearn Decision Tree on the Titanic Dataset
Import Statements
import numpy as np
import pandas as pd
import random
random.seed(42)
Read Data
full_data = pd.read_csv('decision-tree-titanic-example/titanic_data.csv')
features_raw = full_data.drop(columns=['Survived'])
outcomes = full_data['Survived']
features_raw.head()
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
outcomes.head()
0 0
1 1
2 1
3 1
4 0
Name: Survived, dtype: int64
Preprocessing the Data
- Survived: Outcome of survival (0 = No; 1 = Yes)
- Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- Name: Name of passenger
- Sex: Sex of the passenger
- Age: Age of the passenger (Some entries contain
NaN
) - SibSp: Number of siblings and spouses of the passenger aboard
- Parch: Number of parents and children of the passenger aboard
- Ticket: Ticket number of the passenger
- Fare: Fare paid by the passenger
- Cabin Cabin number of the passenger (Some entries contain
NaN
) - Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)
features_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Pclass 891 non-null int64
2 Name 891 non-null object
3 Sex 891 non-null object
4 Age 714 non-null float64
5 SibSp 891 non-null int64
6 Parch 891 non-null int64
7 Ticket 891 non-null object
8 Fare 891 non-null float64
9 Cabin 204 non-null object
10 Embarked 889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB
Drop Name
of passenger
features_no_names = features_raw.drop(columns=['Name'])
One-hot encode the features
Note that following one-hot encoding, there are 839 columns. Fill blanks with 0.0.
features = pd.get_dummies(features_no_names)
features = features.fillna(0.0)
features.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Columns: 839 entries, PassengerId to Embarked_S
dtypes: float64(2), int64(4), uint8(833)
memory usage: 766.7 KB
features.head()
PassengerId | Pclass | Age | SibSp | Parch | Fare | Sex_female | Sex_male | Ticket_110152 | Ticket_110413 | ... | Cabin_F G73 | Cabin_F2 | Cabin_F33 | Cabin_F38 | Cabin_F4 | Cabin_G6 | Cabin_T | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | 22.0 | 1 | 0 | 7.2500 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 2 | 1 | 38.0 | 1 | 0 | 71.2833 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 3 | 3 | 26.0 | 0 | 0 | 7.9250 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 4 | 1 | 35.0 | 1 | 0 | 53.1000 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 5 | 3 | 35.0 | 0 | 0 | 8.0500 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 839 columns
Split the Data into Training and Testing Sets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
X_train, X_test, y_train, y_test = train_test_split(features,
outcomes,
test_size=0.2,
random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(712, 839) (179, 839) (712,) (179,)
Train a Model with Default Hyperparameters
default_model = DecisionTreeClassifier().fit(X_train, y_train)
default_model
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')
Testing the Default Model
from sklearn.metrics import accuracy_score
def test_accuracy(model, print_results=False):
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
train_accuracy = round(accuracy_score(y_train, y_train_pred) * 100.,1)
test_accuracy = round(accuracy_score(y_test, y_test_pred) * 100.,1)
if print_results == True:
print('Training accuracy: {0:2.1f}'.format(train_accuracy)+'%')
print('Testing accuracy: {0:2.1f}'.format(test_accuracy)+'%')
return train_accuracy, test_accuracy
test_accuracy(default_model, True);
Training accuracy: 100.0%
Testing accuracy: 81.6%
100% accuracy on the training set and 81.6% accuracy on the test set indicates overfitting.
Improve the Model
Use the following hyperparameters:
max_depth
min_samples_leaf
min_samples_split
The following code performs an exhaustive search over the space of available hyperparameters.
results = pd.DataFrame(columns=['max_depth',
'min_samples_leaf',
'min_samples_split',
'n_nodes',
'training_accuracy',
'testing_accuracy'])
for max_depth_i in range(1,15,1):
for min_samples_leaf_i in range(1,15,1):
for min_samples_split_i in range(2,15,1):
model = (DecisionTreeClassifier(max_depth = max_depth_i,
min_samples_leaf = min_samples_leaf_i,
min_samples_split = min_samples_split_i)
.fit(X_train, y_train))
training_accuracy, testing_accuracy = test_accuracy(model)
n_nodes = model.tree_.node_count
results = results.append(pd.Series([max_depth_i,
min_samples_leaf_i,
min_samples_split_i,
n_nodes,
training_accuracy,
testing_accuracy],
index=results.columns),
ignore_index=True)
results = (results
.sort_values(['testing_accuracy','n_nodes'],
ascending=[False,True])
.reset_index(drop=True))
for col in results.columns[:4]:
results[col] = results[col].astype(int)
results.head()
max_depth | min_samples_leaf | min_samples_split | n_nodes | training_accuracy | testing_accuracy | |
---|---|---|---|---|---|---|
0 | 7 | 6 | 14 | 81 | 87.5 | 86.0 |
1 | 7 | 6 | 2 | 85 | 87.5 | 86.0 |
2 | 7 | 6 | 3 | 85 | 87.5 | 86.0 |
3 | 7 | 6 | 4 | 85 | 87.5 | 86.0 |
4 | 7 | 6 | 5 | 85 | 87.5 | 86.0 |
results[results['testing_accuracy']==86.0].tail()
max_depth | min_samples_leaf | min_samples_split | n_nodes | training_accuracy | testing_accuracy | |
---|---|---|---|---|---|---|
111 | 14 | 6 | 8 | 119 | 88.2 | 86.0 |
112 | 14 | 6 | 9 | 119 | 88.2 | 86.0 |
113 | 14 | 6 | 10 | 119 | 88.2 | 86.0 |
114 | 14 | 6 | 11 | 119 | 88.2 | 86.0 |
115 | 14 | 6 | 12 | 119 | 88.2 | 86.0 |
results.tail()
max_depth | min_samples_leaf | min_samples_split | n_nodes | training_accuracy | testing_accuracy | |
---|---|---|---|---|---|---|
2543 | 13 | 2 | 6 | 181 | 92.7 | 76.0 |
2544 | 13 | 2 | 4 | 207 | 93.0 | 76.0 |
2545 | 13 | 2 | 3 | 209 | 93.3 | 75.4 |
2546 | 14 | 2 | 3 | 223 | 93.5 | 75.4 |
2547 | 14 | 2 | 2 | 223 | 93.5 | 74.3 |
Results
- 2548 total decision trees were trained.
- Max testing accuracy was 86.0% (116 trees). Of the trees with 86.0% testing accuracy:
- The simplest had 81 nodes (1 tree).
- The most complex had 119 nodes (11 trees).
- Min testing accuracy was 74.3% (1 tree).
- This was a significant overfits with 223 nodes.
Of the high-performing 116 trees, 89.6% (104/116) had a min_samples_leaf
hyperparameter set to 6. The trees were much more evenly split on the other two hyperparameters. min_samples_leaf
seems to have been the most important criterion for ensuring the trees did not overfit the data.
results[results['testing_accuracy']==86.0]['min_samples_leaf'].value_counts()
6 104
5 9
4 3
Name: min_samples_leaf, dtype: int64
This is further demonstrated by the following analysis on the results dataframe.
(results[['min_samples_leaf','testing_accuracy']]
.groupby(['min_samples_leaf'])
.median()
.sort_values(['testing_accuracy'],
ascending=False)
.round(1))
testing_accuracy | |
---|---|
min_samples_leaf | |
6 | 86.0 |
7 | 84.9 |
5 | 84.4 |
4 | 83.8 |
3 | 83.2 |
9 | 83.2 |
8 | 82.7 |
1 | 81.6 |
11 | 81.6 |
12 | 81.6 |
13 | 81.6 |
14 | 81.6 |
10 | 81.0 |
2 | 79.9 |
The best configuration was the tree that scored 86.0% accuracy with only 81 nodes. The hyper parameters for that tree were:
max_depth
= 7min_samples_leaf
= 6min_samples_split
= 14