# Decision Tree Titanic Example

## Training a sklearn Decision Tree on the Titanic Dataset

### Import Statements

import numpy as np
import pandas as pd
import random
random.seed(42)


full_data = pd.read_csv('decision-tree-titanic-example/titanic_data.csv')

features_raw = full_data.drop(columns=['Survived'])
outcomes = full_data['Survived']

features_raw.head()


PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
outcomes.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64


### Preprocessing the Data

• Survived: Outcome of survival (0 = No; 1 = Yes)
• Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
• Name: Name of passenger
• Sex: Sex of the passenger
• Age: Age of the passenger (Some entries contain NaN)
• SibSp: Number of siblings and spouses of the passenger aboard
• Parch: Number of parents and children of the passenger aboard
• Ticket: Ticket number of the passenger
• Fare: Fare paid by the passenger
• Cabin Cabin number of the passenger (Some entries contain NaN)
• Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)
features_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
#   Column       Non-Null Count  Dtype
---  ------       --------------  -----
0   PassengerId  891 non-null    int64
1   Pclass       891 non-null    int64
2   Name         891 non-null    object
3   Sex          891 non-null    object
4   Age          714 non-null    float64
5   SibSp        891 non-null    int64
6   Parch        891 non-null    int64
7   Ticket       891 non-null    object
8   Fare         891 non-null    float64
9   Cabin        204 non-null    object
10  Embarked     889 non-null    object
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB


#### Drop Name of passenger

features_no_names = features_raw.drop(columns=['Name'])


#### One-hot encode the features

Note that following one-hot encoding, there are 839 columns. Fill blanks with 0.0.

features = pd.get_dummies(features_no_names)
features = features.fillna(0.0)

features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Columns: 839 entries, PassengerId to Embarked_S
dtypes: float64(2), int64(4), uint8(833)
memory usage: 766.7 KB

features.head()


PassengerId Pclass Age SibSp Parch Fare Sex_female Sex_male Ticket_110152 Ticket_110413 ... Cabin_F G73 Cabin_F2 Cabin_F33 Cabin_F38 Cabin_F4 Cabin_G6 Cabin_T Embarked_C Embarked_Q Embarked_S
0 1 3 22.0 1 0 7.2500 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 2 1 38.0 1 0 71.2833 1 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
2 3 3 26.0 0 0 7.9250 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 4 1 35.0 1 0 53.1000 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
4 5 3 35.0 0 0 8.0500 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 839 columns

### Split the Data into Training and Testing Sets

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

X_train, X_test, y_train, y_test = train_test_split(features,
outcomes,
test_size=0.2,
random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(712, 839) (179, 839) (712,) (179,)


### Train a Model with Default Hyperparameters

default_model = DecisionTreeClassifier().fit(X_train, y_train)
default_model

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')


### Testing the Default Model

from sklearn.metrics import accuracy_score

def test_accuracy(model, print_results=False):
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

train_accuracy = round(accuracy_score(y_train, y_train_pred) * 100.,1)
test_accuracy = round(accuracy_score(y_test, y_test_pred) * 100.,1)

if print_results == True:
print('Training accuracy:  {0:2.1f}'.format(train_accuracy)+'%')
print('Testing accuracy:    {0:2.1f}'.format(test_accuracy)+'%')

return train_accuracy, test_accuracy

test_accuracy(default_model, True);

Training accuracy:  100.0%
Testing accuracy:    81.6%


100% accuracy on the training set and 81.6% accuracy on the test set indicates overfitting.

### Improve the Model

Use the following hyperparameters:

• max_depth
• min_samples_leaf
• min_samples_split

The following code performs an exhaustive search over the space of available hyperparameters.

results = pd.DataFrame(columns=['max_depth',
'min_samples_leaf',
'min_samples_split',
'n_nodes',
'training_accuracy',
'testing_accuracy'])
for max_depth_i in range(1,15,1):
for min_samples_leaf_i in range(1,15,1):
for min_samples_split_i in range(2,15,1):
model = (DecisionTreeClassifier(max_depth = max_depth_i,
min_samples_leaf = min_samples_leaf_i,
min_samples_split = min_samples_split_i)
.fit(X_train, y_train))

training_accuracy, testing_accuracy = test_accuracy(model)
n_nodes = model.tree_.node_count

results = results.append(pd.Series([max_depth_i,
min_samples_leaf_i,
min_samples_split_i,
n_nodes,
training_accuracy,
testing_accuracy],
index=results.columns),
ignore_index=True)
results = (results
.sort_values(['testing_accuracy','n_nodes'],
ascending=[False,True])
.reset_index(drop=True))
for col in results.columns[:4]:
results[col] = results[col].astype(int)

results.head()


max_depth min_samples_leaf min_samples_split n_nodes training_accuracy testing_accuracy
0 7 6 14 81 87.5 86.0
1 7 6 2 85 87.5 86.0
2 7 6 3 85 87.5 86.0
3 7 6 4 85 87.5 86.0
4 7 6 5 85 87.5 86.0
results[results['testing_accuracy']==86.0].tail()


max_depth min_samples_leaf min_samples_split n_nodes training_accuracy testing_accuracy
111 14 6 8 119 88.2 86.0
112 14 6 9 119 88.2 86.0
113 14 6 10 119 88.2 86.0
114 14 6 11 119 88.2 86.0
115 14 6 12 119 88.2 86.0
results.tail()


max_depth min_samples_leaf min_samples_split n_nodes training_accuracy testing_accuracy
2543 13 2 6 181 92.7 76.0
2544 13 2 4 207 93.0 76.0
2545 13 2 3 209 93.3 75.4
2546 14 2 3 223 93.5 75.4
2547 14 2 2 223 93.5 74.3

### Results

• 2548 total decision trees were trained.
• Max testing accuracy was 86.0% (116 trees). Of the trees with 86.0% testing accuracy:
• The simplest had 81 nodes (1 tree).
• The most complex had 119 nodes (11 trees).
• Min testing accuracy was 74.3% (1 tree).
• This was a significant overfits with 223 nodes.

Of the high-performing 116 trees, 89.6% (104/116) had a min_samples_leaf hyperparameter set to 6. The trees were much more evenly split on the other two hyperparameters. min_samples_leaf seems to have been the most important criterion for ensuring the trees did not overfit the data.

results[results['testing_accuracy']==86.0]['min_samples_leaf'].value_counts()

6    104
5      9
4      3
Name: min_samples_leaf, dtype: int64


This is further demonstrated by the following analysis on the results dataframe.

(results[['min_samples_leaf','testing_accuracy']]
.groupby(['min_samples_leaf'])
.median()
.sort_values(['testing_accuracy'],
ascending=False)
.round(1))


testing_accuracy
min_samples_leaf
6 86.0
7 84.9
5 84.4
4 83.8
3 83.2
9 83.2
8 82.7
1 81.6
11 81.6
12 81.6
13 81.6
14 81.6
10 81.0
2 79.9

The best configuration was the tree that scored 86.0% accuracy with only 81 nodes. The hyper parameters for that tree were:

• max_depth = 7
• min_samples_leaf = 6
• min_samples_split = 14