Decision Tree Titanic Example

Training a sklearn Decision Tree on the Titanic Dataset

Import Statements

import numpy as np
import pandas as pd
import random
random.seed(42)

Read Data

full_data = pd.read_csv('decision-tree-titanic-example/titanic_data.csv')

features_raw = full_data.drop(columns=['Survived'])
outcomes = full_data['Survived']
features_raw.head()
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
outcomes.head()
0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

Preprocessing the Data

  • Survived: Outcome of survival (0 = No; 1 = Yes)
  • Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
  • Name: Name of passenger
  • Sex: Sex of the passenger
  • Age: Age of the passenger (Some entries contain NaN)
  • SibSp: Number of siblings and spouses of the passenger aboard
  • Parch: Number of parents and children of the passenger aboard
  • Ticket: Ticket number of the passenger
  • Fare: Fare paid by the passenger
  • Cabin Cabin number of the passenger (Some entries contain NaN)
  • Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)
features_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Pclass       891 non-null    int64  
 2   Name         891 non-null    object 
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB

Drop Name of passenger

features_no_names = features_raw.drop(columns=['Name'])

One-hot encode the features

Note that following one-hot encoding, there are 839 columns. Fill blanks with 0.0.

features = pd.get_dummies(features_no_names)
features = features.fillna(0.0)
features.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Columns: 839 entries, PassengerId to Embarked_S
dtypes: float64(2), int64(4), uint8(833)
memory usage: 766.7 KB
features.head()
PassengerId Pclass Age SibSp Parch Fare Sex_female Sex_male Ticket_110152 Ticket_110413 ... Cabin_F G73 Cabin_F2 Cabin_F33 Cabin_F38 Cabin_F4 Cabin_G6 Cabin_T Embarked_C Embarked_Q Embarked_S
0 1 3 22.0 1 0 7.2500 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 2 1 38.0 1 0 71.2833 1 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
2 3 3 26.0 0 0 7.9250 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 4 1 35.0 1 0 53.1000 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
4 5 3 35.0 0 0 8.0500 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 839 columns

Split the Data into Training and Testing Sets

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
X_train, X_test, y_train, y_test = train_test_split(features,
                                                    outcomes,
                                                    test_size=0.2,
                                                    random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(712, 839) (179, 839) (712,) (179,)

Train a Model with Default Hyperparameters

default_model = DecisionTreeClassifier().fit(X_train, y_train)
default_model
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

Testing the Default Model

from sklearn.metrics import accuracy_score
def test_accuracy(model, print_results=False):
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_accuracy = round(accuracy_score(y_train, y_train_pred) * 100.,1)
    test_accuracy = round(accuracy_score(y_test, y_test_pred) * 100.,1)
    
    if print_results == True:
        print('Training accuracy:  {0:2.1f}'.format(train_accuracy)+'%')
        print('Testing accuracy:    {0:2.1f}'.format(test_accuracy)+'%')
    
    return train_accuracy, test_accuracy
test_accuracy(default_model, True);
Training accuracy:  100.0%
Testing accuracy:    81.6%

100% accuracy on the training set and 81.6% accuracy on the test set indicates overfitting.

Improve the Model

Use the following hyperparameters:

  • max_depth
  • min_samples_leaf
  • min_samples_split

The following code performs an exhaustive search over the space of available hyperparameters.

results = pd.DataFrame(columns=['max_depth',
                                'min_samples_leaf',
                                'min_samples_split',
                                'n_nodes',
                                'training_accuracy',
                                'testing_accuracy'])
for max_depth_i in range(1,15,1):
    for min_samples_leaf_i in range(1,15,1):
        for min_samples_split_i in range(2,15,1):
            model = (DecisionTreeClassifier(max_depth = max_depth_i,
                                            min_samples_leaf = min_samples_leaf_i,
                                            min_samples_split = min_samples_split_i)
                     .fit(X_train, y_train))

            training_accuracy, testing_accuracy = test_accuracy(model)
            n_nodes = model.tree_.node_count

            results = results.append(pd.Series([max_depth_i,
                                                min_samples_leaf_i,
                                                min_samples_split_i,
                                                n_nodes,
                                                training_accuracy,
                                                testing_accuracy],
                                               index=results.columns),
                                     ignore_index=True)
results = (results
           .sort_values(['testing_accuracy','n_nodes'],
                        ascending=[False,True])
           .reset_index(drop=True))
for col in results.columns[:4]:
    results[col] = results[col].astype(int)
results.head()
max_depth min_samples_leaf min_samples_split n_nodes training_accuracy testing_accuracy
0 7 6 14 81 87.5 86.0
1 7 6 2 85 87.5 86.0
2 7 6 3 85 87.5 86.0
3 7 6 4 85 87.5 86.0
4 7 6 5 85 87.5 86.0
results[results['testing_accuracy']==86.0].tail()
max_depth min_samples_leaf min_samples_split n_nodes training_accuracy testing_accuracy
111 14 6 8 119 88.2 86.0
112 14 6 9 119 88.2 86.0
113 14 6 10 119 88.2 86.0
114 14 6 11 119 88.2 86.0
115 14 6 12 119 88.2 86.0
results.tail()
max_depth min_samples_leaf min_samples_split n_nodes training_accuracy testing_accuracy
2543 13 2 6 181 92.7 76.0
2544 13 2 4 207 93.0 76.0
2545 13 2 3 209 93.3 75.4
2546 14 2 3 223 93.5 75.4
2547 14 2 2 223 93.5 74.3

Results

  • 2548 total decision trees were trained.
  • Max testing accuracy was 86.0% (116 trees). Of the trees with 86.0% testing accuracy:
    • The simplest had 81 nodes (1 tree).
    • The most complex had 119 nodes (11 trees).
  • Min testing accuracy was 74.3% (1 tree).
    • This was a significant overfits with 223 nodes.

Of the high-performing 116 trees, 89.6% (104116) had a min_samples_leaf hyperparameter set to 6. The trees were much more evenly split on the other two hyperparameters. min_samples_leaf seems to have been the most important criterion for ensuring the trees did not overfit the data.

results[results['testing_accuracy']==86.0]['min_samples_leaf'].value_counts()
6    104
5      9
4      3
Name: min_samples_leaf, dtype: int64

This is further demonstrated by the following analysis on the results dataframe.

(results[['min_samples_leaf','testing_accuracy']]
 .groupby(['min_samples_leaf'])
 .median()
 .sort_values(['testing_accuracy'],
              ascending=False)
 .round(1))
testing_accuracy
min_samples_leaf
6 86.0
7 84.9
5 84.4
4 83.8
3 83.2
9 83.2
8 82.7
1 81.6
11 81.6
12 81.6
13 81.6
14 81.6
10 81.0
2 79.9

The best configuration was the tree that scored 86.0% accuracy with only 81 nodes. The hyper parameters for that tree were: