Decision Tree Titanic Example

22 Feb 2020

Training a sklearn Decision Tree on the Titanic Dataset

Import Statements

import numpy as np
import pandas as pd
import random
random.seed(42)

Read Data

full_data = pd.read_csv('decision-tree-titanic-example/titanic_data.csv')

features_raw = full_data.drop(columns=['Survived'])
outcomes = full_data['Survived']

features_raw.head()

	PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

outcomes.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

Preprocessing the Data

Survived: Outcome of survival (0 = No; 1 = Yes)
Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
Name: Name of passenger
Sex: Sex of the passenger
Age: Age of the passenger (Some entries contain NaN)
SibSp: Number of siblings and spouses of the passenger aboard
Parch: Number of parents and children of the passenger aboard
Ticket: Ticket number of the passenger
Fare: Fare paid by the passenger
Cabin Cabin number of the passenger (Some entries contain NaN)
Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

features_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Pclass       891 non-null    int64  
 2   Name         891 non-null    object 
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB

Drop `Name` of passenger

features_no_names = features_raw.drop(columns=['Name'])

One-hot encode the features

Note that following one-hot encoding, there are 839 columns. Fill blanks with 0.0.

features = pd.get_dummies(features_no_names)
features = features.fillna(0.0)

features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Columns: 839 entries, PassengerId to Embarked_S
dtypes: float64(2), int64(4), uint8(833)
memory usage: 766.7 KB

features.head()

	PassengerId	Pclass	Age	SibSp	Fare	Sex_female	Sex_male	...	Embarked_C	Embarked_S
0	1	3	22.0	1	7.2500	0	1	...	0	1
1	2	1	38.0	1	71.2833	1	0	...	1	0
2	3	3	26.0	0	7.9250	1	0	...	0	1
3	4	1	35.0	1	53.1000	1	0	...	0	1
4	5	3	35.0	0	8.0500	0	1	...	0	1

5 rows × 839 columns

Split the Data into Training and Testing Sets

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

X_train, X_test, y_train, y_test = train_test_split(features,
                                                    outcomes,
                                                    test_size=0.2,
                                                    random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(712, 839) (179, 839) (712,) (179,)

Train a Model with Default Hyperparameters

default_model = DecisionTreeClassifier().fit(X_train, y_train)
default_model

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

Testing the Default Model

from sklearn.metrics import accuracy_score

def test_accuracy(model, print_results=False):
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_accuracy = round(accuracy_score(y_train, y_train_pred) * 100.,1)
    test_accuracy = round(accuracy_score(y_test, y_test_pred) * 100.,1)
    
    if print_results == True:
        print('Training accuracy:  {0:2.1f}'.format(train_accuracy)+'%')
        print('Testing accuracy:    {0:2.1f}'.format(test_accuracy)+'%')
    
    return train_accuracy, test_accuracy

test_accuracy(default_model, True);

Training accuracy:  100.0%
Testing accuracy:    81.6%

100% accuracy on the training set and 81.6% accuracy on the test set indicates overfitting.

Improve the Model

Use the following hyperparameters:

max_depth
min_samples_leaf
min_samples_split

The following code performs an exhaustive search over the space of available hyperparameters.

results = pd.DataFrame(columns=['max_depth',
                                'min_samples_leaf',
                                'min_samples_split',
                                'n_nodes',
                                'training_accuracy',
                                'testing_accuracy'])
for max_depth_i in range(1,15,1):
    for min_samples_leaf_i in range(1,15,1):
        for min_samples_split_i in range(2,15,1):
            model = (DecisionTreeClassifier(max_depth = max_depth_i,
                                            min_samples_leaf = min_samples_leaf_i,
                                            min_samples_split = min_samples_split_i)
                     .fit(X_train, y_train))

            training_accuracy, testing_accuracy = test_accuracy(model)
            n_nodes = model.tree_.node_count

            results = results.append(pd.Series([max_depth_i,
                                                min_samples_leaf_i,
                                                min_samples_split_i,
                                                n_nodes,
                                                training_accuracy,
                                                testing_accuracy],
                                               index=results.columns),
                                     ignore_index=True)
results = (results
           .sort_values(['testing_accuracy','n_nodes'],
                        ascending=[False,True])
           .reset_index(drop=True))
for col in results.columns[:4]:
    results[col] = results[col].astype(int)

results.head()

	max_depth	min_samples_leaf	min_samples_split	n_nodes	training_accuracy	testing_accuracy
0	7	6	14	81	87.5	86.0
1	7	6	2	85	87.5	86.0
2	7	6	3	85	87.5	86.0
3	7	6	4	85	87.5	86.0
4	7	6	5	85	87.5	86.0

results[results['testing_accuracy']==86.0].tail()

	max_depth	min_samples_leaf	min_samples_split	n_nodes	training_accuracy	testing_accuracy
111	14	6	8	119	88.2	86.0
112	14	6	9	119	88.2	86.0
113	14	6	10	119	88.2	86.0
114	14	6	11	119	88.2	86.0
115	14	6	12	119	88.2	86.0

results.tail()

	max_depth	min_samples_leaf	min_samples_split	n_nodes	training_accuracy	testing_accuracy
2543	13	2	6	181	92.7	76.0
2544	13	2	4	207	93.0	76.0
2545	13	2	3	209	93.3	75.4
2546	14	2	3	223	93.5	75.4
2547	14	2	2	223	93.5	74.3

Results

2548 total decision trees were trained.
Max testing accuracy was 86.0% (116 trees). Of the trees with 86.0% testing accuracy:
- The simplest had 81 nodes (1 tree).
- The most complex had 119 nodes (11 trees).
Min testing accuracy was 74.3% (1 tree).
- This was a significant overfits with 223 nodes.

Of the high-performing 116 trees, 89.6% (104/116) had a min_samples_leaf hyperparameter set to 6. The trees were much more evenly split on the other two hyperparameters. min_samples_leaf seems to have been the most important criterion for ensuring the trees did not overfit the data.

results[results['testing_accuracy']==86.0]['min_samples_leaf'].value_counts()

6    104
5      9
4      3
Name: min_samples_leaf, dtype: int64

This is further demonstrated by the following analysis on the results dataframe.

(results[['min_samples_leaf','testing_accuracy']]
 .groupby(['min_samples_leaf'])
 .median()
 .sort_values(['testing_accuracy'],
              ascending=False)
 .round(1))

	testing_accuracy
min_samples_leaf
6	86.0
7	84.9
5	84.4
4	83.8
3	83.2
9	83.2
8	82.7
1	81.6
11	81.6
12	81.6
13	81.6
14	81.6
10	81.0
2	79.9

The best configuration was the tree that scored 86.0% accuracy with only 81 nodes. The hyper parameters for that tree were:

max_depth = 7
min_samples_leaf = 6
min_samples_split = 14

This content is taken from notes I took while pursuing the Intro to Machine Learning with Pytorch nanodegree certification.