Decision Tree Implementations

Manually Calculating Information Gain of Different Split Criteria

Use a small, 3-feature dataset of bugs.

import pandas as pd
from math import log
b = pd.read_csv('decision-tree-implementations/bug_data.csv')

Species Color Length (mm)
0 Mobug Brown 11.6
1 Mobug Blue 16.3
2 Lobug Blue 15.1
3 Lobug Green 23.7
4 Lobug Blue 18.4

Which of the following splitting criteria provides the most information gain for discriminating Mobugs from Lobugs?

  1. Color = Brown
  2. Color = Blue
  3. Color = Green
  4. Length < 17 mm
  5. Length < 20 mm


Information Gain=Entropy(Parent)[mm+nEntropy(Child1)+nm+nEntropy(Child2)]

def calculate_entropy(sub):
    l = sub[sub['Species']=='Lobug'].shape[0]
    m = sub[sub['Species']=='Mobug'].shape[0]
    d = sub['Species'].shape[0]

    entropy = (-1 * (l/d) * log((l/d), 2)) + (-1 * (m/d) * log((m/d), 2))
    return entropy
def calculate_info_gain(series):
    sub_1 = b[series].copy()
    sub_2 = b[~series].copy()
    parent_entropy = calculate_entropy(b)
    weight_1 = sub_1.shape[0] / b.shape[0]
    weight_2 = sub_2.shape[0] / b.shape[0]
    entropy_1 = calculate_entropy(sub_1)
    entropy_2 = calculate_entropy(sub_2)
    return parent_entropy - (weight_1 * entropy_1 + weight_2 * entropy_2)
for series, string in [(b['Color']=='Brown', "b['Color']=='Brown'"),
                       (b['Color']=='Blue', "b['Color']=='Blue'"),
                       (b['Color']=='Green', "b['Color']=='Green'"),
                       (b['Length (mm)'] < 17, "b['Length (mm)'] < 17"),
                       (b['Length (mm)'] < 20, "b['Length (mm)'] < 20")]:
    info_gain = calculate_info_gain(series)
    print(string, '\t\t{:1.4f}'.format(info_gain))
b['Color']=='Brown' 		0.0616
b['Color']=='Blue' 		0.0006
b['Color']=='Green' 		0.0428
b['Length (mm)'] < 17 		0.1126
b['Length (mm)'] < 20 		0.1007

Training a sklearn Decision Tree on 2D Binary Classification Data

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

Import Data

data = np.asarray(pd.read_csv('decision-tree-implementations/2d_class_data.csv', 

Split data into features and labels

X = data[:,0:2]   # Features
y = data[:,2]     # Labels

Set up the Model

Note that this means the decision tree will overfit the training data, since no hyperparameters are specified.

model = DecisionTreeClassifier()

Train the Model, y)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

Predict using the Model

y_pred = model.predict(X)

Calculate Accuracy

The 100% accuracy value below confirms that the model overfit the training data.

acc = accuracy_score(y, y_pred)