Decision Tree Implementations

Manually Calculating Information Gain of Different Split Criteria

Use a small, 3-feature dataset of bugs.

import pandas as pd
from math import log
b = pd.read_csv('decision-tree-implementations/bug_data.csv')
b.head()

Species Color Length (mm)
0 Mobug Brown 11.6
1 Mobug Blue 16.3
2 Lobug Blue 15.1
3 Lobug Green 23.7
4 Lobug Blue 18.4

Which of the following splitting criteria provides the most information gain for discriminating Mobugs from Lobugs?

  1. Color = Brown
  2. Color = Blue
  3. Color = Green
  4. Length < 17 mm
  5. Length < 20 mm

$$Entropy = \sum^n_{i=1} p_i log_2 (p_i)$$

$$Information\ Gain = Entropy(Parent) - [\frac{m}{m+n} Entropy(Child_1) + \frac{n}{m+n} Entropy(Child_2)]$$

def calculate_entropy(sub):
    l = sub[sub['Species']=='Lobug'].shape[0]
    m = sub[sub['Species']=='Mobug'].shape[0]
    d = sub['Species'].shape[0]

    entropy = (-1 * (l/d) * log((l/d), 2)) + (-1 * (m/d) * log((m/d), 2))
    
    return entropy
def calculate_info_gain(series):
    sub_1 = b[series].copy()
    sub_2 = b[~series].copy()
    
    parent_entropy = calculate_entropy(b)
    
    weight_1 = sub_1.shape[0] / b.shape[0]
    weight_2 = sub_2.shape[0] / b.shape[0]
    
    entropy_1 = calculate_entropy(sub_1)
    entropy_2 = calculate_entropy(sub_2)
        
    return parent_entropy - (weight_1 * entropy_1 + weight_2 * entropy_2)
for series, string in [(b['Color']=='Brown', "b['Color']=='Brown'"),
                       (b['Color']=='Blue', "b['Color']=='Blue'"),
                       (b['Color']=='Green', "b['Color']=='Green'"),
                       (b['Length (mm)'] < 17, "b['Length (mm)'] < 17"),
                       (b['Length (mm)'] < 20, "b['Length (mm)'] < 20")]:
    info_gain = calculate_info_gain(series)
    print(string, '\t\t{:1.4f}'.format(info_gain))
b['Color']=='Brown' 		0.0616
b['Color']=='Blue' 		0.0006
b['Color']=='Green' 		0.0428
b['Length (mm)'] < 17 		0.1126
b['Length (mm)'] < 20 		0.1007

Training a sklearn Decision Tree on 2D Binary Classification Data

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

Import Data

data = np.asarray(pd.read_csv('decision-tree-implementations/2d_class_data.csv', 
                              header=None))

Split data into features and labels

X = data[:,0:2]   # Features
y = data[:,2]     # Labels

Set up the Model

Note that this means the decision tree will overfit the training data, since no hyperparameters are specified.

model = DecisionTreeClassifier()

Train the Model

model.fit(X, y)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

Predict using the Model

y_pred = model.predict(X)

Calculate Accuracy

The 100% accuracy value below confirms that the model overfit the training data.

acc = accuracy_score(y, y_pred)
acc
1.0