# Decision Tree Implementations

## Manually Calculating Information Gain of Different Split Criteria

Use a small, 3-feature dataset of bugs.

import pandas as pd
from math import log

b = pd.read_csv('decision-tree-implementations/bug_data.csv')


Species Color Length (mm)
0 Mobug Brown 11.6
1 Mobug Blue 16.3
2 Lobug Blue 15.1
3 Lobug Green 23.7
4 Lobug Blue 18.4

Which of the following splitting criteria provides the most information gain for discriminating Mobugs from Lobugs?

1. Color = Brown
2. Color = Blue
3. Color = Green
4. Length < 17 mm
5. Length < 20 mm

$$Entropy = \sum^n_{i=1} p_i log_2 (p_i)$$

$$Information\ Gain = Entropy(Parent) - [\frac{m}{m+n} Entropy(Child_1) + \frac{n}{m+n} Entropy(Child_2)]$$

def calculate_entropy(sub):
l = sub[sub['Species']=='Lobug'].shape
m = sub[sub['Species']=='Mobug'].shape
d = sub['Species'].shape

entropy = (-1 * (l/d) * log((l/d), 2)) + (-1 * (m/d) * log((m/d), 2))

return entropy

def calculate_info_gain(series):
sub_1 = b[series].copy()
sub_2 = b[~series].copy()

parent_entropy = calculate_entropy(b)

weight_1 = sub_1.shape / b.shape
weight_2 = sub_2.shape / b.shape

entropy_1 = calculate_entropy(sub_1)
entropy_2 = calculate_entropy(sub_2)

return parent_entropy - (weight_1 * entropy_1 + weight_2 * entropy_2)

for series, string in [(b['Color']=='Brown', "b['Color']=='Brown'"),
(b['Color']=='Blue', "b['Color']=='Blue'"),
(b['Color']=='Green', "b['Color']=='Green'"),
(b['Length (mm)'] < 17, "b['Length (mm)'] < 17"),
(b['Length (mm)'] < 20, "b['Length (mm)'] < 20")]:
info_gain = calculate_info_gain(series)
print(string, '\t\t{:1.4f}'.format(info_gain))

b['Color']=='Brown' 		0.0616
b['Color']=='Blue' 		0.0006
b['Color']=='Green' 		0.0428
b['Length (mm)'] < 17 		0.1126
b['Length (mm)'] < 20 		0.1007


## Training a sklearn Decision Tree on 2D Binary Classification Data

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np


Import Data

data = np.asarray(pd.read_csv('decision-tree-implementations/2d_class_data.csv',


Split data into features and labels

X = data[:,0:2]   # Features
y = data[:,2]     # Labels


Set up the Model

Note that this means the decision tree will overfit the training data, since no hyperparameters are specified.

model = DecisionTreeClassifier()


Train the Model

model.fit(X, y)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')


Predict using the Model

y_pred = model.predict(X)


Calculate Accuracy

The 100% accuracy value below confirms that the model overfit the training data.

acc = accuracy_score(y, y_pred)
acc

1.0