Decision Tree Implementations
Manually Calculating Information Gain of Different Split Criteria
Use a small, 3-feature dataset of bugs.
import pandas as pd
from math import log
b = pd.read_csv('decision-tree-implementations/bug_data.csv')
b.head()
Species | Color | Length (mm) | |
---|---|---|---|
0 | Mobug | Brown | 11.6 |
1 | Mobug | Blue | 16.3 |
2 | Lobug | Blue | 15.1 |
3 | Lobug | Green | 23.7 |
4 | Lobug | Blue | 18.4 |
Which of the following splitting criteria provides the most information gain for discriminating Mobugs from Lobugs?
- Color = Brown
- Color = Blue
- Color = Green
- Length < 17 mm
- Length < 20 mm
$$Entropy = \sum^n_{i=1} p_i log_2 (p_i)$$
$$Information\ Gain = Entropy(Parent) - [\frac{m}{m+n} Entropy(Child_1) + \frac{n}{m+n} Entropy(Child_2)]$$
def calculate_entropy(sub):
l = sub[sub['Species']=='Lobug'].shape[0]
m = sub[sub['Species']=='Mobug'].shape[0]
d = sub['Species'].shape[0]
entropy = (-1 * (l/d) * log((l/d), 2)) + (-1 * (m/d) * log((m/d), 2))
return entropy
def calculate_info_gain(series):
sub_1 = b[series].copy()
sub_2 = b[~series].copy()
parent_entropy = calculate_entropy(b)
weight_1 = sub_1.shape[0] / b.shape[0]
weight_2 = sub_2.shape[0] / b.shape[0]
entropy_1 = calculate_entropy(sub_1)
entropy_2 = calculate_entropy(sub_2)
return parent_entropy - (weight_1 * entropy_1 + weight_2 * entropy_2)
for series, string in [(b['Color']=='Brown', "b['Color']=='Brown'"),
(b['Color']=='Blue', "b['Color']=='Blue'"),
(b['Color']=='Green', "b['Color']=='Green'"),
(b['Length (mm)'] < 17, "b['Length (mm)'] < 17"),
(b['Length (mm)'] < 20, "b['Length (mm)'] < 20")]:
info_gain = calculate_info_gain(series)
print(string, '\t\t{:1.4f}'.format(info_gain))
b['Color']=='Brown' 0.0616
b['Color']=='Blue' 0.0006
b['Color']=='Green' 0.0428
b['Length (mm)'] < 17 0.1126
b['Length (mm)'] < 20 0.1007
Training a sklearn Decision Tree on 2D Binary Classification Data
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
Import Data
data = np.asarray(pd.read_csv('decision-tree-implementations/2d_class_data.csv',
header=None))
Split data into features and labels
X = data[:,0:2] # Features
y = data[:,2] # Labels
Set up the Model
Note that this means the decision tree will overfit the training data, since no hyperparameters are specified.
model = DecisionTreeClassifier()
Train the Model
model.fit(X, y)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')
Predict using the Model
y_pred = model.predict(X)
Calculate Accuracy
The 100% accuracy value below confirms that the model overfit the training data.
acc = accuracy_score(y, y_pred)
acc
1.0
This content is taken from notes I took while pursuing the Intro to Machine Learning with Pytorch nanodegree certification.