# Decision Tree Implementations

## Manually Calculating Information Gain of Different Split Criteria

Use a small, 3-feature dataset of bugs.

```
import pandas as pd
from math import log
```

```
b = pd.read_csv('decision-tree-implementations/bug_data.csv')
b.head()
```

Species | Color | Length (mm) | |
---|---|---|---|

0 | Mobug | Brown | 11.6 |

1 | Mobug | Blue | 16.3 |

2 | Lobug | Blue | 15.1 |

3 | Lobug | Green | 23.7 |

4 | Lobug | Blue | 18.4 |

Which of the following splitting criteria provides the most information gain for discriminating Mobugs from Lobugs?

- Color = Brown
- Color = Blue
- Color = Green
- Length < 17 mm
- Length < 20 mm

$$Entropy = \sum^n_{i=1} p_i log_2 (p_i)$$

$$Information\ Gain = Entropy(Parent) - [\frac{m}{m+n} Entropy(Child_1) + \frac{n}{m+n} Entropy(Child_2)]$$

```
def calculate_entropy(sub):
l = sub[sub['Species']=='Lobug'].shape[0]
m = sub[sub['Species']=='Mobug'].shape[0]
d = sub['Species'].shape[0]
entropy = (-1 * (l/d) * log((l/d), 2)) + (-1 * (m/d) * log((m/d), 2))
return entropy
```

```
def calculate_info_gain(series):
sub_1 = b[series].copy()
sub_2 = b[~series].copy()
parent_entropy = calculate_entropy(b)
weight_1 = sub_1.shape[0] / b.shape[0]
weight_2 = sub_2.shape[0] / b.shape[0]
entropy_1 = calculate_entropy(sub_1)
entropy_2 = calculate_entropy(sub_2)
return parent_entropy - (weight_1 * entropy_1 + weight_2 * entropy_2)
```

```
for series, string in [(b['Color']=='Brown', "b['Color']=='Brown'"),
(b['Color']=='Blue', "b['Color']=='Blue'"),
(b['Color']=='Green', "b['Color']=='Green'"),
(b['Length (mm)'] < 17, "b['Length (mm)'] < 17"),
(b['Length (mm)'] < 20, "b['Length (mm)'] < 20")]:
info_gain = calculate_info_gain(series)
print(string, '\t\t{:1.4f}'.format(info_gain))
```

```
b['Color']=='Brown' 0.0616
b['Color']=='Blue' 0.0006
b['Color']=='Green' 0.0428
b['Length (mm)'] < 17 0.1126
b['Length (mm)'] < 20 0.1007
```

## Training a sklearn Decision Tree on 2D Binary Classification Data

```
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
```

Import Data

```
data = np.asarray(pd.read_csv('decision-tree-implementations/2d_class_data.csv',
header=None))
```

Split data into features and labels

```
X = data[:,0:2] # Features
y = data[:,2] # Labels
```

Set up the Model

*Note that this means the decision tree will overfit the training data, since no hyperparameters are specified.*

`model = DecisionTreeClassifier()`

Train the Model

`model.fit(X, y)`

```
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')
```

Predict using the Model

`y_pred = model.predict(X)`

Calculate Accuracy

*The 100% accuracy value below confirms that the model overfit the training data.*

```
acc = accuracy_score(y, y_pred)
acc
```

```
1.0
```

This content is taken from notes I took while pursuing the Intro to Machine Learning with Pytorch nanodegree certification.