Ensemble Methods Implementations

08 Mar 2020

Train and Test Each Three Ensemble Methods

Models and Techniques:

BaggingClassifier
- Bootstrap the data passed through a learner (bagging).
RandomForestClassifier
- Subset the features used for a learner (combined with bagging signifies the two random components of random forests).
AdaBoostClassifier
- Ensemble learners together in a way that allows those that perform best in certain areas to create the largest impact (boosting).

High-Level Overview of the Steps:

Preprocess the data.
Import the model.
Instantiate the model with the hyperparameters of interest.
Fit the model to the training data.
Predict on the test data.
Score the model by comparing the predictions to the actual values.

1. Preprocess the Data

This first code block performs the same data preprocessing that is documented in my Naive Bayes SMS Example post. For deails, see that post.

Following this preprocessing, the training and test data is a sparse array of zeros and ones. Each row is a text message, and each column corresponds to a single word.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (accuracy_score, 
                             precision_score,
                             recall_score, 
                             f1_score)

t = pd.read_table('naive-bayes-implementations/sms_spam_collection.txt',
                  sep='\t',
                  names=['label','text'])
t['label'] = (t['label'].replace({'ham'  : 0,
                                  'spam' : 1}))

X_train, X_test, y_train, y_test = train_test_split(t['text'], 
                                                    t['label'], 
                                                    random_state=1)

count_vector = CountVectorizer()

train_data = count_vector.fit_transform(X_train)
test_data = count_vector.transform(X_test)

The baseline for comparison with the Ensemble Methods is the Naive Bayes model trained in a different post.

naive_bayes = MultinomialNB().fit(train_data,
                                  y_train)

naive_bayes_preds = naive_bayes.predict(test_data)

accuracy = accuracy_score(y_test,
                          naive_bayes_preds)
precision = precision_score(y_test,
                            naive_bayes_preds)
recall = recall_score(y_test,
                      naive_bayes_preds)
f1 = f1_score(y_test,
              naive_bayes_preds)

print('  --- Naive Bayes ---')
print(' Accuracy score: {:2.1f}%'.format(accuracy*100.))
print('Precision score: {:2.1f}%'.format(precision*100.))
print('   Recall score: {:2.1f}%'.format(recall*100.))
print('       F1 score: {:2.1f}%'.format(f1*100.))

  --- Naive Bayes ---
 Accuracy score: 98.9%
Precision score: 97.2%
   Recall score: 94.1%
       F1 score: 95.6%

2. Import the Models

from sklearn.ensemble import (BaggingClassifier,
                              RandomForestClassifier,
                              AdaBoostClassifier)

3. Instantiate the Models with the Hyperparameters of Interest

BaggingClassifier
- 200 estimators (weak learners)
RandomForestClassifier
- 200 estimators (weak learners)
AdaBoostClassifier
- 300 estimators (weak learners)
- Learning rate = 0.2

bagging = BaggingClassifier(n_estimators=200)

random_forest = RandomForestClassifier(n_estimators=200)

adaboost = AdaBoostClassifier(n_estimators=300,
                              learning_rate=0.2)

4. Fit the Models

bagging.fit(train_data,
            y_train)

random_forest.fit(train_data,
                  y_train)

adaboost.fit(train_data,
             y_train);

5. Predict on the Test Data

bagging_preds = bagging.predict(test_data)

random_forest_preds = random_forest.predict(test_data)

adaboost_preds = adaboost.predict(test_data)

6. Score the Models

def print_results(y_test,
                  preds,
                  model_name):
    accuracy = accuracy_score(y_test,
                              preds)
    precision = precision_score(y_test,
                                preds)
    recall = recall_score(y_test,
                          preds)
    f1 = f1_score(y_test,
                  preds)

    print('\n  --- {} ---'.format(model_name))
    print(' Accuracy score: {:2.1f}%'.format(accuracy*100.))
    print('Precision score: {:2.1f}%'.format(precision*100.))
    print('   Recall score: {:2.1f}%'.format(recall*100.))
    print('       F1 score: {:2.1f}%'.format(f1*100.))

For all these metrics, the score can range from 0 to 1, with 1 being the best.

Accuracy: how often the classifer makes the right predictions.

$A c c u r a c y = \frac{C o r r e c t P r e d i c t i o n s}{T o t a l P r e d i c t i o n s}$

Precision: proportion of the messages classified as spam that actually were spam.

$P r e c i s i o n = \frac{T r u e P o s i t i v e s}{T r u e P o s i t i v e s + F a l s e P o s i t i v e s}$

Recall (Sensitivity): the proportion of messages that were spam were classified as spam.

$R e c a l l = \frac{T r u e P o s i t i v e s}{T r u e P o s i t i v e s + F a l s e N e g a t i v e s}$

F1 Score: weighted average of the precision and recall scores.

for preds, model in [(naive_bayes_preds, 'Naive Bayes'),
                     (bagging_preds, 'Bagging'),
                     (random_forest_preds, 'Random Forest'),
                     (adaboost_preds, 'Adaboost')]:
    print_results(y_test,
                  preds,
                  model)

  --- Naive Bayes ---
 Accuracy score: 98.9%
Precision score: 97.2%
   Recall score: 94.1%
       F1 score: 95.6%

  --- Bagging ---
 Accuracy score: 97.4%
Precision score: 91.2%
   Recall score: 89.2%
       F1 score: 90.2%

  --- Random Forest ---
 Accuracy score: 98.1%
Precision score: 100.0%
   Recall score: 85.9%
       F1 score: 92.4%

  --- Adaboost ---
 Accuracy score: 97.7%
Precision score: 96.9%
   Recall score: 85.4%
       F1 score: 90.8%

This content is taken from notes I took while pursuing the Intro to Machine Learning with Pytorch nanodegree certification.