Training and Tuning

Types of Errors

There are two general types of errors that are made in Machine Learning: overfitting and underfitting. The difference between these errors will be illustrated using the following simple example. The data shown are the training set.

Underfitting

  • Underfitting occurs when the machine learning model oversimplifies the problem.
  • Characterized by models that underperform on the training set.
  • Also called an error due to bias.

In this example, the model overgeneralizes the difference between these two sets as “animals” and “not animals,” resulting in the cat in the training set being misclassified.

Overfitting

  • Overfitting occurs when the machine learning model overcomplicates the problem.
  • Characterized by models that do well on the training set, but tend to memorize the training set instead of learning the characteristics of it. Consequently, they underperform on the testing set.
  • Also called an error due to variance.

In this example, the model overcomplicates the difference between the two sets as “dogs that are wagging their tail” and “no dogs that are wagging their tail.” The model performs well on the training set, which is the initial set of 6 data instances.

The 7th data instance that is considered (from the testing set) is a dog that is not wagging its tail. The dog is appropriately classified as being in the right-hand set, as a “dog”. But, the model would incorrectly classify it in the left-hand set, as a “dog that is not wagging its tail.” The “wagging tail” detail is a type of overfitting and results in the model underperforming on the test set.

In summary:

Examples from Regression and Classification

Good models are shown below.

Underfit (or “High Bias”) models are shown below.

Overfit (or “High Variance”) models are shown below.

Model Complexity Graph

A model complexity graph is a visualization of the models’ training and testing error rates as a function of model complexity. The purpose of this graph is to help determine the optimal model complexity so the model is neither underfit nor overfit.

The following linear model makes 3 training errors and 3 testing errors.

The following quadratic model makes 1 training error and 1 testing error.

The following degree 6 polynomial model makes 0 training errors and 2 testing errors.

The model complexity curve is shown below.

With increasing model complexity beyond the quadratic model, the training error decreased but the testing error increased. This indicates the model has begun to overfit. In the case of the linear model, the relatively high training error indicates the model has underfit the data.

Cross Validation

Since the testing data is not allowed for training, we use the “cross validation” set instead. The cross validation set is part of the training set that is set aside for use in making decisions about the data.

While the model complexity curves will not always look exactly like the one showed above, the shape of the following is typical. The point of ideal model complexity is the point where training and cross validation error reach a minimum and have not yet begun to diverge.

K-Fold Cross Validation

K-Fold Cross Validation is a means of “recycling” the entire dataset during the training and testing cycles, so that the entire dataset is effectively used for training. Using K-fold cross validation involves splitting the data into K buckets and then train the data K times, each time using a different bucket as the testing set. In the image below, K is 4.

Using K-fold cross validation in SKLearn is straightforward:

from sklearn.model_selection import KFold
kf = KFold(12, 3)
for train_indices, test_indices in kf:
  print(test_indices)

[0  1  2]
[3  4  5]
[6  7  8]
[9 10 11]

Best practice is to shuffle the data. This is accomplished by setting shuffle=True when initializing the KFold object.

kf = KFold(12, 3, shuffle = True)
for train_indices, test_indices in kf:
  print(test_indices)

[7 9 10]
[3 5 11]
[0 2  4]
[1 6  8]

Learning Curves

Learning Curves are another means of detecting over- and underfitting. To generate a learning curve, the model is trained and cross validated on a small subset of the available data. The training and cross validation errors are plotted as two points on the learning curve, which is a function of the size of the training points.

The size of that small subset is incrementally increased over time. As the size of the training and cross validation sets increase, the training error and cross validation error converge. Depending on whether the model is underfit, well fit, or overfit, the learning curve will have different characteristic shapes. Examples of these shapes are shown below.

Grid Search is a method in the sklearn that helps the user to efficiently search for the best-performing model when using multiple hyper-parameters. The steps to train a machine learning model are the following:

  1. Use the training data to train several candidate models.
  2. Use the cross-validation data to pick the best of these models, based on F1 Score, for example.
  3. Use the testing data to ensure that the final model performs well.

Consider the example of a decision tree. First, using the training data, we train 4 different candidate models using depth values from 1 to 4. Second, using the cross-validation data, we determine which of these models has the highest F1 score. Third, using the test data, we test this final model. In the example below, the tree with depth has the largest F1 Score. In the example below, the tree with depth 3 would be the top-performing model.

In this case, the hyper-parameters for the decision tree are straightforward. Consider the more complex case of a support vector machine. When training SVMs, we often need to simultaneously consider alternative kernels and alternative values for the hyper-parameter C.

In these and similar, more complex situations, we use Grid Search. Grid Search creates a table using the different possible combinations of hyper-parameters.

  1. Import GridSearchCV.
from sklearn.model_selection import GridSearchCV
  1. Define the space of hyper-parameters in a dictionary.
parameters = {'kernel':['poly', 'rbf'],'C':[0.1, 1, 10]}
  1. Create a scorer. This is the metric that will be used to score each of the candidate models.
from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score
scorer = make_scorer(f1_score)
  1. Create a GridSearch object with the parameters and the scorer. Then, fit it to the data.
grid_obj = GridSearchCV(clf, parameters, scoring=scorer)
grid_fit = grid_obj.fit(X, y)
  1. Get the best estimator.
best_clf = grid_fit.best_estimator_