Boosting and Ensemble Learning

As an example, what are the simple rules that define Spam Email?

  • Spam: email body contains “manly”

  • Spam: short

  • Spam: just URLs

  • Spam: just image

  • Spam: list of blacklisted words

  • Not Spam: from spouse

The key characteristic of Ensemble Learning is to take a list of simple rules, all of which independently make some sense, but not perfect sense, and combine them to make a complex rule that works very well.

Basic Idea

First, learn a set of rules from a subset of the available already-classified data. Then, combine them into a complex rule.

  1. Uniformly, randomly choose a subset of data. Then, apply a learner to generate a set of hypotheses.
  2. Combine using an average.

“Bagging” Example

Bagging is a specific type of ensemble learning wherein a random subset is curve-fit and then those curve fits are averaged together. It is also called bootstrap aggregation.

With the housing data as the example, using 5 subsets of the red training data (sampled with replacement), train a 3rd order polynomial and average them together.

The five 3rd-order polynomials are shown in the image above. The average is shown below in red. The blue line is a fourth-order polynomial which is fit to the red training data.

This is run over several iterations with randomly-chosen subsets of the red training data. Generally, the red ensemble learning line does not perform as well as the 4th-order polynomial on the training data. But, for the green test point, the red ensemble learning line always outperforms the blue fourth-order curve fit.

Bagging versus Boosting

Bagging:

  1. Uniformly, randomly pick data and apply a learner to generate some output.
  2. Compute the mean of the outputs.

Boosting:

  1. Apply a learner to generate some output, focusing on the “hardest” examples.
  2. Compute the weighted mean of the outputs.

Boosting Terms

So far in this course, error has been defined as the number of mismatches divided by the number of correct matches. Implicit in that definition is the idea that every example is equally important, but this may not be the case. Specifically, some examples may show up more or less often in the test set. Given that, a new definition of error follows.

Error: $$Pr_D[h(x) \neq c(x)]$$ where $Pr$ stands for probability, $D$ stands for distribution, $h$ stands for hypothesis, or what we think is the true concept, and $c$ stands for the true, underlying concept. So, what this says, is that error is defined as the probability, given the underlying distribution, that we will disagree with the true concept on some particular instance, $x$.

In the following example, the first error rate is computed using constant probability and the second is computed using the probabilities shown above the dots.

This notion of distribution ends up being very important to boosting, which relies on information about which samples are important or not important to learn.

Weak Learner: a learner that, no matter the distribution, will perform better than chance when it predicts. Essentially, you always can get some information from the learner, no matter the distribution. Specifically:

$$\forall_D Pr_D [h(x) \neq c(x)] \leq \frac{1}{2} - \epsilon$$

where $\epsilon$ stands in for a small number.

Example

The task with this example is to come up with a probability distribution for ${x_1, x_2, x_3, x_4}$ such that a learning algorithm that has to choose between the hypotheses ${h_1, h_2, h_3}$ will be able to find one that has an success rate greater than 0.5 (first column) and an error rate less than 0.5 (second column).

Since there is a distribution for which none of these hypotheses will do better than chance, there is no weak learner for this hypothesis space on this instance set.

Boosting Pseudocode

  • Given training $\lbrace(x_i,y_x)\rbrace$, where $y_i$ is in $\lbrace-1,+1\rbrace$
  • For t = 1 to T:
    • construct $D_t$, a distribution
    • find weak classifier $h_t(x)$ with small error $\epsilon_t = Pr_D [h_t(x_i)=y_i]$
  • Output $H_{final}$, a final hypothesis

How do we construct the distribution $D_1(i)$? Begin with a uniform distribution:

$$D_1(i)=\frac{1}{n}$$

At every timestep, $t$, construct a new distribution, $D_{t+1}(i)$:

$$D_{t+1}(i) = \frac{D_t(i) e^{-\alpha_t y_i h_t(x_i)}}{z_t}$$

where $\alpha_t = \frac{1}{2}ln\frac{1-\epsilon_t}{\epsilon_t}$, so $\alpha_t$ is always a positive number, and $z_t$ is whatever normalization constant is required at time, $t$, in order to make it work out to be a distribution.

What happens to $D_t(i)$ when the output of a hypothesis, $h_t$ and the output label, $y_i$, agree? The answer is that it depends on the value $z_t$, but generally it will decrease.

The net result is that more weight is placed on the examples that it is getting wrong, meaning they are “harder.” Because we have a weak learner that will always do better than chance on any distribution, we will always be able to output some learner that can get at least some of the ones that we’re getting wrong, right. The overall idea is to take the current distribution and either make it bigger or smaller depending upon how well the current hypothesis does on that particular example.

The final hypothesis, $H_final(x)$, is the $sgn$ function of the weighted sum of all the weighted classifiers that have been weighted by the $\alpha_t$s.

$$H_{final}(x)=sgn(\sum_t \alpha_t h_t(x))$$

Graphical Example

Step 1:

  • Uniform distribution.

$$\epsilon_t=.3$$ $$\alpha_t=.42$$

Step 2:

  • 3 red pluses are weighted more heavily in the distribution. Obtain new hypothesis, $h_2$.

$$\epsilon_t=.21$$ $$\alpha_t=.65$$

Step 3:

  • 3 red pluses previously emphasized become less positively weighted,
  • 3 green minuses in the middle become more heavily weighted, and
  • Obtain new hypothesis, $h_3$.

$$\epsilon_t=.14$$ $$\alpha_t=.92$$

Weighting the 3 hypotheses accordingly, we end up with the following.

An Intuitive Explanation

Boosting essentially works due to requiring the use of a weak learner. In each subsequent iteration, the learner is forced to focus on the parts of the input space that it has gotten wrong in previous iterations. Because it is a weak learner, it always provides new information at subsequent steps. Because it is gaining new information at each iteration, error is always going down.

General Takeaways

  • Combining several simple hypotheses creates a more complex hypothesis.
  • Boosting is learner-agnostic, as long as it is a weak learner.
  • Error must take into account the underlying distribution of the input space.
  • Overfitting is not an issue for Boosting. When boosting, both training error and testing error always reduce with subsequent iterations.

The following is taken from group of lectures on Support Vector Machines

Boosting and Overfitting

Cross-validation error does not grow large with increased training time, as would be expected with most machine learning algorithms. The reason for this requires exploration of a new idea, confidence.

  • Error: The probability that the algorithm will come up with an incorrect answer.
  • Confidence: How strongly a given algorithm “believes” in a given answer.

As an example, in k-Nearest Neighbors, if 5 of the nearest points vote the same way, you would expect to have more confidence than if 3 voted one way and 2 voted another.

Recall that the final output of a boosted classifier is given by:

$$H_{final} = sgn(\sum_t \alpha_t h_t(x))$$

Normalizing by the weights, $\alpha_t$:

$$H_{final} = sgn(\frac{\sum_t \alpha_t h_t(x)}{\sum \alpha_t})$$

That normalization step means that the part of the function that is inside the $sgn$ function will always be in $[-1, +1]$.

On the spectrum shown below, both $+$’s are correct, but the one on the right side is much more confident than the one toward the middle.

The $+$ on the left side in the image below would indicate high confidence in an incorrect classification.

In the example below, most items are classified correctly, but there is still some training error.

With additional training, maybe it would be possible to have no training error:

With even more training, the classifications that are near the boundary will tend to drift away from the boundary.

While the error remains constant at 0, the confidence continues to increase.

The reason boosting tends to do well is that as you add more and more learners, you increase the margin:

One situation where boosting will overfit is if the learner itself is very overfit. This is the case, for example, if it is a weak learner that is a very powerful artificial neural network with many layers and nodes.

If the underlying weak learner overfits, then it is difficult for boosting to over come that.

One signal that tends to cause boosting to overfit is pink noise, which is uniform noise. Pink noise is distinct from white noise, which is Gaussian noise.