Bayesian Learning

Purpose of this content is to think generally about what machine learning is “trying to do,” beyond specific algorithms. The goal of the Supervised Learning algorithms described in other lectures has been to Learn the best hypothesis given data and domain knowledge. We assert that the best hypothesis is more or less synonymous with the most probable hypothesis. The task is expressed mathematically as:

$$argmax_{h \in H} Pr(h|D)$$

where $argmax$ means take the largest, $Pr$ means probability, $h$ means some hypothesis, $H$ represents the hypothesis set, and $D$ represents all data brought to bear on the problem.

Baye’s Rule Derivation

Baye’s Rule is derived as follows. $Pr(a,b)$ is the probability of $a$ and $b$. $Pr(a|b)$ is the probability of $a$ given $b$.

$$Pr(a,b)=Pr(a|b)Pr(b)$$

Because order doesn’t matter:

$$Pr(a,b)=Pr(b|a)Pr(a)$$

Setting equal to one the other and dividing:

$$Pr(a|b)=\frac{Pr(b|a)Pr(a)}{Pr(b)}$$

Baye’s Rule Example

A lab test returns a correct positive 98% of the time, and a correct negative 97% of the time. The disease this test looks for is only present in 0.8% of the population.

The following calculation attempts to determine whether a person has the disease, $s$, given a positive test $+$. The denominator in the following $Pr(+)$ can be disregarded given that it is in both of these quantities.

$$Pr(s|+)=\frac{Pr(+|s)}{P(s)}{Pr(+)}=(.98)(.008)=0.0078 \sim 21%$$

$$Pr(\overline{s}|+)=\frac{Pr(+|\overline{s})}{P(\overline{s})}{Pr(+)}=(.03)(.992)=0.0298 \sim 79%$$

So, if you test positive for the condition, the most likely scenario is that you are still negative for the condition!

What this suggests is that if a random person from the population is tested for the disease, even if they test positive, they still most likely do not have the condition. Essentially, the small likelihood of the test giving an incorrect answer is totally swamped by the fact that “you still probably don’t have the disease!”

An element this reasoning does not take into account is that the test was likely due to the presence of some other symptoms. If other symptoms are taken into account, the factor that this additional information would change would be the prevalence in the general population (0.8%). This “prior” would increase, which would lead to dramatically different results.

Practically, this is an argument for not requiring that people simply take tests for things. If the prior probability is very low, then the test will not likely be useful. If, on the other hand, the prior probability is higher due to some other symptoms, then it makes sense to test them.

In other words: Priors Matter!

Applying Baye’s Rule to $Pr(h|D)$

Replacing with letters from the previous section:

$$Pr(h|D)=\frac{Pr(D|h)Pr(h)}{Pr(D)}$$

Prior on the Data: $Pr(D)$

  • Your prior belief of seeing some particular set of data.
  • This often ends up being a normalizing term.
  • Typically, this is ignored because the focus is on the hypothesis and not on the data.

Probability of the data given the hypothesis: $Pr(D|h)$

  • The likelihood of seeing some data given that we are in a world where some hypothesis is true.
    • The training data is of the form $D=\lbrace(x_i,d_i)\rbrace$. The $x_i$ are given, but if we are in a world where the hypothesis $h$ is true, what is the likelihood we will have labels $d_i$. Those labels are what we want to assign probability to.

For example, imagine we are in a universe where this hypothesis is true: $H(x)=\lbrace x \geq 10\rbrace$, and that we have data as $x=7$. The probability that $x=7$ will return True is 0, and the probability that the same data will return False is 1.

Note that $Pr(D|h)$ is more readily computed than $Pr(h|D)$.

Prior on the Hypothesis: $Pr(h)$

  • Encapsulates our prior belief that one hypothesis is likely or unlikely compared to other hypotheses.
  • This is our domain knowledge.
    • Encapsulates our prior believe about the way the world works (for example: our similarity metric for k-NN, which features are important for decision trees, the structure of a neural network, etc.)

Algorithm

For each $n \in H$:
    calculate $Pr(h|D) = \frac{P(D|h)P(h)}{P(D)}$ (but note $P(D)$ doesn’t need to be calculated, so instead):
    calculate: $Pr(h|D) \doteq P(D|h)P(h)$
Output:
    $h_{ML}=argmax_h Pr(D|h)$

  • $h_{MAP}=argmax_h Pr(h|D)$
    • where MAP stands for “maximum a posteriori” hypothesis
  • $h_{ML}=argmax_h Pr(D|h)$
    • where ML stands for “maximum likelihood” hypothesis

When using $h_ML$, we are not necessarily “dropping” the $Ph(h)$ term, we are just assuming all hypotheses are uniformly likely.

  • Despite the simplicity of the mathematics above, solving this is not practical computationally, due to the number of hypotheses involved.

Connection between Version Space and Bayesian Learning

Noise-Free Data

Given the following assumptions:

  1. Given $\lbrace(x_i, d_i) \rbrace$ as noise-free examples of $c$
  2. $c \in H$
  3. Uniform prior

It follows from a derivation during the lecture that for all $h \in |VS|$:

$$Pr(h|D) = \frac{1}{|VS|}$$

Noisy Data

Given the following assumptions:

  1. Given $\lbrace(x_i, d_i)\rbrace$
  2. $d_i=f(x_i)+\epsilon_i$, where $\epsilon_i$ is an error term
  3. $\epsilon_i ~ N(0, \sigma^2)$, the error term is drawn from a normal distribution

The following comes from a derivation during the lecture. Note that the following simply minimizing the sum of squared errors.

$$h_{ML}=argmin \sum_i (d_i-h(x_i))^2$$

Connection between Information Theory and Bayesian Learning

Recall the maximum “a posteriori” equation, which states the best hypothesis is the hypothesis that maximizes this expression.

$$h_{MAP}=argmax P(D|h) P(h)$$

$$h_{MAP}=argmax [lg P(D|h) + lg P(h)]$$

$$h_{MAP}=argmin [-lg P(D|h) - lg P(h)]$$

From information theory, an event with probability $P$ has length $-lg P$. Applying this notion to the previous function:

$$h_{MAP}=argmin [-lg P(D|h) - lg P(h)] = argmin [length(D|h)+length(h)]$$

The length of a hypothesis, $length(h)$, can be thought of as the amount of relative complexity. So, a smaller decision tree will have less length than a larger one.

The length of the data given a hypothesis, $length(D|h)$, can be thought of as the relative misclassification or error between the hypothesis and the data. So:

$$h_{MAP} \approx argmin[“error” + “size\ of\ h”]$$

Note that in reality, there are tradeoffs between the error and the complexity of the hypothesis. A more complex hypothesis may be able to reduce the error, and vice versa. The best hypothesis is the one that minimizes the error without paying too much penalty for complexity.

This is called the “Minimum Description Length.”

Bayesian Classification

This section has focused on the best hypothesis in the examples so far, but the best label is usually the more pertinent question to ask.

The best hypothesis in the table above is $h_1$, but the best value is $-$. The best value is a computed as a weighted vote over $h \in H$ according to $Pr(h|D)$

$$v_{MAP} = argmax_v \sum_h P(v|h)P(h|D)$$

Final Notes and Summary

  • Baye’s Rule allows us to swap “causes” and “effects”
    • Rather than computing the probability of the hypothesis given the data, $Pr(h|D)$, we calculate the probability of the data given the hypothesis, $Pr(D|h)$, which is usually much easier.
  • Priors matter
  • $h_{MAP}$, $h_{ML}$
  • Derived Bayesian reasoning for sum of squared errors, Occam’s razor
  • Bayesian Optimal Classifier is a weighted vote of all the hypotheses according to the Probability of the hypotheses given the data, $Pr(h|D)$.