Bayesian Learning
Purpose of this content is to think generally about what machine learning is “trying to do,” beyond specific algorithms. The goal of the Supervised Learning algorithms described in other lectures has been to Learn the best hypothesis given data and domain knowledge. We assert that the best hypothesis is more or less synonymous with the most probable hypothesis. The task is expressed mathematically as:
where
Baye’s Rule Derivation
Baye’s Rule is derived as follows.
Because order doesn’t matter:
Setting equal to one the other and dividing:
Baye’s Rule Example
A lab test returns a correct positive 98% of the time, and a correct negative 97% of the time. The disease this test looks for is only present in 0.8% of the population.
The following calculation attempts to determine whether a person has the disease,
So, if you test positive for the condition, the most likely scenario is that you are still negative for the condition!
What this suggests is that if a random person from the population is tested for the disease, even if they test positive, they still most likely do not have the condition. Essentially, the small likelihood of the test giving an incorrect answer is totally swamped by the fact that “you still probably don’t have the disease!”
An element this reasoning does not take into account is that the test was likely due to the presence of some other symptoms. If other symptoms are taken into account, the factor that this additional information would change would be the prevalence in the general population (0.8%). This “prior” would increase, which would lead to dramatically different results.
Practically, this is an argument for not requiring that people simply take tests for things. If the prior probability is very low, then the test will not likely be useful. If, on the other hand, the prior probability is higher due to some other symptoms, then it makes sense to test them.
In other words: Priors Matter!
Applying Baye’s Rule to
Replacing with letters from the previous section:
Prior on the Data:
- Your prior belief of seeing some particular set of data.
- This often ends up being a normalizing term.
- Typically, this is ignored because the focus is on the hypothesis and not on the data.
Probability of the data given the hypothesis:
- The likelihood of seeing some data given that we are in a world where some hypothesis is true.
- The training data is of the form
. The are given, but if we are in a world where the hypothesis is true, what is the likelihood we will have labels . Those labels are what we want to assign probability to.
- The training data is of the form
For example, imagine we are in a universe where this hypothesis is true:
Note that
Prior on the Hypothesis:
- Encapsulates our prior belief that one hypothesis is likely or unlikely compared to other hypotheses.
- This is our domain knowledge.
- Encapsulates our prior believe about the way the world works (for example: our similarity metric for k-NN, which features are important for decision trees, the structure of a neural network, etc.)
Algorithm
For each
calculate
calculate:
Output:
- where MAP stands for “maximum a posteriori” hypothesis
- where ML stands for “maximum likelihood” hypothesis
When using
- Despite the simplicity of the mathematics above, solving this is not practical computationally, due to the number of hypotheses involved.
Connection between Version Space and Bayesian Learning
Noise-Free Data
Given the following assumptions:
- Given
as noise-free examples of - Uniform prior
It follows from a derivation during the lecture that for all
Noisy Data
Given the following assumptions:
- Given
, where is an error term , the error term is drawn from a normal distribution
The following comes from a derivation during the lecture. Note that the following simply minimizing the sum of squared errors.
Connection between Information Theory and Bayesian Learning
Recall the maximum “a posteriori” equation, which states the best hypothesis is the hypothesis that maximizes this expression.
From information theory, an event with probability
The length of a hypothesis,
The length of the data given a hypothesis,
Note that in reality, there are tradeoffs between the error and the complexity of the hypothesis. A more complex hypothesis may be able to reduce the error, and vice versa. The best hypothesis is the one that minimizes the error without paying too much penalty for complexity.
This is called the “Minimum Description Length.”
Bayesian Classification
This section has focused on the best hypothesis in the examples so far, but the best label is usually the more pertinent question to ask.
The best hypothesis in the table above is
Final Notes and Summary
- Baye’s Rule allows us to swap “causes” and “effects”
- Rather than computing the probability of the hypothesis given the data,
, we calculate the probability of the data given the hypothesis, , which is usually much easier.
- Rather than computing the probability of the hypothesis given the data,
- Priors matter
,- Derived Bayesian reasoning for sum of squared errors, Occam’s razor
- Bayesian Optimal Classifier is a weighted vote of all the hypotheses according to the Probability of the hypotheses given the data,
.
For Fall 2019, CS 7461 is instructed by Dr. Charles Isbell. The course content was originally created by Dr. Charles Isbell and Dr. Michael Littman.