Bayesian Inference
Joint Distribution
The following is an example of a joint distribution.
Storm | Lightning | Probability |
---|---|---|
T | T | .25 |
T | F | .40 |
F | T | .05 |
F | F | .30 |
A few possible calculations:
Each additional category multiplies the space by the number of possible values of that category. For example, adding thunder would increase the space to 8 different categories.
Normal Independence
Independence means that the joint distribution of two variables is equal to the product of their marginals.
Chain Rule
Substituting the first equation into the second, we get
Conditional Independence
More compactly, this can be written:
Conditional Independence Example
Storm | Lightning | Thunder | Probability |
---|---|---|---|
T | T | T | .20 |
T | T | F | .05 |
T | F | T | .04 |
T | F | F | .36 |
F | T | T | .04 |
F | T | F | .01 |
F | F | T | .03 |
F | F | F | .27 |
So, we have conditionally independent variables. Thunder is conditionally independent of Storm, given Lightning.
Bayesian Networks
Bayesian Networks, also known as Bayes Nets, Belief Networks, and Graphical Models, are graphical representations for the conditional independence relationships between all the variables in the joint distribution. Nodes correspond to variables and edges correspond to dependencies that need to be explicitly represented.
We have shown that Thunder is conditional independent of storm given lightning, so
Label | Probability |
---|---|
.650 | |
.385 | |
.143 | |
.800 | |
.100 |
If Thunder were not conditionally independent of Storm given Lightning the network would be more complex. The number of values that need to be tracked grows exponentially with indegree of a given node.
Sampling from the Joint Distribution
The only appropriate order is topological, due to the dependencies.
The network must always be directed and acyclic.
Why Sampling?
- Two things distributions are for:
- Determine probability of value
- Generate values
- Simulation of a complex process
- Approximate inference
- Visualization - get a feel for how the distribution works
Inferencing Rules
The following rules are often used in combination to work out the probabilities for various kinds of events.
Marginalization
Chain Rule
Bayes Rule
Marbles and Boxes Example
Consider example where someone picks a box, picks a ball from that box, then picks another ball from the same box without replacing the first ball.
What is the probability of pulling out a green ball followed by a blue ball?
The Baye’s net for this problem is.
Ball 1 Probabilities
Box | Green | Yellow | Blue |
---|---|---|---|
1 | 3/4 | 1/4 | 0 |
2 | 2/5 | 0 | 3/5 |
Ball 2 Probabilities, assuming ball 1 was green
Box | Green | Yellow | Blue |
---|---|---|---|
1 | 2/3 | 1/3 | 0 |
2 | 1/4 | 0 | 3/4 |
The formula below follows from the marginalization rule and the chain rule:
The formulae below follows from Baye’s rule:
The following comes from averaging the likelihood of the first draw being a green, given that the likelihood of choosing either box is
Plugging back in to the first equation:
Naive Bayes - Algorithmic Approach
Naive Bayes is “naive” because it assumes that attributes are independent of one another, conditional on the label.
If a message is a spam email, it is likely to contain certain words.
Which | True | False |
---|---|---|
.4 | .6 | |
.3 | .001 | |
.2 | .1 | |
.0001 | .1 |
Naive Bayes - Generic Form
Using the previous example, the type of email can be considered as a class,
If we have a way of generating attribute values from classes, then the foregoing formula allows us to go in the opposite direction. We can observe the attribute values and infer the class.
Naive Bayes Benefits
- Inference is cheap
- Few parameters
- Estimate parameters with labeled data
- Connects inference and classification
- Empirically successful
- For example, widely employed at Google
For Fall 2019, CS 7461 is instructed by Dr. Charles Isbell. The course content was originally created by Dr. Charles Isbell and Dr. Michael Littman.