Bayesian Inference

29 Sep 2019

Joint Distribution

The following is an example of a joint distribution.

Storm	Lightning	Probability
T	T	.25
T	F	.40
F	T	.05
F	F	.30

A few possible calculations:

$P r (\overset{―}{S t o r m}) = .30 + .05 = .35$

$P r (L i g h t n i n g | S t o r m) = .25 / .65 = .4615$

Each additional category multiplies the space by the number of possible values of that category. For example, adding thunder would increase the space to 8 different categories.

Normal Independence

Independence means that the joint distribution of two variables is equal to the product of their marginals.

$P r (X, Y) = P r (X) P r (Y)$

Chain Rule

$P r (X, Y) = P r (X | Y) P r (Y)$

Substituting the first equation into the second, we get

$\forall_{x, y} P r (X | Y) = P r (X)$

Conditional Independence

$X$ is conditionally independent of $Y$ given $Z$ if the probability distribution governing $X$ is independent of the value of $Y$ given the value of $Z$ ; that is if

$\forall_{x, y, z} P (X = x | Y = y, Z = z) = P (X = x | Z = z)$

More compactly, this can be written:

$P (X | Y, Z) = P (X | Z)$

Conditional Independence Example

Storm	Lightning	Thunder	Probability
T	T	T	.20
T	T	F	.05
T	F	T	.04
T	F	F	.36
F	T	T	.04
F	T	F	.01
F	F	T	.03
F	F	F	.27

$P (t h u n d e r = T | l i g h t n i n g = F, s t o r m = T) = .04 / .40 = .10$

$P (t h u n d e r = T | l i g h t n i n g = F, s t o r m = F) = .03 / .30 = .10$

$P r (T | L, S) = P r (T | L)$

So, we have conditionally independent variables. Thunder is conditionally independent of Storm, given Lightning.

Bayesian Networks

Bayesian Networks, also known as Bayes Nets, Belief Networks, and Graphical Models, are graphical representations for the conditional independence relationships between all the variables in the joint distribution. Nodes correspond to variables and edges correspond to dependencies that need to be explicitly represented.

We have shown that Thunder is conditional independent of storm given lightning, so $P (T h | L, S)$ and $P (T h | L, \overset{―}{S})$ and their variations are excluded. So, with the five numbers shown it is possible to work out any probability we want in the joint.

Label	Probability
$P (S)$	.650
$P (L \| S)$	.385
$P (L \| \overset{―}{S})$	.143
$P (T h \| L)$	.800
$P (T h \| \overset{―}{L})$	.100

If Thunder were not conditionally independent of Storm given Lightning the network would be more complex. The number of values that need to be tracked grows exponentially with indegree of a given node.

Sampling from the Joint Distribution

The only appropriate order is topological, due to the dependencies. $A \sim P (A), B \sim P (B), C \sim P (C | A, B), D \sim P (D | B, C), E \sim P (E | C, D)$

The network must always be directed and acyclic.

$P r (A, B, C, D, E) = P r (A) P r (B) P r (C | A, B) P r (D | B, C) P r (E | C, D)$

Why Sampling?

Two things distributions are for:
- Determine probability of value
- Generate values
Simulation of a complex process
Approximate inference
Visualization - get a feel for how the distribution works

Inferencing Rules

The following rules are often used in combination to work out the probabilities for various kinds of events.

Marginalization

$P (x) = \sum_{y} P (x, y)$

Chain Rule

$P (x, y) = P (x) P (y | x)$

Bayes Rule

$P (y | x) = \frac{P (x | y) P (y)}{P (x)}$

Marbles and Boxes Example

Consider example where someone picks a box, picks a ball from that box, then picks another ball from the same box without replacing the first ball.

What is the probability of pulling out a green ball followed by a blue ball? $P (2 = b l u e | 1 = g r e e n)$

The Baye’s net for this problem is.

Ball 1 Probabilities

Box	Green	Yellow	Blue
1	3/4	1/4	0
2	2/5	0	3/5

Ball 2 Probabilities, assuming ball 1 was green

Box	Green	Yellow	Blue
1	2/3	1/3	0
2	1/4	0	3/4

The formula below follows from the marginalization rule and the chain rule:

The formulae below follows from Baye’s rule:

$P (b o x = 1 | 1 = g r e e n) = P (1 = g r e e n | b o x = 1) \frac{P r (b o x = 1)}{P r (1 = g r e e n)}$

$P (b o x = 2 | 1 = g r e e n) = P (1 = g r e e n | b o x = 2) \frac{P r (b o x = 2)}{P r (1 = g r e e n)}$

The following comes from averaging the likelihood of the first draw being a green, given that the likelihood of choosing either box is $\frac{1}{2}$ :

$P (1 = g r e e n) = \frac{3}{4} * \frac{1}{2} + \frac{2}{5} * \frac{1}{2} = \frac{23}{40}$

$P (b o x = 1 | 1 = g r e e n) = \frac{3}{4} \frac{\frac{1}{2}}{\frac{23}{40}} = \frac{15}{23}$

$P (b o x = 2 | 1 = g r e e n) = \frac{2}{5} \frac{\frac{1}{2}}{\frac{23}{40}} = \frac{8}{23}$

Plugging back in to the first equation:

$P (2 = b l u e | 1 = g r e e n) = (0) (\frac{15}{23}) + (\frac{3}{4}) (\frac{8}{23}) = \frac{6}{23}$

Naive Bayes - Algorithmic Approach

Naive Bayes is “naive” because it assumes that attributes are independent of one another, conditional on the label.

If a message is a spam email, it is likely to contain certain words.

Which	True	False
$P (s p a m)$	.4	.6
$P (“ V i a g r a ” \| s p a m)$	.3	.001
$P (“ P r i c e ” \| s p a m)$	.2	.1
$P (“ U d a c i t y ” \| s p a m)$	.0001	.1

$P (s p a m | “ V i a g r a ”, n o t “ P r i n c e ”, n o t “ U d a c i t y ”) =$ $= \frac{P (“ V i a g r a ”, n o t “ P r i n c e ”, n o t “ U d a c i t y ”) | s p a m}{\dots}$ $= \frac{P (“ V i a g r a ” | s p a m) P (n o t “ P r i n c e ”) | s p a m) P (n o t “ U d a c i t y ” | s p a m) P (s p a m)}{\dots}$ $= \frac{(.3) (.8) (.9999) (.4)}{\dots}$

Naive Bayes - Generic Form

Using the previous example, the type of email can be considered as a class, $V$ . The various words that are more common to spam emails can be considered specific examples of attributes, $a_{1}, a_{2}, \dots a_{n}$ . $Z$ is the number of attributes, and is a normalizing term. Recall $\prod_{i}$ means take the product over $i$ .

$P (V | a_{1}, a_{2}, \dots a_{n}) = \prod_{i} \frac{P (\frac{a_{i}}{V}) P (V)}{Z}$

If we have a way of generating attribute values from classes, then the foregoing formula allows us to go in the opposite direction. We can observe the attribute values and infer the class.

$M A P c l a s s = a r g m a x {P (V) \prod_{i} P (\frac{a_{i}}{V})}$

Naive Bayes Benefits

Inference is cheap
Few parameters
Estimate parameters with labeled data
Connects inference and classification
Empirically successful
- For example, widely employed at Google

Content is taken from my notes on the GA Tech course CS 7461, Machine Learning. Specifically, these notes are taken from the group of lectures "Bayesian Inference."

For Fall 2019, CS 7461 is instructed by Dr. Charles Isbell. The course content was originally created by Dr. Charles Isbell and Dr. Michael Littman.