Naive Bayes Impementations

28 Feb 2020

Bag of Words

The Bag of Words (BoW) concept involves converting a piece of text into a frequency distribution of the words in that text. In the process of this conversion, the specific ordering of the words is lost.

More specifically, the data is converted to a matrix with each document being a row and each word (“token”) being a column. Each value (row, column) is the frequency of occurrence of each token in that document.

This can be accomplished using sklearn’s CountVectorizer class, which also takes care of lowercase-conversion, ignores punctuation, and ignores common words that might skew the results (like “am,” “an,” “and,” “the,” etc. Known as “stop words”).

Implementing Bag of Words from Scratch

Consider the following set of documents.

import pandas as pd

documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']

documents

['Hello, how are you!',
 'Win money, win from home.',
 'Call me now.',
 'Hello, Call hello you tomorrow?']

Convert to lower case.

lower_case_documents = []

for d in documents:
    d = d.lower()
    lower_case_documents.append(d)

lower_case_documents

['hello, how are you!',
 'win money, win from home.',
 'call me now.',
 'hello, call hello you tomorrow?']

Remove all punctuation.

import string

sans_punctuation_documents = []

for d in lower_case_documents:
    for c in string.punctuation:
        d = d.replace(c,'')
    sans_punctuation_documents.append(d)

sans_punctuation_documents

['hello how are you',
 'win money win from home',
 'call me now',
 'hello call hello you tomorrow']

Tokenize.

preprocessed_documents = []

for d in sans_punctuation_documents:
    d = d.split(' ')
    preprocessed_documents.append(d)

preprocessed_documents

[['hello', 'how', 'are', 'you'],
 ['win', 'money', 'win', 'from', 'home'],
 ['call', 'me', 'now'],
 ['hello', 'call', 'hello', 'you', 'tomorrow']]

Count frequencies.

from collections import Counter

frequency_list = []

for d in preprocessed_documents:
    d = dict(Counter(d))
    frequency_list.append(d)

frequency_list

[{'hello': 1, 'how': 1, 'are': 1, 'you': 1},
 {'win': 2, 'money': 1, 'from': 1, 'home': 1},
 {'call': 1, 'me': 1, 'now': 1},
 {'hello': 2, 'call': 1, 'you': 1, 'tomorrow': 1}]

Implementing Bag of Words using scikit-learn

documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']

documents

['Hello, how are you!',
 'Win money, win from home.',
 'Call me now.',
 'Hello, Call hello you tomorrow?']

from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()

The important parameters of CountVectorizer are

lowercase, which converts the text to its lower case form,
token_pattern, which defaults to a regular expression value that ignores all punctuation marks and treats them as delimiters while also accepting alphanumeric strings with length 2+ characters as individual tokens, and
stop_words, which if set to english will remove common filler words (“an”, “the”, “than”, etc.).

count_vector.fit(documents)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

The get_feature_names method returns the set of words that make up the documents the vector was fit on.

word_list = count_vector.get_feature_names()
word_list

['are',
 'call',
 'from',
 'hello',
 'home',
 'how',
 'me',
 'money',
 'now',
 'tomorrow',
 'win',
 'you']

The transform method, chained with toarray, converts a set of documents to a 2-dimensional numpy array with the documents as the rows and the words as the columns.

doc_array = count_vector.transform(documents).toarray()
doc_array

array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1]])

Next, convert this to a DataFrame that is more human-readable.

frequency_matrix = pd.DataFrame(doc_array,
                                columns = word_list)
frequency_matrix

	are	call	from	hello	home	how	me	money	now	tomorrow	win	you
0	1	0	0	1	0	1	0	0	0	0	0	1
1	0	0	1	0	1	0	0	1	0	0	2	0
2	0	1	0	0	0	0	1	0	1	0	0	0
3	0	1	0	2	0	0	0	0	0	1	0	1

Bayes' Theorem Implementation from Scratch

The Bayes theorem calculates the probability of an event occurring based on certain other probabilities that are related to the event in question. It is composed of “prior probabilities,” often shortened to “priors.” The priors are probabilities we are aware of or given. The “posterior probabilities,” often shorted to “posteriors,” are the probabilities we look to compute.

Diabetes Example

P(D) is the probability of a person having diabetes. This has a value of 0.01 or 1%.
P(Pos) is the probability of getting a positive test result.
P(Neg) is the probability of getting a negative test result.
P(Pos|D) is the probability of getting a positive test result, given that the patient has diabetes. This has a value of 0.90 or 90%. This is called the Sensitivity or True Positive Rate.
P(Neg|-D) is the probability of getting a negative test result, given that the patient does not have diabetes. This also has a value of 0.90 or 90%. This is called the Specificity or True Negative Rate.

Baye’s Theorem for this example is:

$P (D | P o s) = P (D) \times \frac{P (P o s | D)}{P (P o s)}$

The probability of getting a positive test result, P(Pos), can be caluclated using the Sensitivity and Specificity as follows:

$P (P o s) = (P (D) \times S e n s i t i v i t y) + (P (- D) \times (1 - S p e c i f i c i t y)$

# P(D)
p_diabetes = 0.01

# P(-D)
p_no_diabetes = 0.99

# Sensitivity or P(Pos|D)
p_pos_diabetes = 0.9

# Specificity or P(Neg|-D)
p_neg_no_diabetes = 0.9

# P(Pos)
p_pos = ((p_diabetes * p_pos_diabetes) 
         + (p_no_diabetes * (1 - p_neg_no_diabetes)))
print('P(Pos) = {:2.2f}%'.format(p_pos*100))

P(Pos) = 10.80%

Now, the posteriors can be calculated.

$P (D | P o s) = \frac{P (D) \times P (P o s | D)}{P (P o s)}$

p_diabetes_pos = (p_diabetes * p_pos_diabetes) / p_pos
print('P(Diabetes|Pos) = {:2.2f}%'.format(p_diabetes_pos*100))

P(Diabetes|Pos) = 8.33%

$P (P o s | - D) = 1 - P (N e g | - D)$

$P (- D | P o s) = \frac{P (- D) \times (P (P o s | - D)}{P (P o s)} = \frac{P (- D) \times (1 - P (N e g | - D))}{P (P o s)}$

p_no_diabetes_pos = (p_no_diabetes * (1-p_neg_no_diabetes)) / p_pos
print('P(No Diabetes|Pos) = {:2.2f}%'.format(p_no_diabetes_pos*100))

P(No Diabetes|Pos) = 91.67%

The term “Naive” in Naive Bayes comes from the fact that the algorithm considers the features that it is using to make the predictions to be independent of each other. This may not always be the case.

Political Speech Example

In this example, assume we have the text of speeches given by “Jill Stein” of the Green Party and “Gary Johnson” of the Libertarian Party.

P(J) = 0.5 - probability that Jill gives a speech
P(G) = 0.5 - probability that Gary gives a speech
P(F|J) = 0.1 - probability that Jill says “freedom”
P(I|J) = 0.1 - probability that Jill says “immigration”
P(E|J) = 0.8 - probability that Jill says “environment”
P(F|G) = 0.7 - probability that Gary says “freedom”
P(I|G) = 0.2 - probability that Gary says “immigration”
P(E|G) = 0.1 - probability that Gary says “environment”

The ultimate goal is to compute the following posteriors:

P(J|F,I) - given the words “freedom” and “immigration” were said, what’s the probability they were said by Jill?

$P (J | F, I) = \frac{P (J) \times P (F | J) \times P (I | J)}{P (F, I)}$

P(G|F,I) - given the words “freedom” and “immigration” were said, what’s the probability they were said by Gary?

$P (G | F, I) = \frac{P (G) \times P (F | G) \times P (I | G)}{P (F, I)}$

First, find P(F,I), which is the probabiliy that the words “freedom” and “immigration” were said in a speech.

p_j = 0.5     # P(J)
p_f_j = 0.1   # P(F/J)
p_i_j = 0.1   # P(I/J)

p_j_text = p_j * p_f_j * p_i_j

p_g = 0.5     # P(G)
p_f_g = 0.7   # P(F/G)
p_i_g = 0.2   # P(I/G)

p_g_text = p_g * p_f_g * p_i_g

p_fi = p_j_text + p_g_text
print('P(F,I) = {:2.2f}%'.format(p_fi*100))

P(F,I) = 7.50%

Now, P(J|F,I) and P(G|F,I)

p_j_fi = (p_j * p_f_j * p_i_j) / p_fi
print('P(J|F,I) = {:2.2f}%'.format(p_j_fi*100))

P(J|F,I) = 6.67%

p_g_fi = (p_g * p_f_g * p_i_g) / p_fi
print('P(G|F,I) = {:2.2f}%'.format(p_g_fi*100))

P(G|F,I) = 93.33%

This content is taken from notes I took while pursuing the Intro to Machine Learning with Pytorch nanodegree certification.