Naive Bayes Impementations
Bag of Words
The Bag of Words (BoW) concept involves converting a piece of text into a frequency distribution of the words in that text. In the process of this conversion, the specific ordering of the words is lost.
More specifically, the data is converted to a matrix with each document being a row and each word (“token”) being a column. Each value (row, column) is the frequency of occurrence of each token in that document.
This can be accomplished using sklearn’s CountVectorizer
class, which also takes care of lowercase-conversion, ignores punctuation, and ignores common words that might skew the results (like “am,” “an,” “and,” “the,” etc. Known as “stop words”).
Implementing Bag of Words from Scratch
Consider the following set of documents.
import pandas as pd
documents = ['Hello, how are you!',
'Win money, win from home.',
'Call me now.',
'Hello, Call hello you tomorrow?']
documents
['Hello, how are you!',
'Win money, win from home.',
'Call me now.',
'Hello, Call hello you tomorrow?']
- Convert to lower case.
lower_case_documents = []
for d in documents:
d = d.lower()
lower_case_documents.append(d)
lower_case_documents
['hello, how are you!',
'win money, win from home.',
'call me now.',
'hello, call hello you tomorrow?']
- Remove all punctuation.
import string
sans_punctuation_documents = []
for d in lower_case_documents:
for c in string.punctuation:
d = d.replace(c,'')
sans_punctuation_documents.append(d)
sans_punctuation_documents
['hello how are you',
'win money win from home',
'call me now',
'hello call hello you tomorrow']
- Tokenize.
preprocessed_documents = []
for d in sans_punctuation_documents:
d = d.split(' ')
preprocessed_documents.append(d)
preprocessed_documents
[['hello', 'how', 'are', 'you'],
['win', 'money', 'win', 'from', 'home'],
['call', 'me', 'now'],
['hello', 'call', 'hello', 'you', 'tomorrow']]
- Count frequencies.
from collections import Counter
frequency_list = []
for d in preprocessed_documents:
d = dict(Counter(d))
frequency_list.append(d)
frequency_list
[{'hello': 1, 'how': 1, 'are': 1, 'you': 1},
{'win': 2, 'money': 1, 'from': 1, 'home': 1},
{'call': 1, 'me': 1, 'now': 1},
{'hello': 2, 'call': 1, 'you': 1, 'tomorrow': 1}]
Implementing Bag of Words using scikit-learn
documents = ['Hello, how are you!',
'Win money, win from home.',
'Call me now.',
'Hello, Call hello you tomorrow?']
documents
['Hello, how are you!',
'Win money, win from home.',
'Call me now.',
'Hello, Call hello you tomorrow?']
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
The important parameters of CountVectorizer
are
lowercase
, which converts the text to its lower case form,token_pattern
, which defaults to a regular expression value that ignores all punctuation marks and treats them as delimiters while also accepting alphanumeric strings with length 2+ characters as individual tokens, andstop_words
, which if set toenglish
will remove common filler words (“an”, “the”, “than”, etc.).
count_vector.fit(documents)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
The get_feature_names
method returns the set of words that make up the documents the vector was fit on.
word_list = count_vector.get_feature_names()
word_list
['are',
'call',
'from',
'hello',
'home',
'how',
'me',
'money',
'now',
'tomorrow',
'win',
'you']
The transform
method, chained with toarray
, converts a set of documents to a 2-dimensional numpy array with the documents as the rows and the words as the columns.
doc_array = count_vector.transform(documents).toarray()
doc_array
array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
[0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0],
[0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
[0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1]])
Next, convert this to a DataFrame that is more human-readable.
frequency_matrix = pd.DataFrame(doc_array,
columns = word_list)
frequency_matrix
are | call | from | hello | home | how | me | money | now | tomorrow | win | you | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 2 | 0 |
2 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
3 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
Bayes' Theorem Implementation from Scratch
The Bayes theorem calculates the probability of an event occurring based on certain other probabilities that are related to the event in question. It is composed of “prior probabilities,” often shortened to “priors.” The priors are probabilities we are aware of or given. The “posterior probabilities,” often shorted to “posteriors,” are the probabilities we look to compute.
Diabetes Example
P(D)
is the probability of a person having diabetes. This has a value of0.01
or 1%.P(Pos)
is the probability of getting a positive test result.P(Neg)
is the probability of getting a negative test result.P(Pos|D)
is the probability of getting a positive test result, given that the patient has diabetes. This has a value of0.90
or 90%. This is called the Sensitivity or True Positive Rate.P(Neg|-D)
is the probability of getting a negative test result, given that the patient does not have diabetes. This also has a value of0.90
or 90%. This is called the Specificity or True Negative Rate.
Baye’s Theorem for this example is:
$$P(D|Pos) = P(D) \times \frac{P(Pos|D)}{P(Pos)}$$
The probability of getting a positive test result, P(Pos)
, can be caluclated using the Sensitivity and Specificity as follows:
$$P(Pos) = (P(D) \times Sensitivity) + (P(-D) \times (1-Specificity)$$
# P(D)
p_diabetes = 0.01
# P(-D)
p_no_diabetes = 0.99
# Sensitivity or P(Pos|D)
p_pos_diabetes = 0.9
# Specificity or P(Neg|-D)
p_neg_no_diabetes = 0.9
# P(Pos)
p_pos = ((p_diabetes * p_pos_diabetes)
+ (p_no_diabetes * (1 - p_neg_no_diabetes)))
print('P(Pos) = {:2.2f}%'.format(p_pos*100))
P(Pos) = 10.80%
Now, the posteriors can be calculated.
$$P(D|Pos) = \frac{P(D) \times P(Pos|D)}{P(Pos)}$$
p_diabetes_pos = (p_diabetes * p_pos_diabetes) / p_pos
print('P(Diabetes|Pos) = {:2.2f}%'.format(p_diabetes_pos*100))
P(Diabetes|Pos) = 8.33%
$$P(Pos|-D) = 1 - P(Neg|-D)$$
$$P(-D|Pos) = \frac{P(-D) \times (P(Pos|-D)}{P(Pos)} = \frac{P(-D) \times (1 - P(Neg|-D))}{P(Pos)}$$
p_no_diabetes_pos = (p_no_diabetes * (1-p_neg_no_diabetes)) / p_pos
print('P(No Diabetes|Pos) = {:2.2f}%'.format(p_no_diabetes_pos*100))
P(No Diabetes|Pos) = 91.67%
The term “Naive” in Naive Bayes comes from the fact that the algorithm considers the features that it is using to make the predictions to be independent of each other. This may not always be the case.
Political Speech Example
In this example, assume we have the text of speeches given by “Jill Stein” of the Green Party and “Gary Johnson” of the Libertarian Party.
-
P(J)
= 0.5 - probability that Jill gives a speech -
P(G)
= 0.5 - probability that Gary gives a speech -
P(F|J)
= 0.1 - probability that Jill says “freedom” -
P(I|J)
= 0.1 - probability that Jill says “immigration” -
P(E|J)
= 0.8 - probability that Jill says “environment” -
P(F|G)
= 0.7 - probability that Gary says “freedom” -
P(I|G)
= 0.2 - probability that Gary says “immigration” -
P(E|G)
= 0.1 - probability that Gary says “environment”
The ultimate goal is to compute the following posteriors:
P(J|F,I)
- given the words “freedom” and “immigration” were said, what’s the probability they were said by Jill?
$$P(J|F,I) = \frac{P(J) \times P(F|J) \times P(I|J)}{P(F,I)}$$
P(G|F,I)
- given the words “freedom” and “immigration” were said, what’s the probability they were said by Gary?
$$P(G|F,I) = \frac{P(G) \times P(F|G) \times P(I|G)}{P(F,I)}$$
First, find P(F,I)
, which is the probabiliy that the words “freedom” and “immigration” were said in a speech.
p_j = 0.5 # P(J)
p_f_j = 0.1 # P(F/J)
p_i_j = 0.1 # P(I/J)
p_j_text = p_j * p_f_j * p_i_j
p_g = 0.5 # P(G)
p_f_g = 0.7 # P(F/G)
p_i_g = 0.2 # P(I/G)
p_g_text = p_g * p_f_g * p_i_g
p_fi = p_j_text + p_g_text
print('P(F,I) = {:2.2f}%'.format(p_fi*100))
P(F,I) = 7.50%
Now, P(J|F,I)
and P(G|F,I)
p_j_fi = (p_j * p_f_j * p_i_j) / p_fi
print('P(J|F,I) = {:2.2f}%'.format(p_j_fi*100))
P(J|F,I) = 6.67%
p_g_fi = (p_g * p_f_g * p_i_g) / p_fi
print('P(G|F,I) = {:2.2f}%'.format(p_g_fi*100))
P(G|F,I) = 93.33%