Naive Bayes SMS Example

01 Mar 2020

The SMS Spam Collection Data Set used for this example was originally downloaded from the UCI Machine Learning Repository, here. A complete description of the data is included in this abstract.

The data contains two columns. One is a label that classifies each message into either “ham” or “spam,” the former indicating a legitimate text message and the latter indicating the text message was spam. The other column is the text content of the SMS message.

Import Data and Preprocess

import pandas as pd

pd.set_option('display.max_colwidth',120)

t = pd.read_table('naive-bayes-implementations/sms_spam_collection.txt',
                  sep='\t',
                  names=['label','text'])
print('{:2.1f}% ham'
      .format((t[t['label']=='ham']).shape[0]/(t.shape[0])*100.))
t.head()

86.6% ham

	label	text
0	ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std t...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives around here though

t['label'] = (t['label'].replace({'ham'  : 0,
                                  'spam' : 1}))

Split into Training and Testing Sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(t['text'], 
                                                    t['label'], 
                                                    random_state=1)

print('Total: {} rows'.format(t.shape[0]))
print('Train: {} rows'.format(X_train.shape[0]))
print(' Test: {} rows'.format(X_test.shape[0]))

Total: 5572 rows
Train: 4179 rows
 Test: 1393 rows

Use a `CountVectorizer` to Convert Text to Numerics

Note that the CountVectorizer is trained on the training data, only, but used to transform both the training and test data.

from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()

train_data = count_vector.fit_transform(X_train)
test_data = count_vector.transform(X_test)

Train a `MultinomialNB` from sklearn

This specific classifier is suitable for classification with discrete features (such as word counts for text classification).

Another option would be Gaussian Naive Bayes which is better suited for continuous data. That classifier assumes the input data has a Gaussian (normal) distribution.

from sklearn.naive_bayes import MultinomialNB

Fit the classifier.

naive_bayes = MultinomialNB().fit(train_data,
                                  y_train)

Make predictions.

predictions = naive_bayes.predict(test_data)

Evaluate the Model

For models that are skewed in the classification distributions (like ours: 86.6% legitimate text messages), the accuracy alone is not always the best indicator. In these situations, precision and recall are useful.

For all these metrics, the score can range from 0 to 1, with 1 being the best.

Accuracy: how often the classifer makes the right predictions.

$A c c u r a c y = \frac{C o r r e c t P r e d i c t i o n s}{T o t a l P r e d i c t i o n s}$

Precision: proportion of the messages classified as spam that actually were spam.

$P r e c i s i o n = \frac{T r u e P o s i t i v e s}{T r u e P o s i t i v e s + F a l s e P o s i t i v e s}$

Recall (Sensitivity): the proportion of messages that were spam were classified as spam.

$R e c a l l = \frac{T r u e P o s i t i v e s}{T r u e P o s i t i v e s + F a l s e N e g a t i v e s}$

F1 Score: weighted average of the precision and recall scores.

from sklearn.metrics import (accuracy_score, 
                             precision_score,
                             recall_score, 
                             f1_score)

accuracy_score = accuracy_score(y_test,
                                predictions)
precision_score = precision_score(y_test,
                                  predictions)
recall_score = recall_score(y_test,
                            predictions)
f1_score = f1_score(y_test,
                    predictions)

print(' Accuracy score: {:2.1f}%'.format(accuracy_score*100.))
print('Precision score: {:2.1f}%'.format(precision_score*100.))
print('   Recall score: {:2.1f}%'.format(recall_score*100.))
print('       F1 score: {:2.1f}%'.format(f1_score*100.))

 Accuracy score: 98.9%
Precision score: 97.2%
   Recall score: 94.1%
       F1 score: 95.6%

Advantages of Naive Bayes:

Handles an extremely large number of features (7456, in this example, one for every word),
Unaffected by irrelevant features,
Relatively simple and usually works without parameter tuning,
Very rarely overfits the data, and
Training and prediction times are very fast for the amount of data it handles.

This content is taken from notes I took while pursuing the Intro to Machine Learning with Pytorch nanodegree certification.