Naive Bayes SMS Example
The SMS Spam Collection Data Set used for this example was originally downloaded from the UCI Machine Learning Repository, here. A complete description of the data is included in this abstract.
The data contains two columns. One is a label that classifies each message into either “ham” or “spam,” the former indicating a legitimate text message and the latter indicating the text message was spam. The other column is the text content of the SMS message.
Import Data and Preprocess
import pandas as pd
pd.set_option('display.max_colwidth',120)
t = pd.read_table('naive-bayes-implementations/sms_spam_collection.txt',
sep='\t',
names=['label','text'])
print('{:2.1f}% ham'
.format((t[t['label']=='ham']).shape[0]/(t.shape[0])*100.))
t.head()
86.6% ham
label | text | |
---|---|---|
0 | ham | Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... |
1 | ham | Ok lar... Joking wif u oni... |
2 | spam | Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std t... |
3 | ham | U dun say so early hor... U c already then say... |
4 | ham | Nah I don't think he goes to usf, he lives around here though |
t['label'] = (t['label'].replace({'ham' : 0,
'spam' : 1}))
Split into Training and Testing Sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(t['text'],
t['label'],
random_state=1)
print('Total: {} rows'.format(t.shape[0]))
print('Train: {} rows'.format(X_train.shape[0]))
print(' Test: {} rows'.format(X_test.shape[0]))
Total: 5572 rows
Train: 4179 rows
Test: 1393 rows
Use a CountVectorizer
to Convert Text to Numerics
Note that the CountVectorizer
is trained on the training data, only, but used to transform both the training and test data.
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
train_data = count_vector.fit_transform(X_train)
test_data = count_vector.transform(X_test)
Train a MultinomialNB
from sklearn
This specific classifier is suitable for classification with discrete features (such as word counts for text classification).
Another option would be Gaussian Naive Bayes which is better suited for continuous data. That classifier assumes the input data has a Gaussian (normal) distribution.
from sklearn.naive_bayes import MultinomialNB
Fit the classifier.
naive_bayes = MultinomialNB().fit(train_data,
y_train)
Make predictions.
predictions = naive_bayes.predict(test_data)
Evaluate the Model
For models that are skewed in the classification distributions (like ours: 86.6% legitimate text messages), the accuracy alone is not always the best indicator. In these situations, precision and recall are useful.
For all these metrics, the score can range from 0 to 1, with 1 being the best.
- Accuracy: how often the classifer makes the right predictions.
$$Accuracy=\frac{Correct Predictions}{Total Predictions}$$
- Precision: proportion of the messages classified as spam that actually were spam.
$$Precision = \frac{True Positives}{True Positives + False Positives}$$
- Recall (Sensitivity): the proportion of messages that were spam were classified as spam.
$$Recall = \frac{True Positives}{True Positives + False Negatives}$$
- F1 Score: weighted average of the precision and recall scores.
from sklearn.metrics import (accuracy_score,
precision_score,
recall_score,
f1_score)
accuracy_score = accuracy_score(y_test,
predictions)
precision_score = precision_score(y_test,
predictions)
recall_score = recall_score(y_test,
predictions)
f1_score = f1_score(y_test,
predictions)
print(' Accuracy score: {:2.1f}%'.format(accuracy_score*100.))
print('Precision score: {:2.1f}%'.format(precision_score*100.))
print(' Recall score: {:2.1f}%'.format(recall_score*100.))
print(' F1 score: {:2.1f}%'.format(f1_score*100.))
Accuracy score: 98.9%
Precision score: 97.2%
Recall score: 94.1%
F1 score: 95.6%
Advantages of Naive Bayes:
- Handles an extremely large number of features (7456, in this example, one for every word),
- Unaffected by irrelevant features,
- Relatively simple and usually works without parameter tuning,
- Very rarely overfits the data, and
- Training and prediction times are very fast for the amount of data it handles.