Basic Steps to k-NN Classification
- Import libraries
- Load the dataset(s)
- Split into data ($X$) and labels ($y$)
- Split into training ($X_{train}$, $y_{train}$) and test ($X_{test}$, $y_{test}$) data
- Fit the classifier
- Calculate the accuracy of the classifier using $y_{test}$
- Use the fitted classifier for prediction
Optionally, visualize the accuracy of the classifier on the various types of data:
- Positive training
- Negative training
- Positive testing
- Negative testing
1. Import libraries
import numpy as np
import pandas as pd
import matplotlib as mpl
import sklearn
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
print(' Numpy version ' + str(np.__version__))
print(' Pandas version ' + str(pd.__version__))
print('Matplotlib version ' + str(mpl.__version__))
print(' Sklearn version ' + str(sklearn.__version__))
%matplotlib notebook
Numpy version 1.17.4
Pandas version 0.25.3
Matplotlib version 3.1.1
Sklearn version 0.22.1
2. Load the Dataset
First, load the dataset from scikit-learn. It is a scikit-learn “Bunch” object, which is similar to a dictionary.
Information about the dataset included at the bottom of this notebook.
cancer = load_breast_cancer()
print(type(cancer))
print(cancer.keys())
<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
Convert the Bunch object to a Pandas DataFrame.
d = pd.DataFrame(cancer['data'],
columns = cancer['feature_names'])
d['target'] = cancer['target']
target_mapping = {
0 : 'malignant',
1 : 'benign'
}
(d['target']
.replace(target_mapping)
.value_counts())
benign 357
malignant 212
Name: target, dtype: int64
3. Split into the data, $X$, and labels, $y$
X = d[d.columns[:-1]].copy()
y = d['target'].copy()
print(X.shape, y.shape)
(569, 30) (569,)
4. Split into Training and Testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=0)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(426, 30) (143, 30) (426,) (143,)
5. Fit a K-Nearest Neighbors (KNN) Classifier
knn = KNeighborsClassifier(n_neighbors = 1)
knn.fit(X_train, y_train);
print(knn)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=1, p=2,
weights='uniform')
6. Calculate the accuracy of the classifier using $y_{test}$
The knn can also predict across the entire $X_{test}$ set. $y_{pred}$ will contain the predicted output of the classifier when run on the test set. accuracy_score
compares the prediction against the correct classification and returns a percent accuracy.
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("{:2.1f}%"
.format(accuracy*100.)
)
91.6%
7. Use the fitted classifier for prediction
As a test, predict the class label using the mean value for each value in the dataset.
- The
values
attribute of a Pandas Series is the raw values it contains in a 1D numpy array. - The numpy
reshape
method converts the 1D array into a an array containing the 1D array (an array of 1D arrays). The -1 parameter means infer the length from the length of the array. - The classifier returns a 1x1 array that is accessible using the usual Python list accessor.
First, predict using the mean values across all fields in the dataset.
means = d.mean()[:-1].values.reshape(1, -1)
print(means)
pred = knn.predict(means)
target_mapping[pred[0]]
[[1.41272917e+01 1.92896485e+01 9.19690334e+01 6.54889104e+02
9.63602812e-02 1.04340984e-01 8.87993158e-02 4.89191459e-02
1.81161863e-01 6.27976098e-02 4.05172056e-01 1.21685343e+00
2.86605923e+00 4.03370791e+01 7.04097891e-03 2.54781388e-02
3.18937163e-02 1.17961371e-02 2.05422988e-02 3.79490387e-03
1.62691898e+01 2.56772232e+01 1.07261213e+02 8.80583128e+02
1.32368594e-01 2.54265044e-01 2.72188483e-01 1.14606223e-01
2.90075571e-01 8.39458172e-02]]
'benign'
Optionally, visualize the accuracy using the plot below or something similar
def accuracy_plot():
# Find the training and testing accuracies by target value (i.e. malignant, benign)
mal_train_X = X_train[y_train==0]
mal_train_y = y_train[y_train==0]
ben_train_X = X_train[y_train==1]
ben_train_y = y_train[y_train==1]
mal_test_X = X_test[y_test==0]
mal_test_y = y_test[y_test==0]
ben_test_X = X_test[y_test==1]
ben_test_y = y_test[y_test==1]
scores = [knn.score(mal_train_X, mal_train_y),
knn.score(ben_train_X, ben_train_y),
knn.score(mal_test_X, mal_test_y),
knn.score(ben_test_X, ben_test_y)]
plt.figure()
bars = plt.bar(np.arange(4), scores, color=['#4c72b0',
'#4c72b0',
'#55a868',
'#55a868'])
for bar in bars:
height = bar.get_height()
plt.gca().text(bar.get_x() + bar.get_width()/2,
height*.90,
'{0:.{1}f}'.format(height, 2),
ha='center',
color='w',
fontsize=11)
plt.tick_params(top=False,
bottom=False,
left=False,
right=False,
labelleft=False,
labelbottom=True)
for spine in plt.gca().spines.values():
spine.set_visible(False)
plt.xticks([0,1,2,3],
['Malignant\nTraining',
'Benign\nTraining',
'Malignant\nTest',
'Benign\nTest'],
alpha=0.8);
plt.title('Training and Test Accuracies for Malignant and Benign Cells',
alpha=0.8)
accuracy_plot()
<IPython.core.display.Javascript object>
The following data about the raw data used in this note.
d.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
mean radius 569 non-null float64
mean texture 569 non-null float64
mean perimeter 569 non-null float64
mean area 569 non-null float64
mean smoothness 569 non-null float64
mean compactness 569 non-null float64
mean concavity 569 non-null float64
mean concave points 569 non-null float64
mean symmetry 569 non-null float64
mean fractal dimension 569 non-null float64
radius error 569 non-null float64
texture error 569 non-null float64
perimeter error 569 non-null float64
area error 569 non-null float64
smoothness error 569 non-null float64
compactness error 569 non-null float64
concavity error 569 non-null float64
concave points error 569 non-null float64
symmetry error 569 non-null float64
fractal dimension error 569 non-null float64
worst radius 569 non-null float64
worst texture 569 non-null float64
worst perimeter 569 non-null float64
worst area 569 non-null float64
worst smoothness 569 non-null float64
worst compactness 569 non-null float64
worst concavity 569 non-null float64
worst concave points 569 non-null float64
worst symmetry 569 non-null float64
worst fractal dimension 569 non-null float64
target 569 non-null int64
dtypes: float64(30), int64(1)
memory usage: 137.9 KB
The following text contains the description of the dataset used above. It is included with sklearn.
print(cancer.DESCR)
.. _breast_cancer_dataset:
Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------
**Data Set Characteristics:**
:Number of Instances: 569
:Number of Attributes: 30 numeric, predictive attributes and the class
:Attribute Information:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1)
The mean, standard error, and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.
- class:
- WDBC-Malignant
- WDBC-Benign
:Summary Statistics:
===================================== ====== ======
Min Max
===================================== ====== ======
radius (mean): 6.981 28.11
texture (mean): 9.71 39.28
perimeter (mean): 43.79 188.5
area (mean): 143.5 2501.0
smoothness (mean): 0.053 0.163
compactness (mean): 0.019 0.345
concavity (mean): 0.0 0.427
concave points (mean): 0.0 0.201
symmetry (mean): 0.106 0.304
fractal dimension (mean): 0.05 0.097
radius (standard error): 0.112 2.873
texture (standard error): 0.36 4.885
perimeter (standard error): 0.757 21.98
area (standard error): 6.802 542.2
smoothness (standard error): 0.002 0.031
compactness (standard error): 0.002 0.135
concavity (standard error): 0.0 0.396
concave points (standard error): 0.0 0.053
symmetry (standard error): 0.008 0.079
fractal dimension (standard error): 0.001 0.03
radius (worst): 7.93 36.04
texture (worst): 12.02 49.54
perimeter (worst): 50.41 251.2
area (worst): 185.2 4254.0
smoothness (worst): 0.071 0.223
compactness (worst): 0.027 1.058
concavity (worst): 0.0 1.252
concave points (worst): 0.0 0.291
symmetry (worst): 0.156 0.664
fractal dimension (worst): 0.055 0.208
===================================== ====== ======
:Missing Attribute Values: None
:Class Distribution: 212 - Malignant, 357 - Benign
:Creator: Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian
:Donor: Nick Street
:Date: November, 1995
This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2
Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass. They describe
characteristics of the cell nuclei present in the image.
Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree. Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.
The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].
This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/
.. topic:: References
- W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction
for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on
Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
San Jose, CA, 1993.
- O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
prognosis via linear programming. Operations Research, 43(4), pages 570-577,
July-August 1995.
- W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994)
163-171.