Basic Steps to k-NN Classification

  1. Import libraries
  2. Load the dataset(s)
  3. Split into data ($X$) and labels ($y$)
  4. Split into training ($X_{train}$, $y_{train}$) and test ($X_{test}$, $y_{test}$) data
  5. Fit the classifier
  6. Calculate the accuracy of the classifier using $y_{test}$
  7. Use the fitted classifier for prediction

Optionally, visualize the accuracy of the classifier on the various types of data:

  • Positive training
  • Negative training
  • Positive testing
  • Negative testing

1. Import libraries

import numpy as np
import pandas as pd
import matplotlib as mpl
import sklearn

import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

print('     Numpy version ' + str(np.__version__))
print('    Pandas version ' + str(pd.__version__))
print('Matplotlib version ' + str(mpl.__version__))
print('   Sklearn version ' + str(sklearn.__version__))

%matplotlib notebook
     Numpy version 1.17.4
    Pandas version 0.25.3
Matplotlib version 3.1.1
   Sklearn version 0.22.1

2. Load the Dataset

First, load the dataset from scikit-learn. It is a scikit-learn “Bunch” object, which is similar to a dictionary.

Information about the dataset included at the bottom of this notebook.

cancer = load_breast_cancer()
print(type(cancer))
print(cancer.keys())
<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

Convert the Bunch object to a Pandas DataFrame.

d = pd.DataFrame(cancer['data'],
                 columns = cancer['feature_names'])
d['target'] = cancer['target']
target_mapping = {
    0 : 'malignant',
    1 : 'benign'
}
(d['target']
 .replace(target_mapping)
 .value_counts())
benign       357
malignant    212
Name: target, dtype: int64

3. Split into the data, $X$, and labels, $y$

X = d[d.columns[:-1]].copy()
y = d['target'].copy()
print(X.shape, y.shape)
(569, 30) (569,)

4. Split into Training and Testing datasets

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=0)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(426, 30) (143, 30) (426,) (143,)

5. Fit a K-Nearest Neighbors (KNN) Classifier

knn = KNeighborsClassifier(n_neighbors = 1)
knn.fit(X_train, y_train);
print(knn)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

6. Calculate the accuracy of the classifier using $y_{test}$

The knn can also predict across the entire $X_{test}$ set. $y_{pred}$ will contain the predicted output of the classifier when run on the test set. accuracy_score compares the prediction against the correct classification and returns a percent accuracy.

y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("{:2.1f}%"
      .format(accuracy*100.)
     )
91.6%

7. Use the fitted classifier for prediction

As a test, predict the class label using the mean value for each value in the dataset.

  • The values attribute of a Pandas Series is the raw values it contains in a 1D numpy array.
  • The numpy reshape method converts the 1D array into a an array containing the 1D array (an array of 1D arrays). The -1 parameter means infer the length from the length of the array.
  • The classifier returns a 1x1 array that is accessible using the usual Python list accessor.

First, predict using the mean values across all fields in the dataset.

means = d.mean()[:-1].values.reshape(1, -1)
print(means)
pred = knn.predict(means)
target_mapping[pred[0]]
[[1.41272917e+01 1.92896485e+01 9.19690334e+01 6.54889104e+02
  9.63602812e-02 1.04340984e-01 8.87993158e-02 4.89191459e-02
  1.81161863e-01 6.27976098e-02 4.05172056e-01 1.21685343e+00
  2.86605923e+00 4.03370791e+01 7.04097891e-03 2.54781388e-02
  3.18937163e-02 1.17961371e-02 2.05422988e-02 3.79490387e-03
  1.62691898e+01 2.56772232e+01 1.07261213e+02 8.80583128e+02
  1.32368594e-01 2.54265044e-01 2.72188483e-01 1.14606223e-01
  2.90075571e-01 8.39458172e-02]]





'benign'

Optionally, visualize the accuracy using the plot below or something similar

def accuracy_plot():
    # Find the training and testing accuracies by target value (i.e. malignant, benign)
    mal_train_X = X_train[y_train==0]
    mal_train_y = y_train[y_train==0]
    ben_train_X = X_train[y_train==1]
    ben_train_y = y_train[y_train==1]

    mal_test_X = X_test[y_test==0]
    mal_test_y = y_test[y_test==0]
    ben_test_X = X_test[y_test==1]
    ben_test_y = y_test[y_test==1]

    scores = [knn.score(mal_train_X, mal_train_y), 
              knn.score(ben_train_X, ben_train_y), 
              knn.score(mal_test_X, mal_test_y), 
              knn.score(ben_test_X, ben_test_y)]


    plt.figure()

    bars = plt.bar(np.arange(4), scores, color=['#4c72b0',
                                                '#4c72b0',
                                                '#55a868',
                                                '#55a868'])

    for bar in bars:
        height = bar.get_height()
        plt.gca().text(bar.get_x() + bar.get_width()/2, 
                       height*.90, 
                       '{0:.{1}f}'.format(height, 2), 
                       ha='center', 
                       color='w', 
                       fontsize=11)

    plt.tick_params(top=False, 
                    bottom=False, 
                    left=False, 
                    right=False, 
                    labelleft=False, 
                    labelbottom=True)

    for spine in plt.gca().spines.values():
        spine.set_visible(False)

    plt.xticks([0,1,2,3], 
               ['Malignant\nTraining', 
                'Benign\nTraining', 
                'Malignant\nTest', 
                'Benign\nTest'], 
               alpha=0.8);
    plt.title('Training and Test Accuracies for Malignant and Benign Cells', 
              alpha=0.8)

accuracy_plot()
<IPython.core.display.Javascript object>

The following data about the raw data used in this note.

d.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
mean radius                569 non-null float64
mean texture               569 non-null float64
mean perimeter             569 non-null float64
mean area                  569 non-null float64
mean smoothness            569 non-null float64
mean compactness           569 non-null float64
mean concavity             569 non-null float64
mean concave points        569 non-null float64
mean symmetry              569 non-null float64
mean fractal dimension     569 non-null float64
radius error               569 non-null float64
texture error              569 non-null float64
perimeter error            569 non-null float64
area error                 569 non-null float64
smoothness error           569 non-null float64
compactness error          569 non-null float64
concavity error            569 non-null float64
concave points error       569 non-null float64
symmetry error             569 non-null float64
fractal dimension error    569 non-null float64
worst radius               569 non-null float64
worst texture              569 non-null float64
worst perimeter            569 non-null float64
worst area                 569 non-null float64
worst smoothness           569 non-null float64
worst compactness          569 non-null float64
worst concavity            569 non-null float64
worst concave points       569 non-null float64
worst symmetry             569 non-null float64
worst fractal dimension    569 non-null float64
target                     569 non-null int64
dtypes: float64(30), int64(1)
memory usage: 137.9 KB

The following text contains the description of the dataset used above. It is included with sklearn.

print(cancer.DESCR)
.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        - class:
                - WDBC-Malignant
                - WDBC-Benign

    :Summary Statistics:

    ===================================== ====== ======
                                           Min    Max
    ===================================== ====== ======
    radius (mean):                        6.981  28.11
    texture (mean):                       9.71   39.28
    perimeter (mean):                     43.79  188.5
    area (mean):                          143.5  2501.0
    smoothness (mean):                    0.053  0.163
    compactness (mean):                   0.019  0.345
    concavity (mean):                     0.0    0.427
    concave points (mean):                0.0    0.201
    symmetry (mean):                      0.106  0.304
    fractal dimension (mean):             0.05   0.097
    radius (standard error):              0.112  2.873
    texture (standard error):             0.36   4.885
    perimeter (standard error):           0.757  21.98
    area (standard error):                6.802  542.2
    smoothness (standard error):          0.002  0.031
    compactness (standard error):         0.002  0.135
    concavity (standard error):           0.0    0.396
    concave points (standard error):      0.0    0.053
    symmetry (standard error):            0.008  0.079
    fractal dimension (standard error):   0.001  0.03
    radius (worst):                       7.93   36.04
    texture (worst):                      12.02  49.54
    perimeter (worst):                    50.41  251.2
    area (worst):                         185.2  4254.0
    smoothness (worst):                   0.071  0.223
    compactness (worst):                  0.027  1.058
    concavity (worst):                    0.0    1.252
    concave points (worst):               0.0    0.291
    symmetry (worst):                     0.156  0.664
    fractal dimension (worst):            0.055  0.208
    ===================================== ====== ======

    :Missing Attribute Values: None

    :Class Distribution: 212 - Malignant, 357 - Benign

    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

    :Donor: Nick Street

    :Date: November, 1995

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass.  They describe
characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree.  Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

.. topic:: References

   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction 
     for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on 
     Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
     San Jose, CA, 1993.
   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and 
     prognosis via linear programming. Operations Research, 43(4), pages 570-577, 
     July-August 1995.
   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 
     163-171.