kNearest Neighbors
The advantages of the kNN approach are that:
 It is simple and easy to understand why a particular prediction is made.
 A kNN approach can be a reasonable baseline against which you can compare more sophisticated methods.
The disadvantages of the kNN model are that:
 When the training dat has many instances, or each instance has lots of features, the kNN model may slow down significantly. If your dataset has hundreds or thousands of features, especially if they are sparse, consider alternatives.
The key parameters in both classification and regression models are:
 n_neighbors: the number of nearest neighbors (k) to consider, which directly controls model complexity.
 Default = 5
 metric: the function that computes the distance between data points, which controls which neighbors are calculated as nearest.
 Default = Minkowski distance with power parameter 2 (Euclidean), which works well for most datasets.
The kNearest Neighbor (kNN) Classifier Algorithm
Given a training set $X_{train}$ with labels $y_{train}$, and given a new instance $X_{test}$ to be classified:

Find the most similar instances (called $X_{NN}$) to $X_{test}$ that are in $X_{train}$.

Get the labels $y_{NN}$ for the instances in $X_{NN}$.

Predict the label for $X_{test}$ by combining the labels $y_{NN}$ by, for example, simple majority vote.
Low k leads to Overfitting
As shown in the image below, a toolow value for k
will result in jagged, high variance decision boundaries. This is because for k = 1
, the regions that are nearest to each training point will have the same class as that training point.
This is an example of a classification model that has high model complexity.
plot_two_class_knn(X_train, y_train, 1, 'uniform', X_test, y_test)
<IPython.core.display.Javascript object>
Contrast this with k = 3
, and k = 11
, which have much smoother boundaries. Because the value of k
is higher, the estimator is required to take into account many more points. The net result is that a single training data point has much less influence on the prediction, overall.
The result is a smoother decision boundary that better captures the overall global trends. The model’s generalization ability has improved.
These overall trends, from overfit to better fit to excellent fit, are born out in some of the accuracy measures in the titles of these plots. These are tabulated below. Note that the testing accuracy increases with increasing k
, ultimately exceeding the training accuracy. This indicates successful generalization to the broader dataset.
k  Training Accuracy  Testing Accuracy 

1  100%  64% 
3  80%  72% 
11  73%  80% 
plot_two_class_knn(X_train, y_train, 3, 'uniform', X_test, y_test)
plot_two_class_knn(X_train, y_train, 11, 'uniform', X_test, y_test)
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
kNN for Regression

The $R^2$ (“rsquared”) Regression Score measures how well a prediction model for regression fits the given data. This is also known as the “coefficient of determination.”

The score is between 0 and 1:
 A value of 0 corresponds to a constant model that predicts the mean value of all training target values.
 A value of 1 corresponds to perfect prediction.
print('Rsquared test score: {:.3f}'
.format(knnreg.score(X_test, y_test)))
Rsquared test score: 0.425
plot_regression_curves_given_k()
<IPython.core.display.Javascript object>
Again, note that for increasing values of k
, the value of testing $R^2$ increases, reaching a maximum with k = 15
.
k  Training $R^2$  Testing $R^2$ 

1  1.000  0.155 
3  0.797  0.323 
7  0.720  0.471 
15  0.647  0.485 
55  0.357  0.371 
For the extreme value of k = 55
, performance decreases, indicating an underfit.
plot_regression_model_complexity()
<IPython.core.display.Javascript object>
Utility Functions for Plots Above
Library Imports
%matplotlib notebook
import numpy as np
import matplotlib
import pandas as pd
import sklearn
print(' Numpy version ' + str(np.__version__))
print(' Matplotlib version ' + str(matplotlib.__version__))
print(' Pandas version ' + str(pd.__version__))
print(' Sklearn version ' + str(sklearn.__version__))
Numpy version 1.18.1
Matplotlib version 3.1.1
Pandas version 0.25.3
Sklearn version 0.22.1
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.patches as mpatches
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor
Low k leads to Overfitting
X_C2, y_C2 = make_classification(n_samples = 100,
n_features=2,
n_redundant=0,
n_informative=2,
n_clusters_per_class=1,
flip_y = 0.20,
class_sep = 0.5,
random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X_C2, y_C2, random_state=0)
def plot_two_class_knn(X, y, n_neighbors, weights, X_test, y_test):
X_mat = X
y_mat = y
# Create color maps
cmap_light = ListedColormap(['#FFFFAA', '#AAFFAA', '#AAAAFF','#EFEFEF'])
cmap_bold = ListedColormap(['#FFFF00', '#00FF00', '#0000FF','#000000'])
clf = KNeighborsClassifier(n_neighbors,
weights=weights)
clf.fit(X_mat, y_mat)
# Plot the decision boundary by assigning a color in the color map
# to each mesh point.
mesh_step_size = .01 # step size in the mesh
plot_symbol_size = 50
x_min, x_max = X_mat[:, 0].min()  1, X_mat[:, 0].max() + 1
y_min, y_max = X_mat[:, 1].min()  1, X_mat[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, mesh_step_size),
np.arange(y_min, y_max, mesh_step_size))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot training points
plt.scatter(X_mat[:, 0],
X_mat[:, 1],
s=plot_symbol_size,
c=y,
cmap=cmap_bold,
edgecolor = 'black')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
title = "Neighbors = {}".format(n_neighbors)
if (X_test is not None):
train_score = clf.score(X_mat, y_mat)
test_score = clf.score(X_test, y_test)
title = title + "\nTrain score = {:.2f}, Test score = {:.2f}".format(train_score, test_score)
patch0 = mpatches.Patch(color='#FFFF00', label='class 0')
patch1 = mpatches.Patch(color='#000000', label='class 1')
plt.legend(handles=[patch0, patch1])
plt.xlabel('Feature 0')
plt.ylabel('Feature 1')
plt.title(title)
plt.show()
kNN for Regression
X_R1, y_R1 = make_regression(n_samples = 100, n_features=1,
n_informative=1, bias = 150.0,
noise = 30, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X_R1, y_R1, random_state = 0)
knnreg = KNeighborsRegressor(n_neighbors = 5).fit(X_train, y_train)
def plot_regression_curves_given_k():
fig, subaxes = plt.subplots(1, 2, figsize=(8,4))
X_predict_input = np.linspace(3, 3, 50).reshape(1,1)
X_train, X_test, y_train, y_test = train_test_split(X_R1[0::5], y_R1[0::5], random_state = 0)
for thisaxis, K in zip(subaxes, [1, 3]):
knnreg = KNeighborsRegressor(n_neighbors = K).fit(X_train, y_train)
y_predict_output = knnreg.predict(X_predict_input)
thisaxis.set_xlim([2.5, 0.75])
thisaxis.plot(X_predict_input, y_predict_output, '^', markersize = 10,
label='Predicted', alpha=0.8)
thisaxis.plot(X_train, y_train, 'o', label='True Value', alpha=0.8)
thisaxis.set_xlabel('Input feature')
thisaxis.set_ylabel('Target value')
thisaxis.set_title('KNN regression (K={})'.format(K))
thisaxis.legend()
plt.tight_layout()
def plot_regression_model_complexity():
fig, subaxes = plt.subplots(5, 1, figsize=(5,20))
X_predict_input = np.linspace(3, 3, 500).reshape(1,1)
X_train, X_test, y_train, y_test = train_test_split(X_R1, y_R1,
random_state = 0)
for thisaxis, K in zip(subaxes, [1, 3, 7, 15, 55]):
knnreg = KNeighborsRegressor(n_neighbors = K).fit(X_train, y_train)
y_predict_output = knnreg.predict(X_predict_input)
train_score = knnreg.score(X_train, y_train)
test_score = knnreg.score(X_test, y_test)
thisaxis.plot(X_predict_input, y_predict_output)
thisaxis.plot(X_train, y_train, 'o', alpha=0.9, label='Train')
thisaxis.plot(X_test, y_test, '^', alpha=0.9, label='Test')
thisaxis.set_xlabel('Input feature')
thisaxis.set_ylabel('Target value')
thisaxis.set_title('KNN Regression (K={})\n\
Train $R^2 = {:.3f}$, Test $R^2 = {:.3f}$'
.format(K, train_score, test_score))
thisaxis.legend()
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)