Clustering Examples

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

%matplotlib inline

# Make the images larger
plt.rcParams['figure.figsize'] = (8, 6)

Visuals of Datasets

X, y = make_blobs(n_samples=500, n_features=3, centers=4, random_state=5)

fig = plt.figure();
ax = Axes3D(fig)
ax.scatter(X[:, 0], X[:, 1], X[:, 2]);

png

2 clusters in the image below.

Z, y = make_blobs(n_samples=500, n_features=5, centers=2, random_state=42)

fig = plt.figure()
plt.scatter(Z[:, 0], Z[:, 1]);

png

Generate Example 3 Data

Appears to be 6 clusters in the image below.

T, y = make_blobs(n_samples=500, n_features=5, centers=8, random_state=5)

fig = plt.figure();
ax = Axes3D(fig)
ax.scatter(T[:, 1], T[:, 3], T[:, 4]);

png

Plot data for Example 4

But, this is the same data as in Example 3. We see that rotated slightly there are actually at least 7 groups in this data.

fig = plt.figure();
ax = Axes3D(fig)
ax.scatter(T[:, 1], T[:, 2], T[:, 3]);

png

Examples with More Labels

Create a dataset with 200 data points (rows), 5 features (columns), and 4 centers.

data, y = make_blobs(n_samples=200, 
                     n_features=5, 
                     centers=4, 
                     random_state=8)

3 steps to use KMeans:

  1. Instantiate model.
  2. Fit model to data.
  3. Predict labels for data.
kmeans_4 = KMeans(n_clusters=4, 
                  random_state=8)  # instantiate your model
model_4 = kmeans_4.fit(data)       # fit the model to your data using kmeans_4
labels_4 = model_4.predict(data)   # predict labels using model_4 on your dataset

fig = plt.figure();
ax = Axes3D(fig)
ax.scatter(data[:, 0], 
           data[:, 1], 
           data[:, 2], 
           c=labels_4, 
           cmap='tab10');

png

kmeans_2 = KMeans(n_clusters=2, 
                  random_state=8)  # instantiate your model
model_2 = kmeans_2.fit(data)       # fit the model to your data using kmeans_4
labels_2 = model_2.predict(data)   # predict labels using model_4 on your dataset

fig = plt.figure();
ax = Axes3D(fig)
ax.scatter(data[:, 0], 
           data[:, 1], 
           data[:, 2], 
           c=labels_2, 
           cmap='tab10');

png

kmeans_7 = KMeans(n_clusters=7, 
                  random_state=8)  # instantiate your model
model_7 = kmeans_7.fit(data)       # fit the model to your data using kmeans_4
labels_7 = model_7.predict(data)   # predict labels using model_4 on your dataset

fig = plt.figure();
ax = Axes3D(fig)
ax.scatter(data[:, 0], 
           data[:, 1], 
           data[:, 2], 
           c=labels_7, 
           cmap='tab10');

png

Build a Scree Plot

The score method that takes the data as a parameter and returns a value that is an indication of how far the points are from the centroids. That number is the average distance of the data from the centroids.

scores = []
centers = list(range(1,11))
for center in centers:
    kmeans = KMeans(n_clusters=center)
    model = kmeans.fit(data)
    score = np.abs(model.score(data))
    scores.append(score)
plt.plot(centers, scores, linestyle='--', marker='o', color='b');
plt.xlabel('K');
plt.ylabel('SSE');
plt.title('SSE vs. K');

png

As shown in the scree plot, the “elbow” and ideal number of clusters appears to be 4.