# Clustering Examples

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

%matplotlib inline

# Make the images larger
plt.rcParams['figure.figsize'] = (8, 6)

### Visuals of Datasets

X, y = make_blobs(n_samples=500, n_features=3, centers=4, random_state=5)

fig = plt.figure();
ax = Axes3D(fig)
ax.scatter(X[:, 0], X[:, 1], X[:, 2]);

2 clusters in the image below.

Z, y = make_blobs(n_samples=500, n_features=5, centers=2, random_state=42)

fig = plt.figure()
plt.scatter(Z[:, 0], Z[:, 1]);

### Generate Example 3 Data

Appears to be 6 clusters in the image below.

T, y = make_blobs(n_samples=500, n_features=5, centers=8, random_state=5)

fig = plt.figure();
ax = Axes3D(fig)
ax.scatter(T[:, 1], T[:, 3], T[:, 4]);

### Plot data for Example 4

But, this is the same data as in Example 3. We see that rotated slightly there are actually at least 7 groups in this data.

fig = plt.figure();
ax = Axes3D(fig)
ax.scatter(T[:, 1], T[:, 2], T[:, 3]);

## Examples with More Labels

Create a dataset with 200 data points (rows), 5 features (columns), and 4 centers.

data, y = make_blobs(n_samples=200,
n_features=5,
centers=4,
random_state=8)

3 steps to use KMeans:

1. Instantiate model.
2. Fit model to data.
3. Predict labels for data.
kmeans_4 = KMeans(n_clusters=4,
model_4 = kmeans_4.fit(data)       # fit the model to your data using kmeans_4
labels_4 = model_4.predict(data)   # predict labels using model_4 on your dataset

fig = plt.figure();
ax = Axes3D(fig)
ax.scatter(data[:, 0],
data[:, 1],
data[:, 2],
c=labels_4,
cmap='tab10');

kmeans_2 = KMeans(n_clusters=2,
model_2 = kmeans_2.fit(data)       # fit the model to your data using kmeans_4
labels_2 = model_2.predict(data)   # predict labels using model_4 on your dataset

fig = plt.figure();
ax = Axes3D(fig)
ax.scatter(data[:, 0],
data[:, 1],
data[:, 2],
c=labels_2,
cmap='tab10');

kmeans_7 = KMeans(n_clusters=7,
model_7 = kmeans_7.fit(data)       # fit the model to your data using kmeans_4
labels_7 = model_7.predict(data)   # predict labels using model_4 on your dataset

fig = plt.figure();
ax = Axes3D(fig)
ax.scatter(data[:, 0],
data[:, 1],
data[:, 2],
c=labels_7,
cmap='tab10');

### Build a Scree Plot

The score method that takes the data as a parameter and returns a value that is an indication of how far the points are from the centroids. That number is the average distance of the data from the centroids.

scores = []
centers = list(range(1,11))
for center in centers:
kmeans = KMeans(n_clusters=center)
model = kmeans.fit(data)
score = np.abs(model.score(data))
scores.append(score)
plt.plot(centers, scores, linestyle='--', marker='o', color='b');
plt.xlabel('K');
plt.ylabel('SSE');
plt.title('SSE vs. K');

As shown in the scree plot, the “elbow” and ideal number of clusters appears to be 4.