Feature Scaling with KMeans

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn import preprocessing as p
from sklearn.datasets import make_blobs

%matplotlib inline

plt.rcParams['figure.figsize'] = (4, 3)

data, y = make_blobs(n_samples=200, 
                     n_features=2, 
                     centers=4, 
                     random_state=42)

df = pd.DataFrame(data)
df.columns = ['height', 'weight']
df['height'] = np.abs(df['height']*100)
df['weight'] = df['weight'] + np.random.normal(50, 10, 200)

Next, take a look at the data to get familiar with it. The dataset has two columns, and it is stored in the df variable. It might be useful to get an idea of the spread in the current data, as well as a visual of the points.

df.describe()

height weight
count 200.000000 200.000000
mean 569.726207 52.100192
std 246.966215 11.937835
min 92.998481 20.656424
25% 357.542793 42.704606
50% 545.766752 52.038721
75% 773.310607 61.404287
max 1096.222348 78.362636
df.head()

height weight
0 650.565335 56.061702
1 512.894273 67.478788
2 885.057453 58.352092
3 1028.641210 54.324747
4 746.899195 56.624122
plt.scatter(df['height'], df['weight']);

png

There are two common types of feature scaling:

  1. StandardScalar: scales the data so it has mean 0 and variance 1.
  2. MinMaxScalar: useful in cases where it makes sense to think of the data in terms of the percent they are as compared to the maximum value.

StandardScalar

df_ss = p.StandardScaler().fit_transform(df)
df_ss = pd.DataFrame(df_ss)
df_ss.columns = ['height', 'weight']

plt.scatter(df_ss['height'], df_ss['weight']);

png

MinMaxScalar

df_mm = p.MinMaxScaler().fit_transform(df)
df_mm = pd.DataFrame(df_mm)
df_mm.columns = ['height', 'weight']

plt.scatter(df_mm['height'], df_mm['weight']);

png

Fit KMeans

def fit_kmeans(data, centers):
    kmeans = KMeans(centers)
    labels = kmeans.fit_predict(data)
    return labels

Raw Data

labels = fit_kmeans(df, 10)
plt.scatter(df['height'], df['weight'], c=labels, cmap='Set1');

png

Standard Scalar

labels = fit_kmeans(df_ss, 10)
plt.scatter(df_ss['height'], df_ss['weight'], c=labels, cmap='Set1');

png

Min-Max

labels = fit_kmeans(df_mm, 10)
plt.scatter(df_mm['height'], df_mm['weight'], c=labels, cmap='Set1');

png

Scaling the data differently changes the groupings that KMeans generates. Note that this is because the KMeans algorithm is refit on each of the different datasets.

This example highlights the importance of always scaling data correctly BEFORE fitting KMeans, or any other model. And, remember that scaling data is required for any distance-based machine learning model.