Feature Scaling with KMeans

29 Jul 2020

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn import preprocessing as p
from sklearn.datasets import make_blobs

%matplotlib inline

plt.rcParams['figure.figsize'] = (4, 3)

data, y = make_blobs(n_samples=200, 
                     n_features=2, 
                     centers=4, 
                     random_state=42)

df = pd.DataFrame(data)
df.columns = ['height', 'weight']
df['height'] = np.abs(df['height']*100)
df['weight'] = df['weight'] + np.random.normal(50, 10, 200)

Next, take a look at the data to get familiar with it. The dataset has two columns, and it is stored in the df variable. It might be useful to get an idea of the spread in the current data, as well as a visual of the points.

df.describe()

	height	weight
count	200.000000	200.000000
mean	569.726207	52.100192
std	246.966215	11.937835
min	92.998481	20.656424
25%	357.542793	42.704606
50%	545.766752	52.038721
75%	773.310607	61.404287
max	1096.222348	78.362636

df.head()

	height	weight
0	650.565335	56.061702
1	512.894273	67.478788
2	885.057453	58.352092
3	1028.641210	54.324747
4	746.899195	56.624122

plt.scatter(df['height'], df['weight']);

png

There are two common types of feature scaling:

StandardScalar: scales the data so it has mean 0 and variance 1.
MinMaxScalar: useful in cases where it makes sense to think of the data in terms of the percent they are as compared to the maximum value.

StandardScalar

df_ss = p.StandardScaler().fit_transform(df)

df_ss = pd.DataFrame(df_ss)
df_ss.columns = ['height', 'weight']

plt.scatter(df_ss['height'], df_ss['weight']);

png

MinMaxScalar

df_mm = p.MinMaxScaler().fit_transform(df)

df_mm = pd.DataFrame(df_mm)
df_mm.columns = ['height', 'weight']

plt.scatter(df_mm['height'], df_mm['weight']);

png

Fit KMeans

def fit_kmeans(data, centers):
    kmeans = KMeans(centers)
    labels = kmeans.fit_predict(data)
    return labels

Raw Data

labels = fit_kmeans(df, 10)
plt.scatter(df['height'], df['weight'], c=labels, cmap='Set1');

png

Standard Scalar

labels = fit_kmeans(df_ss, 10)
plt.scatter(df_ss['height'], df_ss['weight'], c=labels, cmap='Set1');

png

Min-Max

labels = fit_kmeans(df_mm, 10)
plt.scatter(df_mm['height'], df_mm['weight'], c=labels, cmap='Set1');

png

Scaling the data differently changes the groupings that KMeans generates. Note that this is because the KMeans algorithm is refit on each of the different datasets.

This example highlights the importance of always scaling data correctly BEFORE fitting KMeans, or any other model. And, remember that scaling data is required for any distance-based machine learning model.

This content is taken from notes I took while pursuing the Intro to Machine Learning with Pytorch nanodegree certification.