Gaussian Mixture Model Examples

1. KMeans versus GMM on a Generated Dataset

Use sklearn’s make_blobs function to create a dataset of Gaussian blobs.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import cluster, datasets, mixture

%matplotlib inline

n_samples = 1000

varied = datasets.make_blobs(n_samples=n_samples,
                             cluster_std=[5, 1, 0.5],
                             random_state=3)
X, y = varied[0], varied[1]

plt.figure(figsize=(16,12))
plt.scatter(X[:,0], X[:,1], 
            c=y, edgecolor='black', lw=1.5, 
            s=100, cmap=plt.get_cmap('viridis'))
plt.title('Ground Truth')
plt.show()

png

KMeans

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
pred = kmeans.fit_predict(X)

plt.figure( figsize=(16,12))
plt.scatter(X[:,0], X[:,1], 
            c=pred, edgecolor='black', lw=1.5, 
            s=100, cmap=plt.get_cmap('viridis'))
plt.title('KMeans')
plt.show()

png

GuassianMixture

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=3).fit(X)
gmm = gmm.fit(X)
pred_gmm = gmm.predict(X)

plt.figure( figsize=(16,12))
plt.scatter(X[:,0], X[:,1], c=pred_gmm, edgecolor='black', lw=1.5, s=100, cmap=plt.get_cmap('viridis'))
plt.show()

png

GaussianMixture clearly outperforms Kmeans on this dataset. This is expected since make_blobs creates Gaussian blobs.

2. KMeans versus GMM on the Iris Dataset

For the second example, use the Iris dataset. The Iris Dataset is great for this purpose since it is likely the data is normally-distributed.

import seaborn as sns

iris = sns.load_dataset("iris")

iris.info()
iris.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

There are several ways to visualize a dataset with four dimensions, such as:

Use PairGrid since it does not distort the dataset. It simply plots every pair of features against each other in a subplot.

g = sns.PairGrid(iris, hue="species", 
                 palette=sns.color_palette("cubehelix", 3), 
                 vars=['sepal_length','sepal_width',
                       'petal_length','petal_width'])
g.map(plt.scatter)
plt.show()

png

kmeans_iris = KMeans(n_clusters=3)
pred_kmeans_iris = kmeans_iris.fit_predict(iris[['sepal_length',
                                                 'sepal_width',
                                                 'petal_length',
                                                 'petal_width']])

iris['kmeans_pred'] = pred_kmeans_iris

g = sns.PairGrid(iris, hue="kmeans_pred", 
                 palette=sns.color_palette("cubehelix", 3), 
                 vars=['sepal_length','sepal_width',
                       'petal_length','petal_width'])
g.map(plt.scatter)
plt.show()

png

Visual inspection is no longer useful since we are working with multiple dimensions. Instead, use the adjusted Rand score which generates a score betwen -1 and 1, where an exact match would be scored as a 1.

from sklearn.metrics import adjusted_rand_score

iris_kmeans_score = adjusted_rand_score(iris['species'], 
                                        iris['kmeans_pred'])
round(iris_kmeans_score,3)
0.73

Now, try with Gaussian Mixture models.

gmm_iris = GaussianMixture(n_components=3).fit(iris[['sepal_length',
                                                     'sepal_width',
                                                     'petal_length',
                                                     'petal_width']])
pred_gmm_iris = gmm_iris.predict(iris[['sepal_length',
                                       'sepal_width',
                                       'petal_length',
                                       'petal_width']])

iris['gmm_pred'] = pred_gmm_iris

iris_gmm_score = adjusted_rand_score(iris['species'], 
                                     iris['gmm_pred'])

round(iris_gmm_score,3)
0.904

The Gaussian Mixture model outperforms KMeans according to the ARI scores.