Hierarchical Clustering

In this notebook, I use sklearn to conduct hierarchical clustering on the Iris dataset which contains 4 dimensions/attributes and 150 samples. Each sample is labeled as one of the three type of Iris flowers.

Load the Iris Dataset

from sklearn import datasets

iris = datasets.load_iris()


iris.data contains the features; iris.target contains the labels.

iris.data[45:55]

array([[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],
[5.3, 3.7, 1.5, 0.2],
[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5]])

iris.target[45:55]

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

import pandas as pd

df = (pd.DataFrame(iris.data))
df.describe()


0 1 2 3
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Perform Clustering

Use sklearn’s AgglomerativeClustering to conduct the heirarchical clustering

Ward is the default linkage algorithm

from sklearn.cluster import AgglomerativeClustering

ward = AgglomerativeClustering(n_clusters=3)
ward_pred = ward.fit_predict(iris.data)


Also use: complete, average, single

complete = AgglomerativeClustering(n_clusters=3,
linkage='complete')
complete_pred = complete.fit_predict(iris.data)

single = AgglomerativeClustering(n_clusters=3,
linkage='single')
single_pred = single.fit_predict(iris.data)

average = AgglomerativeClustering(n_clusters=3,
linkage='average')
average_pred = average.fit_predict(iris.data)


To determine which clustering result best matches the original labels, use adjusted_rand_score, an external cluster validation index that results in a score between -1 and 1. 1 indicates the clsuters are identical (regardless of specific label).

from sklearn.metrics import adjusted_rand_score

ward_ar_score = adjusted_rand_score(iris.target, ward_pred)
complete_ar_score = adjusted_rand_score(iris.target, complete_pred)
single_ar_score = adjusted_rand_score(iris.target, single_pred)
average_ar_score = adjusted_rand_score(iris.target, average_pred)

print("Ward       {:2.1%}".format(ward_ar_score),
"\nComplete   {:2.1%}".format(complete_ar_score),
"\nAverage    {:2.1%}".format(average_ar_score),
"\nSingle     {:2.1%}".format(single_ar_score))

Ward       73.1%
Complete   64.2%
Average    75.9%
Single     56.4%


The Effect of Normalization on Clustering

The fourth column has smaller values than the rest of the columns, so its variance currently counts for less in the clustering process. By normalizing the dataset so that each dimension lies between 0 and 1, they will have equal weight. This done by subtracting the minimum from each column then dividing the difference by the range.

df.describe()


0 1 2 3
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
from sklearn import preprocessing

normalized_X = preprocessing.normalize(iris.data)

pd.DataFrame(normalized_X).describe()


0 1 2 3
count 150.000000 150.000000 150.000000 150.000000
mean 0.751400 0.405174 0.454784 0.141071
std 0.044368 0.105624 0.159986 0.077977
min 0.653877 0.238392 0.167836 0.014727
25% 0.715261 0.326738 0.250925 0.048734
50% 0.754883 0.354371 0.536367 0.164148
75% 0.786912 0.527627 0.580025 0.197532
max 0.860939 0.607125 0.636981 0.280419
ward = AgglomerativeClustering(n_clusters=3)
ward_pred = ward.fit_predict(normalized_X)

complete = AgglomerativeClustering(n_clusters=3, linkage="complete")
complete_pred = complete.fit_predict(normalized_X)

avg = AgglomerativeClustering(n_clusters=3, linkage="average")
avg_pred = avg.fit_predict(normalized_X)

single = AgglomerativeClustering(n_clusters=3, linkage="single")
single_pred = single.fit_predict(normalized_X)

n_ward_ar_score = adjusted_rand_score(iris.target, ward_pred)
n_complete_ar_score = adjusted_rand_score(iris.target, complete_pred)
n_avg_ar_score = adjusted_rand_score(iris.target, avg_pred)
n_single_ar_score = adjusted_rand_score(iris.target, single_pred)

print('          Unnormalized   Normalized',
"\nWard      {:12.1%}   {:10.1%}".format(ward_ar_score, n_ward_ar_score),
"\nComplete  {:12.1%}   {:10.1%}".format(complete_ar_score, n_complete_ar_score),
"\nAverage   {:12.1%}   {:10.1%}".format(average_ar_score, n_avg_ar_score),
"\nSingle    {:12.1%}   {:10.1%}".format(single_ar_score, n_single_ar_score))

          Unnormalized   Normalized
Ward             73.1%        88.6%
Complete         64.2%        64.4%
Average          75.9%        55.8%
Single           56.4%        55.8%


Dendrogram visualization with scipy

Visualize the highest scoring clustering result. To do this, use Scipy’s linkage function to perform the clustering again. This will enable us to obtain the linkage matrix it will use later to visualize the hierarchy.

from scipy.cluster.hierarchy import linkage

linkage_type = 'ward'

linkage_matrix = linkage(normalized_X, linkage_type)

from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt
plt.figure(figsize=(22,18))

# plot using 'dendrogram()'
dendrogram(linkage_matrix)

plt.show()


Visualization with Seaborn’s clustermap

The seaborn plotting library for python can plot a clustermap, which is a detailed dendrogram which also visualizes the dataset in more detail. It conducts the clustering as well. So, you only need to pass it the dataset and the linkage type, and it will use scipy internally to conduct the clustering.

import seaborn as sns

sns.clustermap(normalized_X,
figsize=(15,40),
method=linkage_type,
cmap='viridis')
plt.show()