Hierarchical Clustering

02 Aug 2020

In this notebook, I use sklearn to conduct hierarchical clustering on the Iris dataset which contains 4 dimensions/attributes and 150 samples. Each sample is labeled as one of the three type of Iris flowers.

Load the Iris Dataset

from sklearn import datasets

iris = datasets.load_iris()

iris.data contains the features; iris.target contains the labels.

iris.data[45:55]

array([[4.8, 3. , 1.4, 0.3],
       [5.1, 3.8, 1.6, 0.2],
       [4.6, 3.2, 1.4, 0.2],
       [5.3, 3.7, 1.5, 0.2],
       [5. , 3.3, 1.4, 0.2],
       [7. , 3.2, 4.7, 1.4],
       [6.4, 3.2, 4.5, 1.5],
       [6.9, 3.1, 4.9, 1.5],
       [5.5, 2.3, 4. , 1.3],
       [6.5, 2.8, 4.6, 1.5]])

iris.target[45:55]

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

import pandas as pd

df = (pd.DataFrame(iris.data))
df.describe()

	0	1	2	3
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

Perform Clustering

Use sklearn’s AgglomerativeClustering to conduct the heirarchical clustering

Ward is the default linkage algorithm

from sklearn.cluster import AgglomerativeClustering

ward = AgglomerativeClustering(n_clusters=3)
ward_pred = ward.fit_predict(iris.data)

Also use: complete, average, single

complete = AgglomerativeClustering(n_clusters=3,
                                   linkage='complete')
complete_pred = complete.fit_predict(iris.data)

single = AgglomerativeClustering(n_clusters=3,
                                 linkage='single')
single_pred = single.fit_predict(iris.data)

average = AgglomerativeClustering(n_clusters=3,
                                  linkage='average')
average_pred = average.fit_predict(iris.data)

To determine which clustering result best matches the original labels, use adjusted_rand_score, an external cluster validation index that results in a score between -1 and 1. 1 indicates the clsuters are identical (regardless of specific label).

from sklearn.metrics import adjusted_rand_score

ward_ar_score = adjusted_rand_score(iris.target, ward_pred)
complete_ar_score = adjusted_rand_score(iris.target, complete_pred)
single_ar_score = adjusted_rand_score(iris.target, single_pred)
average_ar_score = adjusted_rand_score(iris.target, average_pred)

print("Ward       {:2.1%}".format(ward_ar_score),
      "\nComplete   {:2.1%}".format(complete_ar_score), 
      "\nAverage    {:2.1%}".format(average_ar_score),
      "\nSingle     {:2.1%}".format(single_ar_score))

Ward       73.1% 
Complete   64.2% 
Average    75.9% 
Single     56.4%

The Effect of Normalization on Clustering

The fourth column has smaller values than the rest of the columns, so its variance currently counts for less in the clustering process. By normalizing the dataset so that each dimension lies between 0 and 1, they will have equal weight. This done by subtracting the minimum from each column then dividing the difference by the range.

df.describe()

	0	1	2	3
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

from sklearn import preprocessing

normalized_X = preprocessing.normalize(iris.data)

pd.DataFrame(normalized_X).describe()

	0	1	2	3
count	150.000000	150.000000	150.000000	150.000000
mean	0.751400	0.405174	0.454784	0.141071
std	0.044368	0.105624	0.159986	0.077977
min	0.653877	0.238392	0.167836	0.014727
25%	0.715261	0.326738	0.250925	0.048734
50%	0.754883	0.354371	0.536367	0.164148
75%	0.786912	0.527627	0.580025	0.197532
max	0.860939	0.607125	0.636981	0.280419

ward = AgglomerativeClustering(n_clusters=3)
ward_pred = ward.fit_predict(normalized_X)

complete = AgglomerativeClustering(n_clusters=3, linkage="complete")
complete_pred = complete.fit_predict(normalized_X)

avg = AgglomerativeClustering(n_clusters=3, linkage="average")
avg_pred = avg.fit_predict(normalized_X)

single = AgglomerativeClustering(n_clusters=3, linkage="single")
single_pred = single.fit_predict(normalized_X)

n_ward_ar_score = adjusted_rand_score(iris.target, ward_pred)
n_complete_ar_score = adjusted_rand_score(iris.target, complete_pred)
n_avg_ar_score = adjusted_rand_score(iris.target, avg_pred)
n_single_ar_score = adjusted_rand_score(iris.target, single_pred)

print('          Unnormalized   Normalized',
      "\nWard      {:12.1%}   {:10.1%}".format(ward_ar_score, n_ward_ar_score),
      "\nComplete  {:12.1%}   {:10.1%}".format(complete_ar_score, n_complete_ar_score), 
      "\nAverage   {:12.1%}   {:10.1%}".format(average_ar_score, n_avg_ar_score),
      "\nSingle    {:12.1%}   {:10.1%}".format(single_ar_score, n_single_ar_score))

          Unnormalized   Normalized 
Ward             73.1%        88.6% 
Complete         64.2%        64.4% 
Average          75.9%        55.8% 
Single           56.4%        55.8%

Dendrogram visualization with scipy

Visualize the highest scoring clustering result. To do this, use Scipy’s linkage function to perform the clustering again. This will enable us to obtain the linkage matrix it will use later to visualize the hierarchy.

from scipy.cluster.hierarchy import linkage

linkage_type = 'ward'

linkage_matrix = linkage(normalized_X, linkage_type)

from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt
plt.figure(figsize=(22,18))

# plot using 'dendrogram()'
dendrogram(linkage_matrix)

plt.show()

png

Visualization with Seaborn’s `clustermap`

The seaborn plotting library for python can plot a clustermap, which is a detailed dendrogram which also visualizes the dataset in more detail. It conducts the clustering as well. So, you only need to pass it the dataset and the linkage type, and it will use scipy internally to conduct the clustering.

import seaborn as sns

sns.clustermap(normalized_X, 
               figsize=(15,40), 
               method=linkage_type, 
               cmap='viridis')
plt.show()

png

This content is taken from notes I took while pursuing the Intro to Machine Learning with Pytorch nanodegree certification.

Load the Iris Dataset

Perform Clustering

The Effect of Normalization on Clustering

Dendrogram visualization with scipy

Visualization with Seaborn’s clustermap

Visualization with Seaborn’s `clustermap`