Unsupervised Learning to Identify Customer Segments

In this project, I apply unsupervised machine learning techniques to better understand the core core customer base for a mail-order sales company in Germany.

The project involved segmenting large, high-dimensional data related the German population. By comparing similar data for customers of the company, I determined which population segments are disproportionately represented among the company’s customers. These insights could be used to direct marketing campaigns towards audiences that are most likely to be responsive.

The result of the project was a clearer understanding of the company’s typical customer. Specifically, the analysis showed that the company’s customer base is disproportionately financially thrify, relatively old, dutiful, religious, and traditional.

The data used was provided by Udacity partner Bertelsmann Arvato Analytics.


This is page 2 of 2 for the project, where I

Transform Features and Cluster the Data

The specific tasks break down as follows:

  • 1: Feature Transformation
    • 1.1: Apply Feature Scaling
    • 1.2: Perform Dimensionality Reduction
    • 1.3: Interpret Principal Components
  • 2: Clustering
    • 2.1: Apply Clustering to General Population
  • 3: Apply All Steps to the Customer Data
    • 3.1: Feature Scaling
    • 3.2: Dimensionality Reduction
    • 3.3: Clustering
  • 4: Compare Customer Data to Demographics Data
  • 5: Discuss Results

1. Feature Transformation

The raw demographics and customers data was cleansed and preprocessed on the previous page of this project. It was saved as a_processed.csv and c_processed.csv.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

Preprocessed Demographics Data

a_path = '/home/ryan/large-files/datasets/customer-segments/processed/a_processed.csv'
a = pd.read_csv(a_path, index_col=0)
print(a.shape)
print(str(a.isna().sum().sum()) + ' missing values')
a.head()
(623211, 179)
0 missing values

ALTERSKATEGORIE_GROB FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ ... ZABEOTYP_4.0 ZABEOTYP_5.0 ZABEOTYP_6.0 NATIONALITAET_KZ_1.0 NATIONALITAET_KZ_2.0 NATIONALITAET_KZ_3.0 PRAEGENDE_JUGENDJAHRE_decade PRAEGENDE_JUGENDJAHRE_movement CAMEO_INTL_2015_wealth CAMEO_INTL_2015_life_stage
0 1.0 1.0 5.0 2.0 5.0 4.0 5.0 3.0 1.0 5.0 ... 0 1 0 1 0 0 6.0 0.0 5.0 1.0
1 3.0 1.0 4.0 1.0 2.0 3.0 5.0 3.0 3.0 4.0 ... 0 1 0 1 0 0 6.0 1.0 2.0 4.0
2 3.0 4.0 3.0 4.0 1.0 3.0 2.0 3.0 5.0 6.0 ... 1 0 0 1 0 0 4.0 0.0 4.0 3.0
3 1.0 3.0 1.0 5.0 2.0 2.0 5.0 3.0 3.0 2.0 ... 1 0 0 1 0 0 2.0 0.0 5.0 4.0
4 2.0 1.0 5.0 1.0 5.0 4.0 3.0 2.0 4.0 2.0 ... 1 0 0 1 0 0 5.0 0.0 2.0 2.0

5 rows × 179 columns

1.1. Apply Feature Scaling

Default StandardScaler fits each feature to mean 0 and standard deviation 1.

scalar = StandardScaler()
a_ss = scalar.fit_transform(a)
a = pd.DataFrame(a_ss,
                 columns=a.columns)
a.head()

ALTERSKATEGORIE_GROB FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ ... ZABEOTYP_4.0 ZABEOTYP_5.0 ZABEOTYP_6.0 NATIONALITAET_KZ_1.0 NATIONALITAET_KZ_2.0 NATIONALITAET_KZ_3.0 PRAEGENDE_JUGENDJAHRE_decade PRAEGENDE_JUGENDJAHRE_movement CAMEO_INTL_2015_wealth CAMEO_INTL_2015_life_stage
0 -1.746287 -1.512226 1.581061 -1.045045 1.539061 1.047076 1.340485 1.044646 -1.665693 0.388390 ... -0.609065 2.966193 -0.309313 0.383221 -0.307557 -0.208434 1.164455 -0.553672 1.147884 -1.251111
1 0.202108 -1.512226 0.900446 -1.765054 -0.531624 0.318375 1.340485 1.044646 -0.290658 -0.123869 ... -0.609065 2.966193 -0.309313 0.383221 -0.307557 -0.208434 1.164455 1.806125 -0.909992 0.749820
2 0.202108 0.692400 0.219832 0.394972 -1.221852 0.318375 -0.856544 1.044646 1.084378 0.900649 ... 1.641861 -0.337133 -0.309313 0.383221 -0.307557 -0.208434 -0.213395 -0.553672 0.461926 0.082843
3 -1.746287 -0.042475 -1.141397 1.114980 -0.531624 -0.410325 1.340485 1.044646 -0.290658 -1.148386 ... 1.641861 -0.337133 -0.309313 0.383221 -0.307557 -0.208434 -1.591245 -0.553672 1.147884 0.749820
4 -0.772089 -1.512226 1.581061 -1.765054 1.539061 1.047076 -0.124201 -0.273495 0.396860 -1.148386 ... 1.641861 -0.337133 -0.309313 0.383221 -0.307557 -0.208434 0.475530 -0.553672 -0.909992 -0.584134

5 rows × 179 columns

1.2. Perform Dimensionality Reduction

The following helper function fits sklearn’s Principal Component Analysis PCA library on the scaled data using the specified number of components. It returns the actual PCA object as well as the fit data.

def do_pca(n_components, data):
    pca = PCA(n_components)
    X_pca = pca.fit_transform(data)
    X_pca = pd.DataFrame(X_pca)
    return pca, X_pca

The following helper plots the amount of variance explained by each component.

def scree_plot(pca):
    num_components = len(pca.explained_variance_ratio_)
    ind = np.arange(num_components)
    vals = pca.explained_variance_ratio_
 
    plt.figure(figsize=(14, 7))
    ax = plt.subplot(111)
    cumvals = np.cumsum(vals)
    ax.bar(ind, vals)
    ax.plot(ind, cumvals, color='darkorange')
    
    if num_components <= 30:
        for i in range(num_components):
            ax.annotate(r"%s%%" % ((str(round(vals[i]*100,1))[:3])), 
                        (ind[i]+0.2, vals[i]), 
                        va="bottom", 
                        ha="center", 
                        fontsize=8)
 
    ax.xaxis.set_tick_params(width=0)
    ax.yaxis.set_tick_params(width=1, length=6)
 
    ax.set_xlabel("Principal Component")
    ax.set_ylabel("Variance Explained (%)")
    
    for spine in ax.spines.values():
        spine.set_visible(False)
    
    plt.title('Explained Variance Per Principal Component')
pca_182, a_pca = do_pca(a.shape[1], a)
scree_plot(pca_182)

png

The first 30 principal components appear to explain roughly 50% of the overall variance.

pca, a_pca = do_pca(30, a)
scree_plot(pca)

png

1.3. Interpret Principal Components

Now that we have our transformed principal components, it’s a nice idea to check out the weight of each variable on the first few components to see if they can be interpreted in some fashion.

As a reminder, each principal component is a unit vector that points in the direction of highest variance (after accounting for the variance captured by earlier principal components). The further a weight is from zero, the more the principal component is in the direction of the corresponding feature. If two features have large weights of the same sign (both positive or both negative), then increases in one tend expect to be associated with increases in the other. To contrast, features with different signs can be expected to show a negative correlation: increases in one variable should result in a decrease in the other.

The following helper function creates a results DataFrame with the feature weights.

def pca_results(full_dataset, pca):
    # Dimension indexing
    dimensions = dimensions = \
    ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]
    
    # PCA components
    components = pd.DataFrame(np.round(pca.components_, 4), 
                              columns = full_dataset.keys())
    components.index = dimensions
    
    # PCA explained variance
    ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1)
    variance_ratios = pd.DataFrame(np.round(ratios, 4), 
                                   columns = ['Explained Variance'])
    variance_ratios.index = dimensions
    
    # Return a concatenated DataFrame
    return (pd.concat([variance_ratios, components], 
                      axis = 1))
results = pca_results(a, pca)
print(results.shape)
results.T[['Dimension 1','Dimension 2','Dimension 3']].head()
(30, 180)

Dimension 1 Dimension 2 Dimension 3
Explained Variance 0.0732 0.0557 0.0430
ALTERSKATEGORIE_GROB 0.2145 -0.1243 0.0587
FINANZ_MINIMALIST 0.2184 0.0940 0.0845
FINANZ_SPARER -0.2297 0.0920 -0.0701
FINANZ_VORSORGER 0.2002 -0.1027 0.0720

The following helper function sorts the results DataFrame to show the most important features for each dimension.

def top_features(dimension, count=5):
    col_label = 'Dimension {}'.format(dimension)
    
    most_positive = (results
                     .drop(columns=['Explained Variance'])
                     .T
                     [[col_label]]
                     .sort_values([col_label], ascending=False)
                     .head(count))
    
    most_negative = (results
                     .drop(columns=['Explained Variance'])
                     .T
                     [[col_label]]
                     .sort_values([col_label], ascending=False)
                     .tail(count))
    
    top_features = most_positive.append(most_negative)
    
    return top_features

Dimension 1

top_1 = top_features(1)
top_1

Dimension 1
FINANZ_MINIMALIST 0.2184
ALTERSKATEGORIE_GROB 0.2145
FINANZ_VORSORGER 0.2002
SEMIO_LUST 0.1543
SEMIO_ERL 0.1502
SEMIO_TRADV -0.1830
SEMIO_REL -0.1906
SEMIO_PFLICHT -0.1958
PRAEGENDE_JUGENDJAHRE_decade -0.2030
FINANZ_SPARER -0.2297

Interpretation of Dimension 1

  • People who are financially thrify, financially disinterested, relatively older, dutiful, religious, and traditional.

Detailed feature breakdown:

Direction of Vector:

  • FINANZ_MINIMALIST: Financial typology, for each dimension: low financial interest
    • 1: very high
    • 5: very low
  • ALTERSKATEGORIE_GROB: Estimated age based on given name analysis
    • 1: < 30 years old
    • 4: > 60 years old
  • FINANZ_VORSORGER: Financial typology, for each dimension: be prepared
  • SEMIO_LUST: Personality typology, for each dimension: sensual-minded
    • 1: highest affinity
    • 7: lowest affinity
  • SEMIO_ERL: Personality typology, for each dimension: event-oriented

Opposite Direction:

  • SEMIO_TRADV: Personality typology, for each dimension: tradional-minded
  • SEMIO_REL: Personality typology, for each dimension: religious
  • SEMIO_PFLICHT: Personality typology, for each dimension: dutiful
  • PRAEGENDE_JUGENDJAHRE_decade: Dominating movement of person’s youth (avantgarde vs. mainstream; east vs. west): decade of event
    • 1: 40s - war years (Mainstream, E+W)
    • 2: 40s - reconstruction years (Avantgarde, E+W)
    • 14: 90s - digital media kids (Mainstream, E+W)
    • 15: 90s - ecological awareness (Avantgarde, E+W)
  • FINANZ_SPARER: Financial typology, for each dimension: money-saver

Dimension 2

top_2 = top_features(2)
top_2

Dimension 2
ONLINE_AFFINITAET 0.1715
SEMIO_KULT 0.1641
SEMIO_REL 0.1570
ZABEOTYP_1.0 0.1565
ANZ_PERSONEN 0.1356
SEMIO_ERL -0.1630
ZABEOTYP_3.0 -0.1658
CAMEO_INTL_2015_wealth -0.1746
HH_EINKOMMEN_SCORE -0.1778
FINANZ_HAUSBAUER -0.1861

Interpretation of Dimension 2

  • People who are home-owners, high income, wealthy, have a high affinity for being online, and are “green” in terms of energy consumption.

Detailed feature breakdown:

Direction of Vector:

  • ONLINE_AFFINITAET: Online affinity
    • 0: none
    • 5: highest
  • SEMIO_KULT: Personality typology, for each dimension: cultural-minded
    • 1: highest affinity
    • 7: lowest affinity
  • SEMIO_REL: Personality typology, for each dimension: religious
  • ZABEOTYP_1.0: Energy consumption typology: green
  • ANZ_PERSONEN: Number of adults in household

Opposite Direction:

  • SEMIO_ERL: Personality typology, for each dimension: event-oriented
  • ZABEOTYP_3.0: Energy consumption typology: fair supplied
  • CAMEO_INTL_2015_wealth: German CAMEO: Wealth / Life Stage Typology, mapped to international code: Wealth
    • 1: Wealthy
    • 5: Poorer
  • HH_EINKOMMEN_SCORE: Estimated household net income
    • 1: highest income
    • 6: very low income
  • FINANZ_HAUSBAUER: Financial typology, for each dimension: home owndership

Dimension 3

top_3 = top_features(3)
top_3

Dimension 3
ANREDE_KZ_1.0 0.3151
SEMIO_VERT 0.2851
SEMIO_SOZ 0.2259
SEMIO_FAM 0.2255
SEMIO_KULT 0.2043
SEMIO_RAT -0.1764
SEMIO_KRIT -0.2134
SEMIO_DOM -0.2560
SEMIO_KAEM -0.2759
ANREDE_KZ_2.0 -0.3151

Interpretation of Dimension 3

  • Males who are personally combative, dominant, critical, and rational. They are specifically not dreamful, socially-minded, family-minded, or culturally-minded.

Detailed feature breakdown:

Direction of Vector:

  • ANREDE_KZ_1.0: Male Gender
  • SEMIO_VERT: Personality typology, for each dimension: dreamful
    • 1: highest affinity
    • 7: lowest affinity
  • SEMIO_SOZ: Personality typology, for each dimension: socially-minded
  • SEMIO_FAM: Personality typology, for each dimension: family-minded
  • SEMIO_KULT: Personality typology, for each dimension: cultural-minded

Opposite Direction:

  • SEMIO_RAT: Personality typology, for each dimension: rational
  • SEMIO_KRIT: Personality typology, for each dimension: critical-minded
  • SEMIO_DOM: Personality typology, for each dimension: dominant-minded
  • SEMIO_KAEM: Personality typology, for each dimension: combative attitude
  • ANREDE_KZ_2.0: Female Gender

2. Clustering

2.1. Apply Clustering to General Population

Now, check and see how the data clusters in the principal components step. In this step, I apply k-means clustering to the dataset and use the “elbow method” to determine the optimal number of clusters to use.

scores = []
centers = list(range(1,16))
for center in centers:
    kmeans = KMeans(n_clusters=center)
    model = kmeans.fit(a_pca)
    score = np.abs(model.score(a_pca))
    scores.append(score)
plt.figure(figsize=(14, 7))
plt.plot(centers, scores, linestyle='--', marker='o', color='b')
plt.ylim(bottom=0)
plt.xlabel('K')
plt.xticks([1,5,10,15])
plt.ylabel('SSE')
plt.title('SSE vs. K')
for spine in plt.gca().spines.values():
    spine.set_visible(False)

png

Based on the plot above, I select 5 as a resonable point of diminishing returns.

kmeans = KMeans(n_clusters=5)
model = kmeans.fit(a_pca)
a_labels = (pd.DataFrame(model.predict(a_pca))
            .rename(columns={0:'demo label'}))
print(a_labels.shape)
(623211, 1)
a_label_counts = (pd.DataFrame(a_labels['demo label'].value_counts())
                  .rename(columns={'demo label':'demo count'}))
a_label_counts['demo percent'] = round(a_label_counts['demo count'] / 
                                       a_labels.shape[0]*100,1)

3. Apply All Steps to the Customer Data

Next, I apply all these steps to the cleansed customer data using the same, fitted skelarn models generated above.

c_path = '/home/ryan/large-files/datasets/customer-segments/processed/c_processed.csv'
c = pd.read_csv(c_path, index_col=0)
print(c.shape)
print(str(c.isna().sum().sum()) + ' missing values')
c.head()
(133377, 179)
0 missing values

ALTERSKATEGORIE_GROB FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ ... ZABEOTYP_4.0 ZABEOTYP_5.0 ZABEOTYP_6.0 NATIONALITAET_KZ_1.0 NATIONALITAET_KZ_2.0 NATIONALITAET_KZ_3.0 PRAEGENDE_JUGENDJAHRE_decade PRAEGENDE_JUGENDJAHRE_movement CAMEO_INTL_2015_wealth CAMEO_INTL_2015_life_stage
0 4.0 5.0 1.0 5.0 1.0 2.0 2.0 1.0 5.0 6.0 ... 0 0 0 1 0 0 2.0 1.0 1.0 3.0
1 4.0 5.0 1.0 5.0 1.0 4.0 4.0 2.0 5.0 2.0 ... 0 0 0 1 0 0 2.0 1.0 3.0 4.0
2 4.0 5.0 1.0 5.0 2.0 1.0 2.0 2.0 3.0 6.0 ... 0 0 0 1 0 0 1.0 0.0 2.0 4.0
3 3.0 3.0 1.0 4.0 4.0 5.0 2.0 3.0 5.0 4.0 ... 0 0 0 1 0 0 4.0 0.0 4.0 1.0
4 3.0 5.0 1.0 5.0 1.0 2.0 3.0 3.0 3.0 6.0 ... 0 0 0 1 0 0 2.0 1.0 3.0 4.0

5 rows × 179 columns

3.1. Feature Scaling

c_ss = scalar.transform(c)
c = pd.DataFrame(c_ss,
                 columns=c.columns)
print(c.shape)
c.head()
(133377, 179)

ALTERSKATEGORIE_GROB FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ ... ZABEOTYP_4.0 ZABEOTYP_5.0 ZABEOTYP_6.0 NATIONALITAET_KZ_1.0 NATIONALITAET_KZ_2.0 NATIONALITAET_KZ_3.0 PRAEGENDE_JUGENDJAHRE_decade PRAEGENDE_JUGENDJAHRE_movement CAMEO_INTL_2015_wealth CAMEO_INTL_2015_life_stage
0 1.176305 1.427276 -1.141397 1.114980 -1.221852 -0.410325 -0.856544 -1.591635 1.084378 0.900649 ... -0.609065 -0.337133 -0.309313 0.383221 -0.307557 -0.208434 -1.591245 1.806125 -1.595951 0.082843
1 1.176305 1.427276 -1.141397 1.114980 -1.221852 1.047076 0.608142 -0.273495 1.084378 -1.148386 ... -0.609065 -0.337133 -0.309313 0.383221 -0.307557 -0.208434 -1.591245 1.806125 -0.224033 0.749820
2 1.176305 1.427276 -1.141397 1.114980 -0.531624 -1.139026 -0.856544 -0.273495 -0.290658 0.900649 ... -0.609065 -0.337133 -0.309313 0.383221 -0.307557 -0.208434 -2.280170 -0.553672 -0.909992 0.749820
3 0.202108 -0.042475 -1.141397 0.394972 0.848833 1.775776 -0.856544 1.044646 1.084378 -0.123869 ... -0.609065 -0.337133 -0.309313 0.383221 -0.307557 -0.208434 -0.213395 -0.553672 0.461926 -1.251111
4 0.202108 1.427276 -1.141397 1.114980 -1.221852 -0.410325 -0.124201 1.044646 -0.290658 0.900649 ... -0.609065 -0.337133 -0.309313 0.383221 -0.307557 -0.208434 -1.591245 1.806125 -0.224033 0.749820

5 rows × 179 columns

3.2. Dimensionality Reduction

c_pca = pca.transform(c)
c_pca = pd.DataFrame(c_pca)
print(c_pca.shape)
c_pca.head()
(133377, 30)

0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
0 7.058972 3.420497 2.500324 2.100575 -0.997242 0.357982 1.643074 -1.987261 -0.131893 3.745730 ... -2.629017 -3.100104 -1.653428 1.266194 -1.564352 -1.139092 0.593162 -0.507610 -0.417441 -0.898603
1 5.527108 -2.299668 -1.057538 4.004213 -3.509783 0.377007 -0.706439 0.596483 0.403730 -0.726155 ... 2.007500 -1.508163 1.866077 1.669437 -0.639520 0.501114 -1.797476 1.960346 -1.062995 1.349571
2 3.965549 2.526799 1.897971 -4.488827 -0.490012 -0.328013 -1.467358 -0.244206 -0.963943 1.767246 ... -0.256628 1.314349 1.159087 -0.805965 -0.282223 -1.838544 0.288024 1.200617 0.077224 0.680797
3 -0.952263 1.129877 1.647568 0.479191 2.608330 0.865320 -3.497100 -0.222191 -1.224088 -0.148981 ... -0.071952 -0.157687 0.032621 -0.235610 -2.169996 -0.179192 -3.537217 0.913882 -0.013844 -1.198688
4 3.920457 2.322970 2.579195 3.451799 0.253436 3.131984 -1.969115 2.462193 3.717401 2.315231 ... 1.433696 -1.553335 1.425058 -1.520453 1.390906 5.829346 0.425688 1.927881 -0.090094 -0.447725

5 rows × 30 columns

3.3. Clustering

c_labels = model.predict(c_pca)
c_labels = (pd.DataFrame(c_labels)
            .rename(columns={0:'cust label'}))
print(c_labels.shape)
c_labels.head()
(133377, 1)

cust label
0 2
1 2
2 0
3 4
4 2
c_label_counts = (pd.DataFrame(c_labels['cust label'].value_counts())
                  .rename(columns={'cust label':'cust count'}))
c_label_counts['cust percent'] = round(c_label_counts['cust count']/c_labels.shape[0]*100,
                                       1)
c_label_counts

cust count cust percent
2 62982 47.2
0 37226 27.9
1 27909 20.9
4 3026 2.3
3 2234 1.7

4. Compare Customer Data to Demographics Data

In this step, I compare the two cluster distributions to see where the strongest customer base for the company is.

cluster_counts = a_label_counts.join(c_label_counts)
cluster_counts['delta percent'] = (cluster_counts['cust percent'] - 
                                   cluster_counts['demo percent'])
cluster_counts = (cluster_counts
                  .sort_values(['delta percent'],
                               ascending=False)
                  .reset_index()
                  .rename(columns={'index':'clusters'}))
cluster_counts['delta str'] = (cluster_counts['delta percent']
                               .round(0)
                               .astype(int)
                               .astype(str))
cluster_counts.loc[cluster_counts['delta percent']>0, 
                 'delta str'] = '+' + cluster_counts['delta str']
cluster_counts['delta str'] = cluster_counts['delta str'] + '%'
ordered_cluster_list = list(cluster_counts['clusters'])
ordered_cluster_list
[2, 0, 1, 4, 3]
plt.figure(figsize=(14,7))
plt.bar(cluster_counts.index-0.1875,
        cluster_counts['demo percent'],
        width=0.375,
        label='General Population')
plt.bar(cluster_counts.index+0.1875,
        cluster_counts['cust percent'],
        width=0.375,
        label='Customer')
plt.xticks(range(0,len(cluster_counts)),
           cluster_counts['clusters'])
plt.ylim([0,70])
for spine in plt.gca().spines.values():
    spine.set_visible(False)
plt.xlabel('Cluster Labels')
plt.ylabel('% of Dataset')
plt.legend(fancybox=True,
           loc=9,
           ncol=4,
           facecolor='white',
           edgecolor='white',
           framealpha=1,
           fontsize=12)
plt.tick_params(
    axis='y',
    left=False,
    right=False,
    labelleft=False)
plt.tick_params(
    axis='x',
    bottom=False)
rects = plt.gca().patches[len(cluster_counts):]
deltas = list(cluster_counts['delta str'].values)
for n, r in enumerate(rects):
    height = r.get_height()
    plt.gca().text(r.get_x() + r.get_width() / 2,
                   height+1.2,
                   str(deltas[n]),
                   ha='center',
                   va='center')
plt.title('General Population Demographic Groups versus Customer Groups');

png

cust_median_centroids = (c_pca.join(c_labels)
                         .groupby('cust label')
                         .median()
                         .T)
cust_median_centroids.columns.name = 'pc'
cust_median_centroids = (cust_median_centroids[ordered_cluster_list]
                         .round(1))
print(cust_median_centroids.shape)
cust_median_centroids.index += 1
cust_median_centroids.head()
(30, 5)

pc 2 0 1 4 3
1 4.9 3.8 2.0 -1.8 -2.5
2 2.8 1.6 -3.1 2.2 -0.4
3 1.8 0.9 1.0 2.5 -3.0
4 3.0 -2.9 0.5 0.3 0.8
5 0.3 1.2 0.1 0.5 0.2

The following helper function determines the top number of most important components in each cluster.

def most_important_components(df, cluster, top=2):
    high = (cust_median_centroids
            .sort_values([cluster],
                         ascending=False)
            [:top]
            [[cluster]])
    low = (cust_median_centroids
           .sort_values([cluster],
                        ascending=False)
           [-1*top:]
           [[cluster]])
    combined = (high
                .append(low))
    combined['abs'] = combined[cluster].abs()
    combined = (combined
                .sort_values(['abs'],
                             ascending=False)
                .drop(columns=['abs']))
    return combined

The following table shows the most important principal component vectors that characterize each of the 5 clusters. The most important principal components can be either in the direction of a component or strongly opposite (negative).

top = 3
top_components_across_clusters = most_important_components(cust_median_centroids,
                                                           ordered_cluster_list[0],
                                                           top)
    
for cluster in ordered_cluster_list[1:]:
    next_cluster = most_important_components(cust_median_centroids,
                                             cluster,
                                             top)
    top_components_across_clusters = (top_components_across_clusters
                                      .join(next_cluster,
                                            how='outer'))
top_components_across_clusters.fillna('')

pc 2 0 1 4 3
1 4.9 3.8 2 -1.8 -2.5
2 2.8 1.6 -3.1 2.2
3 1 2.5 -3
4 3 -2.9 0.8
5 1.2 0.2
6 -0.6 -0.6 0.9 0.7 0.4
7 -1.2 -0.5 -0.5
8 -0.6 -0.9
9 -0.9
12 -0.6
13 -0.6
top_components_across_clusters.fillna('').head(3)

pc 2 0 1 4 3
1 4.9 3.8 2 -1.8 -2.5
2 2.8 1.6 -3.1 2.2
3 1 2.5 -3

Based on the tables above:

  • Clusters 3 and 0: overrepresented, typified by the characteristicts of principal component 1.
  • Cluster 1: underrepresented, typified by the opposite characteristicts of principal component 2.
  • Cluster 4: underrepresented, typified by the characteristicts of principal component 3 and 2.
  • Cluster 2: underrepresented, typified by the opposite characteristicts of principal components 1 and 3.

5: Discuss Results

Overrepresentation Among Customers

Principal Component 1

The results are clear that the customer base is strongly typified by the characteristics of principal component 1:

  • People who are financially thrify, financially disinterested, relatively older, dutiful, religious, and traditional.

The more strongly the cluster aligns to principal component 1, the more overrepresented the cluster is among customers.

Underrepresentation Among Customers

The characteristics of the underrepresented groups are generally much less clear. I suggest followup work to better understand this.

Principal Component 2

People typified primarily by the opposite of principal component 2 are underrepresented among the customer base.

  • People who are not the combination of the following: home-owners, high income, wealthy, have a high affinity for being online, and are “green” in terms of energy consumption.

Generally speaking, I would expect this to mean people typified primarily by being renters, low income, impoverished, offline, and indifferent about their energy consumption are underrepresented among the customer base.

Principal Component 3

People typified primarily by the large values (positive or negative) of principal component 3 are underrepresented among the customer base.

This means people who are either strongly:

  • Males who are personally combative, dominant, critical, and rational.

or

  • Females who are dreamful, socially-minded, family-minded, or culturally-minded.

would both be unlikely to be among the customer base.