Unsupervised Learning to Identify Customer Segments

22 Aug 2020

In this project, I apply unsupervised machine learning techniques to better understand the core core customer base for a mail-order sales company in Germany.

The project involved segmenting large, high-dimensional data related the German population. By comparing similar data for customers of the company, I determined which population segments are disproportionately represented among the company’s customers. These insights could be used to direct marketing campaigns towards audiences that are most likely to be responsive.

The result of the project was a clearer understanding of the company’s typical customer. Specifically, the analysis showed that the company’s customer base is disproportionately financially thrify, relatively old, dutiful, religious, and traditional.

The data used was provided by Udacity partner Bertelsmann Arvato Analytics.

This is page 1 of 2 for the project, where I

Load and Preprocess the Data

The specific tasks break down as follows:

1: Load the Data
2: Preprocessing
- 2.1: Assess Missing Data
  - 2.1.1: Convert Missing Value Codes to NaNs
  - 2.1.2: Assess Missing Data in Each Column
  - Discussion 2.1.2: Assess Missing Data in Each Column
  - 2.1.3: Assess Missing Data in Each Row
  - Discussion 2.1.3: Assess Missing Data in Each Row
- 2.2: Select and Re-Encode Features
  - 2.2.1: Re-Encode Categorical Features
  - 2.2.2: Engineer Mixed-Type Features
  - 2.2.3: Complete Feature Selection
- 2.3: Convert Processing Steps to a Function
3: Write Processed Data to File

1. Load the Data

There are four files associated with this project:

Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891,211 persons (rows) x 85 features (columns).
Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191,652 persons (rows) x 85 features (columns).
Data_Dictionary.md: Detailed information file about the features in the provided datasets. This file contains written explanations of the data included in the other files.
AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Demographics Data

a_path = '/home/ryan/large-files/datasets/customer-segments/Udacity_AZDIAS_Subset.csv'
a = pd.read_csv(a_path,
                sep=';')
print(a.shape)
a.head()

(891221, 85)

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	-1	2	1	2.0	3	4	3	5	5	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	-1	1	2	5.0	1	5	2	5	4	5	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	-1	3	2	3.0	1	4	1	2	3	5	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	2	4	2	2.0	4	2	5	2	1	2	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	-1	3	1	5.0	4	3	4	1	3	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0

5 rows × 85 columns

Customers Data

c_path = '/home/ryan/large-files/datasets/customer-segments/Udacity_CUSTOMERS_Subset.csv'
c = pd.read_csv(c_path,
                sep=';')
print(c.shape)
c.head()

(191652, 85)

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	2	4	1	5.0	5	1	5	1	2	2	...	3.0	3.0	1.0	0.0	1.0	5.0	5.0	1.0	2.0	1.0
1	-1	4	1	NaN	5	1	5	1	3	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	-1	4	2	2.0	5	1	5	1	4	4	...	2.0	3.0	3.0	1.0	3.0	3.0	2.0	3.0	5.0	3.0
3	1	4	1	2.0	5	1	5	2	1	2	...	3.0	2.0	1.0	0.0	1.0	3.0	4.0	1.0	3.0	1.0
4	-1	3	1	6.0	3	1	4	4	5	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	3.0	5.0	1.0

5 rows × 85 columns

Feature Summary

f_path = '/home/ryan/large-files/datasets/customer-segments/AZDIAS_Feature_Summary.csv'
f = pd.read_csv(f_path,
                sep=';')
print(f.shape)
f.head()

(85, 4)

	attribute	information_level	type	missing_or_unknown
0	AGER_TYP	person	categorical	[-1,0]
1	ALTERSKATEGORIE_GROB	person	ordinal	[-1,0,9]
2	ANREDE_KZ	person	categorical	[-1,0]
3	CJT_GESAMTTYP	person	categorical	[0]
4	FINANZ_MINIMALIST	person	ordinal	[-1]

2. Preprocessing

For purposes of establishing the steps of the cleaning process, use a combined dataset d containing the customer (c) and demographics (a) datasets appended together.

d = (a
     .append(c)
     .reset_index(drop=True))
print(d.shape)

(1082873, 85)

2.1 Assess Missing Data

2.1.1 Convert Missing Value Codes to NaNs

The feature summary dataset, f, contains a missing_or_unknown column that contains a list of values that indicate the data is missing or unknown.

missing_vals = [-1, 0, 9, 'X', 'XX']
missing = f[['attribute','missing_or_unknown']].copy()
for col in missing_vals:
    missing.loc[missing['missing_or_unknown'].str.contains(str(col)), col] = col
missing.head()

	attribute	missing_or_unknown	-1	0	9	X	XX
0	AGER_TYP	[-1,0]	-1.0	0.0	NaN	NaN	NaN
1	ALTERSKATEGORIE_GROB	[-1,0,9]	-1.0	0.0	9.0	NaN	NaN
2	ANREDE_KZ	[-1,0]	-1.0	0.0	NaN	NaN	NaN
3	CJT_GESAMTTYP	[0]	NaN	0.0	NaN	NaN	NaN
4	FINANZ_MINIMALIST	[-1]	-1.0	NaN	NaN	NaN	NaN

The following cell iterates through all the columns in each of the demographics and customer datasets, a and c, replacing values with nan where correct per the missing dataframe, above. Note that the columns in both datasets are identical and match the rows in the missing dataframe, above.

for attr in missing['attribute']:
    if d[attr].dtype in [float, int]:
        d[attr] = d[attr].astype(float)
        for col in missing_vals[:3]:
            missing_val = missing[missing['attribute']==attr][col].values[0]
            d[attr] = d[attr].replace(missing_val, np.nan, regex=False)
    else:
        for col in missing_vals[3:]:
            missing_val = missing[missing['attribute']==attr][col].values[0]
            d[attr] = d[attr].replace(missing_val, np.nan, regex=False)

2.1.2 Assess Missing Data in Each Column

ratio_missing = (d
                 .head(1)
                 .T
                 .reset_index()
                 .rename(columns={'index' : 'attribute',
                                  0       : 'missing'}))
ratio_missing['missing'] = np.nan
for col in d.columns:
    ratio_missing.loc[ratio_missing['attribute']==col,'missing'] = d[col].isna().sum()
    ratio_missing['ratio'] = ratio_missing['missing'] / d.shape[0]
ratio_missing.head()

	attribute	missing	ratio
0	AGER_TYP	782581.0	0.722690
1	ALTERSKATEGORIE_GROB	3121.0	0.002882
2	ANREDE_KZ	0.0	0.000000
3	CJT_GESAMTTYP	8067.0	0.007450
4	FINANZ_MINIMALIST	0.0	0.000000

plt.figure(figsize=(8,24))
plt.barh(ratio_missing['attribute'],
         ratio_missing['ratio'],
         zorder=3)

plt.xlabel('Fraction of Missing Values')
plt.title('Fraction of Missing Values Across Combined Customer and Demographic Datasets')
plt.ylim([-1,85])
plt.tick_params(
    axis='y',
    left=False)
plt.tick_params(
    axis='x',
    bottom=False)
plt.grid(True,
         axis='x',
         zorder=0)

for spine in plt.gca().spines.values():
    spine.set_visible(False)

png

Since most columns seem to have fewer than 27% missing values, drop the fields that have more than 27% missing. This threshold is chosen specifically to make sure the CAMEO_INTL_2015 field is preserved for the both customer and demographic datasets.

AGER_TYP: Best-ager typology
GEBURTSJAHR: Year of birth
TITEL_KZ: Academic title flag
ALTER_HH: Birthdate of head of household
KK_KUNDENTYP: Consumer pattern over past 12 months
KBA05_BAUMAX: Most common building type within the microcell
KKK: Purchasing power in region
REGIOTYP: Neighborhood typology

It seems that data features related to age and geography are particularly likely to have a high ratio of missing values.

columns_to_drop = ratio_missing[ratio_missing['ratio']>0.27]
columns_to_drop

	attribute	missing	ratio
0	AGER_TYP	782581.0	0.722690
11	GEBURTSJAHR	485342.0	0.448198
40	TITEL_KZ	1078401.0	0.995870
43	ALTER_HH	379014.0	0.350008
47	KK_KUNDENTYP	696549.0	0.643242
64	KBA05_BAUMAX	586059.0	0.541208

print(d.shape)
d = d.drop(columns=columns_to_drop['attribute'].unique())
print(d.shape)

(1082873, 85)
(1082873, 79)

2.1.3 Assess Missing Data in Each Row

Label rows with fewer than 53 features with values as low completeness, otherwise high.

d['features with data'] = d.count(axis='columns')
d.loc[d['features with data']>=50, 'completeness'] = 'high'
d['completeness'] = d['completeness'].fillna('low')
print(d['completeness'].value_counts())

high    939786
low     143087
Name: completeness, dtype: int64

plt.figure(figsize=(12,6))
d['features with data'].hist()
plt.xlabel('Features with Values')
plt.ylabel('Row Count')
for spine in plt.gca().spines.values():
    spine.set_visible(False)
plt.vlines(50, 0, 250000, 'red')
plt.ylim(0,249999);

png

Helper function to plot distributions of codes for several features.

def plot_feature_distribution(feature):
    plt.figure(figsize=(12,4))

    plt.subplot(1, 2, 1)
    ax1 = sns.countplot(d[d['completeness']=='low'][feature].fillna('None'));
    ax1.set_title('Low Completeness')
    for spine in ax1.spines.values():
        spine.set_visible(False)

    plt.subplot(1, 2, 2)
    ax2 = sns.countplot(d[d['completeness']=='high'][feature].fillna('None'))
    ax2.set_title('High Completeness')
    for spine in ax2.spines.values():
        spine.set_visible(False)

I picked out five features that are all very complete across the dataset, and examine the distributions of values for the two different types of rows I’ve labeled above: high completeness and low completeness.

for feature in ['ONLINE_AFFINITAET', 'ZABEOTYP', 'SEMIO_TRADV', 
                'FINANZ_HAUSBAUER', 'FINANZ_MINIMALIST']:
    plot_feature_distribution(feature)

png

I note that the rows with relatively low completeness appear to have much different value distributions than the data with high completeness.

For now I flag this for followup later, but for now I plan to remove any rows with missing data from the dataset once processing is completed.

d = d.drop(columns=['completeness',
                    'features with data'])

2.2 Select and Re-Encode Features

The dataset includes a few different types of features.

f['type'].value_counts()

ordinal        49
categorical    21
numeric         7
mixed           7
interval        1
Name: type, dtype: int64

Numeric and Interval data can be kept without changes.
I assume that the Ordinal data all have linear spacing and therefore can be treated the same as Interval data: kept without changes.
The Categorical and Mixed data may require special handling.

2.2.1 Re-Encode Categorical Features

Categorical data typically needs to be encoded as dummy variables, but the right approach varies depending on the number of categories in the feature.

Binary (2) Categoricals with non-numeric values - encode as numbers.
Binary (2) categoricals with numeric values - no re-encoding required.
Multi-level (3+) Categoricals with non-numeric values - one-hot encode.
Multi-level (3+) categoricals with numeric values - one-hot encode.

categorical_features = set(f[f['type']=='categorical']['attribute'])
remaining_cols_in_data = set(d.columns)
categorical_features = categorical_features.intersection(remaining_cols_in_data)
for feature in categorical_features:
    f.loc[f['attribute']==feature, 'dtype'] = d[feature].dtype
    f.loc[f['attribute']==feature, 'unique values'] = len(d[feature].unique())
(f[f['attribute'].isin(categorical_features)]
 [['attribute','type','dtype','unique values']]
 .reset_index(drop=True))

	attribute	type	dtype	unique values
0	ANREDE_KZ	categorical	float64	2.0
1	CJT_GESAMTTYP	categorical	float64	7.0
2	FINANZTYP	categorical	float64	6.0
3	GFK_URLAUBERTYP	categorical	float64	13.0
4	GREEN_AVANTGARDE	categorical	float64	2.0
5	LP_FAMILIE_FEIN	categorical	float64	12.0
6	LP_FAMILIE_GROB	categorical	float64	6.0
7	LP_STATUS_FEIN	categorical	float64	11.0
8	LP_STATUS_GROB	categorical	float64	6.0
9	NATIONALITAET_KZ	categorical	float64	4.0
10	SHOPPER_TYP	categorical	float64	5.0
11	SOHO_KZ	categorical	float64	3.0
12	VERS_TYP	categorical	float64	3.0
13	ZABEOTYP	categorical	float64	6.0
14	GEBAEUDETYP	categorical	float64	8.0
15	OST_WEST_KZ	categorical	object	3.0
16	CAMEO_DEUG_2015	categorical	object	10.0
17	CAMEO_DEU_2015	categorical	object	45.0

One-Hot Encode the categoricals with more than two unique values.

print(d.shape)
d = pd.get_dummies(d, columns=categorical_features)
print(d.shape)

(1082873, 79)
(1082873, 199)

2.2.2 Engineer Mixed-Type Features

The mixed features require additional engineering.

f[f['type']=='mixed']

	attribute	information_level	type	missing_or_unknown	dtype	unique values
15	LP_LEBENSPHASE_FEIN	person	mixed	[0]	NaN	NaN
16	LP_LEBENSPHASE_GROB	person	mixed	[0]	NaN	NaN
22	PRAEGENDE_JUGENDJAHRE	person	mixed	[-1,0]	NaN	NaN
56	WOHNLAGE	building	mixed	[-1]	NaN	NaN
59	CAMEO_INTL_2015	microcell_rr4	mixed	[-1,XX]	NaN	NaN
64	KBA05_BAUMAX	microcell_rr3	mixed	[-1,0]	NaN	NaN
79	PLZ8_BAUMAX	macrocell_plz8	mixed	[-1,0]	NaN	NaN

With several different inconsistent categories, five of these seven features features are more trouble than they are worth and will be dropped.

for drop_col in ['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'WOHNLAGE',
                 'PLZ8_BAUMAX', 'KBA05_BAUMAX']:
    if drop_col in d.columns:
        d = d.drop(columns=[drop_col])

PRAEGENDE_JUGENDJAHRE

The following is from reference info provided by Udacity in the file Data_Dictionary.md:

“1.18. PRAEGENDE_JUGENDJAHRE

Dominating movement of person’s youth (avantgarde vs. mainstream; east vs. west)

-1: unknown
0: unknown
1: 40s - war years (Mainstream, E+W)
2: 40s - reconstruction years (Avantgarde, E+W)
3: 50s - economic miracle (Mainstream, E+W)
4: 50s - milk bar / Individualisation (Avantgarde, E+W)
5: 60s - economic miracle (Mainstream, E+W)
6: 60s - generation 68 / student protestors (Avantgarde, W)
7: 60s - opponents to the building of the Wall (Avantgarde, E)
8: 70s - family orientation (Mainstream, E+W)
9: 70s - peace movement (Avantgarde, E+W)
10: 80s - Generation Golf (Mainstream, W)
11: 80s - ecological awareness (Avantgarde, W)
12: 80s - FDJ / communist party youth organisation (Mainstream, E)
13: 80s - Swords into ploughshares (Avantgarde, E)
14: 90s - digital media kids (Mainstream, E+W)
15: 90s - ecological awareness (Avantgarde, E+W)”

Enough information is available to separate out a generation by decade field, PRAEGENDE_JUGENDJAHRE_decade, and a movement field, PRAEGENDE_JUGENDJAHRE_movement.

d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([1,2]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 1
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([3,4]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 2
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([5,6,7]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 3
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([8,9]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 4
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([10,11,12,13]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 5
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([14,15]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 6

d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([1,3,5,8,10,12,14]), 
      'PRAEGENDE_JUGENDJAHRE_movement'] = 0
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([2,4,6,7,9,11,13,15]), 
      'PRAEGENDE_JUGENDJAHRE_movement'] = 1

CAMEO_INTL_2015

The following is from reference info provided by Udacity in the file Data_Dictionary.md:

“4.3. CAMEO_INTL_2015

German CAMEO: Wealth / Life Stage Typology, mapped to international code

-1: unknown
11: Wealthy Households - Pre-Family Couples & Singles
12: Wealthy Households - Young Couples With Children
13: Wealthy Households - Families With School Age Children
14: Wealthy Households - Older Families & Mature Couples
15: Wealthy Households - Elders In Retirement
21: Prosperous Households - Pre-Family Couples & Singles
22: Prosperous Households - Young Couples With Children
23: Prosperous Households - Families With School Age Children
24: Prosperous Households - Older Families & Mature Couples
25: Prosperous Households - Elders In Retirement
31: Comfortable Households - Pre-Family Couples & Singles
32: Comfortable Households - Young Couples With Children
33: Comfortable Households - Families With School Age Children
34: Comfortable Households - Older Families & Mature Couples
35: Comfortable Households - Elders In Retirement
41: Less Affluent Households - Pre-Family Couples & Singles
42: Less Affluent Households - Young Couples With Children
43: Less Affluent Households - Families With School Age Children
44: Less Affluent Households - Older Families & Mature Couples
45: Less Affluent Households - Elders In Retirement
51: Poorer Households - Pre-Family Couples & Singles
52: Poorer Households - Young Couples With Children
53: Poorer Households - Families With School Age Children
54: Poorer Households - Older Families & Mature Couples
55: Poorer Households - Elders In Retirement
XX: unknown”

The ones and tens places can be split into two different features: a wealth field, CAMEO_INTL_2015_wealth, and a life stage field, CAMEO_INTL_2015_life_stage.

d['CAMEO_INTL_2015_wealth'] = (d['CAMEO_INTL_2015']
                               .astype(str).str[:1]
                               .replace('n',np.nan)
                               .astype(float))

d['CAMEO_INTL_2015_life_stage'] = (d['CAMEO_INTL_2015']
                                   .astype(str).str[1:]
                                   .replace('an',np.nan)
                                   .astype(float))

d = d.drop(columns=['PRAEGENDE_JUGENDJAHRE','CAMEO_INTL_2015'])

print(d.shape)
d = d.dropna(how='any').reset_index(drop=True)
print(d.shape)

(1082873, 197)
(738854, 197)

2.3 Convert Processing Steps to a Function

The following helper function combines all the preprocessing steps created above into a single function that takes a dataframe as a parameter and returns a cleaned dataframe.

def preprocess(dataframe):
    d = dataframe.copy()
    
    # Convert Missing Value Codes to NaNs
    missing_vals = [-1, 0, 9, 'X', 'XX']
    missing = f[['attribute','missing_or_unknown']].copy()
    for col in missing_vals:
        missing.loc[missing['missing_or_unknown'].str.contains(str(col)), col] = col
    
    for attr in missing['attribute']:
        if attr in d.columns:
            if d[attr].dtype in [float, int]:
                d[attr] = d[attr].astype(float)
                for col in missing_vals[:3]:
                    missing_val = missing[missing['attribute']==attr][col].values[0]
                    d[attr] = d[attr].replace(missing_val, np.nan, regex=False)
            else:
                for col in missing_vals[3:]:
                    missing_val = missing[missing['attribute']==attr][col].values[0]
                    d[attr] = d[attr].replace(missing_val, np.nan, regex=False)
    
    # Address Missing Data in Each Column
    ratio_missing = (d
                     .head(1)
                     .T
                     .reset_index()
                     .rename(columns={'index' : 'attribute',
                                      0       : 'missing'}))
    ratio_missing['missing'] = np.nan
    for col in d.columns:
        ratio_missing.loc[ratio_missing['attribute']==col,'missing'] = d[col].isna().sum()
        ratio_missing['ratio'] = ratio_missing['missing'] / d.shape[0]
    columns_to_drop = ratio_missing[ratio_missing['ratio']>0.27]
    d = d.drop(columns=columns_to_drop['attribute'].unique())
    
    # Re-Encode Categorical Features
    categorical_features = set(f[f['type']=='categorical']['attribute'])
    remaining_cols_in_data = set(d.columns)
    categorical_features = categorical_features.intersection(remaining_cols_in_data)
    d = pd.get_dummies(d, columns=categorical_features)

    # Engineer or Drop Mixed-Type Features
    d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([1,2]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 1
    d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([3,4]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 2
    d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([5,6,7]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 3
    d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([8,9]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 4
    d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([10,11,12,13]), 
          'PRAEGENDE_JUGENDJAHRE_decade'] = 5
    d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([14,15]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 6
    
    d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([1,3,5,8,10,12,14]), 
          'PRAEGENDE_JUGENDJAHRE_movement'] = 0
    d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([2,4,6,7,9,11,13,15]), 
          'PRAEGENDE_JUGENDJAHRE_movement'] = 1
    
    d['CAMEO_INTL_2015_wealth'] = (d['CAMEO_INTL_2015']
                                   .astype(str).str[:1]
                                   .replace('n',np.nan)
                                   .astype(float))
    d['CAMEO_INTL_2015_life_stage'] = (d['CAMEO_INTL_2015']
                                       .astype(str).str[1:]
                                       .replace('an',np.nan)
                                       .astype(float))
    
    for drop_col in ['PRAEGENDE_JUGENDJAHRE','CAMEO_INTL_2015', 
                     'LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 
                     'WOHNLAGE', 'PLZ8_BAUMAX', 'KBA05_BAUMAX']:
        if drop_col in d.columns:
            d = d.drop(columns=[drop_col])
    
    # Address Missing Data in Each Row
    d = d.dropna(how='any').reset_index(drop=True)
    
    return d

3. Write to File

First, preprocess the demographics and customer data as outlined above.

print(a.shape)
a = preprocess(a)
print(a.shape)

(891221, 85)
(623211, 197)

print(c.shape)
c = preprocess(c)
print(c.shape)

(191652, 85)
(133377, 179)

I note that the demographics and customer datasets had different columns filtered out due to different data completeness levels in the two files. To handle this, I keep just the features that are present in both files.

print(a.shape)
a = a[c.columns].copy()
print(a.shape)

(623211, 197)
(623211, 179)

print(c.shape)

(133377, 179)

assert list(a.columns) == list(c.columns)

a.to_csv('/home/ryan/large-files/datasets/customer-segments/processed/a_processed.csv')
c.to_csv('/home/ryan/large-files/datasets/customer-segments/processed/c_processed.csv')

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	-1	2	1	2.0	3	4	3	5	5	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	-1	1	2	5.0	1	5	2	5	4	5	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	-1	3	2	3.0	1	4	1	2	3	5	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	2	4	2	2.0	4	2	5	2	1	2	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	-1	3	1	5.0	4	3	4	1	3	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	2	4	1	5.0	5	1	5	1	2	2	...	3.0	3.0	1.0	0.0	1.0	5.0	5.0	1.0	2.0	1.0
1	-1	4	1	NaN	5	1	5	1	3	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	-1	4	2	2.0	5	1	5	1	4	4	...	2.0	3.0	3.0	1.0	3.0	3.0	2.0	3.0	5.0	3.0
3	1	4	1	2.0	5	1	5	2	1	2	...	3.0	2.0	1.0	0.0	1.0	3.0	4.0	1.0	3.0	1.0
4	-1	3	1	6.0	3	1	4	4	5	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	3.0	5.0	1.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	-1	2	1	2.0	3	4	3	5	5	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	-1	1	2	5.0	1	5	2	5	4	5	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	-1	3	2	3.0	1	4	1	2	3	5	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	2	4	2	2.0	4	2	5	2	1	2	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	-1	3	1	5.0	4	3	4	1	3	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	2	4	1	5.0	5	1	5	1	2	2	...	3.0	3.0	1.0	0.0	1.0	5.0	5.0	1.0	2.0	1.0
1	-1	4	1	NaN	5	1	5	1	3	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	-1	4	2	2.0	5	1	5	1	4	4	...	2.0	3.0	3.0	1.0	3.0	3.0	2.0	3.0	5.0	3.0
3	1	4	1	2.0	5	1	5	2	1	2	...	3.0	2.0	1.0	0.0	1.0	3.0	4.0	1.0	3.0	1.0
4	-1	3	1	6.0	3	1	4	4	5	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	3.0	5.0	1.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	-1	2	1	2.0	3	4	3	5	5	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	-1	1	2	5.0	1	5	2	5	4	5	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	-1	3	2	3.0	1	4	1	2	3	5	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	2	4	2	2.0	4	2	5	2	1	2	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	-1	3	1	5.0	4	3	4	1	3	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	2	4	1	5.0	5	1	5	1	2	2	...	3.0	3.0	1.0	0.0	1.0	5.0	5.0	1.0	2.0	1.0
1	-1	4	1	NaN	5	1	5	1	3	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	-1	4	2	2.0	5	1	5	1	4	4	...	2.0	3.0	3.0	1.0	3.0	3.0	2.0	3.0	5.0	3.0
3	1	4	1	2.0	5	1	5	2	1	2	...	3.0	2.0	1.0	0.0	1.0	3.0	4.0	1.0	3.0	1.0
4	-1	3	1	6.0	3	1	4	4	5	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	3.0	5.0	1.0