Unsupervised Learning to Identify Customer Segments

In this project, I apply unsupervised machine learning techniques to better understand the core core customer base for a mail-order sales company in Germany.

The project involved segmenting large, high-dimensional data related the German population. By comparing similar data for customers of the company, I determined which population segments are disproportionately represented among the company’s customers. These insights could be used to direct marketing campaigns towards audiences that are most likely to be responsive.

The result of the project was a clearer understanding of the company’s typical customer. Specifically, the analysis showed that the company’s customer base is disproportionately financially thrify, relatively old, dutiful, religious, and traditional.

The data used was provided by Udacity partner Bertelsmann Arvato Analytics.


This is page 1 of 2 for the project, where I

Load and Preprocess the Data

The specific tasks break down as follows:

  • 1: Load the Data
  • 2: Preprocessing
    • 2.1: Assess Missing Data
      • 2.1.1: Convert Missing Value Codes to NaNs
      • 2.1.2: Assess Missing Data in Each Column
      • Discussion 2.1.2: Assess Missing Data in Each Column
      • 2.1.3: Assess Missing Data in Each Row
      • Discussion 2.1.3: Assess Missing Data in Each Row
    • 2.2: Select and Re-Encode Features
      • 2.2.1: Re-Encode Categorical Features
      • 2.2.2: Engineer Mixed-Type Features
      • 2.2.3: Complete Feature Selection
    • 2.3: Convert Processing Steps to a Function
  • 3: Write Processed Data to File

1. Load the Data

There are four files associated with this project:

  • Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891,211 persons (rows) x 85 features (columns).
  • Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191,652 persons (rows) x 85 features (columns).
  • Data_Dictionary.md: Detailed information file about the features in the provided datasets. This file contains written explanations of the data included in the other files.
  • AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Demographics Data

a_path = '/home/ryan/large-files/datasets/customer-segments/Udacity_AZDIAS_Subset.csv'
a = pd.read_csv(a_path,
                sep=';')
print(a.shape)
a.head()
(891221, 85)

AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 -1 2 1 2.0 3 4 3 5 5 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 -1 1 2 5.0 1 5 2 5 4 5 ... 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0
2 -1 3 2 3.0 1 4 1 2 3 5 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0
3 2 4 2 2.0 4 2 5 2 1 2 ... 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
4 -1 3 1 5.0 4 3 4 1 3 2 ... 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0

5 rows × 85 columns

Customers Data

c_path = '/home/ryan/large-files/datasets/customer-segments/Udacity_CUSTOMERS_Subset.csv'
c = pd.read_csv(c_path,
                sep=';')
print(c.shape)
c.head()
(191652, 85)

AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 2 4 1 5.0 5 1 5 1 2 2 ... 3.0 3.0 1.0 0.0 1.0 5.0 5.0 1.0 2.0 1.0
1 -1 4 1 NaN 5 1 5 1 3 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 -1 4 2 2.0 5 1 5 1 4 4 ... 2.0 3.0 3.0 1.0 3.0 3.0 2.0 3.0 5.0 3.0
3 1 4 1 2.0 5 1 5 2 1 2 ... 3.0 2.0 1.0 0.0 1.0 3.0 4.0 1.0 3.0 1.0
4 -1 3 1 6.0 3 1 4 4 5 2 ... 2.0 4.0 2.0 1.0 2.0 3.0 3.0 3.0 5.0 1.0

5 rows × 85 columns

Feature Summary

f_path = '/home/ryan/large-files/datasets/customer-segments/AZDIAS_Feature_Summary.csv'
f = pd.read_csv(f_path,
                sep=';')
print(f.shape)
f.head()
(85, 4)

attribute information_level type missing_or_unknown
0 AGER_TYP person categorical [-1,0]
1 ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
2 ANREDE_KZ person categorical [-1,0]
3 CJT_GESAMTTYP person categorical [0]
4 FINANZ_MINIMALIST person ordinal [-1]

2. Preprocessing

For purposes of establishing the steps of the cleaning process, use a combined dataset d containing the customer (c) and demographics (a) datasets appended together.

d = (a
     .append(c)
     .reset_index(drop=True))
print(d.shape)
(1082873, 85)

2.1 Assess Missing Data

2.1.1 Convert Missing Value Codes to NaNs

The feature summary dataset, f, contains a missing_or_unknown column that contains a list of values that indicate the data is missing or unknown.

missing_vals = [-1, 0, 9, 'X', 'XX']
missing = f[['attribute','missing_or_unknown']].copy()
for col in missing_vals:
    missing.loc[missing['missing_or_unknown'].str.contains(str(col)), col] = col
missing.head()

attribute missing_or_unknown -1 0 9 X XX
0 AGER_TYP [-1,0] -1.0 0.0 NaN NaN NaN
1 ALTERSKATEGORIE_GROB [-1,0,9] -1.0 0.0 9.0 NaN NaN
2 ANREDE_KZ [-1,0] -1.0 0.0 NaN NaN NaN
3 CJT_GESAMTTYP [0] NaN 0.0 NaN NaN NaN
4 FINANZ_MINIMALIST [-1] -1.0 NaN NaN NaN NaN

The following cell iterates through all the columns in each of the demographics and customer datasets, a and c, replacing values with nan where correct per the missing dataframe, above. Note that the columns in both datasets are identical and match the rows in the missing dataframe, above.

for attr in missing['attribute']:
    if d[attr].dtype in [float, int]:
        d[attr] = d[attr].astype(float)
        for col in missing_vals[:3]:
            missing_val = missing[missing['attribute']==attr][col].values[0]
            d[attr] = d[attr].replace(missing_val, np.nan, regex=False)
    else:
        for col in missing_vals[3:]:
            missing_val = missing[missing['attribute']==attr][col].values[0]
            d[attr] = d[attr].replace(missing_val, np.nan, regex=False)

2.1.2 Assess Missing Data in Each Column

ratio_missing = (d
                 .head(1)
                 .T
                 .reset_index()
                 .rename(columns={'index' : 'attribute',
                                  0       : 'missing'}))
ratio_missing['missing'] = np.nan
for col in d.columns:
    ratio_missing.loc[ratio_missing['attribute']==col,'missing'] = d[col].isna().sum()
    ratio_missing['ratio'] = ratio_missing['missing'] / d.shape[0]
ratio_missing.head()

attribute missing ratio
0 AGER_TYP 782581.0 0.722690
1 ALTERSKATEGORIE_GROB 3121.0 0.002882
2 ANREDE_KZ 0.0 0.000000
3 CJT_GESAMTTYP 8067.0 0.007450
4 FINANZ_MINIMALIST 0.0 0.000000
plt.figure(figsize=(8,24))
plt.barh(ratio_missing['attribute'],
         ratio_missing['ratio'],
         zorder=3)

plt.xlabel('Fraction of Missing Values')
plt.title('Fraction of Missing Values Across Combined Customer and Demographic Datasets')
plt.ylim([-1,85])
plt.tick_params(
    axis='y',
    left=False)
plt.tick_params(
    axis='x',
    bottom=False)
plt.grid(True,
         axis='x',
         zorder=0)

for spine in plt.gca().spines.values():
    spine.set_visible(False)

png

Since most columns seem to have fewer than 27% missing values, drop the fields that have more than 27% missing. This threshold is chosen specifically to make sure the CAMEO_INTL_2015 field is preserved for the both customer and demographic datasets.

  • AGER_TYP: Best-ager typology
  • GEBURTSJAHR: Year of birth
  • TITEL_KZ: Academic title flag
  • ALTER_HH: Birthdate of head of household
  • KK_KUNDENTYP: Consumer pattern over past 12 months
  • KBA05_BAUMAX: Most common building type within the microcell
  • KKK: Purchasing power in region
  • REGIOTYP: Neighborhood typology

It seems that data features related to age and geography are particularly likely to have a high ratio of missing values.

columns_to_drop = ratio_missing[ratio_missing['ratio']>0.27]
columns_to_drop

attribute missing ratio
0 AGER_TYP 782581.0 0.722690
11 GEBURTSJAHR 485342.0 0.448198
40 TITEL_KZ 1078401.0 0.995870
43 ALTER_HH 379014.0 0.350008
47 KK_KUNDENTYP 696549.0 0.643242
64 KBA05_BAUMAX 586059.0 0.541208
print(d.shape)
d = d.drop(columns=columns_to_drop['attribute'].unique())
print(d.shape)
(1082873, 85)
(1082873, 79)

2.1.3 Assess Missing Data in Each Row

Label rows with fewer than 53 features with values as low completeness, otherwise high.

d['features with data'] = d.count(axis='columns')
d.loc[d['features with data']>=50, 'completeness'] = 'high'
d['completeness'] = d['completeness'].fillna('low')
print(d['completeness'].value_counts())
high    939786
low     143087
Name: completeness, dtype: int64
plt.figure(figsize=(12,6))
d['features with data'].hist()
plt.xlabel('Features with Values')
plt.ylabel('Row Count')
for spine in plt.gca().spines.values():
    spine.set_visible(False)
plt.vlines(50, 0, 250000, 'red')
plt.ylim(0,249999);

png

Helper function to plot distributions of codes for several features.

def plot_feature_distribution(feature):
    plt.figure(figsize=(12,4))

    plt.subplot(1, 2, 1)
    ax1 = sns.countplot(d[d['completeness']=='low'][feature].fillna('None'));
    ax1.set_title('Low Completeness')
    for spine in ax1.spines.values():
        spine.set_visible(False)

    plt.subplot(1, 2, 2)
    ax2 = sns.countplot(d[d['completeness']=='high'][feature].fillna('None'))
    ax2.set_title('High Completeness')
    for spine in ax2.spines.values():
        spine.set_visible(False)

I picked out five features that are all very complete across the dataset, and examine the distributions of values for the two different types of rows I’ve labeled above: high completeness and low completeness.

for feature in ['ONLINE_AFFINITAET', 'ZABEOTYP', 'SEMIO_TRADV', 
                'FINANZ_HAUSBAUER', 'FINANZ_MINIMALIST']:
    plot_feature_distribution(feature)

png

png

png

png

png

I note that the rows with relatively low completeness appear to have much different value distributions than the data with high completeness.

For now I flag this for followup later, but for now I plan to remove any rows with missing data from the dataset once processing is completed.

d = d.drop(columns=['completeness',
                    'features with data'])

2.2 Select and Re-Encode Features

The dataset includes a few different types of features.

f['type'].value_counts()
ordinal        49
categorical    21
numeric         7
mixed           7
interval        1
Name: type, dtype: int64
  • Numeric and Interval data can be kept without changes.
  • I assume that the Ordinal data all have linear spacing and therefore can be treated the same as Interval data: kept without changes.
  • The Categorical and Mixed data may require special handling.

2.2.1 Re-Encode Categorical Features

Categorical data typically needs to be encoded as dummy variables, but the right approach varies depending on the number of categories in the feature.

  • Binary (2) Categoricals with non-numeric values - encode as numbers.
  • Binary (2) categoricals with numeric values - no re-encoding required.
  • Multi-level (3+) Categoricals with non-numeric values - one-hot encode.
  • Multi-level (3+) categoricals with numeric values - one-hot encode.
categorical_features = set(f[f['type']=='categorical']['attribute'])
remaining_cols_in_data = set(d.columns)
categorical_features = categorical_features.intersection(remaining_cols_in_data)
for feature in categorical_features:
    f.loc[f['attribute']==feature, 'dtype'] = d[feature].dtype
    f.loc[f['attribute']==feature, 'unique values'] = len(d[feature].unique())
(f[f['attribute'].isin(categorical_features)]
 [['attribute','type','dtype','unique values']]
 .reset_index(drop=True))

attribute type dtype unique values
0 ANREDE_KZ categorical float64 2.0
1 CJT_GESAMTTYP categorical float64 7.0
2 FINANZTYP categorical float64 6.0
3 GFK_URLAUBERTYP categorical float64 13.0
4 GREEN_AVANTGARDE categorical float64 2.0
5 LP_FAMILIE_FEIN categorical float64 12.0
6 LP_FAMILIE_GROB categorical float64 6.0
7 LP_STATUS_FEIN categorical float64 11.0
8 LP_STATUS_GROB categorical float64 6.0
9 NATIONALITAET_KZ categorical float64 4.0
10 SHOPPER_TYP categorical float64 5.0
11 SOHO_KZ categorical float64 3.0
12 VERS_TYP categorical float64 3.0
13 ZABEOTYP categorical float64 6.0
14 GEBAEUDETYP categorical float64 8.0
15 OST_WEST_KZ categorical object 3.0
16 CAMEO_DEUG_2015 categorical object 10.0
17 CAMEO_DEU_2015 categorical object 45.0

One-Hot Encode the categoricals with more than two unique values.

print(d.shape)
d = pd.get_dummies(d, columns=categorical_features)
print(d.shape)
(1082873, 79)
(1082873, 199)

2.2.2 Engineer Mixed-Type Features

The mixed features require additional engineering.

f[f['type']=='mixed']

attribute information_level type missing_or_unknown dtype unique values
15 LP_LEBENSPHASE_FEIN person mixed [0] NaN NaN
16 LP_LEBENSPHASE_GROB person mixed [0] NaN NaN
22 PRAEGENDE_JUGENDJAHRE person mixed [-1,0] NaN NaN
56 WOHNLAGE building mixed [-1] NaN NaN
59 CAMEO_INTL_2015 microcell_rr4 mixed [-1,XX] NaN NaN
64 KBA05_BAUMAX microcell_rr3 mixed [-1,0] NaN NaN
79 PLZ8_BAUMAX macrocell_plz8 mixed [-1,0] NaN NaN

With several different inconsistent categories, five of these seven features features are more trouble than they are worth and will be dropped.

for drop_col in ['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'WOHNLAGE',
                 'PLZ8_BAUMAX', 'KBA05_BAUMAX']:
    if drop_col in d.columns:
        d = d.drop(columns=[drop_col])

PRAEGENDE_JUGENDJAHRE

The following is from reference info provided by Udacity in the file Data_Dictionary.md:

“1.18. PRAEGENDE_JUGENDJAHRE

Dominating movement of person’s youth (avantgarde vs. mainstream; east vs. west)

  • -1: unknown
  • 0: unknown
  • 1: 40s - war years (Mainstream, E+W)
  • 2: 40s - reconstruction years (Avantgarde, E+W)
  • 3: 50s - economic miracle (Mainstream, E+W)
  • 4: 50s - milk bar / Individualisation (Avantgarde, E+W)
  • 5: 60s - economic miracle (Mainstream, E+W)
  • 6: 60s - generation 68 / student protestors (Avantgarde, W)
  • 7: 60s - opponents to the building of the Wall (Avantgarde, E)
  • 8: 70s - family orientation (Mainstream, E+W)
  • 9: 70s - peace movement (Avantgarde, E+W)
  • 10: 80s - Generation Golf (Mainstream, W)
  • 11: 80s - ecological awareness (Avantgarde, W)
  • 12: 80s - FDJ / communist party youth organisation (Mainstream, E)
  • 13: 80s - Swords into ploughshares (Avantgarde, E)
  • 14: 90s - digital media kids (Mainstream, E+W)
  • 15: 90s - ecological awareness (Avantgarde, E+W)”

Enough information is available to separate out a generation by decade field, PRAEGENDE_JUGENDJAHRE_decade, and a movement field, PRAEGENDE_JUGENDJAHRE_movement.

d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([1,2]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 1
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([3,4]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 2
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([5,6,7]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 3
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([8,9]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 4
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([10,11,12,13]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 5
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([14,15]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 6
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([1,3,5,8,10,12,14]), 
      'PRAEGENDE_JUGENDJAHRE_movement'] = 0
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([2,4,6,7,9,11,13,15]), 
      'PRAEGENDE_JUGENDJAHRE_movement'] = 1

CAMEO_INTL_2015

The following is from reference info provided by Udacity in the file Data_Dictionary.md:

“4.3. CAMEO_INTL_2015

German CAMEO: Wealth / Life Stage Typology, mapped to international code

  • -1: unknown
  • 11: Wealthy Households - Pre-Family Couples & Singles
  • 12: Wealthy Households - Young Couples With Children
  • 13: Wealthy Households - Families With School Age Children
  • 14: Wealthy Households - Older Families & Mature Couples
  • 15: Wealthy Households - Elders In Retirement
  • 21: Prosperous Households - Pre-Family Couples & Singles
  • 22: Prosperous Households - Young Couples With Children
  • 23: Prosperous Households - Families With School Age Children
  • 24: Prosperous Households - Older Families & Mature Couples
  • 25: Prosperous Households - Elders In Retirement
  • 31: Comfortable Households - Pre-Family Couples & Singles
  • 32: Comfortable Households - Young Couples With Children
  • 33: Comfortable Households - Families With School Age Children
  • 34: Comfortable Households - Older Families & Mature Couples
  • 35: Comfortable Households - Elders In Retirement
  • 41: Less Affluent Households - Pre-Family Couples & Singles
  • 42: Less Affluent Households - Young Couples With Children
  • 43: Less Affluent Households - Families With School Age Children
  • 44: Less Affluent Households - Older Families & Mature Couples
  • 45: Less Affluent Households - Elders In Retirement
  • 51: Poorer Households - Pre-Family Couples & Singles
  • 52: Poorer Households - Young Couples With Children
  • 53: Poorer Households - Families With School Age Children
  • 54: Poorer Households - Older Families & Mature Couples
  • 55: Poorer Households - Elders In Retirement
  • XX: unknown”

The ones and tens places can be split into two different features: a wealth field, CAMEO_INTL_2015_wealth, and a life stage field, CAMEO_INTL_2015_life_stage.

d['CAMEO_INTL_2015_wealth'] = (d['CAMEO_INTL_2015']
                               .astype(str).str[:1]
                               .replace('n',np.nan)
                               .astype(float))
d['CAMEO_INTL_2015_life_stage'] = (d['CAMEO_INTL_2015']
                                   .astype(str).str[1:]
                                   .replace('an',np.nan)
                                   .astype(float))
d = d.drop(columns=['PRAEGENDE_JUGENDJAHRE','CAMEO_INTL_2015'])
print(d.shape)
d = d.dropna(how='any').reset_index(drop=True)
print(d.shape)
(1082873, 197)
(738854, 197)

2.3 Convert Processing Steps to a Function

The following helper function combines all the preprocessing steps created above into a single function that takes a dataframe as a parameter and returns a cleaned dataframe.

def preprocess(dataframe):
    d = dataframe.copy()
    
    # Convert Missing Value Codes to NaNs
    missing_vals = [-1, 0, 9, 'X', 'XX']
    missing = f[['attribute','missing_or_unknown']].copy()
    for col in missing_vals:
        missing.loc[missing['missing_or_unknown'].str.contains(str(col)), col] = col
    
    for attr in missing['attribute']:
        if attr in d.columns:
            if d[attr].dtype in [float, int]:
                d[attr] = d[attr].astype(float)
                for col in missing_vals[:3]:
                    missing_val = missing[missing['attribute']==attr][col].values[0]
                    d[attr] = d[attr].replace(missing_val, np.nan, regex=False)
            else:
                for col in missing_vals[3:]:
                    missing_val = missing[missing['attribute']==attr][col].values[0]
                    d[attr] = d[attr].replace(missing_val, np.nan, regex=False)
    
    # Address Missing Data in Each Column
    ratio_missing = (d
                     .head(1)
                     .T
                     .reset_index()
                     .rename(columns={'index' : 'attribute',
                                      0       : 'missing'}))
    ratio_missing['missing'] = np.nan
    for col in d.columns:
        ratio_missing.loc[ratio_missing['attribute']==col,'missing'] = d[col].isna().sum()
        ratio_missing['ratio'] = ratio_missing['missing'] / d.shape[0]
    columns_to_drop = ratio_missing[ratio_missing['ratio']>0.27]
    d = d.drop(columns=columns_to_drop['attribute'].unique())
    
    # Re-Encode Categorical Features
    categorical_features = set(f[f['type']=='categorical']['attribute'])
    remaining_cols_in_data = set(d.columns)
    categorical_features = categorical_features.intersection(remaining_cols_in_data)
    d = pd.get_dummies(d, columns=categorical_features)

    # Engineer or Drop Mixed-Type Features
    d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([1,2]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 1
    d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([3,4]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 2
    d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([5,6,7]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 3
    d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([8,9]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 4
    d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([10,11,12,13]), 
          'PRAEGENDE_JUGENDJAHRE_decade'] = 5
    d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([14,15]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 6
    
    d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([1,3,5,8,10,12,14]), 
          'PRAEGENDE_JUGENDJAHRE_movement'] = 0
    d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([2,4,6,7,9,11,13,15]), 
          'PRAEGENDE_JUGENDJAHRE_movement'] = 1
    
    d['CAMEO_INTL_2015_wealth'] = (d['CAMEO_INTL_2015']
                                   .astype(str).str[:1]
                                   .replace('n',np.nan)
                                   .astype(float))
    d['CAMEO_INTL_2015_life_stage'] = (d['CAMEO_INTL_2015']
                                       .astype(str).str[1:]
                                       .replace('an',np.nan)
                                       .astype(float))
    
    for drop_col in ['PRAEGENDE_JUGENDJAHRE','CAMEO_INTL_2015', 
                     'LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 
                     'WOHNLAGE', 'PLZ8_BAUMAX', 'KBA05_BAUMAX']:
        if drop_col in d.columns:
            d = d.drop(columns=[drop_col])
    
    # Address Missing Data in Each Row
    d = d.dropna(how='any').reset_index(drop=True)
    
    return d

3. Write to File

First, preprocess the demographics and customer data as outlined above.

print(a.shape)
a = preprocess(a)
print(a.shape)
(891221, 85)
(623211, 197)
print(c.shape)
c = preprocess(c)
print(c.shape)
(191652, 85)
(133377, 179)

I note that the demographics and customer datasets had different columns filtered out due to different data completeness levels in the two files. To handle this, I keep just the features that are present in both files.

print(a.shape)
a = a[c.columns].copy()
print(a.shape)
(623211, 197)
(623211, 179)
print(c.shape)
(133377, 179)
assert list(a.columns) == list(c.columns)
a.to_csv('/home/ryan/large-files/datasets/customer-segments/processed/a_processed.csv')
c.to_csv('/home/ryan/large-files/datasets/customer-segments/processed/c_processed.csv')