Unsupervised Learning to Identify Customer Segments
In this project, I apply unsupervised machine learning techniques to better understand the core core customer base for a mail-order sales company in Germany.
The project involved segmenting large, high-dimensional data related the German population. By comparing similar data for customers of the company, I determined which population segments are disproportionately represented among the company’s customers. These insights could be used to direct marketing campaigns towards audiences that are most likely to be responsive.
The result of the project was a clearer understanding of the company’s typical customer. Specifically, the analysis showed that the company’s customer base is disproportionately financially thrify, relatively old, dutiful, religious, and traditional.
The data used was provided by Udacity partner Bertelsmann Arvato Analytics.
This is page 1 of 2 for the project, where I
Load and Preprocess the Data
The specific tasks break down as follows:
- 1: Load the Data
- 2: Preprocessing
- 2.1: Assess Missing Data
- 2.1.1: Convert Missing Value Codes to NaNs
- 2.1.2: Assess Missing Data in Each Column
- Discussion 2.1.2: Assess Missing Data in Each Column
- 2.1.3: Assess Missing Data in Each Row
- Discussion 2.1.3: Assess Missing Data in Each Row
- 2.2: Select and Re-Encode Features
- 2.2.1: Re-Encode Categorical Features
- 2.2.2: Engineer Mixed-Type Features
- 2.2.3: Complete Feature Selection
- 2.3: Convert Processing Steps to a Function
- 2.1: Assess Missing Data
- 3: Write Processed Data to File
1. Load the Data
There are four files associated with this project:
Udacity_AZDIAS_Subset.csv
: Demographics data for the general population of Germany; 891,211 persons (rows) x 85 features (columns).Udacity_CUSTOMERS_Subset.csv
: Demographics data for customers of a mail-order company; 191,652 persons (rows) x 85 features (columns).Data_Dictionary.md
: Detailed information file about the features in the provided datasets. This file contains written explanations of the data included in the other files.AZDIAS_Feature_Summary.csv
: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Demographics Data
a_path = '/home/ryan/large-files/datasets/customer-segments/Udacity_AZDIAS_Subset.csv'
a = pd.read_csv(a_path,
sep=';')
print(a.shape)
a.head()
(891221, 85)
AGER_TYP | ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | ... | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_BAUMAX | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -1 | 2 | 1 | 2.0 | 3 | 4 | 3 | 5 | 5 | 3 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | -1 | 1 | 2 | 5.0 | 1 | 5 | 2 | 5 | 4 | 5 | ... | 2.0 | 3.0 | 2.0 | 1.0 | 1.0 | 5.0 | 4.0 | 3.0 | 5.0 | 4.0 |
2 | -1 | 3 | 2 | 3.0 | 1 | 4 | 1 | 2 | 3 | 5 | ... | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 4.0 | 4.0 | 3.0 | 5.0 | 2.0 |
3 | 2 | 4 | 2 | 2.0 | 4 | 2 | 5 | 2 | 1 | 2 | ... | 2.0 | 2.0 | 2.0 | 0.0 | 1.0 | 3.0 | 4.0 | 2.0 | 3.0 | 3.0 |
4 | -1 | 3 | 1 | 5.0 | 4 | 3 | 4 | 1 | 3 | 2 | ... | 2.0 | 4.0 | 2.0 | 1.0 | 2.0 | 3.0 | 3.0 | 4.0 | 6.0 | 5.0 |
5 rows × 85 columns
Customers Data
c_path = '/home/ryan/large-files/datasets/customer-segments/Udacity_CUSTOMERS_Subset.csv'
c = pd.read_csv(c_path,
sep=';')
print(c.shape)
c.head()
(191652, 85)
AGER_TYP | ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | ... | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_BAUMAX | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 4 | 1 | 5.0 | 5 | 1 | 5 | 1 | 2 | 2 | ... | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 5.0 | 5.0 | 1.0 | 2.0 | 1.0 |
1 | -1 | 4 | 1 | NaN | 5 | 1 | 5 | 1 | 3 | 2 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | -1 | 4 | 2 | 2.0 | 5 | 1 | 5 | 1 | 4 | 4 | ... | 2.0 | 3.0 | 3.0 | 1.0 | 3.0 | 3.0 | 2.0 | 3.0 | 5.0 | 3.0 |
3 | 1 | 4 | 1 | 2.0 | 5 | 1 | 5 | 2 | 1 | 2 | ... | 3.0 | 2.0 | 1.0 | 0.0 | 1.0 | 3.0 | 4.0 | 1.0 | 3.0 | 1.0 |
4 | -1 | 3 | 1 | 6.0 | 3 | 1 | 4 | 4 | 5 | 2 | ... | 2.0 | 4.0 | 2.0 | 1.0 | 2.0 | 3.0 | 3.0 | 3.0 | 5.0 | 1.0 |
5 rows × 85 columns
Feature Summary
f_path = '/home/ryan/large-files/datasets/customer-segments/AZDIAS_Feature_Summary.csv'
f = pd.read_csv(f_path,
sep=';')
print(f.shape)
f.head()
(85, 4)
attribute | information_level | type | missing_or_unknown | |
---|---|---|---|---|
0 | AGER_TYP | person | categorical | [-1,0] |
1 | ALTERSKATEGORIE_GROB | person | ordinal | [-1,0,9] |
2 | ANREDE_KZ | person | categorical | [-1,0] |
3 | CJT_GESAMTTYP | person | categorical | [0] |
4 | FINANZ_MINIMALIST | person | ordinal | [-1] |
2. Preprocessing
For purposes of establishing the steps of the cleaning process, use a combined dataset d
containing the customer (c
) and demographics (a
) datasets appended together.
d = (a
.append(c)
.reset_index(drop=True))
print(d.shape)
(1082873, 85)
2.1 Assess Missing Data
2.1.1 Convert Missing Value Codes to NaNs
The feature summary dataset, f
, contains a missing_or_unknown
column that contains a list of values that indicate the data is missing or unknown.
missing_vals = [-1, 0, 9, 'X', 'XX']
missing = f[['attribute','missing_or_unknown']].copy()
for col in missing_vals:
missing.loc[missing['missing_or_unknown'].str.contains(str(col)), col] = col
missing.head()
attribute | missing_or_unknown | -1 | 0 | 9 | X | XX | |
---|---|---|---|---|---|---|---|
0 | AGER_TYP | [-1,0] | -1.0 | 0.0 | NaN | NaN | NaN |
1 | ALTERSKATEGORIE_GROB | [-1,0,9] | -1.0 | 0.0 | 9.0 | NaN | NaN |
2 | ANREDE_KZ | [-1,0] | -1.0 | 0.0 | NaN | NaN | NaN |
3 | CJT_GESAMTTYP | [0] | NaN | 0.0 | NaN | NaN | NaN |
4 | FINANZ_MINIMALIST | [-1] | -1.0 | NaN | NaN | NaN | NaN |
The following cell iterates through all the columns in each of the demographics and customer datasets, a
and c
, replacing values with nan where correct per the missing
dataframe, above. Note that the columns in both datasets are identical and match the rows in the missing
dataframe, above.
for attr in missing['attribute']:
if d[attr].dtype in [float, int]:
d[attr] = d[attr].astype(float)
for col in missing_vals[:3]:
missing_val = missing[missing['attribute']==attr][col].values[0]
d[attr] = d[attr].replace(missing_val, np.nan, regex=False)
else:
for col in missing_vals[3:]:
missing_val = missing[missing['attribute']==attr][col].values[0]
d[attr] = d[attr].replace(missing_val, np.nan, regex=False)
2.1.2 Assess Missing Data in Each Column
ratio_missing = (d
.head(1)
.T
.reset_index()
.rename(columns={'index' : 'attribute',
0 : 'missing'}))
ratio_missing['missing'] = np.nan
for col in d.columns:
ratio_missing.loc[ratio_missing['attribute']==col,'missing'] = d[col].isna().sum()
ratio_missing['ratio'] = ratio_missing['missing'] / d.shape[0]
ratio_missing.head()
attribute | missing | ratio | |
---|---|---|---|
0 | AGER_TYP | 782581.0 | 0.722690 |
1 | ALTERSKATEGORIE_GROB | 3121.0 | 0.002882 |
2 | ANREDE_KZ | 0.0 | 0.000000 |
3 | CJT_GESAMTTYP | 8067.0 | 0.007450 |
4 | FINANZ_MINIMALIST | 0.0 | 0.000000 |
plt.figure(figsize=(8,24))
plt.barh(ratio_missing['attribute'],
ratio_missing['ratio'],
zorder=3)
plt.xlabel('Fraction of Missing Values')
plt.title('Fraction of Missing Values Across Combined Customer and Demographic Datasets')
plt.ylim([-1,85])
plt.tick_params(
axis='y',
left=False)
plt.tick_params(
axis='x',
bottom=False)
plt.grid(True,
axis='x',
zorder=0)
for spine in plt.gca().spines.values():
spine.set_visible(False)
Since most columns seem to have fewer than 27% missing values, drop the fields that have more than 27% missing. This threshold is chosen specifically to make sure the CAMEO_INTL_2015
field is preserved for the both customer and demographic datasets.
AGER_TYP
: Best-ager typologyGEBURTSJAHR
: Year of birthTITEL_KZ
: Academic title flagALTER_HH
: Birthdate of head of householdKK_KUNDENTYP
: Consumer pattern over past 12 monthsKBA05_BAUMAX
: Most common building type within the microcellKKK
: Purchasing power in regionREGIOTYP
: Neighborhood typology
It seems that data features related to age and geography are particularly likely to have a high ratio of missing values.
columns_to_drop = ratio_missing[ratio_missing['ratio']>0.27]
columns_to_drop
attribute | missing | ratio | |
---|---|---|---|
0 | AGER_TYP | 782581.0 | 0.722690 |
11 | GEBURTSJAHR | 485342.0 | 0.448198 |
40 | TITEL_KZ | 1078401.0 | 0.995870 |
43 | ALTER_HH | 379014.0 | 0.350008 |
47 | KK_KUNDENTYP | 696549.0 | 0.643242 |
64 | KBA05_BAUMAX | 586059.0 | 0.541208 |
print(d.shape)
d = d.drop(columns=columns_to_drop['attribute'].unique())
print(d.shape)
(1082873, 85)
(1082873, 79)
2.1.3 Assess Missing Data in Each Row
Label rows with fewer than 53 features with values as low completeness, otherwise high.
d['features with data'] = d.count(axis='columns')
d.loc[d['features with data']>=50, 'completeness'] = 'high'
d['completeness'] = d['completeness'].fillna('low')
print(d['completeness'].value_counts())
high 939786
low 143087
Name: completeness, dtype: int64
plt.figure(figsize=(12,6))
d['features with data'].hist()
plt.xlabel('Features with Values')
plt.ylabel('Row Count')
for spine in plt.gca().spines.values():
spine.set_visible(False)
plt.vlines(50, 0, 250000, 'red')
plt.ylim(0,249999);
Helper function to plot distributions of codes for several features.
def plot_feature_distribution(feature):
plt.figure(figsize=(12,4))
plt.subplot(1, 2, 1)
ax1 = sns.countplot(d[d['completeness']=='low'][feature].fillna('None'));
ax1.set_title('Low Completeness')
for spine in ax1.spines.values():
spine.set_visible(False)
plt.subplot(1, 2, 2)
ax2 = sns.countplot(d[d['completeness']=='high'][feature].fillna('None'))
ax2.set_title('High Completeness')
for spine in ax2.spines.values():
spine.set_visible(False)
I picked out five features that are all very complete across the dataset, and examine the distributions of values for the two different types of rows I’ve labeled above: high completeness and low completeness.
for feature in ['ONLINE_AFFINITAET', 'ZABEOTYP', 'SEMIO_TRADV',
'FINANZ_HAUSBAUER', 'FINANZ_MINIMALIST']:
plot_feature_distribution(feature)
I note that the rows with relatively low completeness appear to have much different value distributions than the data with high completeness.
For now I flag this for followup later, but for now I plan to remove any rows with missing data from the dataset once processing is completed.
d = d.drop(columns=['completeness',
'features with data'])
2.2 Select and Re-Encode Features
The dataset includes a few different types of features.
f['type'].value_counts()
ordinal 49
categorical 21
numeric 7
mixed 7
interval 1
Name: type, dtype: int64
- Numeric and Interval data can be kept without changes.
- I assume that the Ordinal data all have linear spacing and therefore can be treated the same as Interval data: kept without changes.
- The Categorical and Mixed data may require special handling.
2.2.1 Re-Encode Categorical Features
Categorical data typically needs to be encoded as dummy variables, but the right approach varies depending on the number of categories in the feature.
- Binary (2) Categoricals with non-numeric values - encode as numbers.
- Binary (2) categoricals with numeric values - no re-encoding required.
- Multi-level (3+) Categoricals with non-numeric values - one-hot encode.
- Multi-level (3+) categoricals with numeric values - one-hot encode.
categorical_features = set(f[f['type']=='categorical']['attribute'])
remaining_cols_in_data = set(d.columns)
categorical_features = categorical_features.intersection(remaining_cols_in_data)
for feature in categorical_features:
f.loc[f['attribute']==feature, 'dtype'] = d[feature].dtype
f.loc[f['attribute']==feature, 'unique values'] = len(d[feature].unique())
(f[f['attribute'].isin(categorical_features)]
[['attribute','type','dtype','unique values']]
.reset_index(drop=True))
attribute | type | dtype | unique values | |
---|---|---|---|---|
0 | ANREDE_KZ | categorical | float64 | 2.0 |
1 | CJT_GESAMTTYP | categorical | float64 | 7.0 |
2 | FINANZTYP | categorical | float64 | 6.0 |
3 | GFK_URLAUBERTYP | categorical | float64 | 13.0 |
4 | GREEN_AVANTGARDE | categorical | float64 | 2.0 |
5 | LP_FAMILIE_FEIN | categorical | float64 | 12.0 |
6 | LP_FAMILIE_GROB | categorical | float64 | 6.0 |
7 | LP_STATUS_FEIN | categorical | float64 | 11.0 |
8 | LP_STATUS_GROB | categorical | float64 | 6.0 |
9 | NATIONALITAET_KZ | categorical | float64 | 4.0 |
10 | SHOPPER_TYP | categorical | float64 | 5.0 |
11 | SOHO_KZ | categorical | float64 | 3.0 |
12 | VERS_TYP | categorical | float64 | 3.0 |
13 | ZABEOTYP | categorical | float64 | 6.0 |
14 | GEBAEUDETYP | categorical | float64 | 8.0 |
15 | OST_WEST_KZ | categorical | object | 3.0 |
16 | CAMEO_DEUG_2015 | categorical | object | 10.0 |
17 | CAMEO_DEU_2015 | categorical | object | 45.0 |
One-Hot Encode the categoricals with more than two unique values.
print(d.shape)
d = pd.get_dummies(d, columns=categorical_features)
print(d.shape)
(1082873, 79)
(1082873, 199)
2.2.2 Engineer Mixed-Type Features
The mixed
features require additional engineering.
f[f['type']=='mixed']
attribute | information_level | type | missing_or_unknown | dtype | unique values | |
---|---|---|---|---|---|---|
15 | LP_LEBENSPHASE_FEIN | person | mixed | [0] | NaN | NaN |
16 | LP_LEBENSPHASE_GROB | person | mixed | [0] | NaN | NaN |
22 | PRAEGENDE_JUGENDJAHRE | person | mixed | [-1,0] | NaN | NaN |
56 | WOHNLAGE | building | mixed | [-1] | NaN | NaN |
59 | CAMEO_INTL_2015 | microcell_rr4 | mixed | [-1,XX] | NaN | NaN |
64 | KBA05_BAUMAX | microcell_rr3 | mixed | [-1,0] | NaN | NaN |
79 | PLZ8_BAUMAX | macrocell_plz8 | mixed | [-1,0] | NaN | NaN |
With several different inconsistent categories, five of these seven features features are more trouble than they are worth and will be dropped.
for drop_col in ['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'WOHNLAGE',
'PLZ8_BAUMAX', 'KBA05_BAUMAX']:
if drop_col in d.columns:
d = d.drop(columns=[drop_col])
PRAEGENDE_JUGENDJAHRE
The following is from reference info provided by Udacity in the file Data_Dictionary.md:
“1.18. PRAEGENDE_JUGENDJAHRE
Dominating movement of person’s youth (avantgarde vs. mainstream; east vs. west)
- -1: unknown
- 0: unknown
- 1: 40s - war years (Mainstream, E+W)
- 2: 40s - reconstruction years (Avantgarde, E+W)
- 3: 50s - economic miracle (Mainstream, E+W)
- 4: 50s - milk bar / Individualisation (Avantgarde, E+W)
- 5: 60s - economic miracle (Mainstream, E+W)
- 6: 60s - generation 68 / student protestors (Avantgarde, W)
- 7: 60s - opponents to the building of the Wall (Avantgarde, E)
- 8: 70s - family orientation (Mainstream, E+W)
- 9: 70s - peace movement (Avantgarde, E+W)
- 10: 80s - Generation Golf (Mainstream, W)
- 11: 80s - ecological awareness (Avantgarde, W)
- 12: 80s - FDJ / communist party youth organisation (Mainstream, E)
- 13: 80s - Swords into ploughshares (Avantgarde, E)
- 14: 90s - digital media kids (Mainstream, E+W)
- 15: 90s - ecological awareness (Avantgarde, E+W)”
Enough information is available to separate out a generation by decade field, PRAEGENDE_JUGENDJAHRE_decade
, and a movement field, PRAEGENDE_JUGENDJAHRE_movement
.
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([1,2]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 1
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([3,4]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 2
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([5,6,7]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 3
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([8,9]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 4
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([10,11,12,13]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 5
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([14,15]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 6
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([1,3,5,8,10,12,14]),
'PRAEGENDE_JUGENDJAHRE_movement'] = 0
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([2,4,6,7,9,11,13,15]),
'PRAEGENDE_JUGENDJAHRE_movement'] = 1
CAMEO_INTL_2015
The following is from reference info provided by Udacity in the file Data_Dictionary.md:
“4.3. CAMEO_INTL_2015
German CAMEO: Wealth / Life Stage Typology, mapped to international code
- -1: unknown
- 11: Wealthy Households - Pre-Family Couples & Singles
- 12: Wealthy Households - Young Couples With Children
- 13: Wealthy Households - Families With School Age Children
- 14: Wealthy Households - Older Families & Mature Couples
- 15: Wealthy Households - Elders In Retirement
- 21: Prosperous Households - Pre-Family Couples & Singles
- 22: Prosperous Households - Young Couples With Children
- 23: Prosperous Households - Families With School Age Children
- 24: Prosperous Households - Older Families & Mature Couples
- 25: Prosperous Households - Elders In Retirement
- 31: Comfortable Households - Pre-Family Couples & Singles
- 32: Comfortable Households - Young Couples With Children
- 33: Comfortable Households - Families With School Age Children
- 34: Comfortable Households - Older Families & Mature Couples
- 35: Comfortable Households - Elders In Retirement
- 41: Less Affluent Households - Pre-Family Couples & Singles
- 42: Less Affluent Households - Young Couples With Children
- 43: Less Affluent Households - Families With School Age Children
- 44: Less Affluent Households - Older Families & Mature Couples
- 45: Less Affluent Households - Elders In Retirement
- 51: Poorer Households - Pre-Family Couples & Singles
- 52: Poorer Households - Young Couples With Children
- 53: Poorer Households - Families With School Age Children
- 54: Poorer Households - Older Families & Mature Couples
- 55: Poorer Households - Elders In Retirement
- XX: unknown”
The ones and tens places can be split into two different features: a wealth field, CAMEO_INTL_2015_wealth
, and a life stage field, CAMEO_INTL_2015_life_stage
.
d['CAMEO_INTL_2015_wealth'] = (d['CAMEO_INTL_2015']
.astype(str).str[:1]
.replace('n',np.nan)
.astype(float))
d['CAMEO_INTL_2015_life_stage'] = (d['CAMEO_INTL_2015']
.astype(str).str[1:]
.replace('an',np.nan)
.astype(float))
d = d.drop(columns=['PRAEGENDE_JUGENDJAHRE','CAMEO_INTL_2015'])
print(d.shape)
d = d.dropna(how='any').reset_index(drop=True)
print(d.shape)
(1082873, 197)
(738854, 197)
2.3 Convert Processing Steps to a Function
The following helper function combines all the preprocessing steps created above into a single function that takes a dataframe as a parameter and returns a cleaned dataframe.
def preprocess(dataframe):
d = dataframe.copy()
# Convert Missing Value Codes to NaNs
missing_vals = [-1, 0, 9, 'X', 'XX']
missing = f[['attribute','missing_or_unknown']].copy()
for col in missing_vals:
missing.loc[missing['missing_or_unknown'].str.contains(str(col)), col] = col
for attr in missing['attribute']:
if attr in d.columns:
if d[attr].dtype in [float, int]:
d[attr] = d[attr].astype(float)
for col in missing_vals[:3]:
missing_val = missing[missing['attribute']==attr][col].values[0]
d[attr] = d[attr].replace(missing_val, np.nan, regex=False)
else:
for col in missing_vals[3:]:
missing_val = missing[missing['attribute']==attr][col].values[0]
d[attr] = d[attr].replace(missing_val, np.nan, regex=False)
# Address Missing Data in Each Column
ratio_missing = (d
.head(1)
.T
.reset_index()
.rename(columns={'index' : 'attribute',
0 : 'missing'}))
ratio_missing['missing'] = np.nan
for col in d.columns:
ratio_missing.loc[ratio_missing['attribute']==col,'missing'] = d[col].isna().sum()
ratio_missing['ratio'] = ratio_missing['missing'] / d.shape[0]
columns_to_drop = ratio_missing[ratio_missing['ratio']>0.27]
d = d.drop(columns=columns_to_drop['attribute'].unique())
# Re-Encode Categorical Features
categorical_features = set(f[f['type']=='categorical']['attribute'])
remaining_cols_in_data = set(d.columns)
categorical_features = categorical_features.intersection(remaining_cols_in_data)
d = pd.get_dummies(d, columns=categorical_features)
# Engineer or Drop Mixed-Type Features
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([1,2]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 1
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([3,4]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 2
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([5,6,7]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 3
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([8,9]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 4
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([10,11,12,13]),
'PRAEGENDE_JUGENDJAHRE_decade'] = 5
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([14,15]), 'PRAEGENDE_JUGENDJAHRE_decade'] = 6
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([1,3,5,8,10,12,14]),
'PRAEGENDE_JUGENDJAHRE_movement'] = 0
d.loc[d['PRAEGENDE_JUGENDJAHRE'].isin([2,4,6,7,9,11,13,15]),
'PRAEGENDE_JUGENDJAHRE_movement'] = 1
d['CAMEO_INTL_2015_wealth'] = (d['CAMEO_INTL_2015']
.astype(str).str[:1]
.replace('n',np.nan)
.astype(float))
d['CAMEO_INTL_2015_life_stage'] = (d['CAMEO_INTL_2015']
.astype(str).str[1:]
.replace('an',np.nan)
.astype(float))
for drop_col in ['PRAEGENDE_JUGENDJAHRE','CAMEO_INTL_2015',
'LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB',
'WOHNLAGE', 'PLZ8_BAUMAX', 'KBA05_BAUMAX']:
if drop_col in d.columns:
d = d.drop(columns=[drop_col])
# Address Missing Data in Each Row
d = d.dropna(how='any').reset_index(drop=True)
return d
3. Write to File
First, preprocess the demographics and customer data as outlined above.
print(a.shape)
a = preprocess(a)
print(a.shape)
(891221, 85)
(623211, 197)
print(c.shape)
c = preprocess(c)
print(c.shape)
(191652, 85)
(133377, 179)
I note that the demographics and customer datasets had different columns filtered out due to different data completeness levels in the two files. To handle this, I keep just the features that are present in both files.
print(a.shape)
a = a[c.columns].copy()
print(a.shape)
(623211, 197)
(623211, 179)
print(c.shape)
(133377, 179)
assert list(a.columns) == list(c.columns)
a.to_csv('/home/ryan/large-files/datasets/customer-segments/processed/a_processed.csv')
c.to_csv('/home/ryan/large-files/datasets/customer-segments/processed/c_processed.csv')