Using PCA on Cars Dataset

15 Aug 2020

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

The dataset for this example consists of 387 cars with 18 features. 7 of these features are dummy variables.

c = pd.read_csv('pca-on-cars-dataset/cars.csv')
c.info()
c.head().T

<class 'pandas.core.frame.DataFrame'>
Index: 387 entries, Acura 3.5 RL to Volvo XC90 T6
Data columns (total 18 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Sports      387 non-null    int64  
 1   SUV         387 non-null    int64  
 2   Wagon       387 non-null    int64  
 3   Minivan     387 non-null    int64  
 4   Pickup      387 non-null    int64  
 5   AWD         387 non-null    int64  
 6   RWD         387 non-null    int64  
 7   Retail      387 non-null    int64  
 8   Dealer      387 non-null    int64  
 9   Engine      387 non-null    float64
 10  Cylinders   387 non-null    int64  
 11  Horsepower  387 non-null    int64  
 12  CityMPG     387 non-null    int64  
 13  HighwayMPG  387 non-null    int64  
 14  Weight      387 non-null    int64  
 15  Wheelbase   387 non-null    int64  
 16  Length      387 non-null    int64  
 17  Width       387 non-null    int64  
dtypes: float64(1), int64(17)
memory usage: 57.4+ KB

	Acura 3.5 RL	Acura 3.5 RL Navigation	Acura MDX	Acura NSX S	Acura RSX
Sports	0.0	0.0	0.0	1.0	0.0
SUV	0.0	0.0	1.0	0.0	0.0
Wagon	0.0	0.0	0.0	0.0	0.0
Minivan	0.0	0.0	0.0	0.0	0.0
Pickup	0.0	0.0	0.0	0.0	0.0
AWD	0.0	0.0	1.0	0.0	0.0
RWD	0.0	0.0	0.0	1.0	0.0
Retail	43755.0	46100.0	36945.0	89765.0	23820.0
Dealer	39014.0	41100.0	33337.0	79978.0	21761.0
Engine	3.5	3.5	3.5	3.2	2.0
Cylinders	6.0	6.0	6.0	6.0	4.0
Horsepower	225.0	225.0	265.0	290.0	200.0
CityMPG	18.0	18.0	17.0	17.0	24.0
HighwayMPG	24.0	24.0	23.0	24.0	31.0
Weight	3880.0	3893.0	4451.0	3153.0	2778.0
Wheelbase	115.0	115.0	106.0	100.0	101.0
Length	197.0	197.0	189.0	174.0	172.0
Width	72.0	72.0	77.0	71.0	68.0

Helper function that performs the scaling and actual PCA.

def do_pca(n_components, data):
    X = StandardScaler().fit_transform(data)
    pca = PCA(n_components)
    X_pca = pca.fit_transform(X)
    
    return pca, X_pca

Helper function that:

Creates a DataFrame of the PCA results
Includes feature weights and explained variance
Visualizes the PCA results

def pca_results(full_dataset, pca, plot=True):
    '''
    Create a DataFrame of the PCA results
    Includes dimension feature weights and explained variance
    Visualizes the PCA results
    '''

    # Dimension indexing
    dimensions = dimensions = \
    ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]

    # PCA components
    components = pd.DataFrame(np.round(pca.components_, 4), columns = full_dataset.keys())
    components.index = dimensions

    # PCA explained variance
    ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1)
    variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance'])
    variance_ratios.index = dimensions

    if plot:
        # Create a bar plot visualization
        fig, ax = plt.subplots(figsize = (14,8))

        # Plot the feature weights as a function of the components
        components.plot(ax = ax, kind = 'bar');
        ax.set_ylabel("Feature Weights")
        ax.set_xticklabels(dimensions, rotation=0)

        # Display the explained variance ratios
        for i, ev in enumerate(pca.explained_variance_ratio_):
            ax.text(i-0.40, ax.get_ylim()[1] + 0.05, 
                    "Explained Variance\n          %.4f"%(ev))

    # Return a concatenated DataFrame
    return pd.concat([variance_ratios, components], axis = 1)

For context, as outlined in my other notes on PCA:

Principal components span the directions of maximum variability.
Principal components are always orthogonal to one another.
Eigenvalues tell us the amount of information a principal component holds.

First, consider the first 3 principal components:

pca, X_pca = do_pca(3, c)
pca_results(c, pca).T

	Dimension 1	Dimension 2	Dimension 3
Explained Variance	0.4352	0.1667	0.1034
Sports	-0.0343	0.4420	0.0875
SUV	-0.1298	-0.2261	0.4898
Wagon	0.0289	-0.0106	0.0496
Minivan	-0.0481	-0.2074	-0.2818
Pickup	0.0000	-0.0000	0.0000
AWD	-0.0928	-0.1447	0.5506
RWD	-0.1175	0.3751	-0.2416
Retail	-0.2592	0.3447	0.0154
Dealer	-0.2576	0.3453	0.0132
Engine	-0.3396	0.0022	-0.0489
Cylinders	-0.3263	0.0799	-0.0648
Horsepower	-0.3118	0.2342	0.0040
CityMPG	0.3063	0.0169	-0.1421
HighwayMPG	0.3061	0.0433	-0.2486
Weight	-0.3317	-0.1832	0.0851
Wheelbase	-0.2546	-0.3066	-0.2846
Length	-0.2414	-0.2701	-0.3361
Width	-0.2886	-0.2163	-0.1369

png

print('Number of Components :   Variance Explained')
for component_count in range(1,17):
    pca, X_pca = do_pca(component_count, c)
    results = pca_results(c, pca, False)
    print('{:20} :   {:2.1%}'.format(component_count,
                                     results['Explained Variance'].sum()))

Number of Components :   Variance Explained
                   1 :   43.5%
                   2 :   60.2%
                   3 :   70.5%
                   4 :   76.8%
                   5 :   82.4%
                   6 :   86.8%
                   7 :   89.7%
                   8 :   92.5%
                   9 :   95.0%
                  10 :   96.8%
                  11 :   97.7%
                  12 :   98.5%
                  13 :   99.0%
                  14 :   99.5%
                  15 :   99.8%
                  16 :   100.0%

Note that the first two components, alone, account for 60% of the variance in the dataset.

This content is taken from notes I took while pursuing the Intro to Machine Learning with Pytorch nanodegree certification.