Visualizing with Pandas

25 Nov 2019

Pandas uses Matplotlib under the hood. It provides some convenient functions for visualizing data. This built-in visualization capability is particularly useful for fast and easy visualizing of data series to facilitate data exploration.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib notebook

Matplotlib comes with several built-in styles that allow us to change the look of our plots. These can be shown using the plt.style.available command.

plt.style.available

['seaborn-dark',
 'seaborn-darkgrid',
 'seaborn-ticks',
 'fivethirtyeight',
 'seaborn-whitegrid',
 'classic',
 '_classic_test',
 'fast',
 'seaborn-talk',
 'seaborn-dark-palette',
 'seaborn-bright',
 'seaborn-pastel',
 'grayscale',
 'seaborn-notebook',
 'ggplot',
 'seaborn-colorblind',
 'seaborn-muted',
 'seaborn',
 'Solarize_Light2',
 'seaborn-paper',
 'bmh',
 'tableau-colorblind10',
 'seaborn-white',
 'dark_background',
 'seaborn-poster',
 'seaborn-deep']

The following sets the color pallete to use similar rgb values as the Tableau colorblind palatte, which uses blues and oranges preferentially.

plt.style.use('tableau-colorblind10')

The following uses the numpy cumsum function, which cumulatively sums up an array. The pandas date_range functions allows the index to be set to every day in 2017.

np.random.seed(123)

d = pd.DataFrame({'A': np.random.randn(365).cumsum(0), 
                  'B': np.random.randn(365).cumsum(0) + 20,
                  'C': np.random.randn(365).cumsum(0) - 20}, 
                 index=pd.date_range('1/1/2017', periods=365))
d.head()

	A	B	C
2017-01-01	-1.085631	20.059291	-20.230904
2017-01-02	-0.088285	21.803332	-16.659325
2017-01-03	0.194693	20.835588	-17.055481
2017-01-04	-1.311601	21.255156	-17.093802
2017-01-05	-1.890202	21.462083	-19.518638

d.plot();

<IPython.core.display.Javascript object>

Line plots are the default, but there are many kinds of plots that can be generated.

kind :

'line' : line plot (default)
'bar' : vertical bar plot
'barh' : horizontal bar plot
'hist' : histogram
'box' : boxplot
'kde' : Kernel Density Estimation plot
'density' : same as ‘kde’
'area' : area plot
'pie' : pie plot
'scatter' : scatter plot
'hexbin' : hexbin plot

This can be changed using the kind parameter, or by appending it to the plot command, such as d.plot.scatter. The following shows array B plotted against array A.

d.plot.scatter('A','B');

<IPython.core.display.Javascript object>

The following shows C plotted against A, but with the color, c, and size, s, varying with B. Additionally, the colormap is set to viridis.

d.plot.scatter('A', 'C', 
               c='B', 
               s=d['B'],
               colormap='viridis');

<IPython.core.display.Javascript object>

plot.scatter returns a matplotlib.axes._subplot, so we can perform modifications on those items in the same manner as if we were using matplotlib. In the following, the plot is identical, except the aspect ratios are locked so that it is obvious to the viewer that the range of C is greater than the range of A.

ax = d.plot.scatter('A', 'C', 
                    c='B', 
                    s=d['B'],
                    colormap='viridis')
ax.set_aspect('equal')

<IPython.core.display.Javascript object>

Other examples of plots created using built-in Pandas visualization are shown below. Examples include boxplots, histograms, and kernel density estimate plots.

d.plot.box();

<IPython.core.display.Javascript object>

d.plot.hist(alpha=0.7);

<IPython.core.display.Javascript object>

d.plot.kde();

<IPython.core.display.Javascript object>

Pandas also has plotting tools designed to help with visualizing large amounts of data or high dimensional data.

The iris dataset is a classic multivariate dataset that is built-in with seaborn. It contains the sepal length, sepal width, petal length, and petal width for hundreds of samples of 3 species of the iris flower.

import seaborn as sns

iris = sns.load_dataset('iris')
iris.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

A scattermatrix is a way of comparing each column in a dataframe to every other column in a pairwise fashion. It is just a series of subplots with some plots shown as histograms and others scatterplots.

pd.plotting.scatter_matrix(iris);

<IPython.core.display.Javascript object>

Pandas also includes a tool for creating parallel coordinates plots, which are a common way of visualizing high-dimensional, multivariate data. Each variable in the dataset corresponds to an equally-spaced, parallel, vertical line. The values of each variable are connected by lines between them, for each observation. Coloring the lines by class allows the viewer to quickly determine whether there are any clear patterns or clustering.

In the example below, it is clear that petal_length and petal_width both split the data very well.

plt.figure()
pd.plotting.parallel_coordinates(iris, 'species');

<IPython.core.display.Javascript object>

These notes were taken from the Coursera course Applied Plotting, Charting & Data Representation in Python. The information is presented by Filip Jankovic, who at the time the course was created was a reasearch assistant at the University of Michigan. He now works as a Data Scientist.