Box Plots

A box plot (or, box and whisker plot) is a method of showing aggregate statistics of various samples in a concise manner. For each sample, it simultaneously shows:

  • median of each sample,
  • minimum and maximum of the samples, and the
  • interquartile range.

The following code block creates three different samplings from numpy. A normal distribution, a random distribution, and a gamma distribution.

%matplotlib notebook

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

normal_sample = np.random.normal(loc=0.0, scale=1.0, size=10000)
random_sample = np.random.random(size=10000)
gamma_sample = np.random.gamma(2, size=10000)

df = pd.DataFrame({'normal': normal_sample, 
                   'random': random_sample, 
                   'gamma': gamma_sample})

df.describe()

normal random gamma
count 10000.000000 10000.000000 10000.000000
mean 0.005574 0.503947 1.980398
std 1.003427 0.289082 1.400785
min -4.013316 0.000017 0.014506
25% -0.666291 0.252406 0.947084
50% -0.004336 0.507529 1.671177
75% 0.684656 0.751771 2.687132
max 3.474749 0.999931 12.756334

The summary statistics shown above require splitting the data into four quarters. The first quarter is between the minimal value and the first 25% of the data. That first 25% is called the first quartile. The seconda nd third quarters of the data are between the first quartile and the 75% mark, which is called the third quartile. The finnal quarter of the data is between the third quartile and the maximum.

The interquartile range is between the first and third quartiles.

In the box plots shown below the:

  • interquartile range is the boxed region, the
  • maximum and minimum values are the horizontal lines, and the
  • median is the red line in the center of the boxed region.

It is also possible to specify more limited ranges for the whiskers. This is changeable via the whis parameter to the function. The ‘range’ parameter means use the maximum and minimum of the data.

plt.figure()

plt.boxplot([ df['normal'], 
              df['random'], 
              df['gamma'] ], 
            whis='range');
<IPython.core.display.Javascript object>

If the whis parameter is left out, the top whisker defaults to reaching to the last datum less than Q3 + 1.5*IQR, where IQR represents the interquartile range.

Plots like the one shown below are good for detecting outliers. The datapoints beyond the whiskers are called “fliers.”

plt.figure()

plt.boxplot([ df['normal'], 
              df['random'], 
              df['gamma'] ]);
<IPython.core.display.Javascript object>