# Box Plots

A **box plot** (or, box and whisker plot) is a method of showing aggregate statistics of various samples in a concise manner. For each sample, it simultaneously shows:

**median**of each sample,**minimum**and**maximum**of the samples, and the**interquartile range**.

The following code block creates three different samplings from numpy. A **normal** distribution, a **random** distribution, and a **gamma** distribution.

```
%matplotlib notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
normal_sample = np.random.normal(loc=0.0, scale=1.0, size=10000)
random_sample = np.random.random(size=10000)
gamma_sample = np.random.gamma(2, size=10000)
df = pd.DataFrame({'normal': normal_sample,
'random': random_sample,
'gamma': gamma_sample})
df.describe()
```

normal | random | gamma | |
---|---|---|---|

count | 10000.000000 | 10000.000000 | 10000.000000 |

mean | 0.005574 | 0.503947 | 1.980398 |

std | 1.003427 | 0.289082 | 1.400785 |

min | -4.013316 | 0.000017 | 0.014506 |

25% | -0.666291 | 0.252406 | 0.947084 |

50% | -0.004336 | 0.507529 | 1.671177 |

75% | 0.684656 | 0.751771 | 2.687132 |

max | 3.474749 | 0.999931 | 12.756334 |

The summary statistics shown above require splitting the data into four quarters. The first quarter is between the minimal value and the first 25% of the data. That first 25% is called the **first quartile**. The seconda nd third quarters of the data are between the **first quartile** and the 75% mark, which is called the **third quartile**. The finnal quarter of the data is between the third quartile and the maximum.

The **interquartile range** is between the **first** and **third quartiles**.

In the box plots shown below the:

**interquartile range**is the boxed region, the**maximum**and**minimum**values are the horizontal lines, and the**median**is the red line in the center of the boxed region.

It is also possible to specify more limited ranges for the whiskers. This is changeable via the `whis`

parameter to the function. The ‘range’ parameter means use the maximum and minimum of the data.

```
plt.figure()
plt.boxplot([ df['normal'],
df['random'],
df['gamma'] ],
whis='range');
```

```
```

If the `whis`

parameter is left out, the top whisker defaults to reaching to the last datum less than `Q3 + 1.5*IQR`

, where `IQR`

represents the interquartile range.

Plots like the one shown below are good for detecting outliers. The datapoints beyond the whiskers are called “fliers.”

```
plt.figure()
plt.boxplot([ df['normal'],
df['random'],
df['gamma'] ]);
```

```
```