Detailed files are available above. The following is a summary of key results and brief discussion of parts of the analysis.
R is a programming language used for statistical computing. It is most commonly used with RStudio, a free programming environment that is reminiscent of the Matlab user interface. R is much different than almost all programming languages I have used. This distinctiveness reflects its purpose as a tool for mathematics and statistics. The syntax is very unique as compared with more traditional languages designed for large-scale software development (C++, Java, even Python). Some code snippets are included below.
This project is a dataset exploration. Despite its quirks, R appears to be well-suited to exploring data. R produces beautiful visualizations with relative code brevity.
The approach to dataset exploration advocated by Udacity is to begin by analyzing the distributions of single variables (univariate analysis). From there, relationships between two of these variables can be analyzed (bivariate analysis). Finally, datapoint coloration, size, and shape can be employed to examine the relationships between more than two variables at a time (multivariate analysis).
The publicly available dataset used for this project contains physiochemical properties and quality ratings for a set of nearly 1600 wines. Information regarding the dataset, including where it can be downloaded, are available here.
The dependent variable for the dataset is the wine quality. This quality factor is a rating between 0 and 10 based on the subjective rating of a group of wine experts. The actual distribution of these ratings are computed and plotted as follows.
As shown in the histogram, the assigned ratings range from 3 to 8.
To aid in subsequent visualizations that benefit from a smaller number of groups, I create a wine “rating” that groups the quality factors as follows:
- 0: Quality 3 or 4 (low quality, 63 wines)
- 1: Quality 5 or 6 (moderate quality, 1319 wines)
- 2: Quality 7 or 8 (high quality, 217 wines)
Admittedly, as “feature engineering” goes, this simple grouping of variables is very tame.
An examination of the histograms for the independent variables show that a majority of them are positively skewed. I test the skewness by running summaries for each of the independent factors.
The only parameters for which the mean is not substantially above the median are density and pH. For the remainder, the data is positively skewed. To compensate, I perform a log-transform on the data. The resulting histograms appear as follows.
In most cases, the spread of the data better approximates a normal distribution following the log transform.
GGpairs is one of the most impressive tools for dataset exploration I have ever found. I am interested to learn what the Python equivalent is. The results of running ggpairs on this dataset follow.
As shown, for each combination of variables, ggpairs produces a scatterplot and calculates the correlation coefficient. From this generalized pairs plot, the factors with the strongest correlations to wine quality are alcohol, volatile acidity, sulphates, and citric acid, in that order. I re-run the ggpairs plot with just that subset of factors.
For each of the possible pairings between the independent variables and wine quality, I produce a scatterplot and a trendline to better understand the relationship. The parameter with the strongest relationship to wine quality is alcohol content.
I also create scatterplots describing a few of the relationships between key features of interest. These are shown below.
In the following plots, the dimensionality of the quality factor is reduced by using the previously discussed (High, Medium, Low) rating system. In these plots, the best wines (blue) minimize volatile acidity (on the y-axis) and maximize each of the other features (on the x-axis).
The lefthand plot is on a linear scale (untransformed, raw data) whereas the righthand plot has been log-transformed to improve the spread of the data.
The following plots are similar, except the best wines maximize the parameters on both axes. The code used to generate these plots are similar to the previous code block.
The final permutation follows.
Another way to visualize the same data is to split the quality factor into two groups: less than or equal to five, and greater than 5. The following data are plotted with and without log transforms. The advantage of this visualization is the location of the highest quality, blue group is very apparent by comparing the left-hand group to the right-hand.
For all of these plots, the clustering of the quality groups becomes more evident as the plots are printed larger.
Analytics Lessons Learned
- R seems to be a great tool for visualization and exploration. It is able to produce great visuals with relatively few lines of code. However, I find the syntax unintuitive and strange. Given a choice between Python and R, I would prefer Python. I would, however, like to learn how to produce the identical plots of this project with Python. I imagine it requires more libraries and code to produce similar plots as R, overall.