Detailed files are available above. The following is a summary of key results and brief discussion of parts of the analysis.
This project involves wrangling data obtained from three sources, all of which relate to the famous WeRateDogs (@dog_rates) Twitter account. WeRateDogs is a Twitter account that tweets images of dogs their owners send in, along with a funny caption and a rating that almost always exceeds 10/10. An example tweet follows.
The overall approach to this project is typical of many analytics projects. It involved gathering, assessing, cleaning, storing, and finally analyzing and visualizing the data.
The three datasets include:
- The complete Twitter archive for the account, in the form of a manually downloaded CSV file.
- A set of neural network image predictions for the set of dog images the account has tweeted. The neural network was trained to recognize dog breeds, but also predicts items, such as paper towels and, randomly, seat belts, if that is what it perceives in the images. This file was downloaded programmatically from Udacity servers using the Requests library.
- The Tweet information stored on the Twitter servers for each of the “Tweet IDs” contained in the Twitter archive. This information was downloaded from Twitter servers in the form of JSON entries, using the Tweepy library.
The following example code shows how one might interface with the Tweepy API. Learn more here.
Data assessment involves examining data quality and tidiness. The following definitions for these terms are taken from Udacity coursework:
Quality issues pertain to the content of data. Low quality data is also known as dirty data. There are four dimensions of quality data:
- Completeness: do we have all of the records that we should? Do we have missing records or not? Are there specific rows, columns, or cells missing?
- Validity: we have the records, but they’re not valid, i.e., they don’t conform to a defined schema. A schema is a defined set of rules for data. These rules can be real-world constraints (e.g. negative height is impossible) and table-specific constraints (e.g. unique key constraints in tables).
- Accuracy: inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect. Example: a patient’s weight that is 5 lbs too heavy because the scale was faulty.
- Consistency: inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing. Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.
Tidiness issues pertain to the structure of data. These structural problems generally prevent easy analysis. Untidy data is also known as messy data. The requirements for tidy data are:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
The most egregious completeness issue was that the various data files did not contain the same number of tweets.
Cleaning this issue involved dropping the tweets that were not part of the intersection of all three dataframes. Ultimately, this reduced the total tweet count to 2069 with entries in each table.
I created a new category called “prediction” in the images dataframe to simplify what seemed to be an excessive number of columns in that dataframe. The “prediction” variable was “Dog” if all three predictions were a dog breed, “Mixed” if the predictions contained both dog breeds and miscellaneous items, and “Not Dog” if all three predictions were not dogs. This single column eliminated the need for “p1_dog,” “p2_dog,” and “p3_dog” while still providing much of the same analytical value.
A sample of the same dataframe following cleaning looks as follows.
The before and after of the archive and json dataframes are included below. The dataframe prints very wide in a Jupyter notebook and has been split into two images.
The json dataframe is very small, by comparison.
Fundamentally, the two observation types these three tables represented were Tweets (in the case of the archive and json dataframes) and Images. The key that joins the three tables together is the ‘tweet_id’ (example: 668537837512433665). In keeping with the two observation types just listed, the archive and json dataframes are joined into a single tweets dataframe.
In keeping with the types of observations each of these tables represent, the data are written to two CSV files: ‘twitter_archive_master.csv’ and ‘image_archive_master.csv’.
Analyze and Visualize
The following visualizations were created with Matplotlib.
Since the first tweet in this dataset (2015/11/15), the number of times an @dog_rates tweet is favorited has increased at a rate of roughly 32 per day.
Since the first tweet in this dataset (2015/11/15), the number of times an @dog_rates tweet is retweeted has increased at a rate of roughly 7 per day.
Golden Retrievers are the most popular breed on the WeRateDogs Twitter feed, followed by Pembrokes and Labrador Retrievers.
Golden Retrievers seem especially highly rated, especially given the very large number of tweets that included Golden Retriever images.
It is unsurprising that the breeds that resulted in the highest average prediction confidence (in particular, the top three: Pomeranian, French Bulldog, and Pug) are also among the most distinctive looking dog breeds in the dataset.
Analytics Lessons Learned
- Data Cleaning takes much longer than you initially estimate. This project required over 70 hours to complete, a large majority of which was spent manipulating and transforming the data. Visualizing and analyzing the cleaned data is very brief exercise, by comparison.
- The data quality and tidiness definitions set forth above are great working definitions that should be reused in future analyses.
- A good practice is to assess the data in an assessment section, noting any issues inline. Then, summarize the complete set of those issues at the end of the assessment section, just above the clean section. This makes the list of issues easy to reference during the cleaning process.
- Perform cleaning on a copy of the original dataframe, so “before” and “after” states are easily compared.
- Perform cleaning systematically, using “Define,” “Code,” and “Test” sections for each issue identified during assessment. This ensures that results are as expected.