Wrangling the WeRateDogs Twitter Feed

Jupyter Notebook | Github Repo

Detailed files are available above. The following is a summary of key results and brief discussion of parts of the analysis.

This project involves wrangling data obtained from three sources, all of which relate to the famous WeRateDogs (@dog_rates) Twitter account. WeRateDogs is a Twitter account that tweets images of dogs their owners send in, along with a funny caption and a rating that almost always exceeds 10/10. An example tweet follows.

Example Tweet

The overall approach to this project is typical of many analytics projects. It involved gathering, assessing, cleaning, storing, and finally analyzing and visualizing the data.

Gather

The three datasets include:

  1. The complete Twitter archive for the account, in the form of a manually downloaded CSV file.
  2. A set of neural network image predictions for the set of dog images the account has tweeted. The neural network was trained to recognize dog breeds, but also predicts items, such as paper towels and, randomly, seat belts, if that is what it perceives in the images. This file was downloaded programmatically from Udacity servers using the Requests library.
  3. The Tweet information stored on the Twitter servers for each of the “Tweet IDs” contained in the Twitter archive. This information was downloaded from Twitter servers in the form of JSON entries, using the Tweepy library.

The following example code shows how one might interface with the Tweepy API. Learn more here.

consumer_key ='QsQAXrXsa0IDTpWr6ZmQ7EKua'                              # Fake
consumer_secret = 'T8Cv9CNciLgCSjuXhfeMsliQ4fCk9GrcfqW16pEvTiwwzsniK9' # Fake
access_token = '046106474066380523-C3f8KxZt7iwEo3k9TfuO9vn0UklxUba'    # Fake
access_secret = 'w6nrLYXKwq98Sy3pEVVQu0DyLtMMYXvVcXWjqUVukTPhZ'        # Fake

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth,
                 wait_on_rate_limit=True,
                 wait_on_rate_limit_notify=True)

with open('tweet_json.txt', mode = 'w') as file:
    for tweet_id in tweet_ids:
        try:
            status = api.get_status(tweet_id)
            json_str = json.dumps(status._json)
        except:
            # Then must have been deleted
             continue
        file.write(json_str + '\n')

Assess

Data assessment involves examining data quality and tidiness. The following definitions for these terms are taken from Udacity coursework:

Quality issues pertain to the content of data. Low quality data is also known as dirty data. There are four dimensions of quality data:

  • Completeness: do we have all of the records that we should? Do we have missing records or not? Are there specific rows, columns, or cells missing?
  • Validity: we have the records, but they’re not valid, i.e., they don’t conform to a defined schema. A schema is a defined set of rules for data. These rules can be real-world constraints (e.g. negative height is impossible) and table-specific constraints (e.g. unique key constraints in tables).
  • Accuracy: inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect. Example: a patient’s weight that is 5 lbs too heavy because the scale was faulty.
  • Consistency: inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing. Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.

Tidiness issues pertain to the structure of data. These structural problems generally prevent easy analysis. Untidy data is also known as messy data. The requirements for tidy data are:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

Clean

The most egregious completeness issue was that the various data files did not contain the same number of tweets.

print('  archive tweet count = ' + str(len(archive)))
print('   images tweet count = ' + str(len(images)))
print('json_info tweet count = ' + str(len(json_info)))
-------------------------------------------------------
  archive tweet count = 2356
   images tweet count = 2075
json_info tweet count = 2345

Cleaning this issue involved dropping the tweets that were not part of the intersection of all three dataframes. Ultimately, this reduced the total tweet count to 2069 with entries in each table.

# Reduce set of tweets_to_keep to the set in images_clean
tweets_to_keep = set(images_clean.tweet_id)
json_clean = json_clean[json_clean['tweet_id'].isin(tweets_to_keep)]

# Reduce set of tweets_to_keep to the intersection of the images_clean tweets
# and the json_clean tweets
tweets_to_keep = set(json_clean.tweet_id)
archive_clean = archive_clean[archive_clean['tweet_id'].isin(tweets_to_keep)]
images_clean = images_clean[images_clean['tweet_id'].isin(tweets_to_keep)]

I created a new category called “prediction” in the images dataframe to simplify what seemed to be an excessive number of columns in that dataframe. The “prediction” variable was “Dog” if all three predictions were a dog breed, “Mixed” if the predictions contained both dog breeds and miscellaneous items, and “Not Dog” if all three predictions were not dogs. This single column eliminated the need for “p1_dog,” “p2_dog,” and “p3_dog” while still providing much of the same analytical value.

Example rows of Images dataframe
ps = ['p1_dog', 'p2_dog', 'p3_dog']

for p in ps:
    images_clean[p] = images_clean[p].astype(int)

images_clean['prediction'] =
    images_clean.p1_dog + images_clean.p2_dog + images_clean.p3_dog

images_clean['prediction'] = images_clean['prediction'].replace(3, 'Dog')
images_clean['prediction'] = images_clean['prediction'].replace(2, 'Mixed')
images_clean['prediction'] = images_clean['prediction'].replace(1, 'Mixed')
images_clean['prediction'] = images_clean['prediction'].replace(0, 'Not Dog')

ps = ['p1_conf', 'p2_conf', 'p3_conf']

for p in ps:
    images_clean[p] = round(images_clean[p]*100).astype(int)

A sample of the same dataframe following cleaning looks as follows.

Example rows of Images dataframe following cleaning

The before and after of the archive and json dataframes are included below. The dataframe prints very wide in a Jupyter notebook and has been split into two images.

Example rows of archive dataframe before cleaning
Example rows of archive dataframe before cleaning

The json dataframe is very small, by comparison.

Example rows of archive dataframe before cleaning

Fundamentally, the two observation types these three tables represented were Tweets (in the case of the archive and json dataframes) and Images. The key that joins the three tables together is the ‘tweet_id’ (example: 668537837512433665). In keeping with the two observation types just listed, the archive and json dataframes are joined into a single tweets dataframe.

Example rows of the tweets dataframe after cleaning

Store

In keeping with the types of observations each of these tables represent, the data are written to two CSV files: ‘twitter_archive_master.csv’ and ‘image_archive_master.csv’.

tweets_clean.to_csv('twitter_archive_master.csv')
images_clean.to_csv('image_archive_master.csv')

Analyze and Visualize

The following visualizations were created with Matplotlib.

Since the first tweet in this dataset (2015/11/15), the number of times an @dog_rates tweet is favorited has increased at a rate of roughly 32 per day.

Visualization 1


Since the first tweet in this dataset (2015/11/15), the number of times an @dog_rates tweet is retweeted has increased at a rate of roughly 7 per day.

Visualization 2


Golden Retrievers are the most popular breed on the WeRateDogs Twitter feed, followed by Pembrokes and Labrador Retrievers.

Visualization 3


Golden Retrievers seem especially highly rated, especially given the very large number of tweets that included Golden Retriever images.

Visualization 4


It is unsurprising that the breeds that resulted in the highest average prediction confidence (in particular, the top three: Pomeranian, French Bulldog, and Pug) are also among the most distinctive looking dog breeds in the dataset.

Visualization 5

Analytics Lessons Learned

  • Data Cleaning takes much longer than you initially estimate. This project required over 70 hours to complete, a large majority of which was spent manipulating and transforming the data. Visualizing and analyzing the cleaned data is very brief exercise, by comparison.
  • The data quality and tidiness definitions set forth above are great working definitions that should be reused in future analyses.
  • A good practice is to assess the data in an assessment section, noting any issues inline. Then, summarize the complete set of those issues at the end of the assessment section, just above the clean section. This makes the list of issues easy to reference during the cleaning process.
  • Perform cleaning on a copy of the original dataframe, so “before” and “after” states are easily compared.
  • Perform cleaning systematically, using “Define,” “Code,” and “Test” sections for each issue identified during assessment. This ensures that results are as expected.