We Rate Dogs Twitter Feed Analysis

By: Ryan Wingate
Completed: April 15, 2018

Intro

The project consists of gathering, assessing, cleaning, storing, and finally analyzing and visualizing the tweet history of the famous twitter account WeRateDogs.

Preliminaries

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import requests
import tweepy
import json
import pprint as pp
import matplotlib.pyplot as plt

Gather

This section of the report details the processes by which the data for this analysis was obtained.

Downloaded CSV File

'twitter-archive-enhanced.csv' is a file downloaded manually from Udacity. It contains the Twitter archive for the WeRateDogs Twitter account.

It also contains several columns that were programmatically extracted from the Tweet text. It is likely that these columns will require cleaning.

In [2]:
# Downloaded this CSV from Udacity
archive = pd.read_csv('twitter-archive-enhanced.csv')

Programmatic Download Using Requests

'tweet-image-predictions.tsv' is a file downloaded programmatically from Udacity, using the requests python library. It contains the results of a neural network's analysis of tweet images.

Specifically, it contains predictions as to the image's contents. If it is a dog, then it predicts that dog's breed.

In [3]:
# Download tweet image predictions from Udacity's cloud storage
# url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
# response = requests.get(url)
# with open('tweet-image-predictions.tsv', mode = 'wb') as file:
#     file.write(response.content)
In [4]:
images = pd.read_csv('tweet-image-predictions.tsv', sep='\t')

Download via Twitter API Call

The Tweepy library is used to extract JSON data for each of the tweets included in the 'twitter-archive-enhanced.csv' file.

In [5]:
# consumer_key = ''
# consumer_secret = ''
# access_token = ''
# access_secret = ''

# auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# auth.set_access_token(access_token, access_secret)
In [6]:
# api = tweepy.API(auth, 
#                  wait_on_rate_limit=True,
#                  wait_on_rate_limit_notify=True)
In [7]:
# Determine how many tweet_ids there are
print('Unique Values = ' + str(archive.tweet_id.nunique()))
print(' Total Length = ' + str(len(archive)))

# Convert to local list to iterate through with
tweet_ids = archive.tweet_id.tolist()
Unique Values = 2356
 Total Length = 2356
In [8]:
# In a loop, access each tweet_id using the API and write each line to
# the file.

# with open('tweet_json.txt', mode = 'w') as file:
#     for tweet_id in tweet_ids:
#         try:
#             status = api.get_status(tweet_id)
#             json_str = json.dumps(status._json)
#         except:
             # Then must have been deleted
#             continue
#         file.write(json_str + '\n')

Print a line of the complete JSON downloaded from Twitter.

In [9]:
with open('tweet_json.txt') as json_file:
    line = json_file.readline()
    tweet = json.loads(line)
    pp.pprint(tweet)
{'contributors': None,
 'coordinates': None,
 'created_at': 'Tue Aug 01 16:23:56 +0000 2017',
 'entities': {'hashtags': [],
              'media': [{'display_url': 'pic.twitter.com/MgUWQ76dJU',
                         'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
                         'id': 892420639486877696,
                         'id_str': '892420639486877696',
                         'indices': [86, 109],
                         'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
                         'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
                         'sizes': {'large': {'h': 528,
                                             'resize': 'fit',
                                             'w': 540},
                                   'medium': {'h': 528,
                                              'resize': 'fit',
                                              'w': 540},
                                   'small': {'h': 528,
                                             'resize': 'fit',
                                             'w': 540},
                                   'thumb': {'h': 150,
                                             'resize': 'crop',
                                             'w': 150}},
                         'type': 'photo',
                         'url': 'https://t.co/MgUWQ76dJU'}],
              'symbols': [],
              'urls': [],
              'user_mentions': []},
 'extended_entities': {'media': [{'display_url': 'pic.twitter.com/MgUWQ76dJU',
                                  'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
                                  'id': 892420639486877696,
                                  'id_str': '892420639486877696',
                                  'indices': [86, 109],
                                  'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
                                  'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
                                  'sizes': {'large': {'h': 528,
                                                      'resize': 'fit',
                                                      'w': 540},
                                            'medium': {'h': 528,
                                                       'resize': 'fit',
                                                       'w': 540},
                                            'small': {'h': 528,
                                                      'resize': 'fit',
                                                      'w': 540},
                                            'thumb': {'h': 150,
                                                      'resize': 'crop',
                                                      'w': 150}},
                                  'type': 'photo',
                                  'url': 'https://t.co/MgUWQ76dJU'}]},
 'favorite_count': 38984,
 'favorited': False,
 'geo': None,
 'id': 892420643555336193,
 'id_str': '892420643555336193',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'is_quote_status': False,
 'lang': 'en',
 'place': None,
 'possibly_sensitive': False,
 'possibly_sensitive_appealable': False,
 'retweet_count': 8647,
 'retweeted': False,
 'source': '<a href="http://twitter.com/download/iphone" '
           'rel="nofollow">Twitter for iPhone</a>',
 'text': "This is Phineas. He's a mystical boy. Only ever appears in the hole "
         'of a donut. 13/10 https://t.co/MgUWQ76dJU',
 'truncated': False,
 'user': {'contributors_enabled': False,
          'created_at': 'Sun Nov 15 21:41:29 +0000 2015',
          'default_profile': False,
          'default_profile_image': False,
          'description': 'Your Only Source for Professional Dog Ratings STORE: '
                         '@ShopWeRateDogs | IG, FB & SC: WeRateDogs | MOBILE '
                         'APP: @GoodDogsGame Business: '
                         'dogratingtwi[email protected]',
          'entities': {'description': {'urls': []},
                       'url': {'urls': [{'display_url': 'weratedogs.com',
                                         'expanded_url': 'http://weratedogs.com',
                                         'indices': [0, 23],
                                         'url': 'https://t.co/N7sNNHAEXS'}]}},
          'favourites_count': 132909,
          'follow_request_sent': False,
          'followers_count': 6450951,
          'following': False,
          'friends_count': 103,
          'geo_enabled': True,
          'has_extended_profile': True,
          'id': 4196983835,
          'id_str': '4196983835',
          'is_translation_enabled': False,
          'is_translator': False,
          'lang': 'en',
          'listed_count': 4070,
          'location': '𝓶𝓮𝓻𝓬𝓱 ↴      DM YOUR DOGS',
          'name': 'WeRateDogs™',
          'notifications': False,
          'profile_background_color': '000000',
          'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
          'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
          'profile_background_tile': False,
          'profile_banner_url': 'https://pbs.twimg.com/profile_banners/4196983835/1515037507',
          'profile_image_url': 'http://pbs.twimg.com/profile_images/948761950363664385/Fpr2Oz35_normal.jpg',
          'profile_image_url_https': 'https://pbs.twimg.com/profile_images/948761950363664385/Fpr2Oz35_normal.jpg',
          'profile_link_color': 'F5ABB5',
          'profile_sidebar_border_color': '000000',
          'profile_sidebar_fill_color': '000000',
          'profile_text_color': '000000',
          'profile_use_background_image': False,
          'protected': False,
          'screen_name': 'dog_rates',
          'statuses_count': 6840,
          'time_zone': None,
          'translator_type': 'none',
          'url': 'https://t.co/N7sNNHAEXS',
          'utc_offset': None,
          'verified': True}}

Extract a few of the important features from the JSON.

In [10]:
with open('tweet_json.txt') as json_file:
    json_info = pd.DataFrame(columns = ['tweet_id', 
                                        'favorites', 
                                        'retweets'])
    for line in json_file:
        tweet = json.loads(line)
        json_info = json_info.append({
            'tweet_id': tweet['id'],
            'favorites': tweet['favorite_count'],
            'retweets': tweet['retweet_count']
        }, ignore_index=True)

Assess

This section details my assessment of the data with regard to quality and tidiness. The general approach will be to assess with regard to the following definitnions of those terms.

Quality issues pertain to content. Low quality data is also known as dirty data. There are four dimensions of quality data:

  • Completeness: do we have all of the records that we should? Do we have missing records or not? Are there specific rows, columns, or cells missing?
  • Validity: we have the records, but they're not valid, i.e., they don't conform to a defined schema. A schema is a defined set of rules for data. These rules can be real-world constraints (e.g. negative height is impossible) and table-specific constraints (e.g. unique key constraints in tables).
  • Accuracy: inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect. Example: a patient's weight that is 5 lbs too heavy because the scale was faulty.
  • Consistency: inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing. Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.

Tidiness issues pertain to structure. These structural problems generally prevent easy analysis. Untidy data is also known as messy data. The requirements for tidy data are:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

Quality

Compare columns to get a sense of how the dataframes will eventually be merged. "tweet_id" appears to be the "primary key," meaning it is the parameter on which the dataframes should be joined.

In [11]:
print('archive columns = ')
print(list(archive.columns))
print('\nimages columns = ')
print(list(images.columns))
print('\njson_info columns = ')
print(list(json_info.columns))
archive columns = 
['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp', 'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator', 'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo']

images columns = 
['tweet_id', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog']

json_info columns = 
['tweet_id', 'favorites', 'retweets']

Sample each of the dataframes to get a feel for the data.

In [12]:
archive.head(2)
Out[12]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None
In [13]:
images.head()
Out[13]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True
In [14]:
json_info.head(2)
Out[14]:
tweet_id favorites retweets
0 892420643555336193 38984 8647
1 892177421306343426 33369 6354

Completeness

Compare the lengths of the dataframes to get a sense of which is a subset of the others.

In [15]:
print('  archive tweet count = ' + str(len(archive)))
print('   images tweet count = ' + str(len(images)))
print('json_info tweet count = ' + str(len(json_info)))
  archive tweet count = 2356
   images tweet count = 2075
json_info tweet count = 2345

- Tables do not contain the same number of entries.

It appears the images dataframe contains the smallest subset of tweets. Since I cannot obtain more complete images information, the best course of action appears to be to subset the other dataframes to just the tweets that are contained in images.

Tweets that are contained in archive but not in json_info are likely tweets that were deleted.

Check whether the tweet_ids in images are also contained in archive and json_info.

In [16]:
print('Set of images tweet_ids contained in the set of archive tweet_ids?  ' + 
      str(images.tweet_id.isin(archive.tweet_id).all()))

print('   Set of images tweet_ids contained in the set of json_info tweet_ids?  ' +
      str(images.tweet_id.isin(pd.to_numeric(json_info.tweet_id)).all()))
Set of images tweet_ids contained in the set of archive tweet_ids?  True
   Set of images tweet_ids contained in the set of json_info tweet_ids?  False

- Some tweet_ids contained in images are not included in json_info.

This indicates that those tweets were deleted following the creation of the image predictions dataset. The appropriate set of tweet_ids to use is the intersection of all three dataframes.

Validity

Check values in the probabilities of the predictions in images.

In [17]:
images.p1_conf.describe()
Out[17]:
count    2075.000000
mean        0.594548
std         0.271174
min         0.044333
25%         0.364412
50%         0.588230
75%         0.843855
max         1.000000
Name: p1_conf, dtype: float64
In [18]:
images.p2_conf.describe()
Out[18]:
count    2.075000e+03
mean     1.345886e-01
std      1.006657e-01
min      1.011300e-08
25%      5.388625e-02
50%      1.181810e-01
75%      1.955655e-01
max      4.880140e-01
Name: p2_conf, dtype: float64
In [19]:
images.p3_conf.describe()
Out[19]:
count    2.075000e+03
mean     6.032417e-02
std      5.090593e-02
min      1.740170e-10
25%      1.622240e-02
50%      4.944380e-02
75%      9.180755e-02
max      2.734190e-01
Name: p3_conf, dtype: float64

These appear correct.

Compare the relative confidence when the prediction is of a dog versus when it is some other object.

In [20]:
images[images.p1_dog == True].p1_conf.describe()
Out[20]:
count    1532.000000
mean        0.613823
std         0.259735
min         0.044333
25%         0.390981
50%         0.614025
75%         0.850559
max         0.999956
Name: p1_conf, dtype: float64
In [21]:
images[images.p1_dog == False].p1_conf.describe()
Out[21]:
count    543.000000
mean       0.540167
std        0.294639
min        0.059033
25%        0.280340
50%        0.493257
75%        0.821904
max        1.000000
Name: p1_conf, dtype: float64

The neural network predicts the images contain dogs nearly 75% of the time, and tends to be more certain about its classifications when the image is predicted to contain a dog.

Check formatting of the timestamp received from Twitter.

In [22]:
print(archive.timestamp.max())
print(archive.timestamp.min())
2017-08-01 16:23:56 +0000
2015-11-15 22:32:08 +0000

These appear reasonable.

Need to check whether interpretted data in archive is correct based on the text of the tweet.

From a visual inspection, the denominator is almost always 10. Also, the 25%, 50%, and 75% values for rating_denominator are all 10.

In [23]:
archive.rating_denominator.describe()
Out[23]:
count    2356.000000
mean       10.455433
std         6.745237
min         0.000000
25%        10.000000
50%        10.000000
75%        10.000000
max       170.000000
Name: rating_denominator, dtype: float64
In [24]:
len(archive[archive.rating_denominator != 10])
Out[24]:
23
In [25]:
print(archive.loc[313].tweet_id)
print(archive.loc[313].text)
print(archive.loc[313].rating_numerator)
print(archive.loc[313].rating_denominator)
835246439529840640
@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho
960
0

- Denominator is not 10 for 23 tweets.

From a visual inspection, the numerator is almost always less than 16.

In [26]:
archive.rating_numerator.describe()
Out[26]:
count    2356.000000
mean       13.126486
std        45.876648
min         0.000000
25%        10.000000
50%        11.000000
75%        12.000000
max      1776.000000
Name: rating_numerator, dtype: float64
In [27]:
len(archive[archive.rating_numerator > 15])
Out[27]:
26
In [28]:
len(archive[archive.rating_numerator < 5])
Out[28]:
56
In [29]:
plt.hist(archive.rating_numerator.sort_values()[0:2330]);

- Numerator is greater than 15 (26 times).
- Investigate rows for which Numerator is less than 15.

I suspect that dog ratings have increased over time. Dog rating by date might make for an interesting visualization.

Accuracy

There do not appear to be any glaring Accuracy issues.

Consistency

Check that the types of data in the tables make sense.

In [30]:
archive.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB

- Doggo, floofer, pupper, and puppo are categories.
- timestamp and retweeted_status_timestamp should be of data type datetime.

In [31]:
images.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB

Determine whether there are few enough dog breed represented to store them as a categorical variable.

The following are the number of unique categorizations, regardless of whether they are a dog or not.

In [32]:
print('Unique predictions in p1: ' + str(images.p1.nunique()))
print('Unique predictions in p2: ' + str(images.p2.nunique()))
print('Unique predictions in p3: ' + str(images.p3.nunique()))
Unique predictions in p1: 378
Unique predictions in p2: 405
Unique predictions in p3: 408

Now restrict the predictions to dogs only.

In [33]:
print(images[images.p1_dog == True].p1.describe())
print('\r')
print(images[images.p2_dog == True].p2.describe())
print('\r')
print(images[images.p3_dog == True].p3.describe())
print('\r')
print(images[images.p1_dog == True].p1.value_counts(normalize=True))
count                 1532
unique                 111
top       golden_retriever
freq                   150
Name: p1, dtype: object

count                   1553
unique                   113
top       Labrador_retriever
freq                     104
Name: p2, dtype: object

count                   1499
unique                   116
top       Labrador_retriever
freq                      79
Name: p3, dtype: object

golden_retriever               0.097911
Labrador_retriever             0.065274
Pembroke                       0.058094
Chihuahua                      0.054178
pug                            0.037206
chow                           0.028721
Samoyed                        0.028068
toy_poodle                     0.025457
Pomeranian                     0.024804
malamute                       0.019582
cocker_spaniel                 0.019582
French_bulldog                 0.016971
miniature_pinscher             0.015013
Chesapeake_Bay_retriever       0.015013
Siberian_husky                 0.013055
German_shepherd                0.013055
Staffordshire_bullterrier      0.013055
Cardigan                       0.012402
beagle                         0.011749
Maltese_dog                    0.011749
Shetland_sheepdog              0.011749
Eskimo_dog                     0.011749
Shih-Tzu                       0.011097
Rottweiler                     0.011097
Lakeland_terrier               0.011097
Italian_greyhound              0.010444
kuvasz                         0.010444
West_Highland_white_terrier    0.009138
Great_Pyrenees                 0.009138
vizsla                         0.008486
                                 ...   
Afghan_hound                   0.002611
Gordon_setter                  0.002611
Mexican_hairless               0.002611
Norwich_terrier                0.002611
miniature_schnauzer            0.002611
Brabancon_griffon              0.001958
Ibizan_hound                   0.001958
Welsh_springer_spaniel         0.001958
curly-coated_retriever         0.001958
giant_schnauzer                0.001958
cairn                          0.001958
briard                         0.001958
Scottish_deerhound             0.001958
Irish_water_spaniel            0.001958
Greater_Swiss_Mountain_dog     0.001958
komondor                       0.001958
Leonberg                       0.001958
Appenzeller                    0.001305
wire-haired_fox_terrier        0.001305
black-and-tan_coonhound        0.001305
Sussex_spaniel                 0.001305
Australian_terrier             0.001305
toy_terrier                    0.001305
Scotch_terrier                 0.000653
clumber                        0.000653
groenendael                    0.000653
Japanese_spaniel               0.000653
standard_schnauzer             0.000653
EntleBucher                    0.000653
silky_terrier                  0.000653
Name: p1, Length: 111, dtype: float64

- Most popular dog breed predictions should be represented as categorical variables.

For fun, see what non-dog images are most common in the images dataset.

In [34]:
print(images[images.p1_dog == False].p1.describe())
print('\r')
print(images[images.p2_dog == False].p2.describe())
print('\r')
print(images[images.p3_dog == False].p3.describe())
print('\r')
print(images[images.p1_dog == False].p1.value_counts(normalize=True))
count           543
unique          267
top       seat_belt
freq             22
Name: p1, dtype: object

count         522
unique        292
top       doormat
freq           12
Name: p2, dtype: object

count         576
unique        292
top       doormat
freq           16
Name: p3, dtype: object

seat_belt          0.040516
web_site           0.034991
teddy              0.033149
tennis_ball        0.016575
dingo              0.016575
doormat            0.014733
bath_towel         0.012891
swing              0.012891
Siamese_cat        0.012891
tub                0.012891
hamster            0.012891
llama              0.011050
home_theater       0.011050
ice_bear           0.011050
car_mirror         0.011050
porcupine          0.009208
minivan            0.009208
shopping_cart      0.009208
ox                 0.009208
hippopotamus       0.009208
Arctic_fox         0.007366
bathtub            0.007366
jigsaw_puzzle      0.007366
goose              0.007366
patio              0.007366
barrow             0.007366
wombat             0.007366
brown_bear         0.007366
hog                0.007366
bow_tie            0.007366
                     ...   
pedestal           0.001842
syringe            0.001842
Egyptian_cat       0.001842
pencil_box         0.001842
mud_turtle         0.001842
fountain           0.001842
bonnet             0.001842
pole               0.001842
limousine          0.001842
African_grey       0.001842
lynx               0.001842
picket_fence       0.001842
lorikeet           0.001842
crane              0.001842
rotisserie         0.001842
pool_table         0.001842
piggy_bank         0.001842
bald_eagle         0.001842
bookcase           0.001842
sliding_door       0.001842
otter              0.001842
washer             0.001842
harp               0.001842
minibus            0.001842
tailed_frog        0.001842
china_cabinet      0.001842
ice_lolly          0.001842
sandbar            0.001842
radio_telescope    0.001842
leopard            0.001842
Name: p1, Length: 267, dtype: float64

- Most popular non-dog predictions should be represented as categorical variables.

It would be interesting to see which dog breeds are most commonly mistaken for objects, and what objects they are typically mistaken for.

- Classify each image as having been predicted to be Dog (if all three predicitons were dogs), Mixed (if the predictions included both dog and non-dog predictions), and Not Dog (if all three predictions were not dogs).

In [35]:
json_info.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2345 entries, 0 to 2344
Data columns (total 3 columns):
tweet_id     2345 non-null object
favorites    2345 non-null object
retweets     2345 non-null object
dtypes: object(3)
memory usage: 55.0+ KB

- All data columns in json_info can be better represented as int64s.

Tidiness

Ensure there are no duplicate tweet_ids across all dataframes.

In [36]:
print('  archive tweet_id duplicate values = ' + str(archive.tweet_id.duplicated().sum()))
print('   images tweet_id duplicate values = ' + str(images.tweet_id.duplicated().sum()))
print('json_info tweet_id duplicate values = ' + str(json_info.tweet_id.duplicated().sum()))
  archive tweet_id duplicate values = 0
   images tweet_id duplicate values = 0
json_info tweet_id duplicate values = 0

Check whether all images are unique.

In [37]:
print('Unique images in images = ' + str(images.jpg_url.nunique()))
print('Unique tweets in images = ' + str(images.tweet_id.nunique()))
Unique images in images = 2009
Unique tweets in images = 2075
In [38]:
(images.jpg_url.value_counts() == 2).sum()
Out[38]:
66
In [39]:
print(images[images.jpg_url.duplicated(keep=False)].p1_conf.nunique())
print(images[images.jpg_url.duplicated(keep=False)].p2_conf.nunique())
print(images[images.jpg_url.duplicated(keep=False)].p3_conf.nunique())
66
66
66

66 jpg_urls appear twice in images, meaning that they are included in more than one tweet. Investigation shows that these tweets can have different names and ratings associated with the same image.

But, the immediately foregoing code cell indicates that the neural network produced identical results for both the duplicate and original photos, as would be expected.

The best way to handle is to consider that these datasets really describe two different observational units, tweets and images, and therefore two tables are required. This will remedy the fact that there are currently multiple rows that are part of the same observation.

In [40]:
images.head()
Out[40]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True

- Archive and json_info can be consolidated into a single table for which the observational units are tweets. Images can be left independent, because the observational unit for it are images.
- Classify each image as having been predicted to be Dog (if all three predicitons were dogs), Mixed (if the predictions included both dog and non-dog predictions), and Not Dog (if all three predictions were not dogs).

  • "Dog Classification" - all three predictions are dogs.
  • "Mixed Classification" - three predictions consist of both dogs and items.
  • "Object Classification" - all three predictions are objects.

Summary

Assess Quality

Completeness

- Tables do not contain the same number of entries.
- Some tweet_ids contained in images are not included in json_info.

Validity

- Denominator is not 10 for 55 tweets.
- Numerator is greater than 15 (26 times) or less than 5 (56 times).

Accuracy

- *No issues of note.*

Consistency

- Doggo, floofer, pupper, and puppo are better represented as data categories in a single column.
- timestamp and retweeted_status_timestamp should be of data type datetime.
- Most popular dog breed predictions should be represented as categorical variables.
- Most popular non-dog predictions should be represented as categorical variables.
- All data columns in json_info can be better represented as int64s.

Assess Tidiness

- Classify each image as having been predicted to be Dog (if all three predicitons were dogs), Mixed (if the predictions included both dog and non-dog predictions), and Not Dog (if all three predictions were not dogs).
- archive and json_info can be consolidated into a single table for which the observational units are tweets. images can be left as-is, because the observational units are images.
- Rewrite the prediction confidences to be 2-digit integers expressed as percentages.

Clean

This section details the programmatic correction of the issues discovered in the Assess section.

In [41]:
archive_clean = archive.copy()
images_clean = images.copy()
json_clean = json_info.copy()

Quality

Define

- Determine the minimum subset of tweets included in archive, images, and json_info.
- Drop the rows that aren't part of that subset.

Code

In [42]:
print('archive_clean tweet count = ' + str(len(archive_clean)))
print(' images_clean tweet count = ' + str(len(images_clean)))
print('   json_clean tweet count = ' + str(len(json_clean)))
archive_clean tweet count = 2356
 images_clean tweet count = 2075
   json_clean tweet count = 2345
In [43]:
# Reduce set of tweets_to_keep to the set in images_clean
tweets_to_keep = set(images_clean.tweet_id)

json_clean = json_clean[json_clean['tweet_id'].isin(tweets_to_keep)]

# Reduce set of tweets_to_keep to the intersection of the images_clean tweets and the json_clean tweets
tweets_to_keep = set(json_clean.tweet_id)

archive_clean = archive_clean[archive_clean['tweet_id'].isin(tweets_to_keep)]
images_clean = images_clean[images_clean['tweet_id'].isin(tweets_to_keep)]

Test

In [44]:
print('     tweets to keep count = ' + str(len(tweets_to_keep)))
print('\r')
print('archive_clean tweet count = ' + str(len(archive_clean)))
print(' images_clean tweet count = ' + str(len(images_clean)))
print('   json_clean tweet count = ' + str(len(json_clean)))
     tweets to keep count = 2069

archive_clean tweet count = 2069
 images_clean tweet count = 2069
   json_clean tweet count = 2069

Define

- For all tables, remove tweets that do not contain the string '/10' in the tweet text.

Code

In [45]:
print("    Count of tweets without '/10' in the text = " + \
      str(archive_clean[~archive_clean.text.str.contains('/10')].tweet_id.count()))
print("Count of tweets without 10 rating_denominator = " + \
      str(archive_clean[archive_clean.rating_denominator != 10].tweet_id.count()))
    Count of tweets without '/10' in the text = 13
Count of tweets without 10 rating_denominator = 18
In [46]:
tweets_to_remove = set(archive_clean[~archive_clean.text.str.contains('/10')].tweet_id)

json_clean = json_clean[~json_clean['tweet_id'].isin(tweets_to_remove)]
archive_clean = archive_clean[~archive_clean['tweet_id'].isin(tweets_to_remove)]
images_clean = images_clean[~images_clean['tweet_id'].isin(tweets_to_remove)]

Test

In [47]:
print('archive_clean tweet count = ' + str(len(archive_clean)))
print(' images_clean tweet count = ' + str(len(images_clean)))
print('   json_clean tweet count = ' + str(len(json_clean)))
archive_clean tweet count = 2056
 images_clean tweet count = 2056
   json_clean tweet count = 2056
In [48]:
print("    Count of tweets without '/10' in the text = " +\
      str(archive_clean[~archive_clean.text.str.contains('/10')].tweet_id.count()))
print("Count of tweets without 10 rating_denominator = " +\
      str(archive_clean[archive_clean.rating_denominator != 10].tweet_id.count()))
    Count of tweets without '/10' in the text = 0
Count of tweets without 10 rating_denominator = 5

Define

- Set the rating_denominator to 10 for the 5 tweets still in archive with an incorrect denominator.

Code

In [49]:
print(list(archive_clean[archive_clean.rating_denominator != 10].text))
['After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ', 'Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a', 'This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq', 'This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5', 'This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv']
In [50]:
indexes_to_repair = set(archive_clean[archive_clean.rating_denominator != 10].index)
for i in indexes_to_repair:
    archive_clean.at[i, 'rating_denominator'] = 10

Test

In [51]:
print("Count of tweets without 10 rating_denominator = " +\
      str(archive_clean[archive_clean.rating_denominator != 10].tweet_id.count()))
Count of tweets without 10 rating_denominator = 0

Define

- Check all tweets to see whether the rating_numerator matches the digits in front of '/10'. Where it doesn't, investigate. If invalid, remove. If valid, replace.

Code

In [52]:
cannot_parse = set()
incorrect = set()

for i in archive_clean.index:
    index = int(archive_clean.loc[i].text.find('/10'))
    try:
        numerator = int(archive_clean.loc[i].text[index-2:index].strip())
    except:
        cannot_parse.add(i)
        continue
    if numerator != archive_clean.loc[i].rating_numerator:
        incorrect.add(i)

print('    Count of indexes this code cannot parse =  ' + str(len(cannot_parse)))
print('Count of incorrect rating_numerator indexes =  ' + str(len(incorrect)))
    Count of indexes this code cannot parse =  7
Count of incorrect rating_numerator indexes =  8
In [53]:
# Indexes to manually check because they cannot be parsed or they are incorrect in the table.

for i in cannot_parse.union(incorrect):
    print(str(i) + ' - ' + str(archive_clean.loc[i].text))
1025 - This is an Iraqi Speed Kangaroo. It is not a dog. Please only send in dogs. I'm very angry with all of you ...9/10 https://t.co/5qpBTTpgUt
2246 - This is Tedrick. He lives on the edge. Needs someone to hit the gas tho. Other than that he's a baller. 10&amp;2/10 https://t.co/LvP1TTYSCN
1610 - For the last time, WE. DO. NOT. RATE. BULBASAUR. We only rate dogs. Please only send dogs. Thank you ...9/10 https://t.co/GboDG8WhJG
1068 - After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ
45 - This is Bella. She hopes her smile made you smile. If not, she is also offering you her favorite monkey. 13.5/10 https://t.co/qjrljjt948
1165 - Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a
1202 - This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq
979 - This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh
2260 - RT @dogratingrating: Unoriginal idea. Blatant plagiarism. Curious grammar. -5/10 https://t.co/r7XzeQZWzb
1653 - "Hello forest pupper I am house pupper welcome to my abode" (8/10 for both) https://t.co/qFD8217fUT
2074 - After so many requests... here you go.

Good dogg. 420/10 https://t.co/yfAAo1gdeY
1435 - Please stop sending in saber-toothed tigers. This is getting ridiculous. We only rate dogs.
...8/10 https://t.co/iAeQNueou8
1372 - I know it's tempting, but please stop sending in pics of Donald Trump. Thank you ...9/10 https://t.co/y35Y1TJERY
1662 - This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5
2335 - This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv
In [54]:
# Drop tweets at the following indexes:
# 2246- "10&2/10", 
# 45- ignore fractional rating, 
# 979- "1776/10",
# 2260- negative rating
# 1653- deer and a dog,
# 2074 - Snoop Dogg

tweets_to_remove = set()
for i in [2246, 45, 979, 2260, 1653, 2074]:
    tweets_to_remove.add(archive_clean.loc[i].tweet_id)

json_clean = json_clean[~json_clean['tweet_id'].isin(tweets_to_remove)]
archive_clean = archive_clean[~archive_clean['tweet_id'].isin(tweets_to_remove)]
images_clean = images_clean[~images_clean['tweet_id'].isin(tweets_to_remove)]
In [55]:
cannot_parse = set()
incorrect = set()

for i in archive_clean.index:
    index = int(archive_clean.loc[i].text.find('/10'))
    try:
        numerator = int(archive_clean.loc[i].text[index-2:index].strip('.').strip())
    except:
        cannot_parse.add(i)
        continue
    if numerator != archive_clean.loc[i].rating_numerator:
        archive_clean.at[i, 'rating_numerator'] = numerator

Test

In [56]:
cannot_parse = set()
incorrect = set()

for i in archive_clean.index:
    index = int(archive_clean.loc[i].text.find('/10'))
    try:
        numerator = int(archive_clean.loc[i].text[index-2:index].strip('.').strip())
    except:
        cannot_parse.add(i)
        continue
    if numerator != archive_clean.loc[i].rating_numerator:
        incorrect.add(i)

print('    Count of indexes this code cannot parse =  ' + str(len(cannot_parse)))
print('Count of incorrect rating_numerator indexes =  ' + str(len(incorrect)))
    Count of indexes this code cannot parse =  0
Count of incorrect rating_numerator indexes =  0

Define

- Investigate high and low rating_numerators. Remove values that appear invalid.

Code

In [57]:
low_numerators = set(archive_clean[archive_clean.rating_numerator < 5].index)
high_numerators = set(archive_clean[archive_clean.rating_numerator > 15].index)

print('archive_clean tweet count = ' + str(len(archive_clean)))
print(' images_clean tweet count = ' + str(len(images_clean)))
print('   json_clean tweet count = ' + str(len(json_clean)))
print('\r')
print("Count of low_numerators = " + str(len(low_numerators)))
print("Count of high_numerators = " + str(len(high_numerators)))
archive_clean tweet count = 2050
 images_clean tweet count = 2050
   json_clean tweet count = 2050

Count of low_numerators = 48
Count of high_numerators = 3
In [58]:
string = ""
for i in low_numerators:
    index = int(archive_clean.loc[i].text.find('/10'))
    string += str(i) + ": '" + archive_clean.loc[i].text[index-2:index+3] + "'\t\t"
string += '\n\n'
for i in high_numerators:
    index = int(archive_clean.loc[i].text.find('/10'))
    string += str(i) + ": '" + archive_clean.loc[i].text[index-6:index+3] + "'\t" 
print(string)
1920: ' 2/10'		2305: ' 3/10'		2310: ' 2/10'		2183: ' 3/10'		1928: ' 3/10'		2186: ' 4/10'		2316: ' 4/10'		912: ' 4/10'		1938: ' 3/10'		1941: ' 4/10'		2070: ' 4/10'		1303: ' 4/10'		2326: ' 2/10'		2202: ' 3/10'		1947: ' 3/10'		1692: ' 3/10'		2076: ' 4/10'		2334: ' 3/10'		2079: ' 2/10'		1314: ' 3/10'		2338: ' 1/10'		1189: ' 3/10'		1701: ' 4/10'		2091: ' 1/10'		1836: ' 3/10'		2349: ' 2/10'		2222: ' 4/10'		1459: ' 4/10'		315: ' 0/10'		2237: ' 2/10'		2239: ' 3/10'		1601: ' 3/10'		1219: ' 4/10'		1478: ' 3/10'		1869: ' 1/10'		2261: ' 1/10'		2136: ' 3/10'		1629: ' 4/10'		1249: ' 3/10'		1761: ' 2/10'		1764: ' 2/10'		1898: ' 3/10'		1004: ' 4/10'		2288: ' 4/10'		883: ' 4/10'		1016: ' 0/10'		765: ' 3/10'		1406: ' 3/10'		

1712: ' 11.26/10'	763: ' 11.27/10'	695: 'f 9.75/10'	

Low numerators appear to be valid.

High numerators are not valid, and decimal numerators cause more trouble than they're worth. Drop them.

In [59]:
tweets_to_remove = set()
for i in high_numerators:
    tweets_to_remove.add(archive_clean.loc[i].tweet_id)

json_clean = json_clean[~json_clean['tweet_id'].isin(tweets_to_remove)]
archive_clean = archive_clean[~archive_clean['tweet_id'].isin(tweets_to_remove)]
images_clean = images_clean[~images_clean['tweet_id'].isin(tweets_to_remove)]

Test

In [60]:
print('     High numerator count = ' + str(len(set(archive_clean[archive_clean.rating_numerator > 15].index))))
print('\r')
print('archive_clean tweet count = ' + str(len(archive_clean)))
print(' images_clean tweet count = ' + str(len(images_clean)))
print('   json_clean tweet count = ' + str(len(json_clean)))
     High numerator count = 0

archive_clean tweet count = 2047
 images_clean tweet count = 2047
   json_clean tweet count = 2047

Define

- Ensure there are not duplicates of doggo, floofer, pupper, and puppo.
- Create fifth column: none, to indicate the tweet is not classified by dog stage.

Code

In [61]:
archive_clean.doggo = archive_clean.doggo.replace('None', 0)
archive_clean.doggo = archive_clean.doggo.replace('doggo', 1)
archive_clean.floofer = archive_clean.floofer.replace('None', 0)
archive_clean.floofer = archive_clean.floofer.replace('floofer', 1)
archive_clean.pupper = archive_clean.pupper.replace('None', 0)
archive_clean.pupper = archive_clean.pupper.replace('pupper', 1)
archive_clean.puppo = archive_clean.puppo.replace('None', 0)
archive_clean.puppo = archive_clean.puppo.replace('puppo', 1)

archive_clean['none'] = 1 - (archive_clean.doggo + archive_clean.floofer + archive_clean.pupper + archive_clean.puppo)
print('Duplicate stage categories = ' + str(archive_clean[archive_clean.none == -1].tweet_id.count()))
Duplicate stage categories = 13

Investigation shows that many of the tweets with duplicate stages contain multiple dogs. Remove all of these as it introduces too many complications for the scope of this project.

In [62]:
duplicate_categories = set(archive_clean[archive_clean.none == -1].tweet_id)

json_clean = json_clean[~json_clean['tweet_id'].isin(duplicate_categories)]
archive_clean = archive_clean[~archive_clean['tweet_id'].isin(duplicate_categories)]
images_clean = images_clean[~images_clean['tweet_id'].isin(duplicate_categories)]

Test

In [63]:
print('Duplicate stage categories = ' + str(archive_clean[archive_clean.none == -1].tweet_id.count()))
print('\r')
print('archive_clean tweet count = ' + str(len(archive_clean)))
print(' images_clean tweet count = ' + str(len(images_clean)))
print('   json_clean tweet count = ' + str(len(json_clean)))
Duplicate stage categories = 0

archive_clean tweet count = 2034
 images_clean tweet count = 2034
   json_clean tweet count = 2034

Define

- Convert doggo, floofer, pupper, puppo, and none to categories in a single column.

Code

In [64]:
values = ['doggo', 'floofer', 'pupper', 'puppo', 'none']

ids = [x for x in list(archive_clean.columns) if x not in values]

print(archive_clean.shape)
print(archive_clean.columns)
(2034, 18)
Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo',
       'none'],
      dtype='object')
In [65]:
archive_clean = pd.melt(archive_clean, id_vars = ids, value_vars = values, var_name='stage')

print(archive_clean.shape)
(10170, 15)
In [66]:
archive_clean = archive_clean[archive_clean.value == 1]
archive_clean.drop('value', axis=1, inplace=True)
archive_clean.reset_index(drop=True, inplace=True);
archive_clean.stage = archive_clean.stage.astype('category')

Test

In [67]:
print(archive_clean.shape)
print(archive_clean.columns)
archive_clean.stage.value_counts()
(2034, 14)
Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'stage'],
      dtype='object')
Out[67]:
none       1728
pupper      209
doggo        67
puppo        23
floofer       7
Name: stage, dtype: int64

Define

- Convert timestamp and retweeted_status_timestamp to data type datetime.

Code

In [68]:
archive_clean.timestamp = pd.to_datetime(archive_clean.timestamp)
archive_clean.retweeted_status_timestamp = pd.to_datetime(archive_clean.retweeted_status_timestamp)

Test

In [69]:
archive_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2034 entries, 0 to 2033
Data columns (total 14 columns):
tweet_id                      2034 non-null int64
in_reply_to_status_id         21 non-null float64
in_reply_to_user_id           21 non-null float64
timestamp                     2034 non-null datetime64[ns]
source                        2034 non-null object
text                          2034 non-null object
retweeted_status_id           72 non-null float64
retweeted_status_user_id      72 non-null float64
retweeted_status_timestamp    72 non-null datetime64[ns]
expanded_urls                 2034 non-null object
rating_numerator              2034 non-null int64
rating_denominator            2034 non-null int64
name                          2034 non-null object
stage                         2034 non-null category
dtypes: category(1), datetime64[ns](2), float64(4), int64(3), object(4)
memory usage: 208.8+ KB

Define

- Convert most popular dog breed predictions to categories.
- Convert most popular non-dog predictions to categories.

Code

In [70]:
ps = ['p1', 'p2', 'p3']

string = '    Dog tweet count for P1 / P2 / P3: \t\t\t'
for p in ps:
    string += str(images_clean[images_clean[p + '_dog']==True].tweet_id.count()) + ' / '

string += '\nNot dog tweet count for P1 / P2 / P3: \t\t\t'
for p in ps:
    string += str(images_clean[images_clean[p + '_dog']==False].tweet_id.count()) + ' / '

string += '\n    Dog classification count for P1 / P2 / P3: \t\t'
for p in ps:
    string += str(images_clean[images_clean[p + '_dog']==True][p].nunique()) + ' / '

string += '\nNot dog classification count for P1 / P2 / P3: \t\t'
for p in ps:
    string += str(images_clean[images_clean[p + '_dog']==False][p].nunique()) + ' / '
    
print(string)
    Dog tweet count for P1 / P2 / P3: 			1501 / 1522 / 1469 / 
Not dog tweet count for P1 / P2 / P3: 			533 / 512 / 565 / 
    Dog classification count for P1 / P2 / P3: 		110 / 113 / 116 / 
Not dog classification count for P1 / P2 / P3: 		266 / 287 / 289 / 
In [71]:
ps = ['p1', 'p2', 'p3']
dog_labels = set(images_clean[images_clean['p1_dog']==True].p1.value_counts()[0:75].index)
item_labels = set(images_clean[images_clean['p1_dog']==False].p1.value_counts()[0:25].index)

for i in images_clean.index:
    for p in ps:
        if images_clean.loc[i,p+'_dog']:
            if images_clean.loc[i,p] not in dog_labels:
                images_clean.loc[i,p] = 'other_dog'
            else:
                continue
        else:
            if images_clean.loc[i,p] not in item_labels:
                images_clean.loc[i,p] = 'other_item'
            else:
                continue

images_clean.p1 = images_clean.p1.astype('category')
images_clean.p2 = images_clean.p2.astype('category')
images_clean.p3 = images_clean.p3.astype('category')

Test

In [72]:
images_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2034 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2034 non-null int64
jpg_url     2034 non-null object
img_num     2034 non-null int64
p1          2034 non-null category
p1_conf     2034 non-null float64
p1_dog      2034 non-null bool
p2          2034 non-null category
p2_conf     2034 non-null float64
p2_dog      2034 non-null bool
p3          2034 non-null category
p3_conf     2034 non-null float64
p3_dog      2034 non-null bool
dtypes: bool(3), category(3), float64(3), int64(2), object(1)
memory usage: 218.0+ KB

Define

- All data columns in json_info can be better represented as int64s.

Code

In [73]:
json_info.tweet_id = json_info.tweet_id.astype(int)
json_info.favorites = json_info.favorites.astype(int)
json_info.retweets = json_info.retweets.astype(int)

Test

In [74]:
json_info.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2345 entries, 0 to 2344
Data columns (total 3 columns):
tweet_id     2345 non-null int64
favorites    2345 non-null int64
retweets     2345 non-null int64
dtypes: int64(3)
memory usage: 55.0 KB

Tidiness

Define

- Classify each image as having been predicted to be Dog (if all three predicitons were dogs), Mixed (if the predictions included both dog and non-dog predictions), and Not Dog (if all three predictions were not dogs).

Code

In [75]:
ps = ['p1_dog', 'p2_dog', 'p3_dog']

for p in ps:
    images_clean[p] = images_clean[p].astype(int)

images_clean['prediction'] = images_clean.p1_dog + images_clean.p2_dog + images_clean.p3_dog
In [76]:
images_clean['prediction'] = images_clean['prediction'].replace(3, 'Dog')
images_clean['prediction'] = images_clean['prediction'].replace(2, 'Mixed')
images_clean['prediction'] = images_clean['prediction'].replace(1, 'Mixed')
images_clean['prediction'] = images_clean['prediction'].replace(0, 'Not Dog')

Test

In [77]:
images_clean.prediction.value_counts()
Out[77]:
Dog        1217
Mixed       499
Not Dog     318
Name: prediction, dtype: int64

Define

- Consolidate archive and json_info into a single table for which the observational units are tweets.

Code

In [78]:
print(archive_clean.shape)
print(json_clean.shape)
print(json_clean.columns)
(2034, 14)
(2034, 3)
Index(['tweet_id', 'favorites', 'retweets'], dtype='object')
In [79]:
tweets_clean = pd.merge(archive_clean, json_clean, on='tweet_id')

Test

In [80]:
tweets_clean.shape
Out[80]:
(2034, 16)
In [81]:
print(archive_clean[archive_clean['tweet_id']==890240255349198849].name)
print(json_clean[json_clean['tweet_id']==890240255349198849].favorites)
print(tweets_clean[tweets_clean['tweet_id']==890240255349198849].name)
print(tweets_clean[tweets_clean['tweet_id']==890240255349198849].favorites)
0    Cassie
Name: name, dtype: object
9    32077
Name: favorites, dtype: object
0    Cassie
Name: name, dtype: object
0    32077
Name: favorites, dtype: object

Define

- Final cleanup: reorder columns and delete unnecessary columns from tweets_clean.

Code

In [82]:
tweets_clean.shape
Out[82]:
(2034, 16)

Reorder columns.

In [83]:
cols = tweets_clean.columns.tolist()
print(cols)
cols = cols[0:1] + cols[10:12] + cols[13:] + cols[12:13] + cols[5:6] + cols[3:4] + cols[1:3] + cols[4:5] + cols[6:10]
tweets_clean = tweets_clean[cols]
['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp', 'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator', 'rating_denominator', 'name', 'stage', 'favorites', 'retweets']

Drop columns that aren't useful for subsequent analysis.

In [84]:
tweets_clean.drop(cols[9:], axis=1, inplace=True)

Since rating_denominator is 10 for all tweets, it can be dropped.

In [85]:
tweets_clean.drop('rating_denominator', axis=1, inplace=True)
tweets_clean.rename(index=str, columns={'rating_numerator': 'rating'}, inplace=True)

Test

In [86]:
print(tweets_clean.shape)
tweets_clean.head(5)
(2034, 8)
Out[86]:
tweet_id rating stage favorites retweets name text timestamp
0 890240255349198849 14 doggo 32077 7526 Cassie This is Cassie. She is a college pup. Studying... 2017-07-26 15:59:51
1 884162670584377345 12 doggo 20472 3035 Yogi Meet Yogi. He doesn't have any important dog m... 2017-07-09 21:29:42
2 872967104147763200 12 doggo 27600 5521 None Here's a very large dog. He has a date later. ... 2017-06-09 00:02:31
3 871515927908634625 12 doggo 20432 3544 Napolean This is Napolean. He's a Raggedy East Nicaragu... 2017-06-04 23:56:03
4 869596645499047938 12 doggo 16225 3238 Scout This is Scout. He just graduated. Officially a... 2017-05-30 16:49:31

Define

- Final cleanup: reorder columns and delete unnecessary columns from images_clean.

Code

Drop columns that aren't useful.

In [87]:
for p in ps:
    images_clean.drop(p, axis=1, inplace=True)

images_clean.drop('img_num', axis=1, inplace=True)

Reorder remaining columns.

In [88]:
cols = images_clean.columns.tolist()
cols = cols[:1] + cols[-1:] + cols[2:-1] + cols[1:2]
images_clean = images_clean[cols]

Test

In [89]:
images_clean.tail(5)
Out[89]:
tweet_id prediction p1 p1_conf p2 p2_conf p3 p3_conf jpg_url
2070 891327558926688256 Dog basset 0.555712 English_springer 0.225770 German_short-haired_pointer 0.175219 https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg
2071 891689557279858688 Mixed other_item 0.170278 Labrador_retriever 0.168086 other_item 0.040836 https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg
2072 891815181378084864 Dog Chihuahua 0.716012 malamute 0.078253 kelpie 0.031379 https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg
2073 892177421306343426 Dog Chihuahua 0.323581 Pekinese 0.090647 papillon 0.068957 https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg
2074 892420643555336193 Not Dog other_item 0.097049 other_item 0.085851 other_item 0.076110 https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg

Define

- Final cleanup: change confidences to be two-digit percentage expressions.

The extra digits are not particularly useful.

Code

In [90]:
ps = ['p1_conf', 'p2_conf', 'p3_conf']

for p in ps:
    images_clean[p] = round(images_clean[p]*100).astype(int)

Test

In [91]:
images_clean.sample(5)
Out[91]:
tweet_id prediction p1 p1_conf p2 p2_conf p3 p3_conf jpg_url
44 666781792255496192 Dog Italian_greyhound 62 other_dog 15 vizsla 9 https://pbs.twimg.com/media/CUDigRXXIAATI_H.jpg
1228 745422732645535745 Mixed Labrador_retriever 66 golden_retriever 31 ice_bear 0 https://pbs.twimg.com/media/ClhGBCAWIAAFCsz.jpg
1305 753375668877008896 Mixed other_dog 36 other_item 13 other_item 10 https://pbs.twimg.com/media/CnSHLFeWgAAwV-I.jpg
1017 709918798883774466 Dog Pembroke 96 Cardigan 2 Chihuahua 1 https://pbs.twimg.com/media/CdojYQmW8AApv4h.jpg
1954 864197398364647424 Dog golden_retriever 95 Labrador_retriever 2 Tibetan_mastiff 2 https://pbs.twimg.com/media/C_4-8iPV0AA1Twg.jpg

Final Format

In [92]:
# print(list(tweets_clean.sample(5).tweet_id))
In [93]:
id_sample = [780192070812196864, 693647888581312512, 786363235746385920, 743980027717509120, 711732680602345472]
In [94]:
tweets_clean[tweets_clean.tweet_id.isin(id_sample)].sort_values(by='tweet_id')
Out[94]:
tweet_id rating stage favorites retweets name text timestamp
1313 693647888581312512 7 none 2901 653 None What kind of person sends in a pic without a d... 2016-01-31 04:11:58
1146 711732680602345472 10 none 9550 4530 None I want to hear the joke this dog was just told... 2016-03-21 01:54:29
998 743980027717509120 11 none 4473 1203 Geno This is Geno. He's a Wrinkled Baklavian Velvee... 2016-06-18 01:33:55
795 780192070812196864 11 none 9545 2529 None We only rate dogs. Pls stop sending in non-can... 2016-09-25 23:47:39
40 786363235746385920 13 doggo 11976 3971 Rizzo This is Rizzo. He has many talents. A true ren... 2016-10-13 00:29:39
In [95]:
images_clean[images_clean.tweet_id.isin(id_sample)].sort_values(by='tweet_id')
Out[95]:
tweet_id prediction p1 p1_conf p2 p2_conf p3 p3_conf jpg_url
832 693647888581312512 Not Dog other_item 27 doormat 17 bathtub 7 https://pbs.twimg.com/media/CaBVE80WAAA8sGk.jpg
1034 711732680602345472 Mixed dingo 37 other_dog 33 Eskimo_dog 7 https://pbs.twimg.com/media/CeCVGEbUYAASeY4.jpg
1220 743980027717509120 Dog bull_mastiff 98 other_dog 1 pug 1 https://pbs.twimg.com/media/ClMl4VLUYAA5qBb.jpg
1473 780192070812196864 Mixed vizsla 14 other_item 9 other_item 7 https://pbs.twimg.com/media/CtPMhwvXYAIt6NG.jpg
1512 786363235746385920 Dog golden_retriever 93 Labrador_retriever 6 other_dog 0 https://pbs.twimg.com/media/Cum5LlfWAAAyPcS.jpg
In [96]:
tweets_clean.head()
Out[96]:
tweet_id rating stage favorites retweets name text timestamp
0 890240255349198849 14 doggo 32077 7526 Cassie This is Cassie. She is a college pup. Studying... 2017-07-26 15:59:51
1 884162670584377345 12 doggo 20472 3035 Yogi Meet Yogi. He doesn't have any important dog m... 2017-07-09 21:29:42
2 872967104147763200 12 doggo 27600 5521 None Here's a very large dog. He has a date later. ... 2017-06-09 00:02:31
3 871515927908634625 12 doggo 20432 3544 Napolean This is Napolean. He's a Raggedy East Nicaragu... 2017-06-04 23:56:03
4 869596645499047938 12 doggo 16225 3238 Scout This is Scout. He just graduated. Officially a... 2017-05-30 16:49:31

Store

In [97]:
tweets_clean.to_csv('twitter_archive_master.csv')
images_clean.to_csv('image_archive_master.csv')

Analyzing and Visualizing

Function Definitions

In [98]:
def plot_favorites_per_tweet_by_day():
    q = tweets_clean.favorites.quantile([0.05, 0.95])
    
    g1_subset = tweets_clean
    g1_subset = g1_subset.drop(tweets_clean[tweets_clean['favorites'] < q[0.05]].index)
    g1_subset = g1_subset.drop(tweets_clean[tweets_clean['favorites'] > q[0.95]].index)
    
    fig, ax = plt.subplots(1, 1, figsize=(16, 9))
    
    ax.spines['top'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)

    x = g1_subset['timestamp'].dt.dayofyear + \
        (g1_subset['timestamp'].dt.year-2015)*365-319
    y = g1_subset['favorites']

    plt.scatter(x, y);

    z = np.polyfit(x, y, 1)
    p = np.poly1d(z)
    plt.plot(x,p(x),"r--")

    label = "faves = %.0f * days + %.0f"%(z[0],z[1])

    plt.text(625, 21500, label, fontsize=18, color='red')
    plt.xticks(fontsize=18)
    plt.yticks(fontsize=18)
    plt.xlabel('Days Since First Tweet', fontsize=18)
    plt.ylabel('Favorites per Tweet', fontsize=18)
    # fig.suptitle('Tweet Favorites by Days Since First Tweet', fontsize=24, ha='center');
In [99]:
def plot_retweets_per_tweet_by_day():
    q = tweets_clean.retweets.quantile([0.05, 0.95])

    g2_subset = tweets_clean
    g2_subset = g2_subset.drop(tweets_clean[tweets_clean['retweets'] < q[0.05]].index)
    g2_subset = g2_subset.drop(tweets_clean[tweets_clean['retweets'] > q[0.95]].index)

    fig, ax = plt.subplots(1, 1, figsize=(16, 9))

    ax.spines['top'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)

    x = g2_subset['timestamp'].dt.dayofyear + \
        (g2_subset['timestamp'].dt.year-2015)*365-319
    y = g2_subset['retweets']

    plt.scatter(x, y);

    z = np.polyfit(x, y, 1)
    p = np.poly1d(z)
    plt.plot(x,p(x),"r--")

    label = "retweets = %.0f * days + %.0f"%(z[0],z[1])

    plt.text(630, 5220, label, fontsize=18, color='red')
    plt.xticks(fontsize=18)
    plt.yticks(fontsize=18)
    plt.xlabel('Days Since First Tweet', fontsize=18)
    plt.ylabel('Retweets per Tweet', fontsize=18)
    # fig.suptitle('Retweets by Days Since First Tweet', fontsize=24, ha='center');
In [100]:
def plot_tweets_per_breed():
    hi_conf_breeds = images_clean[images_clean['prediction'] == 'Dog']
    hi_conf_breeds = hi_conf_breeds[hi_conf_breeds['p1_conf'] > 50]

    g3_data = hi_conf_breeds.p1.value_counts()[0:11].sort_values(axis=0, ascending=False)

    fig, ax = plt.subplots(1, 1, figsize=(16, 9))

    ax.spines['top'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)

    x = g3_data.index    
    y = g3_data.values
    
    plt.bar(range(len(x)), y) # Bug in matplotlib alphabetizes x-axis lables. Assign x in order, then relabel.

    label = []
    for item in x:
        label.append(item.replace('_', '\n').title())

    plt.xticks(range(len(x)), label);

    plt.xticks(fontsize=18, rotation=90)
    plt.yticks(fontsize=18)
    plt.ylabel('Tweet Count', fontsize=18)
    # fig.suptitle('@dog_rates Tweet Count by Dog Breed', fontsize=24, ha='center');
In [101]:
def plot_rating_by_breed():
    images_w_rating = pd.merge(images_clean, tweets_clean[['tweet_id', 'rating']], on='tweet_id')

    hi_conf_breeds = images_w_rating[images_w_rating['prediction'] == 'Dog']
    hi_conf_breeds = hi_conf_breeds[hi_conf_breeds['p1_conf'] > 50]

    hi_rating_counts = list(hi_conf_breeds.p1.value_counts()[0:21].index) # Subsets to dogs with at least 10 ratings

    g4_data = hi_conf_breeds[hi_conf_breeds['p1'].isin(hi_rating_counts)]
    g4_data = g4_data[['tweet_id', 'p1', 'rating']]
    g4_data = g4_data.groupby(['p1']).mean()
    g4_data = g4_data.dropna()
        
    g4_data = g4_data.sort_values(by='rating', ascending=False)
    
    fig, ax = plt.subplots(1, 1, figsize=(16, 9))

    ax.spines['top'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)

    x = g4_data.index
    y = g4_data.rating
    
    plt.bar(range(len(x)), y) # Bug in matplotlib alphabetizes x-axis lables. Assign x in order, then relabel.

    label = []
    for item in x:
        label.append(item.replace('_', ' ').title())

    plt.xticks(range(len(x)), label);

    plt.xticks(fontsize=14, rotation=90)
    plt.yticks(fontsize=18)
    plt.ylabel('Rating', fontsize=18)
    # fig.suptitle('Average Rating by Breed', fontsize=24, ha='center');
In [102]:
def plot_confidence_by_breed():
    hi_conf_dogs = images_clean[images_clean['prediction'] == 'Dog']

    hi_rating_counts = list(hi_conf_dogs.p1.value_counts()[0:21].index)
    
    g4_data = hi_conf_dogs[hi_conf_dogs['p1'].isin(hi_rating_counts)]
    g4_data = g4_data[['tweet_id', 'p1', 'p1_conf']]
    g4_data = g4_data.groupby(['p1']).mean()
    g4_data = g4_data.dropna()
        
    g4_data = g4_data.sort_values(by='p1_conf', ascending=False)
    
    fig, ax = plt.subplots(1, 1, figsize=(16, 9))

    ax.spines['top'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)

    x = g4_data.index
    y = g4_data.p1_conf
    
    plt.bar(range(len(x)), y) # Bug in matplotlib alphabetizes x-axis lables. Assign x in order, then relabel.

    label = []
    for item in x:
        label.append(item.replace('_', ' ').title())

    plt.xticks(range(len(x)), label);

    plt.xticks(fontsize=14, rotation=90)
    plt.yticks(fontsize=18)
    plt.ylabel('Prediction Confidence', fontsize=18)
    # fig.suptitle('Average Neural Network Confidence by Breed', fontsize=24, ha='center');

Plots and Insights

In [103]:
plot_favorites_per_tweet_by_day()
  • Since the first tweet in this dataset (2015/11/15), the number of times an @dog_rates tweet is favorited has increased at a rate of roughly 32 per day.
In [104]:
plot_retweets_per_tweet_by_day()
  • Since the first tweet in this dataset (2015/11/15), the number of times an @dog_rates tweet is retweeted has increased at a rate of roughly 7 per day.
In [105]:
plot_tweets_per_breed();
  • Golden Retrievers are the most popular breed on the WeRateDogs Twitter feed, followed by Pembrokes and Labrador Retrievers.
In [106]:
plot_rating_by_breed()
  • Golden Retrievers seem especially highly rated, especially given the very large number of tweets that included Golden Retriever images.
In [107]:
plot_confidence_by_breed()
  • It is unsurprising that the breeds that resulted in the highest average prediction confidence (in particular, the top three: Pomeranian, French Bulldog, and Pug) are also among the most distinctive looking dog breeds in the dataset.