Queries that Test Relationships


The context for this note is that it is a description of a SQL investigation into two particular branches of the sPAP described at a high level, below.

sPAP Description

S.M.A.R.T. Goal

Project Goal: Increase the number of Dognition tests completed

Note this goal is not particularly Specific or and is not Time-Bound, at all.

Relevant Dependent Variables

The relevant dependent variables are not defined at this point.

According to Jana Schaich Borg, at this level of the pyramid one should:

…write out all the ways you plan on operationalizing the business metric you are analyzing in your project here. Your description would include the equations you plan to use, and the field and table names of the columns in your database that you need to calculate the metric.

Relevant Independent Variables

  • Features of the Dogs
    • Dognition Personality Dimension
    • Speed of game completion
    • Breed
      • Breed group, Breed type (purebred?)
    • Neutered
    • Previous Behavioral Training
  • Features of the Owners
    • Personality Traits
      • Knowledge-seeking, Easily bored
    • Demographics
      • Education, Socioeconomic status, Age
    • Dog History
      • # dogs owned now, # lifetime dogs owned, history buying pure-breeds, ever bred dogs?
  • Features of Testing Circumstances
    • When?
      • Time of year, day of week, time of day
    • Where?
      • Geographic location, home or work?
  • Interaction with Dognition and Product
    • How did they learn about Dognition?
      • 60 Minutes, Internet, Recommendation
    • Promotions
      • Free Start?
  • Technology Interactions
    • Internet browser, phone/tablet/desktop?

Features of the Dogs

The branch of the sPAP that will be investigated is the Features of the Dogs branch.

Dognition Personality Dimensions

SELECT u.dimension AS uDimension,
       COUNT(u.dog_guid),
       AVG(u.tests_complete)
  FROM (SELECT d.dog_guid,
               d.dimension,
               COUNT(t.created_at) tests_complete
          FROM dogs d
          JOIN complete_tests t
            ON d.dog_guid = t.dog_guid
         GROUP BY d.dog_guid,
                  d.dimension) AS u
 GROUP BY u.dimension;
uDimension COUNT(u.dog_guid) AVG(u.tests_complete)
None 13705 6.9416
’’ 71 9.5352
ace 477 23.3878
charmer 690 23.2594
einstein 129 23.2171
expert 298 23.3926
maverick 272 22.8199
protodog 602 22.9336
renaissance-dog 510 23.0157
socialite 871 23.1194
stargazer 361 22.7368

A few criteria are added in a WHERE clause based upon brief examination. In particular, these are the following.

WHERE d.dimension != ""
  AND d.dimension IS NOT NULL
  AND (d.exclude IS NULL
       OR d.exclude = 0)

The end result is the following table.

uDimension COUNT(u.dog_guid) AVG(u.tests_complete)
ace 402 23.5100
charmer 626 23.3594
einstein 109 23.2385
expert 273 23.4249
maverick 245 22.7673
protodog 535 22.9570
renaissance-dog 463 23.0410
socialite 792 23.0997
stargazer 310 22.7968

These results suggest there are not important differences in the number of tests completed by dogs with different Dognition personality dimensions.

Dog Breeds

SELECT u.breed_type AS uBreedType,
       COUNT(u.dog_guid),
       AVG(u.tests_complete)
  FROM (SELECT d.dog_guid,
               d.breed_type,
               COUNT(t.created_at) tests_complete
          FROM dogs d
          JOIN complete_tests t
            ON d.dog_guid = t.dog_guid
         WHERE (d.exclude IS NULL
                OR d.exclude = 0)
         GROUP BY d.dog_guid,
                  d.breed_type) AS u
 GROUP BY u.breed_type;
uBreedType COUNT(u.dog_guid) AVG(u.tests_complete)
Cross Breed 2884 10.6009
Mixed Breed/ Other/ I Don’t Know 4818 10.2688
Popular Hybrid 634 10.8423
Pure Breed 8865 10.4107

Dog Breeds and Neutering

SELECT u.pure_breed,
       u.dog_fixed,
       COUNT(u.dog_guid),
       AVG(u.tests_complete)
  FROM (SELECT d.dog_guid,
               d.breed_type,
               d.dog_fixed,
               COUNT(t.created_at) tests_complete,
               CASE d.breed_type
                    WHEN 'Pure Breed' THEN 'Pure_Breed'
                    ELSE 'Not_Pure_Breed'
               END AS pure_breed
          FROM dogs d
          JOIN complete_tests t
            ON d.dog_guid = t.dog_guid
         WHERE (d.exclude IS NULL
            OR d.exclude = 0)
         GROUP BY d.dog_guid,
                  d.breed_type,
                  d.dog_fixed) AS u
 GROUP BY u.pure_breed,
          u.dog_fixed;
pure_breed dog_fixed COUNT(u.dog_guid) AVG(u.tests_complete)
Not_Pure_Breed None 97 9.9897
Not_Pure_Breed 0 592 8.6807
Not_Pure_Breed 1 7647 10.5681
Pure_Breed None 135 8.2815
Pure_Breed 0 1687 9.3788
Pure_Breed 1 7043 10.6987

These results suggest that a dog’s breeding doesn’t have a strong relationship with how many tests a dog completed. Neutered dogs, on the other hand, seem to finish 1-2 more tests than non-neutered dogs.

Next steps to investigate this further would be to determine whether the neutering effect is consistent across segments of dogs broken up by according to other variables. In particular, it may be difficult to determine whether neutered dogs complete more tests due to traits that arise from being neutered, or whether owners who neuter their dogs are more likely to want to complete more tests.

Other Dog Features

Two other dog features that were listed in the sPAP were speed of game completion and previous behavioral training. The relationship between the speed of game completion and number of games completed would be most readily examined by creating a scatter plot with a best fit line. Alternatively, a statistical regression could be performed. Either way, the best course of action is to export the data to a program like Tableau, Matlab, or R.

A Note on Averages as Summary Metrics

Averages may not be the best feature to use for summary statistics. This is because averages are very sensitive to outliers. The median is a more robust summary statistic. Medians are more computationally intensive than averages, however, so there is no built-in function that allows median calculation in SQL. There are advanced strategies that make this possible.

A practical means of testing for misleading average values is to also measure the standard deviation of the data. Outliers will result in large standard deviation values. Whenever standard deviations are a large percentage of the average, and absolutely whenever the standard deviation is larger, it is necessary to perform sophisticated statistical analysis before interpreting the results of an analysis too strongly.

Data Set Median Mode Mean Standard Deviation
1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4 2.5 2 2.58 1.08
1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 400 2.5 2 35.38 114.74

Practical Averages Example 1

SELECT u.dimension AS uDimension,
       COUNT(u.dog_guid),
       AVG(u.tests_complete),
       STDDEV(u.tests_complete)
  FROM (SELECT d.dog_guid,
               d.dimension,
               COUNT(t.created_at) tests_complete
          FROM dogs d
          JOIN complete_tests t
            ON d.dog_guid = t.dog_guid
         WHERE d.dimension != ''
           AND d.dimension IS NOT NULL
           AND (d.exclude IS NULL
                OR d.exclude = 0)
         GROUP BY d.dog_guid,
                  d.dimension) AS u
 GROUP BY u.dimension;
uDimension COUNT(u.dog_guid) AVG(u.tests_complete) STDDEV(u.tests_complete)
ace 402 23.5100 5.4896
charmer 626 23.3594 5.1919
einstein 109 23.2385 5.3155
expert 273 23.4249 4.7589
maverick 245 22.7673 4.7353
protodog 535 22.9570 5.3742
renaissance-dog 463 23.0410 4.9508
socialite 792 23.0997 4.9748
stargazer 310 22.7968 4.8254

Since the standard deviation is only a small fraction of the average, and consistent across the data, it is reasonable to assume that there averages are trustworthy.

Practical Averages Example 2

SELECT d.breed_type,
       ROUND(AVG(TIMESTAMPDIFF(MINUTE,start_time,end_time))) AS avg_duration,
       ROUND(STDDEV(TIMESTAMPDIFF(MINUTE,start_time,end_time))) AS stddev_duration
  FROM dogs d
  JOIN exam_answers e
    ON d.dog_guid = e.dog_guid
 WHERE TIMESTAMPDIFF(MINUTE,start_time,end_time) > 0
 GROUP BY d.breed_type;
breed_type avg_duration stddev_duration
Cross Breed 11810 59113.0
Mixed Breed/ Other/ I Don’t Know 9145 48749.0
Popular Hybrid 7734 45578.0
Pure Breed 12311 60997.0

Note that in the preceding example the standard deviations are much larger than the averages. This indicates that the data are likely to have extreme outliers, and that the averages reported are not likely to be trustworthy. More in-depth statistical analysis is required in this case.

Features of Testing Circumstances

The branch of the sPAP that will be investigated is the Features of Testing Circumstances branch.

This note focuses on assessing the relationships between the following data features and the number of tests completed, which is the business metric this analysis ultimately seeks to improve.

Which Day of Week?

The query below was developed incrementally. It involves excluding the users and dogs indicated by Dognition as needing to be excluded (exclude = 1). It also limited the users to the contiguous United States (excluding Hawaii and Alaska).

SELECT YEAR(DATE_SUB(t.created_at, INTERVAL 6 HOUR)) AS year,
       CASE DAYOFWEEK(DATE_SUB(t.created_at, INTERVAL 6 HOUR))
            WHEN 1 THEN 'Sunday'
            WHEN 2 THEN 'Monday'
            WHEN 3 THEN 'Tuesday'
            WHEN 4 THEN 'Wednesday'
            WHEN 5 THEN 'Thursday'
            WHEN 6 THEN 'Friday'
            WHEN 7 THEN 'Saturday'
        END AS weekday_name,
       COUNT(DATE_SUB(t.created_at, INTERVAL 6 HOUR)) AS test_count
  FROM complete_tests t
  JOIN (SELECT DISTINCT d.dog_guid
          FROM dogs d
          JOIN users u
            ON d.user_guid = u.user_guid
         WHERE (d.exclude IS NULL
            OR  d.exclude = 0)
           AND (u.exclude IS NULL
            OR  u.exclude = 0)
           AND u.country = 'US'
           AND u.state NOT IN ('AK','HI')) AS included_dogs_users
    ON t.dog_guid = included_dogs_users.dog_guid
 GROUP BY year,
          weekday_name
 ORDER BY year,
          FIELD(weekday_name,
                'Monday',
                'Tuesday',
                'Wednesday',
                'Thursday',
                'Friday',
                'Saturday',
                'Sunday');
year weekday_name test_count
2013 Monday 3798
2013 Tuesday 3281
2013 Wednesday 3410
2013 Thursday 3079
2013 Friday 3049
2013 Saturday 4754
2013 Sunday 6061
2014 Monday 7908
2014 Tuesday 6508
2014 Wednesday 5823
2014 Thursday 4800
2014 Friday 4339
2014 Saturday 6081
2014 Sunday 7736
2015 Monday 8229
2015 Tuesday 6673
2015 Wednesday 6215
2015 Thursday 5881
2015 Friday 5275
2015 Saturday 7154
2015 Sunday 10406

The conclusion from this data is that Sundays are generally the day with the most completed tests, and Fridays are typically the worst days. This indicates it may make sense for Dognition to send reminders or encouragement messages on Sundays.

Which States?

SELECT included_dogs_users.state user_state,
       COUNT(DISTINCT included_dogs_users.user_guid) user_count
  FROM complete_tests t
  JOIN (SELECT DISTINCT u.user_guid,
               u.state,
               d.dog_guid
          FROM dogs d
          JOIN users u
            ON d.user_guid = u.user_guid
         WHERE (d.exclude IS NULL
            OR  d.exclude = 0)
           AND (u.exclude IS NULL
            OR  u.exclude = 0)
           AND u.country = 'US') AS included_dogs_users
    ON t.dog_guid = included_dogs_users.dog_guid
 GROUP BY user_state
 ORDER BY user_count DESC
 LIMIT 5;
user_state user_count
CA 1363
NY 628
TX 536
FL 502
NC 467

California has more than twice the users as the next state. This information could be very useful. The next questions to ask include:

  • What other characteristics distinguish California users?
  • Were specific advertising channels or promotions used in California?
  • Can the answers to those questions be used in future marketing efforts?

Which Countries?

SELECT included_dogs_users.country user_country,
       COUNT(DISTINCT included_dogs_users.user_guid) user_count
  FROM complete_tests t
  JOIN (SELECT DISTINCT u.user_guid,
               u.country,
               d.dog_guid
          FROM dogs d
          JOIN users u
            ON d.user_guid = u.user_guid
         WHERE (d.exclude IS NULL
            OR  d.exclude = 0)
           AND (u.exclude IS NULL
            OR  u.exclude = 0)) AS included_dogs_users
    ON t.dog_guid = included_dogs_users.dog_guid
 GROUP BY user_country
 ORDER BY user_count DESC
 LIMIT 10;
user_country user_count
US 8936
N/A 5466
CA 484
AU 142
GB 123
DE 40
NZ 38
DK 34
NO 30
FR 23

It is obvious from these results that users from English-speaking countries represent a large majority of Dognition’s userbase. It may make sense to create translated versions of the Dognition website in order to penetrate foreign markets to a greater degree.

Some content in this note are taken from the Jupyter Notebooks below. They are accessible as part of the “Managing Big Data with MySQL” course on coursera.org, and licensed by Jana Schaich Borg under CC BY-NC 4.0.

  • MySQL Exercise 11 Queries that Test Relationships Between Test Completion and Dog Characteristics
  • MySQL Exercise 12 Queries that Test Relationships Between Test Completion and Testing Characteristics

Other content for this note is taken from the Coursera course “Managing Big Data with MySQL.”