The context for this note is that it is a description of a SQL investigation into two particular branches of the sPAP described at a high level, below.
Project Goal: Increase the number of Dognition tests completed
Note this goal is not particularly Specific or and is not Time-Bound, at all.
Relevant Dependent Variables
The relevant dependent variables are not defined at this point.
According to Jana Schaich Borg, at this level of the pyramid one should:
…write out all the ways you plan on operationalizing the business metric you are analyzing in your project here. Your description would include the equations you plan to use, and the field and table names of the columns in your database that you need to calculate the metric.
Relevant Independent Variables
- Features of the Dogs
- Dognition Personality Dimension
- Speed of game completion
- Breed group, Breed type (purebred?)
- Previous Behavioral Training
- Features of the Owners
- Personality Traits
- Knowledge-seeking, Easily bored
- Education, Socioeconomic status, Age
- Dog History
- # dogs owned now, # lifetime dogs owned, history buying pure-breeds, ever bred dogs?
- Personality Traits
- Features of Testing Circumstances
- Time of year, day of week, time of day
- Geographic location, home or work?
- Interaction with Dognition and Product
- How did they learn about Dognition?
- 60 Minutes, Internet, Recommendation
- Free Start?
- How did they learn about Dognition?
- Technology Interactions
- Internet browser, phone/tablet/desktop?
Features of the Dogs
The branch of the sPAP that will be investigated is the Features of the Dogs branch.
Dognition Personality Dimensions
A few criteria are added in a
WHERE clause based upon brief examination. In particular, these are the following.
The end result is the following table.
These results suggest there are not important differences in the number of tests completed by dogs with different Dognition personality dimensions.
|Mixed Breed/ Other/ I Don’t Know||4818||10.2688|
Dog Breeds and Neutering
These results suggest that a dog’s breeding doesn’t have a strong relationship with how many tests a dog completed. Neutered dogs, on the other hand, seem to finish 1-2 more tests than non-neutered dogs.
Next steps to investigate this further would be to determine whether the neutering effect is consistent across segments of dogs broken up by according to other variables. In particular, it may be difficult to determine whether neutered dogs complete more tests due to traits that arise from being neutered, or whether owners who neuter their dogs are more likely to want to complete more tests.
Other Dog Features
Two other dog features that were listed in the sPAP were speed of game completion and previous behavioral training. The relationship between the speed of game completion and number of games completed would be most readily examined by creating a scatter plot with a best fit line. Alternatively, a statistical regression could be performed. Either way, the best course of action is to export the data to a program like Tableau, Matlab, or R.
A Note on Averages as Summary Metrics
Averages may not be the best feature to use for summary statistics. This is because averages are very sensitive to outliers. The median is a more robust summary statistic. Medians are more computationally intensive than averages, however, so there is no built-in function that allows median calculation in SQL. There are advanced strategies that make this possible.
A practical means of testing for misleading average values is to also measure the standard deviation of the data. Outliers will result in large standard deviation values. Whenever standard deviations are a large percentage of the average, and absolutely whenever the standard deviation is larger, it is necessary to perform sophisticated statistical analysis before interpreting the results of an analysis too strongly.
|Data Set||Median||Mode||Mean||Standard Deviation|
|1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4||2.5||2||2.58||1.08|
|1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 400||2.5||2||35.38||114.74|
Practical Averages Example 1
Since the standard deviation is only a small fraction of the average, and consistent across the data, it is reasonable to assume that there averages are trustworthy.
Practical Averages Example 2
|Mixed Breed/ Other/ I Don’t Know||9145||48749.0|
Note that in the preceding example the standard deviations are much larger than the averages. This indicates that the data are likely to have extreme outliers, and that the averages reported are not likely to be trustworthy. More in-depth statistical analysis is required in this case.
Features of Testing Circumstances
The branch of the sPAP that will be investigated is the Features of Testing Circumstances branch.
This note focuses on assessing the relationships between the following data features and the number of tests completed, which is the business metric this analysis ultimately seeks to improve.
Which Day of Week?
The query below was developed incrementally. It involves excluding the users and dogs indicated by Dognition as needing to be excluded (
exclude = 1). It also limited the users to the contiguous United States (excluding Hawaii and Alaska).
The conclusion from this data is that Sundays are generally the day with the most completed tests, and Fridays are typically the worst days. This indicates it may make sense for Dognition to send reminders or encouragement messages on Sundays.
California has more than twice the users as the next state. This information could be very useful. The next questions to ask include:
- What other characteristics distinguish California users?
- Were specific advertising channels or promotions used in California?
- Can the answers to those questions be used in future marketing efforts?
It is obvious from these results that users from English-speaking countries represent a large majority of Dognition’s userbase. It may make sense to create translated versions of the Dognition website in order to penetrate foreign markets to a greater degree.
Some content in this note are taken from the Jupyter Notebooks below. They are accessible as part of the “Managing Big Data with MySQL” course on coursera.org, and licensed by Jana Schaich Borg under CC BY-NC 4.0.
- MySQL Exercise 11 Queries that Test Relationships Between Test Completion and Dog Characteristics
- MySQL Exercise 12 Queries that Test Relationships Between Test Completion and Testing Characteristics
Other content for this note is taken from the Coursera course “Managing Big Data with MySQL.”