Aggregations and HAVING

This week’s content enables analysts to manipulate SQL data into whatever segments or categories they are interested in.

Aggregations

  • SQL aggregations are different from Excel aggregations as a result of the table format that queries must return. The aggregations must be a single level.
DESCRIBE dogs;
Field Type Null Key Default Extra
gender varchar(255) YES   None  
birthday varchar(255) YES   None  
breed varchar(255) YES   None  
weight int(11) YES   None  
dog_fixed tinyint(1) YES   None  
dna_tested tinyint(1) YES   None  
created_at datetime NO   None  
updated_at datetime NO   None  
dimension varchar(255) YES   None  
exclude tinyint(1) YES   None  
breed_type varchar(255) YES   None  
breed_group varchar(255) YES   None  
dog_guid varchar(60) YES MUL None  
user_guid varchar(60) YES MUL None  
total_tests_completed varchar(255) YES   None  
mean_iti_days varchar(255) YES   None  
mean_iti_minutes varchar(255) YES   None  
median_iti_days varchar(255) YES   None  
median_iti_minutes varchar(255) YES   None  
time_diff_between_first_and_last_game_days varchar(255) YES   None  
time_diff_between_first_and_last_game_minutes varchar(255) YES   None  

COUNT

  • When a column is included in a count function, null values are ignored in the count. When an asterisk is included in a count function, nulls are included in the count.
SELECT COUNT(*), COUNT(exclude)
FROM dogs;
COUNT(*) COUNT(exclude)
35050 1025

SUM

SUM can be combined with ISNULL to find the number of null values in a given column.

SELECT COUNT(*), COUNT(exclude), SUM(ISNULL(exclude))
FROM dogs;
COUNT(*) COUNT(exclude) SUM(ISNULL(exclude))
35050 1025 34025

AVG, MIN, MAX

SELECT test_name,
AVG(rating) AS AVG_Rating,
MIN(rating) AS MIN_Rating,
MAX(rating) AS MAX_Rating
FROM reviews
test_name AVG_Rating MIN_Rating MAX_Rating
Memory versus Pointing 3.5584 0 9

Examples

How would you query how much time it took to complete each test provided in the exam_answers table, in minutes?

DESCRIBE exam_answers;
Field Type Null Key Default Extra
script_detail_id int(11) YES   None  
subcategory_name varchar(255) YES   None  
test_name varchar(255) YES   None  
step_type varchar(255) YES   None  
start_time datetime YES   None  
end_time datetime YES   None  
loop_number int(11) YES   None  
dog_guid varchar(60) YES   None  
SELECT TIMESTAMPDIFF(MINUTE,start_time,end_time) "Duration",
       start_time,
       end_time
FROM exam_answers
LIMIT 3;
Duration start_time end_time
345139 2013-02-05 03:58:13 2013-10-02 20:18:06
345139 2013-02-05 03:58:31 2013-10-02 20:18:06
345139 2013-02-05 03:59:03 2013-10-02 20:18:06

What is the average amount of time it took customers to complete all of the tests in the exam_answers table when the negative durations are excluded from your calculation?

SELECT AVG(TIMESTAMPDIFF(MINUTE,start_time,end_time)) "Average Duration"
FROM exam_answers
WHERE TIMESTAMPDIFF(MINUTE,start_time,end_time) > 0;
Average Duration
11233.0951

Group By

MySQL allows you to use aliases in a GROUP BY clause, but some database systems do not.

Note that the following query does not return the correct ordering, because the alias “Dogs” is the name of one of the tables. If instead of double-quotes, the alias is wrapped in back-ticks, then the query works as expected.

SELECT breed_group, gender, COUNT(*) "Dogs"
FROM dogs
GROUP BY breed_group, gender
ORDER BY "Dogs" DESC
LIMIT 3;
breed_group gender Dogs
Sporting female 2262
Herding male 1736
Sporting male 2584
SELECT breed_group, gender, COUNT(*) `Dogs`
FROM dogs
GROUP BY breed_group, gender
ORDER BY `Dogs` DESC
LIMIT 3;
breed_group gender Dogs
None male 8466
None female 8367
Sporting male 2584

HAVING

Whereas the expression that follows a WHERE clause has to be applicable to each row of data in a column, the expression that follows a HAVING clause has to be applicable or computable using a group of data.

SELECT breed_group, gender, COUNT(DISTINCT dog_guid) `Dogs`
FROM dogs
WHERE breed_group IS NOT NULL
AND breed_group != ""
GROUP BY breed_group, gender
HAVING COUNT(DISTINCT dog_guid) > 1000
ORDER BY `Dogs` DESC
LIMIT 3;
breed_group gender Dogs
Sporting male 2584
Sporting female 2262
Herding male 1736

Examples

Write a query that outputs the average number of tests completed and average mean inter-test-interval for every breed type, sorted by the average number of completed tests in descending order.

SELECT breed_type,
       AVG(total_tests_completed) 'Tests Completed',
       AVG(mean_iti_minutes) 'Avg ITI'
FROM dogs
GROUP BY breed_type
ORDER BY AVG(total_tests_completed) DESC;
breed_type Tests Completed Avg ITI
Popular Hybrid 10.257530120481928 2834.3205728931534
Cross Breed 9.945900537634408 2872.351156110182
Pure Breed 9.871602824737856 3193.350493795222
Mixed Breed/ Other/ I Don’t Know 9.54250850170034 3023.0711302156274

Write a query that outputs the average amount of time it took customers to complete each type of test where any individual reaction times over 6000 hours are excluded and only average reaction times that are greater than 0 seconds are included.

SELECT test_name,
       AVG(TIMESTAMPDIFF(HOUR,start_time,end_time))
FROM exam_answers
WHERE TIMESTAMPDIFF(HOUR,start_time,end_time) < 6000
GROUP BY test_name
HAVING AVG(TIMESTAMPDIFF(SECOND,start_time,end_time)) > 0
ORDER BY AVG(TIMESTAMPDIFF(HOUR,start_time,end_time)) DESC
LIMIT 3;
test_name AVG(TIMESTAMPDIFF(HOUR,start_time,end_time))
Social-Quiz 81.9923
Diet 32.7712
Sociability 21.1726

Write a query that outputs the total number of unique User_Guids in each combination of State and ZIP code (postal code) in the United States, sorted first by state name in ascending alphabetical order, and second by total number of unique User_Guids in descending order.

SELECT state,
       zip,
       COUNT(DISTINCT user_guid)
FROM users
WHERE country = 'US'
GROUP BY state, zip
ORDER BY state, COUNT(DISTINCT user_guid) DESC
LIMIT 3;
state zip COUNT(DISTINCT user_guid)
AE 9128 2
AE 9053 1
AE 9107 1

Write a query that outputs the total number of unique User_Guids in each combination of State and ZIP code in the United States that have at least 5 users, sorted first by state name in ascending alphabetical order, and second by total number of unique User_Guids in descending order.

SELECT state,
       zip,
       COUNT(DISTINCT user_guid)
FROM users
WHERE country = 'US'
GROUP BY state, zip
HAVING COUNT(DISTINCT user_guid) > 5
ORDER BY state, COUNT(DISTINCT user_guid) DESC
LIMIT 3;
state zip COUNT(DISTINCT user_guid)
AZ 86303 14
AZ 85718 6
CA 92107 16

Pitfalls of GROUP BY

The following query executes without an error, but the query is nonsensical. It includes both aggregate and non-aggregate functions in the SELECT statement.

SELECT breed_type, COUNT(DISTINCT dog_guid) AS NumDogs, weight
FROM dogs
GROUP BY breed_type;
breed_type NumDogs weight
Cross Breed 5568 0
Mixed Breed/ Other/ I Don’t Know 9499 50
Popular Hybrid 1160 70
Pure Breed 18823 50

MySQL resolves this by choosing its own way to “summarize” the unaggregated field, “weight.” Other database systems may through an error, such as the following.

Column 'X' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

The order in which SQL queries are written and actually executed are tabulated below.

Written Executed
SELECT FROM
DISTINCT WHERE
FROM GROUP BY
WHERE HAVING
GROUP BY SELECT
HAVING DISTINCT
ORDER BY ORDER BY
LIMIT LIMIT

Some content in this note are taken from the Jupyter Notebooks enumerated below. They are accessible as part of the “Managing Big Data with MySQL” course on coursera.org, and licensed by Jana Schaich Borg under CC BY-NC 4.0.

  • MySQL Exercise 04 Summarizing Your Data
  • MySQL Exercise 05 Breaking Your Summaries into Groups
  • MySQL Exercise 06 Common Pitfalls of GROUP BY

Other content for this note is taken from the Coursera course “Managing Big Data with MySQL.”