SQL Aggregations

The tables queried this blog post are the publicly-available “tutorial.us_flights,” “tutorial.city_populations,” “tutorial.billboard_top_100_year_end,” “tutorial.sat_scores,” “tutorial.excel_sql_inventory_data,” and “tutorial.excel_sql_transaction_data” tables that are available at modeanalytics.com.

What are Aggregations?

SQL aggregations are a means of consolidating data down to a more concise representation, and are useful in situations where row-level data would not be useful or would be entirely overwhelming. There are many situations where aggregated data are useful in business environments. Examples include measuring sales transactions by month, inventory by year, or sales volume per salesperson per year.

Aggregations operate down columns, and not across rows. Examples of aggregate functions include COUNT, SUM, MIN, MAX, and AVG.

NULLs

NULLs are a special data type included in tables to indicate the absence of data. NULLs are not zero and not considered a value. Rather, they are best considered as a property of the data. NULLs commonly occur when performing LEFT JOINs.

In order to identify NULL using a WHERE clause, we use IS NULL or IS NOT NULL. = NULL will result in an error, since NULL is not a value.

SELECT i.product_id,
       i.price_unit,
       t.time,
       t.transaction_id
  FROM tutorial.excel_sql_inventory_data i
  LEFT JOIN tutorial.excel_sql_transaction_data t
    ON i.product_id = t.product_id
 WHERE t.transaction_id IS NULL
product_id price_unit time transaction_id
85 0.08    


SELECT school,
       teacher,
       hrs_studied,
       sat_writing,
       sat_verbal,
       sat_math
  FROM tutorial.sat_scores
 WHERE hrs_studied IS NULL
school teacher hrs_studied sat_writing sat_verbal sat_math
Washington HS Frederickson   642 287 282
Washington HS Spellman   502 291 716
Petersville HS Davis   756 595 427
Petersville HS Davis   779 656 724
Petersville HS Perry   649 654 636

COUNT

COUNT counts how many rows are in a particular column.

SELECT count(*)
  FROM tutorial.sat_scores
count
135


SELECT count(city)
  FROM tutorial.city_populations
count
50


Note that COUNT excludes NULL values, as demonstrated in the query below.

SELECT COUNT(i.product_id) "Count of Product IDs",
       COUNT(t.transaction_id) "Count of Transaction IDs"
  FROM tutorial.excel_sql_inventory_data i
  LEFT JOIN tutorial.excel_sql_transaction_data t
    ON i.product_id = t.product_id
Count of Product IDs Count of Transaction IDs
595 594

SUM

SUM adds all values in a particular column.

SELECT COUNT(*) "Total Flights",
       SUM(cancelled) "Cancelled Flights" /* Cancelled = 1, Not cancelled = 0 */
  FROM tutorial.us_flights
Total Flights Cancelled Flights
201664 4791

MIN and MAX

MIN and MAX return the lowest and highest values, respectively, of the values in a particular column.

SELECT MAX(sat_writing) "Max Writing",
       MIN(sat_writing) "Min Writing",
       MAX(sat_verbal) "Max Verbal",
       MIN(sat_verbal) "Min Verbal",
       MAX(sat_math) "Max Math",
       MIN(sat_math) "Min Math"
  FROM tutorial.sat_scores
Max Writing Min Writing Max Verbal Min Verbal Max Math Min Math
799 206 799 200 797 201

MIN and MAX also operate on nonnumeric data, returning the earliest date and nonnumeric data in the case of the former, and latest and closest to ‘Z’ in the case of the latter.

AVG

AVG calculates the average of the values in a particular column.

SELECT AVG(actual_elapsed_time)
  FROM tutorial.us_flights
 WHERE dest = 'SFO'
   AND origin = 'LAX'
avg
75.2930648769575


SELECT AVG(sat_writing) "Avg Writing",
       AVG(sat_verbal) "Avg Verbal",
       AVG(sat_math) "Avg Math"
  FROM tutorial.sat_scores
Avg Writing Avg Verbal Avg Math
500.5037037037037 495.72592592592594 517.4148148148148

GROUP BY

GROUP BY creates segments that can be aggregated independently of one another. The GROUP BY clause is placed between the WHERE clause and the ORDER BY clause. Any column in the SELECT statement that is not within an aggregator must be in the GROUP BY clause.

Examples of parameters that might be grouped on in a business context might include different sales representatives, accounts, or regions.

SELECT artist "Artist",
       COUNT(song_name) "Top 100 Hits"
  FROM tutorial.billboard_top_100_year_end
 GROUP BY artist
 ORDER BY count(song_name) DESC
Artist Top 100 Hits
Madonna 36
Elvis Presley 36
Mariah Carey 33
Rihanna 33
Ludacris 28


SELECT state "State",
       SUM(population_estimate_2012) "Urban Population",
       COUNT(city) "Large City Count"
  FROM tutorial.city_populations
 GROUP BY state
 ORDER BY "Large City Count" DESC
State Urban Population Large City Count
CA 8854805 8
TX 7453656 7
AZ 2465129 3
TN 1279651 2
MI 1165785 2


GROUP BY and ORDER BY can also be used on multiple columns at once, as shown below.

SELECT product_type "Product Type",
       AVG(price_unit) "Average Price Per Unit",
       unit "Unit"
  FROM tutorial.excel_sql_inventory_data
 GROUP BY product_type, "Unit"
 ORDER BY "Average Price Per Unit" DESC
Product Type Average Price Per Unit Unit
meat 29.79 lb
other 6 fl_oz
produce 4.29 oz
frozen 3.95375 oz
other 2.67 cubic_ft


SELECT school "School",
       teacher "Teacher",
       AVG(sat_writing + sat_verbal + sat_math) "Avg Total"
  FROM tutorial.sat_scores
 GROUP BY school, teacher
 ORDER BY "Avg Total" DESC
School Teacher Avg Total SAT
St. John’s Tran 1643.9333333333334
Petersville HS Davis 1605.1538461538462
Petersville HS Perry 1594.304347826087
St. John’s Williams 1508.6875
St. John’s Rajaram 1459.25


Instead of retyping the entire label in the GROUP BY and ORDER BY clauses, the columns can instead be referred to by the numerical order in which they appear in the SELECT clause.

SELECT school "School",
       teacher "Teacher",
       AVG(sat_writing + sat_verbal + sat_math) "Avg Total"
  FROM tutorial.sat_scores
 GROUP BY 1, 2
 ORDER BY 1 DESC, 2
School Teacher Avg Total
Washington HS Frederickson 1430.076923076923
Washington HS Spellman 1451
St. John’s Rajaram 1459.25
St. John’s Tran 1643.9333333333334
St. John’s Williams 1508.6875

The order that the columns appear in the GROUP BY clause does not impact the results. The order that the columns appear in the ORDER BY does. Compare the following to the foregoing.

SELECT school "School",
       teacher "Teacher",
       AVG(sat_writing + sat_verbal + sat_math) "Avg Total"
  FROM tutorial.sat_scores
 GROUP BY 1, 2
 ORDER BY 2, 1 DESC
School Teacher Avg Total
Petersville HS Brown 1418.4705882352941
Petersville HS Davis 1605.1538461538462
Washington HS Frederickson 1430.076923076923
Petersville HS Perry 1594.304347826087
St. John’s Rajaram 1459.25

In the former table, the rows are ordered in reverse alphabetical by school. For each school, the rows are ordered alphabetically by teacher.

In the latter table, the rows are ordered alphabetically by teacher and the school is actually not used in the order, since the teachers are all unique names.

DISTINCT

DISTINCT is a modifier placed, once, in a SELECT statement. It returns the unique entries for each of the columns in the SELECT statement.

It is used when you do not want to GROUP BY some columns, but don’t want to perform any other aggregations.

The results of the following query are the first three of 201664 rows.

SELECT origin,
       dest
  FROM tutorial.us_flights
origin dest
JFK LAX
JFK LAX
JFK LAX


Inserting the DISTINCT statement into the query produces 4152 rows.

SELECT DISTINCT origin,
       dest
  FROM tutorial.us_flights
origin dest
PDX HNL
ATL PWM
LRD DFW
MIA TUL
DAL ICT


Note that it also collapses rows containing duplicate combinations of entries. The following query produces only a single row. Without DISTINCT, there are 487 rows with origin ‘JFK’ and destination ‘LAX’.

SELECT DISTINCT origin,
       dest
  FROM tutorial.us_flights
 WHERE origin = 'JFK'
   AND dest = 'LAX'
origin dest
JFK LAX

A final note on DISTINCT: it may dramatically increase the execution time of some queries. This is particularly the case when used with aggregations.

HAVING

The WHERE clause does not allow you to filter based on aggregate data. In order to filter based upon an aggregation, the HAVING clause must be used. The HAVING clause follows the GROUP BY clause and precedes the ORDER BY clause.

SELECT state,
       SUM(population_estimate_2012) "Top 50 City Population"
  FROM tutorial.city_populations
 GROUP BY state
HAVING SUM(population_estimate_2012) < 500000
   AND SUM(population_estimate_2012) > 0
 ORDER BY SUM(population_estimate_2012) DESC
 
state Top 50 City Population
VA 447021
GA 443775
NE 421570
MN 392880
KS 385577


SELECT origin,
       COUNT(*) "Flights"
  FROM tutorial.us_flights
 GROUP BY origin
HAVING COUNT(*) > 7000
 ORDER BY COUNT(*) DESC
origin Flights
ATL 12678
ORD 10046
DFW 9854
DEN 7441
LAX 7434

Note that the HAVING’s functionality is frequently performed with a subquery instead of with HAVING. Using the HAVING clause is the more ‘clean’ way to filter based on aggregate data, however.

DATE

Datetime data in SQL is stored in the following format: YYYY-MM-DD HH:MM:SS. Because the datetime data is commonly stored with granularity to the second, GROUP BY using date information, as-is, is not very helpful. An example of typical datetime data follows.

SELECT time,
       transaction_id,
       product_id
  FROM tutorial.excel_sql_transaction_data
 ORDER BY transaction_id
time transaction_id product_id
2016-01-08 17:46:17 1 3
2016-01-08 17:46:17 1 61
2016-01-07 14:11:57 2 23
2016-01-06 17:57:42 4 52
2016-01-06 17:57:42 4 4


SELECT MIN(time),
       MAX(time)
  FROM tutorial.excel_sql_transaction_data
min max
2016-01-03 08:02:39 2016-01-08 19:59:51

SQL has several built-in functions to cope with the granularity of datetime data. Two of the most important include DATE_TRUNC and DATE_PART.

DATE_TRUNC

DATE_TRUNC allows you to truncate a datetime to a particular part of the datetime column. Commonly, this truncation is to a year, month, or day. The query from above is run with the DATE_TRUNC function applied to the datetime, with the following results.

SELECT DATE_TRUNC('day',time),
       transaction_id,
       product_id
  FROM tutorial.excel_sql_transaction_data
 ORDER BY transaction_id
date_trunc transaction_id product_id
2016-01-08 00:00:00 1 3
2016-01-08 00:00:00 1 61
2016-01-07 00:00:00 2 23
2016-01-06 00:00:00 4 52
2016-01-06 00:00:00 4 4

The data can now be aggregated much more readily.

SELECT DATE_TRUNC('day',time) "Date",
       COUNT(*) "Transaction Count"
  FROM tutorial.excel_sql_transaction_data
 GROUP BY DATE_TRUNC('day',time)
Date Transaction Count
2016-01-04 00:00:00 105
2016-01-08 00:00:00 101
2016-01-03 00:00:00 96
2016-01-06 00:00:00 126
2016-01-05 00:00:00 78

Other parameters to which datetime can be truncated include second, minute, hour, week, month, quarter, and year.

DATE_PART

DATE_PART is distinct from DATE_TRUNC in that it does not keep track of the larger time denominations. In the following example, the number of transactions that occurs is counted regardless of the day on which the transaction occurred. This function is useful for examining trends based on day of week (‘dow’), as an example.

SELECT DATE_PART('hour', time) "Hour",
       COUNT(*) "Transaction Count"
  FROM tutorial.excel_sql_transaction_data
 GROUP BY 1
 ORDER BY 1
 

Hour Transaction Count
8 54
9 60
10 33
11 55
12 55


SELECT DATE_PART('dow', flight_date) "Day of Week",
       SUM(distance) "Total Scheduled Flight Distance"
  FROM tutorial.us_flights
 GROUP BY DATE_PART('dow', flight_date)
 ORDER BY DATE_PART('dow', flight_date)
Day of Week Total Scheduled Flight Distance
0 25646241
1 26147981
2 23837864
3 24379063
4 12694557
5 26762083
6 23354741

Mondays and Fridays appear to be common days for flying.

CASE

CASE statements are SQL’s way of handling if-then logic, and are used to create derived columns. CASE statements are placed within SELECT clauses.

CASE statements need to include CASE, WHEN, and END keywords. ELSE is optional, and used as a catch-all for situations where the input doesn’t find a match in the other conditions.

SELECT transaction_id,
       CASE WHEN COUNT(product_id) = 1 THEN 'Single Item'
            ELSE 'Multiple Item' END purchase_type
  FROM tutorial.excel_sql_transaction_data
 GROUP BY transaction_id
 ORDER BY transaction_id
transaction_id purchase_type
1 Multiple Item
2 Single Item
4 Multiple Item
5 Single Item
7 Multiple Item


 SELECT flight_date,
        origin,
        dest,
        arr_delay,
        CASE WHEN arr_delay < -10 THEN 'Early'
             WHEN arr_delay > 10 AND arr_delay < 60 THEN 'Late'
             WHEN arr_delay >= 60 THEN 'Very Late'
             WHEN arr_delay IS NULL THEN 'Cancelled'
             ELSE 'On Time' END "Flight Status"
   FROM tutorial.us_flights
  WHERE origin = 'LAX'
flight_date origin dest arr_delay Flight Status
2015-01-02 00:00:00 LAX JFK -8 On Time
2015-01-03 00:00:00 LAX JFK 24 Late
2015-01-04 00:00:00 LAX JFK 150 Very Late
2015-01-05 00:00:00 LAX JFK -15 Early
2015-01-06 00:00:00 LAX JFK -18 Early


Note that WHERE clauses can be used within CASE statements as well.

Content for this note is taken from information I learned while pursuing the Udacity Data Analyst Nanodegree. Udacity is an online tech education service that offers both free and paid, tech-focused training. Learn more.