Aggregations

16 Apr 2018

The tables queried this blog post are the publicly-available "tutorial.us_flights," "tutorial.city_populations," "tutorial.billboard_top_100_year_end," "tutorial.sat_scores," "tutorial.excel_sql_inventory_data," and "tutorial.excel_sql_transaction_data" tables that are available at modeanalytics.com.

For each of the example queries, the first few rows of the result are included.

What are Aggregations?

SQL aggregations are a means of consolidating data down to a more concise representation, and are useful in situations where row-level data would not be useful or would be entirely overwhelming. There are many situations where aggregated data are useful in business environments. Examples include measuring sales transactions by month, inventory by year, or sales volume per salesperson per year.

Aggregations operate down columns, and not across rows. Examples of aggregate functions include COUNT, SUM, MIN, MAX, and AVG.

NULLs

NULLs are a special data type included in tables to indicate the absence of data. NULLs are not zero and not considered a value. Rather, they are best considered as a property of the data. NULLs commonly occur when performing LEFT JOINs.

In order to identify NULL using a WHERE clause, we use IS NULL or IS NOT NULL. = NULL will result in an error, since NULL is not a value.

SELECT i.product_id,
       i.price_unit,
       t.time,
       t.transaction_id
  FROM tutorial.excel_sql_inventory_data i
  LEFT JOIN tutorial.excel_sql_transaction_data t
    ON i.product_id = t.product_id
 WHERE t.transaction_id IS NULL

product_id	price_unit	time	transaction_id
85	0.08

SELECT school,
       teacher,
       hrs_studied,
       sat_writing,
       sat_verbal,
       sat_math
  FROM tutorial.sat_scores
 WHERE hrs_studied IS NULL

school	teacher	sat_writing	sat_verbal	sat_math
Washington HS	Frederickson	642	287	282
Washington HS	Spellman	502	291	716
Petersville HS	Davis	756	595	427
Petersville HS	Davis	779	656	724
Petersville HS	Perry	649	654	636

COUNT

COUNT counts how many rows are in a particular column.

SELECT count(*)
  FROM tutorial.sat_scores

count
135

SELECT count(city)
  FROM tutorial.city_populations

count
50

Note that COUNT excludes NULL values, as demonstrated in the query below.

SELECT COUNT(i.product_id) "Count of Product IDs",
       COUNT(t.transaction_id) "Count of Transaction IDs"
  FROM tutorial.excel_sql_inventory_data i
  LEFT JOIN tutorial.excel_sql_transaction_data t
    ON i.product_id = t.product_id

Count of Product IDs	Count of Transaction IDs
595	594

SUM

SUM adds all values in a particular column.

SELECT COUNT(*) "Total Flights",
       SUM(cancelled) "Cancelled Flights" /* Cancelled = 1, Not cancelled = 0 */
  FROM tutorial.us_flights

Total Flights	Cancelled Flights
201664	4791

MIN and MAX

MIN and MAX return the lowest and highest values, respectively, of the values in a particular column.

SELECT MAX(sat_writing) "Max Writing",
       MIN(sat_writing) "Min Writing",
       MAX(sat_verbal) "Max Verbal",
       MIN(sat_verbal) "Min Verbal",
       MAX(sat_math) "Max Math",
       MIN(sat_math) "Min Math"
  FROM tutorial.sat_scores

Max Writing	Min Writing	Max Verbal	Min Verbal	Max Math	Min Math
799	206	799	200	797	201

MIN and MAX also operate on nonnumeric data, returning the earliest date and nonnumeric data in the case of the former, and latest and closest to ‘Z’ in the case of the latter.

AVG

AVG calculates the average of the values in a particular column.

SELECT AVG(actual_elapsed_time)
  FROM tutorial.us_flights
 WHERE dest = 'SFO'
   AND origin = 'LAX'

avg
75.2930648769575

SELECT AVG(sat_writing) "Avg Writing",
       AVG(sat_verbal) "Avg Verbal",
       AVG(sat_math) "Avg Math"
  FROM tutorial.sat_scores

Avg Writing	Avg Verbal	Avg Math
500.5037037037037	495.72592592592594	517.4148148148148

GROUP BY

GROUP BY creates segments that can be aggregated independently of one another. The GROUP BY clause is placed between the WHERE clause and the ORDER BY clause. Any column in the SELECT statement that is not within an aggregator must be in the GROUP BY clause.

Examples of parameters that might be grouped on in a business context might include different sales representatives, accounts, or regions.

SELECT artist "Artist",
       COUNT(song_name) "Top 100 Hits"
  FROM tutorial.billboard_top_100_year_end
 GROUP BY artist
 ORDER BY count(song_name) DESC

Artist	Top 100 Hits
Madonna	36
Elvis Presley	36
Mariah Carey	33
Rihanna	33
Ludacris	28

SELECT state "State",
       SUM(population_estimate_2012) "Urban Population",
       COUNT(city) "Large City Count"
  FROM tutorial.city_populations
 GROUP BY state
 ORDER BY "Large City Count" DESC

State	Urban Population	Large City Count
CA	8854805	8
TX	7453656	7
AZ	2465129	3
TN	1279651	2
MI	1165785	2

GROUP BY and ORDER BY can also be used on multiple columns at once, as shown below.

SELECT product_type "Product Type",
       AVG(price_unit) "Average Price Per Unit",
       unit "Unit"
  FROM tutorial.excel_sql_inventory_data
 GROUP BY product_type, "Unit"
 ORDER BY "Average Price Per Unit" DESC

Product Type	Average Price Per Unit	Unit
meat	29.79	lb
other	6	fl_oz
produce	4.29	oz
frozen	3.95375	oz
other	2.67	cubic_ft

SELECT school "School",
       teacher "Teacher",
       AVG(sat_writing + sat_verbal + sat_math) "Avg Total"
  FROM tutorial.sat_scores
 GROUP BY school, teacher
 ORDER BY "Avg Total" DESC

School	Teacher	Avg Total SAT
St. John’s	Tran	1643.9333333333334
Petersville HS	Davis	1605.1538461538462
Petersville HS	Perry	1594.304347826087
St. John’s	Williams	1508.6875
St. John’s	Rajaram	1459.25

Instead of retyping the entire label in the GROUP BY and ORDER BY clauses, the columns can instead be referred to by the numerical order in which they appear in the SELECT clause.

SELECT school "School",
       teacher "Teacher",
       AVG(sat_writing + sat_verbal + sat_math) "Avg Total"
  FROM tutorial.sat_scores
 GROUP BY 1, 2
 ORDER BY 1 DESC, 2

School	Teacher	Avg Total
Washington HS	Frederickson	1430.076923076923
Washington HS	Spellman	1451
St. John’s	Rajaram	1459.25
St. John’s	Tran	1643.9333333333334
St. John’s	Williams	1508.6875

The order that the columns appear in the GROUP BY clause does not impact the results. The order that the columns appear in the ORDER BY does. Compare the following to the foregoing.

SELECT school "School",
       teacher "Teacher",
       AVG(sat_writing + sat_verbal + sat_math) "Avg Total"
  FROM tutorial.sat_scores
 GROUP BY 1, 2
 ORDER BY 2, 1 DESC

School	Teacher	Avg Total
Petersville HS	Brown	1418.4705882352941
Petersville HS	Davis	1605.1538461538462
Washington HS	Frederickson	1430.076923076923
Petersville HS	Perry	1594.304347826087
St. John’s	Rajaram	1459.25

In the former table, the rows are ordered in reverse alphabetical by school. For each school, the rows are ordered alphabetically by teacher.

In the latter table, the rows are ordered alphabetically by teacher and the school is actually not used in the order, since the teachers are all unique names.

DISTINCT

DISTINCT is a modifier placed, once, in a SELECT statement. It returns the unique entries for each of the columns in the SELECT statement.

It is used when you do not want to GROUP BY some columns, but don’t want to perform any other aggregations.

The results of the following query are the first three of 201664 rows.

SELECT origin,
       dest
  FROM tutorial.us_flights

origin	dest
JFK	LAX
JFK	LAX
JFK	LAX

Inserting the DISTINCT statement into the query produces 4152 rows.

SELECT DISTINCT origin,
       dest
  FROM tutorial.us_flights

origin	dest
PDX	HNL
ATL	PWM
LRD	DFW
MIA	TUL
DAL	ICT

Note that it also collapses rows containing duplicate combinations of entries. The following query produces only a single row. Without DISTINCT, there are 487 rows with origin ‘JFK’ and destination ‘LAX’.

SELECT DISTINCT origin,
       dest
  FROM tutorial.us_flights
 WHERE origin = 'JFK'
   AND dest = 'LAX'

origin	dest
JFK	LAX

Note DISTINCT may dramatically increase the execution time of some queries. This is particularly the case when used with aggregations.

HAVING

The WHERE clause does not allow you to filter based on aggregate data. In order to filter based upon an aggregation, the HAVING clause must be used. The HAVING clause follows the GROUP BY clause and precedes the ORDER BY clause.

SELECT state,
       SUM(population_estimate_2012) "Top 50 City Population"
  FROM tutorial.city_populations
 GROUP BY state
HAVING SUM(population_estimate_2012) < 500000
   AND SUM(population_estimate_2012) > 0
 ORDER BY SUM(population_estimate_2012) DESC

state	Top 50 City Population
VA	447021
GA	443775
NE	421570
MN	392880
KS	385577

SELECT origin,
       COUNT(*) "Flights"
  FROM tutorial.us_flights
 GROUP BY origin
HAVING COUNT(*) > 7000
 ORDER BY COUNT(*) DESC

origin	Flights
ATL	12678
ORD	10046
DFW	9854
DEN	7441
LAX	7434

Note that the HAVING’s functionality is frequently performed with a subquery instead of with HAVING. Using the HAVING clause is the more ‘clean’ way to filter based on aggregate data, however.

DATE

Datetime data in SQL is stored in the following format: YYYY-MM-DD HH:MM:SS. Because the datetime data is commonly stored with granularity to the second, GROUP BY using date information, as-is, is not very helpful. An example of typical datetime data follows.

SELECT time,
       transaction_id,
       product_id
  FROM tutorial.excel_sql_transaction_data
 ORDER BY transaction_id

time	transaction_id	product_id
2016-01-08 17:46:17	1	3
2016-01-08 17:46:17	1	61
2016-01-07 14:11:57	2	23
2016-01-06 17:57:42	4	52
2016-01-06 17:57:42	4	4

SELECT MIN(time),
       MAX(time)
  FROM tutorial.excel_sql_transaction_data

min	max
2016-01-03 08:02:39	2016-01-08 19:59:51

SQL has several built-in functions to cope with the granularity of datetime data. Two of the most important include DATE_TRUNC and DATE_PART.

DATE_TRUNC

DATE_TRUNC allows you to truncate a datetime to a particular part of the datetime column. Commonly, this truncation is to a year, month, or day. The query from above is run with the DATE_TRUNC function applied to the datetime, with the following results.

SELECT DATE_TRUNC('day',time),
       transaction_id,
       product_id
  FROM tutorial.excel_sql_transaction_data
 ORDER BY transaction_id

date_trunc	transaction_id	product_id
2016-01-08 00:00:00	1	3
2016-01-08 00:00:00	1	61
2016-01-07 00:00:00	2	23
2016-01-06 00:00:00	4	52
2016-01-06 00:00:00	4	4

The data can now be aggregated much more readily.

SELECT DATE_TRUNC('day',time) "Date",
       COUNT(*) "Transaction Count"
  FROM tutorial.excel_sql_transaction_data
 GROUP BY DATE_TRUNC('day',time)

Date	Transaction Count
2016-01-04 00:00:00	105
2016-01-08 00:00:00	101
2016-01-03 00:00:00	96
2016-01-06 00:00:00	126
2016-01-05 00:00:00	78

Other parameters to which datetime can be truncated include second, minute, hour, week, month, quarter, and year.

DATE_PART

DATE_PART is distinct from DATE_TRUNC in that it does not keep track of the larger time denominations. In the following example, the number of transactions that occurs is counted regardless of the day on which the transaction occurred. This function is useful for examining trends based on day of week (‘dow’), as an example.

SELECT DATE_PART('hour', time) "Hour",
       COUNT(*) "Transaction Count"
  FROM tutorial.excel_sql_transaction_data
 GROUP BY 1
 ORDER BY 1

Hour	Transaction Count
8	54
9	60
10	33
11	55
12	55

SELECT DATE_PART('dow', flight_date) "Day of Week",
       SUM(distance) "Total Scheduled Flight Distance"
  FROM tutorial.us_flights
 GROUP BY DATE_PART('dow', flight_date)
 ORDER BY DATE_PART('dow', flight_date)

Day of Week	Total Scheduled Flight Distance
0	25646241
1	26147981
2	23837864
3	24379063
4	12694557
5	26762083
6	23354741

CASE

CASE statements are SQL’s way of handling if-then logic, and are used to create derived columns. CASE statements are placed within SELECT clauses.

CASE statements need to include CASE, WHEN, and END keywords. ELSE is optional, and used as a catch-all for situations where the input doesn’t find a match in the other conditions.

SELECT transaction_id,
       CASE WHEN COUNT(product_id) = 1 THEN 'Single Item'
            ELSE 'Multiple Item' END purchase_type
  FROM tutorial.excel_sql_transaction_data
 GROUP BY transaction_id
 ORDER BY transaction_id

transaction_id	purchase_type
1	Multiple Item
2	Single Item
4	Multiple Item
5	Single Item
7	Multiple Item

 SELECT flight_date,
        origin,
        dest,
        arr_delay,
        CASE WHEN arr_delay < -10 THEN 'Early'
             WHEN arr_delay > 10 AND arr_delay < 60 THEN 'Late'
             WHEN arr_delay >= 60 THEN 'Very Late'
             WHEN arr_delay IS NULL THEN 'Cancelled'
             ELSE 'On Time' END "Flight Status"
  FROM tutorial.us_flights
 WHERE origin = 'LAX'

flight_date	origin	dest	arr_delay	Flight Status
2015-01-02 00:00:00	LAX	JFK	-8	On Time
2015-01-03 00:00:00	LAX	JFK	24	Late
2015-01-04 00:00:00	LAX	JFK	150	Very Late
2015-01-05 00:00:00	LAX	JFK	-15	Early
2015-01-06 00:00:00	LAX	JFK	-18	Early

Note that WHERE clauses can be used within CASE statements as well.

Content for these notes was taken from information I learned while pursuing the Udacity Data Analyst Nanodegree. Udacity is an online tech education service that offers tech-focused training. Learn more.