Aggregations
For each of the example queries, the first few rows of the result are included.
What are Aggregations?
SQL aggregations are a means of consolidating data down to a more concise representation, and are useful in situations where row-level data would not be useful or would be entirely overwhelming. There are many situations where aggregated data are useful in business environments. Examples include measuring sales transactions by month, inventory by year, or sales volume per salesperson per year.
Aggregations operate down columns, and not across rows. Examples of aggregate functions include COUNT
, SUM
, MIN
, MAX
, and AVG
.
NULLs
NULL
s are a special data type included in tables to indicate the absence of data. NULL
s are not zero and not considered a value. Rather, they are best considered as a property of the data. NULL
s commonly occur when performing LEFT JOIN
s.
In order to identify NULL
using a WHERE
clause, we use IS NULL
or IS NOT NULL
. = NULL
will result in an error, since NULL
is not a value.
SELECT i.product_id,
i.price_unit,
t.time,
t.transaction_id
FROM tutorial.excel_sql_inventory_data i
LEFT JOIN tutorial.excel_sql_transaction_data t
ON i.product_id = t.product_id
WHERE t.transaction_id IS NULL
product_id | price_unit | time | transaction_id |
---|---|---|---|
85 | 0.08 |
SELECT school,
teacher,
hrs_studied,
sat_writing,
sat_verbal,
sat_math
FROM tutorial.sat_scores
WHERE hrs_studied IS NULL
school | teacher | hrs_studied | sat_writing | sat_verbal | sat_math |
---|---|---|---|---|---|
Washington HS | Frederickson | 642 | 287 | 282 | |
Washington HS | Spellman | 502 | 291 | 716 | |
Petersville HS | Davis | 756 | 595 | 427 | |
Petersville HS | Davis | 779 | 656 | 724 | |
Petersville HS | Perry | 649 | 654 | 636 |
COUNT
COUNT
counts how many rows are in a particular column.
SELECT count(*)
FROM tutorial.sat_scores
count |
---|
135 |
SELECT count(city)
FROM tutorial.city_populations
count |
---|
50 |
Note that COUNT
excludes NULL
values, as demonstrated in the query below.
SELECT COUNT(i.product_id) "Count of Product IDs",
COUNT(t.transaction_id) "Count of Transaction IDs"
FROM tutorial.excel_sql_inventory_data i
LEFT JOIN tutorial.excel_sql_transaction_data t
ON i.product_id = t.product_id
Count of Product IDs | Count of Transaction IDs |
---|---|
595 | 594 |
SUM
SUM
adds all values in a particular column.
SELECT COUNT(*) "Total Flights",
SUM(cancelled) "Cancelled Flights" /* Cancelled = 1, Not cancelled = 0 */
FROM tutorial.us_flights
Total Flights | Cancelled Flights |
---|---|
201664 | 4791 |
MIN and MAX
MIN
and MAX
return the lowest and highest values, respectively, of the values in a particular column.
SELECT MAX(sat_writing) "Max Writing",
MIN(sat_writing) "Min Writing",
MAX(sat_verbal) "Max Verbal",
MIN(sat_verbal) "Min Verbal",
MAX(sat_math) "Max Math",
MIN(sat_math) "Min Math"
FROM tutorial.sat_scores
Max Writing | Min Writing | Max Verbal | Min Verbal | Max Math | Min Math |
---|---|---|---|---|---|
799 | 206 | 799 | 200 | 797 | 201 |
MIN
and MAX
also operate on nonnumeric data, returning the earliest date and nonnumeric data in the case of the former, and latest and closest to ‘Z’ in the case of the latter.
AVG
AVG
calculates the average of the values in a particular column.
SELECT AVG(actual_elapsed_time)
FROM tutorial.us_flights
WHERE dest = 'SFO'
AND origin = 'LAX'
avg |
---|
75.2930648769575 |
SELECT AVG(sat_writing) "Avg Writing",
AVG(sat_verbal) "Avg Verbal",
AVG(sat_math) "Avg Math"
FROM tutorial.sat_scores
Avg Writing | Avg Verbal | Avg Math |
---|---|---|
500.5037037037037 | 495.72592592592594 | 517.4148148148148 |
GROUP BY
GROUP BY
creates segments that can be aggregated independently of one another. The GROUP BY
clause is placed between the WHERE
clause and the ORDER BY
clause. Any column in the SELECT
statement that is not within an aggregator must be in the GROUP BY
clause.
Examples of parameters that might be grouped on in a business context might include different sales representatives, accounts, or regions.
SELECT artist "Artist",
COUNT(song_name) "Top 100 Hits"
FROM tutorial.billboard_top_100_year_end
GROUP BY artist
ORDER BY count(song_name) DESC
Artist | Top 100 Hits |
---|---|
Madonna | 36 |
Elvis Presley | 36 |
Mariah Carey | 33 |
Rihanna | 33 |
Ludacris | 28 |
SELECT state "State",
SUM(population_estimate_2012) "Urban Population",
COUNT(city) "Large City Count"
FROM tutorial.city_populations
GROUP BY state
ORDER BY "Large City Count" DESC
State | Urban Population | Large City Count |
---|---|---|
CA | 8854805 | 8 |
TX | 7453656 | 7 |
AZ | 2465129 | 3 |
TN | 1279651 | 2 |
MI | 1165785 | 2 |
GROUP BY
and ORDER BY
can also be used on multiple columns at once, as shown below.
SELECT product_type "Product Type",
AVG(price_unit) "Average Price Per Unit",
unit "Unit"
FROM tutorial.excel_sql_inventory_data
GROUP BY product_type, "Unit"
ORDER BY "Average Price Per Unit" DESC
Product Type | Average Price Per Unit | Unit |
---|---|---|
meat | 29.79 | lb |
other | 6 | fl_oz |
produce | 4.29 | oz |
frozen | 3.95375 | oz |
other | 2.67 | cubic_ft |
SELECT school "School",
teacher "Teacher",
AVG(sat_writing + sat_verbal + sat_math) "Avg Total"
FROM tutorial.sat_scores
GROUP BY school, teacher
ORDER BY "Avg Total" DESC
School | Teacher | Avg Total SAT |
---|---|---|
St. John’s | Tran | 1643.9333333333334 |
Petersville HS | Davis | 1605.1538461538462 |
Petersville HS | Perry | 1594.304347826087 |
St. John’s | Williams | 1508.6875 |
St. John’s | Rajaram | 1459.25 |
Instead of retyping the entire label in the GROUP BY
and ORDER BY
clauses, the columns can instead be referred to by the numerical order in which they appear in the SELECT
clause.
SELECT school "School",
teacher "Teacher",
AVG(sat_writing + sat_verbal + sat_math) "Avg Total"
FROM tutorial.sat_scores
GROUP BY 1, 2
ORDER BY 1 DESC, 2
School | Teacher | Avg Total |
---|---|---|
Washington HS | Frederickson | 1430.076923076923 |
Washington HS | Spellman | 1451 |
St. John’s | Rajaram | 1459.25 |
St. John’s | Tran | 1643.9333333333334 |
St. John’s | Williams | 1508.6875 |
The order that the columns appear in the GROUP BY
clause does not impact the results. The order that the columns appear in the ORDER BY
does. Compare the following to the foregoing.
SELECT school "School",
teacher "Teacher",
AVG(sat_writing + sat_verbal + sat_math) "Avg Total"
FROM tutorial.sat_scores
GROUP BY 1, 2
ORDER BY 2, 1 DESC
School | Teacher | Avg Total |
---|---|---|
Petersville HS | Brown | 1418.4705882352941 |
Petersville HS | Davis | 1605.1538461538462 |
Washington HS | Frederickson | 1430.076923076923 |
Petersville HS | Perry | 1594.304347826087 |
St. John’s | Rajaram | 1459.25 |
In the former table, the rows are ordered in reverse alphabetical by school. For each school, the rows are ordered alphabetically by teacher.
In the latter table, the rows are ordered alphabetically by teacher and the school is actually not used in the order, since the teachers are all unique names.
DISTINCT
DISTINCT
is a modifier placed, once, in a SELECT
statement. It returns the unique entries for each of the columns in the SELECT
statement.
It is used when you do not want to GROUP BY
some columns, but don’t want to perform any other aggregations.
The results of the following query are the first three of 201664 rows.
SELECT origin,
dest
FROM tutorial.us_flights
origin | dest |
---|---|
JFK | LAX |
JFK | LAX |
JFK | LAX |
Inserting the DISTINCT
statement into the query produces 4152 rows.
SELECT DISTINCT origin,
dest
FROM tutorial.us_flights
origin | dest |
---|---|
PDX | HNL |
ATL | PWM |
LRD | DFW |
MIA | TUL |
DAL | ICT |
Note that it also collapses rows containing duplicate combinations of entries. The following query produces only a single row. Without DISTINCT
, there are 487 rows with origin ‘JFK’ and destination ‘LAX’.
SELECT DISTINCT origin,
dest
FROM tutorial.us_flights
WHERE origin = 'JFK'
AND dest = 'LAX'
origin | dest |
---|---|
JFK | LAX |
Note DISTINCT
may dramatically increase the execution time of some queries. This is particularly the case when used with aggregations.
HAVING
The WHERE
clause does not allow you to filter based on aggregate data. In order to filter based upon an aggregation, the HAVING
clause must be used. The HAVING
clause follows the GROUP BY
clause and precedes the ORDER BY
clause.
SELECT state,
SUM(population_estimate_2012) "Top 50 City Population"
FROM tutorial.city_populations
GROUP BY state
HAVING SUM(population_estimate_2012) < 500000
AND SUM(population_estimate_2012) > 0
ORDER BY SUM(population_estimate_2012) DESC
state | Top 50 City Population |
---|---|
VA | 447021 |
GA | 443775 |
NE | 421570 |
MN | 392880 |
KS | 385577 |
SELECT origin,
COUNT(*) "Flights"
FROM tutorial.us_flights
GROUP BY origin
HAVING COUNT(*) > 7000
ORDER BY COUNT(*) DESC
origin | Flights |
---|---|
ATL | 12678 |
ORD | 10046 |
DFW | 9854 |
DEN | 7441 |
LAX | 7434 |
Note that the HAVING
’s functionality is frequently performed with a subquery instead of with HAVING
. Using the HAVING
clause is the more ‘clean’ way to filter based on aggregate data, however.
DATE
Datetime data in SQL is stored in the following format: YYYY-MM-DD HH:MM:SS. Because the datetime data is commonly stored with granularity to the second, GROUP BY
using date information, as-is, is not very helpful. An example of typical datetime data follows.
SELECT time,
transaction_id,
product_id
FROM tutorial.excel_sql_transaction_data
ORDER BY transaction_id
time | transaction_id | product_id |
---|---|---|
2016-01-08 17:46:17 | 1 | 3 |
2016-01-08 17:46:17 | 1 | 61 |
2016-01-07 14:11:57 | 2 | 23 |
2016-01-06 17:57:42 | 4 | 52 |
2016-01-06 17:57:42 | 4 | 4 |
SELECT MIN(time),
MAX(time)
FROM tutorial.excel_sql_transaction_data
min | max |
---|---|
2016-01-03 08:02:39 | 2016-01-08 19:59:51 |
SQL has several built-in functions to cope with the granularity of datetime data. Two of the most important include DATE_TRUNC
and DATE_PART
.
DATE_TRUNC
DATE_TRUNC
allows you to truncate a datetime to a particular part of the datetime column. Commonly, this truncation is to a year, month, or day. The query from above is run with the DATE_TRUNC
function applied to the datetime, with the following results.
SELECT DATE_TRUNC('day',time),
transaction_id,
product_id
FROM tutorial.excel_sql_transaction_data
ORDER BY transaction_id
date_trunc | transaction_id | product_id |
---|---|---|
2016-01-08 00:00:00 | 1 | 3 |
2016-01-08 00:00:00 | 1 | 61 |
2016-01-07 00:00:00 | 2 | 23 |
2016-01-06 00:00:00 | 4 | 52 |
2016-01-06 00:00:00 | 4 | 4 |
The data can now be aggregated much more readily.
SELECT DATE_TRUNC('day',time) "Date",
COUNT(*) "Transaction Count"
FROM tutorial.excel_sql_transaction_data
GROUP BY DATE_TRUNC('day',time)
Date | Transaction Count |
---|---|
2016-01-04 00:00:00 | 105 |
2016-01-08 00:00:00 | 101 |
2016-01-03 00:00:00 | 96 |
2016-01-06 00:00:00 | 126 |
2016-01-05 00:00:00 | 78 |
Other parameters to which datetime can be truncated include second, minute, hour, week, month, quarter, and year.
DATE_PART
DATE_PART
is distinct from DATE_TRUNC
in that it does not keep track of the larger time denominations. In the following example, the number of transactions that occurs is counted regardless of the day on which the transaction occurred. This function is useful for examining trends based on day of week (‘dow’), as an example.
SELECT DATE_PART('hour', time) "Hour",
COUNT(*) "Transaction Count"
FROM tutorial.excel_sql_transaction_data
GROUP BY 1
ORDER BY 1
Hour | Transaction Count |
---|---|
8 | 54 |
9 | 60 |
10 | 33 |
11 | 55 |
12 | 55 |
SELECT DATE_PART('dow', flight_date) "Day of Week",
SUM(distance) "Total Scheduled Flight Distance"
FROM tutorial.us_flights
GROUP BY DATE_PART('dow', flight_date)
ORDER BY DATE_PART('dow', flight_date)
Day of Week | Total Scheduled Flight Distance |
---|---|
0 | 25646241 |
1 | 26147981 |
2 | 23837864 |
3 | 24379063 |
4 | 12694557 |
5 | 26762083 |
6 | 23354741 |
CASE
CASE
statements are SQL’s way of handling if-then logic, and are used to create derived columns. CASE
statements are placed within SELECT
clauses.
CASE
statements need to include CASE
, WHEN
, and END
keywords. ELSE
is optional, and used as a catch-all for situations where the input doesn’t find a match in the other conditions.
SELECT transaction_id,
CASE WHEN COUNT(product_id) = 1 THEN 'Single Item'
ELSE 'Multiple Item' END purchase_type
FROM tutorial.excel_sql_transaction_data
GROUP BY transaction_id
ORDER BY transaction_id
transaction_id | purchase_type |
---|---|
1 | Multiple Item |
2 | Single Item |
4 | Multiple Item |
5 | Single Item |
7 | Multiple Item |
SELECT flight_date,
origin,
dest,
arr_delay,
CASE WHEN arr_delay < -10 THEN 'Early'
WHEN arr_delay > 10 AND arr_delay < 60 THEN 'Late'
WHEN arr_delay >= 60 THEN 'Very Late'
WHEN arr_delay IS NULL THEN 'Cancelled'
ELSE 'On Time' END "Flight Status"
FROM tutorial.us_flights
WHERE origin = 'LAX'
flight_date | origin | dest | arr_delay | Flight Status |
---|---|---|---|---|
2015-01-02 00:00:00 | LAX | JFK | -8 | On Time |
2015-01-03 00:00:00 | LAX | JFK | 24 | Late |
2015-01-04 00:00:00 | LAX | JFK | 150 | Very Late |
2015-01-05 00:00:00 | LAX | JFK | -15 | Early |
2015-01-06 00:00:00 | LAX | JFK | -18 | Early |
Note that WHERE
clauses can be used within CASE
statements as well.