For each of the example queries, the first few rows of the result are included.
What are Aggregations?
SQL aggregations are a means of consolidating data down to a more concise representation, and are useful in situations where row-level data would not be useful or would be entirely overwhelming. There are many situations where aggregated data are useful in business environments. Examples include measuring sales transactions by month, inventory by year, or sales volume per salesperson per year.
Aggregations operate down columns, and not across rows. Examples of aggregate functions include
NULLs are a special data type included in tables to indicate the absence of data.
NULLs are not zero and not considered a value. Rather, they are best considered as a property of the data.
NULLs commonly occur when performing
In order to identify
NULL using a
WHERE clause, we use
IS NULL or
IS NOT NULL.
= NULL will result in an error, since
NULL is not a value.
SELECT i.product_id, i.price_unit, t.time, t.transaction_id FROM tutorial.excel_sql_inventory_data i LEFT JOIN tutorial.excel_sql_transaction_data t ON i.product_id = t.product_id WHERE t.transaction_id IS NULL
SELECT school, teacher, hrs_studied, sat_writing, sat_verbal, sat_math FROM tutorial.sat_scores WHERE hrs_studied IS NULL
COUNT counts how many rows are in a particular column.
SELECT count(*) FROM tutorial.sat_scores
SELECT count(city) FROM tutorial.city_populations
NULL values, as demonstrated in the query below.
SELECT COUNT(i.product_id) "Count of Product IDs", COUNT(t.transaction_id) "Count of Transaction IDs" FROM tutorial.excel_sql_inventory_data i LEFT JOIN tutorial.excel_sql_transaction_data t ON i.product_id = t.product_id
|Count of Product IDs||Count of Transaction IDs|
SUM adds all values in a particular column.
SELECT COUNT(*) "Total Flights", SUM(cancelled) "Cancelled Flights" /* Cancelled = 1, Not cancelled = 0 */ FROM tutorial.us_flights
|Total Flights||Cancelled Flights|
MIN and MAX
MAX return the lowest and highest values, respectively, of the values in a particular column.
SELECT MAX(sat_writing) "Max Writing", MIN(sat_writing) "Min Writing", MAX(sat_verbal) "Max Verbal", MIN(sat_verbal) "Min Verbal", MAX(sat_math) "Max Math", MIN(sat_math) "Min Math" FROM tutorial.sat_scores
|Max Writing||Min Writing||Max Verbal||Min Verbal||Max Math||Min Math|
MAX also operate on nonnumeric data, returning the earliest date and nonnumeric data in the case of the former, and latest and closest to ‘Z’ in the case of the latter.
AVG calculates the average of the values in a particular column.
SELECT AVG(actual_elapsed_time) FROM tutorial.us_flights WHERE dest = 'SFO' AND origin = 'LAX'
SELECT AVG(sat_writing) "Avg Writing", AVG(sat_verbal) "Avg Verbal", AVG(sat_math) "Avg Math" FROM tutorial.sat_scores
|Avg Writing||Avg Verbal||Avg Math|
GROUP BY creates segments that can be aggregated independently of one another. The
GROUP BY clause is placed between the
WHERE clause and the
ORDER BY clause. Any column in the
SELECT statement that is not within an aggregator must be in the
GROUP BY clause.
Examples of parameters that might be grouped on in a business context might include different sales representatives, accounts, or regions.
SELECT artist "Artist", COUNT(song_name) "Top 100 Hits" FROM tutorial.billboard_top_100_year_end GROUP BY artist ORDER BY count(song_name) DESC
|Artist||Top 100 Hits|
SELECT state "State", SUM(population_estimate_2012) "Urban Population", COUNT(city) "Large City Count" FROM tutorial.city_populations GROUP BY state ORDER BY "Large City Count" DESC
|State||Urban Population||Large City Count|
GROUP BY and
ORDER BY can also be used on multiple columns at once, as shown below.
SELECT product_type "Product Type", AVG(price_unit) "Average Price Per Unit", unit "Unit" FROM tutorial.excel_sql_inventory_data GROUP BY product_type, "Unit" ORDER BY "Average Price Per Unit" DESC
|Product Type||Average Price Per Unit||Unit|
SELECT school "School", teacher "Teacher", AVG(sat_writing + sat_verbal + sat_math) "Avg Total" FROM tutorial.sat_scores GROUP BY school, teacher ORDER BY "Avg Total" DESC
|School||Teacher||Avg Total SAT|
Instead of retyping the entire label in the
GROUP BY and
ORDER BY clauses, the columns can instead be referred to by the numerical order in which they appear in the
SELECT school "School", teacher "Teacher", AVG(sat_writing + sat_verbal + sat_math) "Avg Total" FROM tutorial.sat_scores GROUP BY 1, 2 ORDER BY 1 DESC, 2
The order that the columns appear in the
GROUP BY clause does not impact the results. The order that the columns appear in the
ORDER BY does. Compare the following to the foregoing.
SELECT school "School", teacher "Teacher", AVG(sat_writing + sat_verbal + sat_math) "Avg Total" FROM tutorial.sat_scores GROUP BY 1, 2 ORDER BY 2, 1 DESC
In the former table, the rows are ordered in reverse alphabetical by school. For each school, the rows are ordered alphabetically by teacher.
In the latter table, the rows are ordered alphabetically by teacher and the school is actually not used in the order, since the teachers are all unique names.
DISTINCT is a modifier placed, once, in a
SELECT statement. It returns the unique entries for each of the columns in the
It is used when you do not want to
GROUP BY some columns, but don’t want to perform any other aggregations.
The results of the following query are the first three of 201664 rows.
SELECT origin, dest FROM tutorial.us_flights
DISTINCT statement into the query produces 4152 rows.
SELECT DISTINCT origin, dest FROM tutorial.us_flights
Note that it also collapses rows containing duplicate combinations of entries. The following query produces only a single row. Without
DISTINCT, there are 487 rows with origin ‘JFK’ and destination ‘LAX’.
SELECT DISTINCT origin, dest FROM tutorial.us_flights WHERE origin = 'JFK' AND dest = 'LAX'
DISTINCT may dramatically increase the execution time of some queries. This is particularly the case when used with aggregations.
WHERE clause does not allow you to filter based on aggregate data. In order to filter based upon an aggregation, the
HAVING clause must be used. The
HAVING clause follows the
GROUP BY clause and precedes the
ORDER BY clause.
SELECT state, SUM(population_estimate_2012) "Top 50 City Population" FROM tutorial.city_populations GROUP BY state HAVING SUM(population_estimate_2012) < 500000 AND SUM(population_estimate_2012) > 0 ORDER BY SUM(population_estimate_2012) DESC
|state||Top 50 City Population|
SELECT origin, COUNT(*) "Flights" FROM tutorial.us_flights GROUP BY origin HAVING COUNT(*) > 7000 ORDER BY COUNT(*) DESC
Note that the
HAVING’s functionality is frequently performed with a subquery instead of with
HAVING. Using the
HAVING clause is the more ‘clean’ way to filter based on aggregate data, however.
Datetime data in SQL is stored in the following format: YYYY-MM-DD HH:MM:SS. Because the datetime data is commonly stored with granularity to the second,
GROUP BY using date information, as-is, is not very helpful. An example of typical datetime data follows.
SELECT time, transaction_id, product_id FROM tutorial.excel_sql_transaction_data ORDER BY transaction_id
SELECT MIN(time), MAX(time) FROM tutorial.excel_sql_transaction_data
|2016-01-03 08:02:39||2016-01-08 19:59:51|
SQL has several built-in functions to cope with the granularity of datetime data. Two of the most important include
DATE_TRUNC allows you to truncate a datetime to a particular part of the datetime column. Commonly, this truncation is to a year, month, or day. The query from above is run with the
DATE_TRUNC function applied to the datetime, with the following results.
SELECT DATE_TRUNC('day',time), transaction_id, product_id FROM tutorial.excel_sql_transaction_data ORDER BY transaction_id
The data can now be aggregated much more readily.
SELECT DATE_TRUNC('day',time) "Date", COUNT(*) "Transaction Count" FROM tutorial.excel_sql_transaction_data GROUP BY DATE_TRUNC('day',time)
Other parameters to which datetime can be truncated include second, minute, hour, week, month, quarter, and year.
DATE_PART is distinct from
DATE_TRUNC in that it does not keep track of the larger time denominations. In the following example, the number of transactions that occurs is counted regardless of the day on which the transaction occurred. This function is useful for examining trends based on day of week (‘dow’), as an example.
SELECT DATE_PART('hour', time) "Hour", COUNT(*) "Transaction Count" FROM tutorial.excel_sql_transaction_data GROUP BY 1 ORDER BY 1
SELECT DATE_PART('dow', flight_date) "Day of Week", SUM(distance) "Total Scheduled Flight Distance" FROM tutorial.us_flights GROUP BY DATE_PART('dow', flight_date) ORDER BY DATE_PART('dow', flight_date)
|Day of Week||Total Scheduled Flight Distance|
CASE statements are SQL’s way of handling if-then logic, and are used to create derived columns.
CASE statements are placed within
CASE statements need to include
ELSE is optional, and used as a catch-all for situations where the input doesn’t find a match in the other conditions.
SELECT transaction_id, CASE WHEN COUNT(product_id) = 1 THEN 'Single Item' ELSE 'Multiple Item' END purchase_type FROM tutorial.excel_sql_transaction_data GROUP BY transaction_id ORDER BY transaction_id
SELECT flight_date, origin, dest, arr_delay, CASE WHEN arr_delay < -10 THEN 'Early' WHEN arr_delay > 10 AND arr_delay < 60 THEN 'Late' WHEN arr_delay >= 60 THEN 'Very Late' WHEN arr_delay IS NULL THEN 'Cancelled' ELSE 'On Time' END "Flight Status" FROM tutorial.us_flights WHERE origin = 'LAX'
|2015-01-02 00:00:00||LAX||JFK||-8||On Time|
|2015-01-04 00:00:00||LAX||JFK||150||Very Late|
WHERE clauses can be used within
CASE statements as well.