The tables queried this blog post are the publicly-available “tutorial.us_flights,” “tutorial.city_populations,” “tutorial.billboard_top_100_year_end,” “tutorial.sat_scores,” “tutorial.excel_sql_inventory_data,” and “tutorial.excel_sql_transaction_data” tables that are available at modeanalytics.com.
What are Aggregations?
SQL aggregations are a means of consolidating data down to a more concise representation, and are useful in situations where row-level data would not be useful or would be entirely overwhelming. There are many situations where aggregated data are useful in business environments. Examples include measuring sales transactions by month, inventory by year, or sales volume per salesperson per year.
Aggregations operate down columns, and not across rows. Examples of aggregate functions include
NULLs are a special data type included in tables to indicate the absence of data.
NULLs are not zero and not considered a value. Rather, they are best considered as a property of the data.
NULLs commonly occur when performing
In order to identify
NULL using a
WHERE clause, we use
IS NULL or
IS NOT NULL.
= NULL will result in an error, since
NULL is not a value.
COUNT counts how many rows are in a particular column.
NULL values, as demonstrated in the query below.
|Count of Product IDs||Count of Transaction IDs|
SUM adds all values in a particular column.
|Total Flights||Cancelled Flights|
MIN and MAX
MAX return the lowest and highest values, respectively, of the values in a particular column.
|Max Writing||Min Writing||Max Verbal||Min Verbal||Max Math||Min Math|
MAX also operate on nonnumeric data, returning the earliest date and nonnumeric data in the case of the former, and latest and closest to ‘Z’ in the case of the latter.
AVG calculates the average of the values in a particular column.
|Avg Writing||Avg Verbal||Avg Math|
GROUP BY creates segments that can be aggregated independently of one another. The
GROUP BY clause is placed between the
WHERE clause and the
ORDER BY clause. Any column in the
SELECT statement that is not within an aggregator must be in the
GROUP BY clause.
Examples of parameters that might be grouped on in a business context might include different sales representatives, accounts, or regions.
|Artist||Top 100 Hits|
|State||Urban Population||Large City Count|
GROUP BY and
ORDER BY can also be used on multiple columns at once, as shown below.
|Product Type||Average Price Per Unit||Unit|
|School||Teacher||Avg Total SAT|
Instead of retyping the entire label in the
GROUP BY and
ORDER BY clauses, the columns can instead be referred to by the numerical order in which they appear in the
The order that the columns appear in the
GROUP BY clause does not impact the results. The order that the columns appear in the
ORDER BY does. Compare the following to the foregoing.
In the former table, the rows are ordered in reverse alphabetical by school. For each school, the rows are ordered alphabetically by teacher.
In the latter table, the rows are ordered alphabetically by teacher and the school is actually not used in the order, since the teachers are all unique names.
DISTINCT is a modifier placed, once, in a
SELECT statement. It returns the unique entries for each of the columns in the
It is used when you do not want to
GROUP BY some columns, but don’t want to perform any other aggregations.
The results of the following query are the first three of 201664 rows.
DISTINCT statement into the query produces 4152 rows.
Note that it also collapses rows containing duplicate combinations of entries. The following query produces only a single row. Without
DISTINCT, there are 487 rows with origin ‘JFK’ and destination ‘LAX’.
A final note on
DISTINCT: it may dramatically increase the execution time of some queries. This is particularly the case when used with aggregations.
WHERE clause does not allow you to filter based on aggregate data. In order to filter based upon an aggregation, the
HAVING clause must be used. The
HAVING clause follows the
GROUP BY clause and precedes the
ORDER BY clause.
|state||Top 50 City Population|
Note that the
HAVING’s functionality is frequently performed with a subquery instead of with
HAVING. Using the
HAVING clause is the more ‘clean’ way to filter based on aggregate data, however.
Datetime data in SQL is stored in the following format: YYYY-MM-DD HH:MM:SS. Because the datetime data is commonly stored with granularity to the second,
GROUP BY using date information, as-is, is not very helpful. An example of typical datetime data follows.
|2016-01-03 08:02:39||2016-01-08 19:59:51|
SQL has several built-in functions to cope with the granularity of datetime data. Two of the most important include
DATE_TRUNC allows you to truncate a datetime to a particular part of the datetime column. Commonly, this truncation is to a year, month, or day. The query from above is run with the
DATE_TRUNC function applied to the datetime, with the following results.
The data can now be aggregated much more readily.
Other parameters to which datetime can be truncated include second, minute, hour, week, month, quarter, and year.
DATE_PART is distinct from
DATE_TRUNC in that it does not keep track of the larger time denominations. In the following example, the number of transactions that occurs is counted regardless of the day on which the transaction occurred. This function is useful for examining trends based on day of week (‘dow’), as an example.
|Day of Week||Total Scheduled Flight Distance|
Mondays and Fridays appear to be common days for flying.
CASE statements are SQL’s way of handling if-then logic, and are used to create derived columns.
CASE statements are placed within
CASE statements need to include
ELSE is optional, and used as a catch-all for situations where the input doesn’t find a match in the other conditions.
|2015-01-02 00:00:00||LAX||JFK||-8||On Time|
|2015-01-04 00:00:00||LAX||JFK||150||Very Late|
WHERE clauses can be used within
CASE statements as well.