Joins
For each of the example queries, the first few rows of the result are included.
What are SQL JOINs?
JOIN
s are where the power of SQL truly lies, and are a primary reason for SQL’s continued popularity in the marketplace. They allow users to join multiple tables from a relational database together.
A common question about relational databases is why databases are split up in the first place. Why can’t all the data be included in a single, large table? The short answer is that separating the data into a series of tables limits the overall number of reads and writes that would otherwise need to occur on a single, very large table. Splitting up the data also speeds up queries, and makes sense given that different types of data are likely to required different sorts of computing activity to maintain. As an example, an orders
or web_events
table is likely to be written to, once, and then subsequently read occasionally. A sales_reps
table, on the other hand, may require frequent access to update contact information and which accounts a given salesperson is responsible for.
In practice, the approach companies have adopted is to split up the data logically, based upon the different objects that the data refers to. ERDs, or Entity Relationship Diagrams, define these relationships. The example I showed previously is included below, for reference.
The process of breaking up a database into a series of data tables is called “Database Normalization.” This is a deep topic and requires its own skillset, but is generally outside the job scope of most analysts, who are principally concerned with pulling data from a database that already exists.
JOIN Syntax
The following syntax demonstrates how an analyst could pull data from multiple tables at one time.
Note that JOIN
s consist of two keywords: JOIN
precedes the table being joined, and ON
precedes the column on which the tables are to be joined.
SELECT tutorial.excel_sql_inventory_data.product_id,
tutorial.excel_sql_inventory_data.price_unit,
tutorial.excel_sql_transaction_data.time,
tutorial.excel_sql_transaction_data.transaction_id
FROM tutorial.excel_sql_inventory_data
JOIN tutorial.excel_sql_transaction_data
ON tutorial.excel_sql_inventory_data.product_id = tutorial.excel_sql_transaction_data.product_id
product_id | price_unit | time | transaction_id |
---|---|---|---|
3 | 1.02 | 2016-01-08 17:46:17 | 1 |
61 | 0.55 | 2016-01-08 17:46:17 | 1 |
23 | 0.91 | 2016-01-07 14:11:57 | 2 |
52 | 1.79 | 2016-01-06 17:57:42 | 4 |
4 | 1.19 | 2016-01-06 17:57:42 | 4 |
Aliases
It is possible to create aliases for table names for subsequent use in a given SQL query. The keyword AS
is sometimes used in between the table and its alias, but it can be omitted. As an example, utilizing aliases in the query above would result in the following much more readable query:
SELECT i.product_id,
i.price_unit,
t.time,
t.transaction_id
FROM tutorial.excel_sql_inventory_data i
JOIN tutorial.excel_sql_transaction_data t
ON i.product_id = t.product_id
In this example, only a few keystrokes are saved, but aliases greatly enhance readability in larger, more complex queries. Note that while the aliases are “defined” in the FROM
and JOIN
clauses, they can be used before those clauses, in the SELECT
clause.
It is most common to alias tables, but it is also possible to alias columns, as shown below. The quotes are only necessary if spaces are included in the aliases.
SELECT i.product_id "Product ID",
i.price_unit "Price Per Unit",
t.time "Transaction Time",
t.transaction_id "Transaction ID"
FROM tutorial.excel_sql_inventory_data i
JOIN tutorial.excel_sql_transaction_data t
ON i.product_id = t.product_id
The table that is returned will have name fields that reflect the alias, which may be much more readable.
Product ID | Price Per Unit | Transaction Time | Transaction ID |
---|---|---|---|
3 | 1.02 | 2016-01-08 17:46:17 | 1 |
61 | 0.55 | 2016-01-08 17:46:17 | 1 |
23 | 0.91 | 2016-01-07 14:11:57 | 2 |
52 | 1.79 | 2016-01-06 17:57:42 | 4 |
4 | 1.19 | 2016-01-06 17:57:42 | 4 |
Types of JOINs
In addition to the standard JOIN
demonstrated above, there are also other types of joins that retain or exclude different data depending on the table from which they are extracted. These JOIN
s are best imagined as a Venn diagram. The left circle includes the rows of data in the table in the FROM
clause. The right circle includes all rows of data in the table in the JOIN
clause. The overlapping section in the middle represents all rows for which the ON
clause is true.
Those types of JOIN
s are defined below.
INNER JOIN
- The most exclusiveJOIN
, it returns only rows at the intersection of the two tables.LEFT JOIN
- Consists of anINNER JOIN
plus all rows in theFROM
table, even if they did not match. Imagine all the data within the left circle of the Venn diagram is included.RIGHT JOIN
- Consists of anINNER JOIN
plus all rows in theJOIN
table, even if they did not match. Imagine all the data within the right circle of the Venn diagram is included.- (Full)
OUTER JOIN
- The most inclusiveJOIN
, it includes all rows in both tables. These are rare.
Note that LEFT JOIN
s and RIGHT JOIN
s are interchangeable by swapping which tables included in the FROM
and JOIN
clauses. The standard has become to architect queries such that LEFT JOIN
s are used preferentially. As a result, RIGHT JOIN
s are rarely seen.
SELECT i.product_id,
i.price_unit,
t.time,
t.transaction_id
FROM tutorial.excel_sql_inventory_data i
LEFT JOIN tutorial.excel_sql_transaction_data t
ON i.product_id = t.product_id
ORDER BY product_id DESC
product_id | price_unit | time | t_transaction_id |
---|---|---|---|
85 | 0.08 | ||
84 | 2.59 | 2016-01-03 12:36:15 | 152 |
84 | 2.59 | 2016-01-04 14:29:52 | 221 |
84 | 2.59 | 2016-01-08 11:57:31 | 260 |
84 | 2.59 | 2016-01-06 08:48:59 | 143 |
Omit the word LEFT
in the foregoing query and an INNER JOIN
is performed instead of LEFT JOIN
. As a result, the row containing product_id 85 is not included in the result set, since it does not find a correlate in the transactions table.
Filtering
It is possible to filter the data returned from a query using JOIN
clauses without using a WHERE
clause, one of the primary means of filtering. Consider the following two examples.
SELECT i.product_id,
i.price_unit,
t.time,
t.transaction_id
FROM tutorial.excel_sql_inventory_data i
LEFT JOIN tutorial.excel_sql_transaction_data t
ON i.product_id = t.product_id
WHERE i.product_id = 5
ORDER BY transaction_id
product_id | price_unit | time | transaction_id |
---|---|---|---|
5 | 3.86 | 2016-01-06 08:42:28 | 26 |
5 | 3.86 | 2016-01-04 12:49:51 | 108 |
The foregoing code will return only rows meeting the WHERE
clause, so the unmatched rows included as a result of the LEFT JOIN
will be excluded. Essentially, the WHERE
clause is performed after the JOIN
is executed.
A similar, but not identical, approach is the following.
SELECT i.product_id,
i.price_unit,
t.time,
t.transaction_id
FROM tutorial.excel_sql_inventory_data i
LEFT JOIN tutorial.excel_sql_transaction_data t
ON i.product_id = t.product_id
AND i.product_id = 5
ORDER BY transaction_id
product_id | price_unit | time | transaction_id |
---|---|---|---|
5 | 3.86 | 2016-01-06 08:42:28 | 26 |
5 | 3.86 | 2016-01-04 12:49:51 | 108 |
48 | 0.89 | ||
53 | 1.69 | ||
32 | 2.99 |
Setting up the code this way results in rows being returned that match the product_id
, or are blank in the columns sourced from the transactions table. It is like executing a WHERE
clause on the right (or JOIN
) table before executing the JOIN
. The rows that do not match the product_id
above will be included, but the columns containing time
or transaction_id
data will be blank.
The simplest way to remember this is to remember that when this query is executed, the ON
clause is always executed first.
Keys
A “primary key” is a column that exists in every table. It is a column whose value is unique for every row. Designated with “PK” in ERDs, the primary key is commonly the first column in a table.
A “foreign key” (FK) in one table is a primary key in another. They are a link between the tables. Unlike primary keys, foreign keys do not need to be unique.
In the ON
clause, a PK is always linked a FK.
Union
UNION
combines the results of multiple SELECT
statements into a single set of results. Two rules must be followed when using UNION
:
- All queries must return the same number of columns,
- Corresponding columns in the queries must have compatible data types.
The syntax follows.
SELECT column_1,
column_2
FROM table_A
UNION
SELECT column_1,
column_2
FROM table_B
Reasons to use UNION
include:
UNION
removes all duplicate rows unlessUNION ALL
is used. Duplicate rows are defined as having the same data across all columns.UNION
is frequently used to combine data from similar tables that are not perfectly normalized, meaning, the information has not been combined into a single table yet.