SQL Joins

The tables queried this blog post are the publicly-available “tutorial.excel_sql_inventory_data” and “tutorial.excel_sql_transaction_data” tables that are available at modeanalytics.com.

What are SQL JOINs?

JOINs are where the power of SQL truly lies, and are a primary reason for SQL’s continued popularity in the marketplace. They allow users to join multiple tables from a relational database together.

A common question about relational databases is why databases are split up in the first place. Why can’t all the data be included in a single, large table? The short answer is that separating the data into a series of tables limits the overall number of reads and writes that would otherwise need to occur on a single, very large table. Splitting up the data also speeds up queries, and makes sense given that different types of data are likely to required different sorts of computing activity to maintain. As an example, an orders or web_events table is likely to be written to, once, and then subsequently read occasionally. A sales_reps table, on the other hand, may require frequent access to update contact information and which accounts a given salesperson is responsible for.

In practice, the approach companies have adopted is to split up the data logically, based upon the different objects that the data refers to. ERDs, or Entity Relationship Diagrams, define these relationships. The example I showed previously is included below, for reference.

The process of breaking up a database into a series of data tables is called “Database Normalization.” This is a deep topic and requires its own skillset, but is generally outside the job scope of most analysts, who are principally concerned with pulling data from a database that already exists.

Example Entity Relationship Diagram
Image courtesy of lucidchart, an online service with which you can create diagrams, like that shown here.

JOIN Syntax

The following syntax demonstrates how an analyst could pull data from multiple tables at one time.

Note that JOINs consist of two keywords: JOIN precedes the table being joined, and ON precedes the column on which the tables are to be joined.

SELECT tutorial.excel_sql_inventory_data.product_id,
       tutorial.excel_sql_inventory_data.price_unit,
       tutorial.excel_sql_transaction_data.time,
       tutorial.excel_sql_transaction_data.transaction_id
  FROM tutorial.excel_sql_inventory_data
  JOIN tutorial.excel_sql_transaction_data
    ON tutorial.excel_sql_inventory_data.product_id = tutorial.excel_sql_transaction_data.product_id
product_id price_unit time transaction_id
3 1.02 2016-01-08 17:46:17 1
61 0.55 2016-01-08 17:46:17 1
23 0.91 2016-01-07 14:11:57 2
52 1.79 2016-01-06 17:57:42 4
4 1.19 2016-01-06 17:57:42 4

Aliases

It is possible to create aliases for table names for subsequent use in a given SQL query. The keyword AS is sometimes used in between the table and its alias, but it can be omitted. As an example, utilizing aliases in the query above would result in the following much more readable query:

SELECT i.product_id,
       i.price_unit,
       t.time,
       t.transaction_id
  FROM tutorial.excel_sql_inventory_data i
  JOIN tutorial.excel_sql_transaction_data t
    ON i.product_id = t.product_id

In this example, only a few keystrokes are saved, but aliases greatly enhance readability in larger, more complex queries. Note that while the aliases are “defined” in the FROM and JOIN clauses, they can be used before those clauses, in the SELECT clause.

It is most common to alias tables, but it is also possible to alias columns, as shown below. The quotes are only necessary if spaces are included in the aliases.

SELECT i.product_id "Product ID",
       i.price_unit "Price Per Unit",
       t.time "Transaction Time",
       t.transaction_id "Transaction ID"
  FROM tutorial.excel_sql_inventory_data i
  JOIN tutorial.excel_sql_transaction_data t
    ON i.product_id = t.product_id

The table that is returned will have name fields that reflect the alias, which may be much more readable.

Product ID Price Per Unit Transaction Time Transaction ID
3 1.02 2016-01-08 17:46:17 1
61 0.55 2016-01-08 17:46:17 1
23 0.91 2016-01-07 14:11:57 2
52 1.79 2016-01-06 17:57:42 4
4 1.19 2016-01-06 17:57:42 4

Types of JOINs

In addition to the standard JOIN demonstrated above, there are also other types of joins that retain or exclude different data depending on the table from which they are extracted. These JOINs are best imagined as a Venn diagram. The left circle includes the rows of data in the table in the FROM clause. The right circle includes all rows of data in the table in the JOIN clause. The overlapping section in the middle represents all rows for which the ON clause is true.

Those types of JOINs are defined below.

  • INNER JOIN - The most exclusive JOIN, it returns only rows at the intersection of the two tables.
  • LEFT JOIN - Consists of an INNER JOIN plus all rows in the FROM table, even if they did not match. Imagine all the data within the left circle of the Venn diagram is included.
  • RIGHT JOIN - Consists of an INNER JOIN plus all rows in the JOIN table, even if they did not match. Imagine all the data within the right circle of the Venn diagram is included.
  • (Full) OUTER JOIN - The most inclusive JOIN, it includes all rows in both tables. These are rare.

Note that LEFT JOINs and RIGHT JOINs are interchangeable by swapping which tables included in the FROM and JOIN clauses. The standard has become to architect queries such that LEFT JOINs are used preferentially. As a result, RIGHT JOINs are rarely seen.

SELECT i.product_id,
       i.price_unit,
       t.time,
       t.transaction_id
  FROM tutorial.excel_sql_inventory_data i
  LEFT JOIN tutorial.excel_sql_transaction_data t
    ON i.product_id = t.product_id
 ORDER BY product_id DESC
product_id price_unit time t_transaction_id
85 0.08    
84 2.59 2016-01-03 12:36:15 152
84 2.59 2016-01-04 14:29:52 221
84 2.59 2016-01-08 11:57:31 260
84 2.59 2016-01-06 08:48:59 143

Omit the word LEFT in the foregoing query and an INNER JOIN is performed instead of LEFT JOIN. As a result, the row containing product_id 85 is not included in the result set, since it does not find a correlate in the transactions table.

Filtering

It is possible to filter the data returned from a query using JOIN clauses without using a WHERE clause, one of the primary means of filtering. Consider the following two examples.

SELECT i.product_id,
       i.price_unit,
       t.time,
       t.transaction_id
  FROM tutorial.excel_sql_inventory_data i
  LEFT JOIN tutorial.excel_sql_transaction_data t
    ON i.product_id = t.product_id
 WHERE i.product_id = 5
 ORDER BY transaction_id
product_id price_unit time transaction_id
5 3.86 2016-01-06 08:42:28 26
5 3.86 2016-01-04 12:49:51 108

The foregoing code will return only rows meeting the WHERE clause, so the unmatched rows included as a result of the LEFT JOIN will be excluded. Essentially, the WHERE clause is performed after the JOIN is executed.

A similar, but not identical, approach is the following.

SELECT i.product_id,
       i.price_unit,
       t.time,
       t.transaction_id
  FROM tutorial.excel_sql_inventory_data i
  LEFT JOIN tutorial.excel_sql_transaction_data t
    ON i.product_id = t.product_id
   AND i.product_id = 5
 ORDER BY transaction_id
product_id price_unit time transaction_id
5 3.86 2016-01-06 08:42:28 26
5 3.86 2016-01-04 12:49:51 108
48 0.89    
53 1.69    
32 2.99    

Setting up the code this way results in rows being returned that match the product_id, or are blank in the columns sourced from the transactions table. It is like executing a WHERE clause on the right (or JOIN) table before executing the JOIN. The rows that do not match the product_id above will be included, but the columns containing time or transaction_id data will be blank.

The simplest way to remember this is to remember that when this query is executed, the ON clause is always executed first.

Keys

A “primary key” is a column that exists in every table. It is a column whose value is unique for every row. Designated with “PK” in ERDs, the primary key is commonly the first column in a table.

A “foreign key” (FK) in one table is a primary key in another. They are a link between the tables. Unlike primary keys, foreign keys do not need to be unique.

In the ON clause, a PK is always linked a FK.

Content for this note is taken from information I learned while pursuing the Udacity Data Analyst Nanodegree. Udacity is an online tech education service that offers both free and paid, tech-focused training. Learn more.