The tables queried this blog post are the publicly-available “tutorial.excel_sql_inventory_data” and “tutorial.excel_sql_transaction_data” tables that are available at modeanalytics.com.
What are SQL JOINs?
JOINs are where the power of SQL truly lies, and are a primary reason for SQL’s continued popularity in the marketplace. They allow users to join multiple tables from a relational database together.
A common question about relational databases is why databases are split up in the first place. Why can’t all the data be included in a single, large table? The short answer is that separating the data into a series of tables limits the overall number of reads and writes that would otherwise need to occur on a single, very large table. Splitting up the data also speeds up queries, and makes sense given that different types of data are likely to required different sorts of computing activity to maintain. As an example, an
web_events table is likely to be written to, once, and then subsequently read occasionally. A
sales_reps table, on the other hand, may require frequent access to update contact information and which accounts a given salesperson is responsible for.
In practice, the approach companies have adopted is to split up the data logically, based upon the different objects that the data refers to. ERDs, or Entity Relationship Diagrams, define these relationships. The example I showed previously is included below, for reference.
The process of breaking up a database into a series of data tables is called “Database Normalization.” This is a deep topic and requires its own skillset, but is generally outside the job scope of most analysts, who are principally concerned with pulling data from a database that already exists.
The following syntax demonstrates how an analyst could pull data from multiple tables at one time.
JOINs consist of two keywords:
JOIN precedes the table being joined, and
ON precedes the column on which the tables are to be joined.
It is possible to create aliases for table names for subsequent use in a given SQL query. The keyword
AS is sometimes used in between the table and its alias, but it can be omitted. As an example, utilizing aliases in the query above would result in the following much more readable query:
In this example, only a few keystrokes are saved, but aliases greatly enhance readability in larger, more complex queries. Note that while the aliases are “defined” in the
JOIN clauses, they can be used before those clauses, in the
It is most common to alias tables, but it is also possible to alias columns, as shown below. The quotes are only necessary if spaces are included in the aliases.
The table that is returned will have name fields that reflect the alias, which may be much more readable.
|Product ID||Price Per Unit||Transaction Time||Transaction ID|
Types of JOINs
In addition to the standard
JOIN demonstrated above, there are also other types of joins that retain or exclude different data depending on the table from which they are extracted. These
JOINs are best imagined as a Venn diagram. The left circle includes the rows of data in the table in the
FROM clause. The right circle includes all rows of data in the table in the
JOIN clause. The overlapping section in the middle represents all rows for which the
ON clause is true.
Those types of
JOINs are defined below.
INNER JOIN- The most exclusive
JOIN, it returns only rows at the intersection of the two tables.
LEFT JOIN- Consists of an
INNER JOINplus all rows in the
FROMtable, even if they did not match. Imagine all the data within the left circle of the Venn diagram is included.
RIGHT JOIN- Consists of an
INNER JOINplus all rows in the
JOINtable, even if they did not match. Imagine all the data within the right circle of the Venn diagram is included.
OUTER JOIN- The most inclusive
JOIN, it includes all rows in both tables. These are rare.
LEFT JOINs and
RIGHT JOINs are interchangeable by swapping which tables included in the
JOIN clauses. The standard has become to architect queries such that
LEFT JOINs are used preferentially. As a result,
RIGHT JOINs are rarely seen.
Omit the word
LEFT in the foregoing query and an
INNER JOIN is performed instead of
LEFT JOIN. As a result, the row containing product_id 85 is not included in the result set, since it does not find a correlate in the transactions table.
It is possible to filter the data returned from a query using
JOIN clauses without using a
WHERE clause, one of the primary means of filtering. Consider the following two examples.
The foregoing code will return only rows meeting the
WHERE clause, so the unmatched rows included as a result of the
LEFT JOIN will be excluded. Essentially, the
WHERE clause is performed after the
JOIN is executed.
A similar, but not identical, approach is the following.
Setting up the code this way results in rows being returned that match the
product_id, or are blank in the columns sourced from the transactions table. It is like executing a
WHERE clause on the right (or
JOIN) table before executing the
JOIN. The rows that do not match the
product_id above will be included, but the columns containing
transaction_id data will be blank.
The simplest way to remember this is to remember that when this query is executed, the
ON clause is always executed first.
A “primary key” is a column that exists in every table. It is a column whose value is unique for every row. Designated with “PK” in ERDs, the primary key is commonly the first column in a table.
A “foreign key” (FK) in one table is a primary key in another. They are a link between the tables. Unlike primary keys, foreign keys do not need to be unique.
ON clause, a PK is always linked a FK.