Recall that relational databases work by breaking large, complex datasets into smaller tables with unified themes. This only works if users can combine the information back together when required. This is achieved using a class of SQL commands called
The way that SQL efficiently joins tables is by performing a Cartesian product, or cross product, of the possible pairs of items in the two tables. This generates all possible combinations of rows. In practice, this means filtering the columns by the criteria in the query, taking the Cartesian product of those filtered columns, and appending any additional columns required.
So, the only way to join tables appropriately is to specify the Cartesian product of rows that have the same IDs as a unique ID column. This is the explanation of unexpected outputs when you join tables that have duplicate rows, or columns that have many to many relationships. The results will be more output rows than expected.
- Full Outer Joins are not possible in MySQL, but are possible in Teradata. The fact that not all database systems use Full Outer Joins should give some indication as to how rare full outer joins are.
- While primary keys are not supposed to have duplicate values, this is not always the case in practice. This can lead to some of the
JOINissues described above.
- Many-to-One cardinality constraints, where one primary key links to multiple items in the tabled being
JOINed, will also result in extra resulting rows.
JOINs (inner) will omit nulls in either table.
LEFT JOINs will retain nulls in the left table.
The issues outlined above are some of the reasons why database experts prefer to use terminology from set theory (tuple) for database design, rather than more common terminology (row). In theory, tuples can’t be duplicated. In practice, rows can be duplicated. So, different terminology can be useful.
- Clean Your Data! Get rid of as many duplicates as possible before doing analysis.
- Look at small testing examples of rows that will be outputted by every pair of tables to be joined, before aggregating the results of a join. This also helps to ensure the wrong kind of join isn’t used at any point.
- Double-check and build large queries incrementally.
The “equijoin” syntax involves telling the database how to relate tables in a
WHERE clause, like this:
The more traditional syntax involves the
ON keywords, and frees up the
WHERE clause for other things that users might want to include in the query. The bulk of this section will use this more common, traditional syntax, but first, a few examples using the “equijoin” syntax.
How would you extract the user_guid, dog_guid, breed, breed_type, and breed_group for all animals who completed the “Yawn Warm-up” game?
|ce134e42-7144-11e5-ba71-058fbc01cf0b||fd27b272-7144-11e5-ba71-058fbc01cf0b||Labrador Retriever||Pure Breed||Sporting|
|ce1353d8-7144-11e5-ba71-058fbc01cf0b||fd27b5ba-7144-11e5-ba71-058fbc01cf0b||Shetland Sheepdog||Pure Breed||Herding|
|ce135ab8-7144-11e5-ba71-058fbc01cf0b||fd27b6b4-7144-11e5-ba71-058fbc01cf0b||Golden Retriever||Pure Breed||Sporting|
How would you extract the user_guid, membership_type, and dog_guid of all the golden retrievers who completed at least 1 Dognition test?
How many unique Golden Retrievers who live in North Carolina are there in the Dognition database?
How many unique customers within each membership type provided reviews?
For which 3 dog breeds do we have the greatest amount of site_activity data, (as defined by non-NULL values in script_detail_id)?
|Labrador Retriever-Golden Retriever Mix||27498|
Traditional joins are best explained in comparison to the “equijoin” syntax. The following queries produce equivalent results.
Use COUNT and DISTINCT to determine how many distinct skus there are in pairs of the skuinfo, skstinfo, and trnsact tables. Which skus are common to pairs of tables, or unique to specific tables?
|Left Table||Right Table||Left Join Sku Count||Inner Join Sku Count|
This analysis also can be consolidated into a single query:
|Table with Null||i.sku||s.sku||t.sku|
- The SKUs in the skuinfo table are a superset of the SKUs in the trnsact table
- The SKUs in the skuinfo table are a superset of the SKUs in the skstinfo table
- Both skstinfo and trnsact contain SKUs that are not contained in the other
Use COUNT to determine how many instances there are of each sku associated with each store in the skstinfo table and the trnsact table?
- There are multiple instances of every sku/store combination in the trnsact table, but only one instance of every sku/store combination in the skstinfo table.
Use COUNT and DISTINCT to determine how many distinct stores there are in the strinfo, store_msa, skstinfo, and trnsact tables.
Which stores are common to all four tables, or unique to specific tables?
|Table with Null||strinfo Stores||store_msa Stores||skstinfo Stores||trnsact Stores|
Some SKUs in the trnsact table are not in the skstinfo table. What do those rows have in common?
What is Dillard’s average profit per day?
Result is $10,779.20.
On what day was the total value (in $) of returned goods the greatest? On what day was the total number of individual returned items the greatest?
What is the maximum price paid for an item in our database? What is the minimum price paid for an item in our database?
How many departments have more than 100 brands associated with them, and what are their descriptions?
Write a query that retrieves the department descriptions of each of the skus in the skstinfo table.
What department (with department description), brand, style, and color had the greatest total value of returned items?
In what state and zip code is the store that had the greatest total revenue during the time period monitored in our dataset?
On what day was Dillard’s income, based on total sum of purchases, the greatest?
What is the deptdesc of the departments that have the top 3 greatest numbers of skus from the skuinfo table associated with them?
Which table contains the most distinct sku numbers?
This could also be determined by examining the entity-relationship diagram.
How many skus are in the skstinfo table, but NOT in the skuinfo table?
What is the average amount of profit Dillard’s made per day?
Using one query to retrieve your answer, how many MSAs are there within the state of North Carolina (abbreviated “NC”), and within these MSAs, what is the lowest population level (msa_pop) and highest income level (msa_income)?
How many stores have more than 180,000 distinct skus associated with them in the skstinfo table?
Returns 12 rows, indicating 12 stores.
Look at the data from all the distinct skus in the “cop” department with a “federal” brand and a “rinse wash” color.
How many skus are in the skuinfo table, but NOT in the skstinfo table?
In what city and state is the store that had the greatest total sum of sales?
Returns store 8402
Returns Metairie, LA.
How many states have more than 10 Dillards stores in them?
Returns 15 rows, indicating 15 states.
What is the suggested retail price of all the skus in the “reebok” department with the “skechers” brand and a “wht/saphire” color?
Some content in this note are taken from the Jupyter Notebooks enumerated below. They are accessible as part of the “Managing Big Data with MySQL” course on coursera.org, and licensed by Jana Schaich Borg under CC BY-NC 4.0.
- MySQL Exercise 07 Inner Joins
- MySQL Exercise 08 Joining Tables with Outer Joins
Other content for this note is taken from the Coursera course “Managing Big Data with MySQL.”