Aggregations and Joins Queries

JOINs

Recall that relational databases work by breaking large, complex datasets into smaller tables with unified themes. This only works if users can combine the information back together when required. This is achieved using a class of SQL commands called JOINs.

The way that SQL efficiently joins tables is by performing a Cartesian product, or cross product, of the possible pairs of items in the two tables. This generates all possible combinations of rows. In practice, this means filtering the columns by the criteria in the query, taking the Cartesian product of those filtered columns, and appending any additional columns required.

So, the only way to join tables appropriately is to specify the Cartesian product of rows that have the same IDs as a unique ID column. This is the explanation of unexpected outputs when you join tables that have duplicate rows, or columns that have many to many relationships. The results will be more output rows than expected.

  • Full Outer Joins are not possible in MySQL, but are possible in Teradata. The fact that not all database systems use Full Outer Joins should give some indication as to how rare full outer joins are.
  • While primary keys are not supposed to have duplicate values, this is not always the case in practice. This can lead to some of the JOIN issues described above.
  • Many-to-One cardinality constraints, where one primary key links to multiple items in the tabled being JOINed, will also result in extra resulting rows.
  • Standard JOINs (inner) will omit nulls in either table. LEFT JOINs will retain nulls in the left table.

The issues outlined above are some of the reasons why database experts prefer to use terminology from set theory (tuple) for database design, rather than more common terminology (row). In theory, tuples can’t be duplicated. In practice, rows can be duplicated. So, different terminology can be useful.

Mitigation Strategies

  • Clean Your Data! Get rid of as many duplicates as possible before doing analysis.
  • Look at small testing examples of rows that will be outputted by every pair of tables to be joined, before aggregating the results of a join. This also helps to ensure the wrong kind of join isn’t used at any point.
  • Double-check and build large queries incrementally.

Equijoin Syntax

The “equijoin” syntax involves telling the database how to relate tables in a WHERE clause, like this:

WHERE d.dog_guid = r.dog_guid

The more traditional syntax involves the JOIN and ON keywords, and frees up the WHERE clause for other things that users might want to include in the query. The bulk of this section will use this more common, traditional syntax, but first, a few examples using the “equijoin” syntax.

Examples

How would you extract the user_guid, dog_guid, breed, breed_type, and breed_group for all animals who completed the “Yawn Warm-up” game?

SELECT d.user_guid,
       d.dog_guid,
       d.breed,
       d.breed_type,
       d.breed_group
FROM complete_tests c,
     dogs d
WHERE c.dog_guid = d.dog_guid
  AND c.test_name = "Yawn Warm-up"
LIMIT 3;
user_guid dog_guid breed breed_type breed_group
ce134e42-7144-11e5-ba71-058fbc01cf0b fd27b272-7144-11e5-ba71-058fbc01cf0b Labrador Retriever Pure Breed Sporting
ce1353d8-7144-11e5-ba71-058fbc01cf0b fd27b5ba-7144-11e5-ba71-058fbc01cf0b Shetland Sheepdog Pure Breed Herding
ce135ab8-7144-11e5-ba71-058fbc01cf0b fd27b6b4-7144-11e5-ba71-058fbc01cf0b Golden Retriever Pure Breed Sporting

How would you extract the user_guid, membership_type, and dog_guid of all the golden retrievers who completed at least 1 Dognition test?

SELECT DISTINCT u.user_guid,
       u.membership_type,
       d.dog_guid,
       d.breed
FROM users u, dogs d, complete_tests c
WHERE u.user_guid = d.user_guid
  AND d.dog_guid = c.dog_guid
  AND d.breed = 'Golden Retriever'
LIMIT 3;
user_guid membership_type dog_guid breed
ce135ab8-7144-11e5-ba71-058fbc01cf0b 1 fd27b6b4-7144-11e5-ba71-058fbc01cf0b Golden Retriever
ce13507c-7144-11e5-ba71-058fbc01cf0b 1 fd27b79a-7144-11e5-ba71-058fbc01cf0b Golden Retriever
ce1389d4-7144-11e5-ba71-058fbc01cf0b 1 fd27efb2-7144-11e5-ba71-058fbc01cf0b Golden Retriever

How many unique Golden Retrievers who live in North Carolina are there in the Dognition database?

SELECT COUNT(DISTINCT d.dog_guid)
FROM dogs d,
     users u
WHERE d.user_guid = u.user_guid
  AND d.breed = 'Golden Retriever'
  AND u.state = 'NC';
COUNT(DISTINCT d.dog_guid)
30

How many unique customers within each membership type provided reviews?

SELECT u.membership_type,
       COUNT(DISTINCT u.user_guid)
  FROM users u, reviews r
 WHERE u.user_guid = r.user_guid
   AND r.rating IS NOT null
 GROUP BY membership_type;
membership_type COUNT(DISTINCT u.user_guid)
1 2900
2 1120
3 238
4 816
5 15

For which 3 dog breeds do we have the greatest amount of site_activity data, (as defined by non-NULL values in script_detail_id)?

SELECT d.breed,
       COUNT(s.script_detail_id) Activities
FROM dogs d, site_activities s
WHERE d.dog_guid = s.dog_guid
GROUP BY d.breed
ORDER BY Activities DESC
LIMIT 3;
breed Activities
Mixed 93415
Labrador Retriever 38804
Labrador Retriever-Golden Retriever Mix 27498

Traditional Joins

Traditional joins are best explained in comparison to the “equijoin” syntax. The following queries produce equivalent results.

SELECT d.user_guid AS UserID,
       d.dog_guid AS DogID
  FROM dogs d,
       complete_tests c
 WHERE d.dog_guid = c.dog_guid
   AND c.test_name = 'Yawn Warm-up';
SELECT d.user_guid AS UserID,
       d.dog_guid AS DogID
  FROM dogs d
  JOIN complete_tests c
    ON d.dog_guid = c.dog_guid
 WHERE c.test_name = 'Yawn Warm-up';

Exercises

Use COUNT and DISTINCT to determine how many distinct skus there are in pairs of the skuinfo, skstinfo, and trnsact tables. Which skus are common to pairs of tables, or unique to specific tables?

SELECT COUNT(DISTINCT sku)
  FROM skuinfo;
Table Sku Count
skuinfo 1,564,178
skstinfo 760,212
trnsact 714,499
SELECT COUNT(DISTINCT i.sku)
  FROM skuinfo i
  LEFT JOIN skstinfo s
    ON i.sku = s.sku;
Left Table Right Table Left Join Sku Count Inner Join Sku Count
skuinfo skstinfo 1,564,178 760,212
skuinfo trnsact 1,564,178 714,499
skstinfo skuinfo 760,212 760,212
skstinfo trnsact 760,212 542,513
trnsact skuinfo 714,499 714,499
trnsact skstinfo 714,499 542,513

This analysis also can be consolidated into a single query:

SELECT COUNT(DISTINCT i.sku),
       COUNT(DISTINCT s.sku),
       COUNT(DISTINCT t.sku)
  FROM skuinfo i
  LEFT JOIN skstinfo s
    ON i.sku = s.sku
  LEFT JOIN trnsact t
    ON i.sku = t.sku
 WHERE t.sku IS NULL
Table with Null i.sku s.sku t.sku
i.sku 0 0 0
s.sku 803,966 0 171,986
t.sku 849,679 217,699 0
  • The SKUs in the skuinfo table are a superset of the SKUs in the trnsact table
  • The SKUs in the skuinfo table are a superset of the SKUs in the skstinfo table
  • Both skstinfo and trnsact contain SKUs that are not contained in the other

Use COUNT to determine how many instances there are of each sku associated with each store in the skstinfo table and the trnsact table?

SELECT store,
       sku,
       COUNT(*)
  FROM skstinfo
 GROUP BY store,
          sku
 ORDER BY store,
          sku;
  • There are multiple instances of every sku/store combination in the trnsact table, but only one instance of every sku/store combination in the skstinfo table.

Use COUNT and DISTINCT to determine how many distinct stores there are in the strinfo, store_msa, skstinfo, and trnsact tables.

SELECT COUNT(DISTINCT store)
  FROM strinfo;
Table Count
strinfo 453
skstinfo 357
store_msa 333
trnsact 332

Which stores are common to all four tables, or unique to specific tables?

SELECT COUNT(DISTINCT strinfo.store),
       COUNT(DISTINCT store_msa.store),
       COUNT(DISTINCT skstinfo.store),
       COUNT(DISTINCT trnsact.store)
  FROM strinfo
  LEFT JOIN store_msa
    ON strinfo.store = store_msa.store
  LEFT JOIN skstinfo
    ON strinfo.store = skstinfo.store
  LEFT JOIN trnsact
    ON strinfo.store = trnsact.store
 WHERE strinfo.store IS NULL;
Table with Null strinfo Stores store_msa Stores skstinfo Stores trnsact Stores
strinfo 0 0 0 0
store_msa 120 0 29 0
skstinfo 96 5 0 5
trnsact ??? ??? ??? ???

Some SKUs in the trnsact table are not in the skstinfo table. What do those rows have in common?

SELECT trnsact.*,
       skstinfo.sku
  FROM trnsact
  LEFT JOIN skstinfo
    ON trnsact.sku = skstinfo.sku
   AND trnsact.store = skstinfo.store
 WHERE skstinfo.sku IS NULL;

What is Dillard’s average profit per day?

SELECT SUM(t.amt-s.cost)/COUNT(DISTINCT t.saledate)
  FROM trnsact t
  JOIN skstinfo s
    ON t.sku = s.sku
   AND t.store = s.store
 WHERE t.quantity = 1
   AND t.register = 640
   AND t.stype = 'P';

Result is $10,779.20.

On what day was the total value (in $) of returned goods the greatest? On what day was the total number of individual returned items the greatest?

SELECT TOP 3 saledate,
       SUM(quantity) returns_qty,
       SUM(amt) returns_value
  FROM trnsact t
 WHERE t.stype = 'R'
 GROUP BY saledate
 ORDER BY SUM(quantity) DESC;
SALEDATE RETURNS_QTY RETURNS_VALUE
04/12/27 82512 3030259.76
04/12/26 71710 2665283.86
04/12/28 64265 2332544.44

What is the maximum price paid for an item in our database? What is the minimum price paid for an item in our database?

SELECT MAX(amt),
       MIN(amt)
  FROM trnsact t
 WHERE t.stype = 'P'
   AND t.quantity = 1;
MAXIMUM(AMT) MINIMUM(AMT)
6017.0 0.00

How many departments have more than 100 brands associated with them, and what are their descriptions?

SELECT d.deptdesc,
       COUNT(DISTINCT s.brand) brand_count
  FROM deptinfo d
  JOIN skuinfo s
    ON d.dept = s.dept
 GROUP BY d.deptdesc
HAVING COUNT(DISTINCT s.brand) > 100
 ORDER BY COUNT(DISTINCT s.brand) DESC;
DEPTDESC BRAND_COUNT
ENVIRON 389
COLEHAAN 118
CARTERS 109

Write a query that retrieves the department descriptions of each of the skus in the skstinfo table.

SELECT s.sku,
       d.deptdesc
  FROM skstinfo s
  JOIN skuinfo i
    ON s.sku = i.sku
  JOIN deptinfo d
    ON i.dept = d.dept;

What department (with department description), brand, style, and color had the greatest total value of returned items?

SELECT SUM(t.amt),
       d.deptdesc,
       s.brand,
       s.style,
       s.color
  FROM trnsact t
  JOIN skuinfo s
    ON t.sku = s.sku
  JOIN deptinfo d
    ON s.dept = d.dept
 WHERE t.stype = 'R'
 GROUP BY d.deptdesc,
          s.brand,
          s.style,
          s.color
 ORDER BY SUM(t.amt) DESC;

In what state and zip code is the store that had the greatest total revenue during the time period monitored in our dataset?

SELECT TOP 3 s.store,
       s.zip,
       s.state,
       SUM(t.amt)
  FROM trnsact t
  JOIN strinfo s
    ON t.store = s.store
 WHERE t.stype = 'P'
 GROUP BY s.store,
       s.zip,
       s.state
 ORDER BY SUM(t.amt) DESC;

On what day was Dillard’s income, based on total sum of purchases, the greatest?

SELECT saledate,
       SUM(amt) Income
  FROM trnsact
 WHERE saledate in ('2004-11-01',
                    '2005-02-28',
                    '2005-02-01',
                    '2004-12-18')
 GROUP BY saledate
 ORDER BY SUM(amt) DESC;

What is the deptdesc of the departments that have the top 3 greatest numbers of skus from the skuinfo table associated with them?

SELECT d.deptdesc,
       COUNT(s.sku)
  FROM skuinfo s
  JOIN deptinfo d
    ON s.dept = d.dept
 GROUP BY d.deptdesc
 ORDER BY COUNT(s.sku) DESC;

Which table contains the most distinct sku numbers?

SELECT COUNT(DISTINCT sku)
  FROM skstinfo;
Table Distinct SKUs
skuinfo 1564178
skstinfo 760212
trnsact 714499

This could also be determined by examining the entity-relationship diagram.

How many skus are in the skstinfo table, but NOT in the skuinfo table?

SELECT COUNT(s.sku)
  FROM skstinfo s
  LEFT JOIN skuinfo u
    ON s.sku = u.sku
 WHERE u.sku IS NULL;

0.

What is the average amount of profit Dillard’s made per day?

SELECT SUM(t.amt-s.cost)/COUNT(DISTINCT t.saledate)
  FROM trnsact t
  JOIN skstinfo s
    ON t.sku = s.sku
   AND t.store = s.store
 WHERE t.quantity = 1
   AND t.stype = 'P';

$1,527,903.

Using one query to retrieve your answer, how many MSAs are there within the state of North Carolina (abbreviated “NC”), and within these MSAs, what is the lowest population level (msa_pop) and highest income level (msa_income)?

SELECT COUNT(msa),
       MIN(msa_pop),
       MAX(msa_income)
  FROM store_msa
 WHERE state = "NC";
COUNT(msa) MIN(msa_pop) MAX(msa_income)
16 339511 36151

How many stores have more than 180,000 distinct skus associated with them in the skstinfo table?

SELECT r.store,
       COUNT(DISTINCT s.sku)
  FROM skstinfo s
  JOIN strinfo r
    ON s.store = r.store
 GROUP BY r.store
HAVING COUNT(DISTINCT s.sku) > 180000;

Returns 12 rows, indicating 12 stores.

Look at the data from all the distinct skus in the “cop” department with a “federal” brand and a “rinse wash” color.

SELECT d.deptdesc,
       s.*
  FROM skuinfo s
  JOIN deptinfo d
    ON s.dept = d.dept
 WHERE d.deptdesc = 'cop'
   AND s.brand = 'federal'
   AND s.color = 'rinse wash';

How many skus are in the skuinfo table, but NOT in the skstinfo table?

SELECT COUNT(u.sku)
  FROM skuinfo u
  LEFT JOIN skstinfo s
    ON u.sku = s.sku
 WHERE s.sku IS NULL;

803,966.

In what city and state is the store that had the greatest total sum of sales?

SELECT s.store,
       SUM(t.amt)
  FROM transact t
  JOIN strinfo s
    ON t.store = s.store
 GROUP BY s.store
 ORDER BY SUM(t.amt) DESC;

Returns store 8402

SELECT city,
       state
  FROM strinfo
 WHERE store = 8402;

Returns Metairie, LA.

How many states have more than 10 Dillards stores in them?

SELECT state,
       COUNT(store)
  FROM strinfo
 GROUP BY state
HAVING COUNT(store) > 10;

Returns 15 rows, indicating 15 states.

What is the suggested retail price of all the skus in the “reebok” department with the “skechers” brand and a “wht/saphire” color?

SELECT MIN(r.retail),
       MAX(r.retail)
  FROM deptinfo d
  JOIN skuinfo s
    ON d.dept = s.dept
  JOIN skstinfo r
    ON s.sku = r.sku
 WHERE d.deptdesc = 'reebok'
   AND s.brand = 'skechers'
   AND s.color = 'wht/saphire';

$29.00.

Some content in this note are taken from the Jupyter Notebooks enumerated below. They are accessible as part of the “Managing Big Data with MySQL” course on coursera.org, and licensed by Jana Schaich Borg under CC BY-NC 4.0.

  • MySQL Exercise 07 Inner Joins
  • MySQL Exercise 08 Joining Tables with Outer Joins

Other content for this note is taken from the Coursera course “Managing Big Data with MySQL.”