Complex Teradata Queries


Exercises

How many distinct dates are there in the saledate column of the transaction table for each month/year combination in the database?

SELECT DISTINCT EXTRACT(MONTH from saledate) t_month,
       EXTRACT(YEAR from saledate) t_year,
       COUNT(DISTINCT saledate) distinct_saledates
  FROM trnsact
 GROUP BY t_month,
          t_year
 ORDER BY t_year,
          t_month;
T_MONTH T_YEAR DISTINCT_SALEDATES
8 2004 31
9 2004 30
10 2004 31
11 2004 29
12 2004 30
1 2005 31
2 2005 28
3 2005 30
4 2005 30
5 2005 31
6 2005 30
7 2005 31
8 2005 27

Note there are 27 days recorded in August 2005, and 31 recorded for August 2004. August 2005 transactions will be excluded. Thanksgiving (Nov 25), Christmas (Dec 25), and March 27 do not have sales data.

Use a CASE statement within an aggregate function to determine which sku had the greatest total sales during the combined summer months of June, July, and August.

SELECT sku,
       SUM(CASE EXTRACT(MONTH from saledate)
                WHEN 6 THEN amt
                WHEN 7 THEN amt
                WHEN 8 THEN amt
                ELSE 0
            END) AS summer_sales
  FROM trnsact
 WHERE stype = 'P'
 GROUP BY sku
 ORDER BY summer_sales DESC;
SKU SUMMER_SALES
4108011 1646017.38
3524026 1464189.00
5528349 1315121.00

How many distinct dates are there in the saledate column of the transaction table for each month/year/store combination in the database? Sort your results by the number of days per combination in ascending order.

SELECT DISTINCT EXTRACT(MONTH from saledate) t_month,
       EXTRACT(YEAR from saledate) t_year,
       store,
       COUNT(DISTINCT saledate) distinct_saledates
  FROM trnsact
 GROUP BY t_month,
          t_year,
          store
 ORDER BY distinct_saledates;
T_MONTH T_YEAR STORE DISTINCT_SALEDATES
3 2005 8304 1
7 2005 7604 1
8 2004 8304 1

There are multiple rows with only one saledate for a given month year and store. This is most likely a result of some data being removed before the dataset was donated. This will need to be taken into account when examining sales trends for subsets of stores.

What is the average daily revenue for each store/month/year combination in the database? Calculate this by dividing the total revenue for a group by the number of sales days available in the transaction table for that group.

SELECT DISTINCT EXTRACT(MONTH from saledate) t_month,
       EXTRACT(YEAR from saledate) t_year,
       store,
       COUNT(DISTINCT saledate) distinct_saledates,
       SUM(amt) total_sales,
       (total_sales / distinct_saledates) AS avg_daily_revenue
  FROM trnsact
 WHERE stype = 'P'
 GROUP BY t_month,
          t_year,
          store
 ORDER BY avg_daily_revenue DESC;
T_MONTH T_YEAR STORE DISTINCT_SALEDATES TOTAL_SALES AVG_DAILY_REVENUE
12 2004 8402 30 3152448.34 105081.61
12 2004 504 30 2724376.37 90812.55
12 2004 2707 30 2711980.93 90399.36

Now build a query to remove all data from August, 2005, because the data for this month is incomplete. This is accomplished by creating a string including both the year and month for each row, and then using a WHERE clause to exclude rows having 200508 in that column.

Also, remove all combinations of year_month and store having fewer than 20 transactions, using a HAVING clause.

SELECT EXTRACT(YEAR from saledate) t_year,
       EXTRACT(MONTH from saledate) t_month,
       store,
       CASE WHEN t_month < 10
            THEN t_year||'0'||TRIM(t_month)
            ELSE t_year||TRIM(t_month)
             END AS year_month,
       COUNT(DISTINCT saledate) distinct_saledates
  FROM trnsact
 WHERE stype = 'P'
   AND year_month <> 200508
 GROUP BY t_year,
          t_month,
          store,
          year_month
HAVING distinct_saledates >= 20
 ORDER BY distinct_saledates;
T_YEAR T_MONTH STORE YEAR_MONTH DISTINCT_SALEDATES
2005 4 1704 200504 21
2004 9 3802 200409 21
2004 9 5502 200409 22

This filtering query can be used as a subquery by joining into it on YEAR, MONTH, and STORE. The subquery included below is a shorted version of the query above, but has identical functionality.

Note that it is necessary to specify the purchase type in the outer query as well as the inner query, because the inner query only excludes transactions on the basis of year, month, and store.

SELECT EXTRACT(YEAR from t.saledate) t_year,
       EXTRACT(MONTH from t.saledate) t_month,
       t.store t_store,
       (SUM(t.amt) / COUNT(DISTINCT t.saledate)) AS avg_daily_revenue
  FROM trnsact t
  JOIN (SELECT EXTRACT(YEAR from saledate) y,
               EXTRACT(MONTH from saledate) m,
               store s,
               CASE WHEN m < 10 THEN y||'0'||TRIM(m)
                    ELSE y||TRIM(m)
                END AS y_m
          FROM trnsact
         WHERE stype = 'P'
           AND y_m <> 200508
         GROUP BY y, m, s, y_m
        HAVING COUNT(DISTINCT saledate) >= 20) AS clean
    ON t_year = clean.y
   AND t_month = clean.m
   AND t_store = clean.s
 WHERE stype = 'P'
 GROUP BY t_year,
          t_month,
          t_store;
T_YEAR T_MONTH T_STORE AVG_DAILY_REVENUE
2004 12 8402 105081.61
2004 12 504 90812.55
2004 12 2707 90399.36

The subquery above will be used to clean (IE, exclude some of) the data for the majority of the queries in the remainder of this section.

What is the average daily revenue brought in by Dillard’s stores in areas of high, medium, or low levels of high school education?

SELECT CASE
            WHEN s_msa.msa_high > 70 THEN 'high'
            WHEN s_msa.msa_high > 60 THEN 'medium'
            WHEN s_msa.msa_high > 50 THEN 'low'
        END AS hs_attainment,
       SUM(stores.store_revenue)/SUM(stores.store_saledates) avg_daily_revenue
  FROM (SELECT t.store,
               SUM(t.amt) store_revenue,
               COUNT(DISTINCT t.saledate) store_saledates
          FROM trnsact t
          JOIN (SELECT EXTRACT(YEAR from saledate) y,
                       EXTRACT(MONTH from saledate) m,
                       store s,
                       CASE WHEN m < 10 THEN y||'0'||TRIM(m)
                            ELSE y||TRIM(m)
                        END AS y_m
                  FROM trnsact
                 WHERE stype = 'P'
                   AND y_m <> 200508
                 GROUP BY y, m, s, y_m
                HAVING COUNT(DISTINCT saledate) >= 20) AS clean
            ON EXTRACT(YEAR from t.saledate) = clean.y
           AND EXTRACT(MONTH from t.saledate) = clean.m
           AND t.store = clean.s
         WHERE stype = 'P'
         GROUP BY t.store) AS stores
  JOIN store_msa s_msa
    ON stores.store = s_msa.store
 GROUP BY hs_attainment;
HS_ATTAINMENT AVG_DAILY_REVENUE
low 34159.76
medium 25037.89
high 20937.31

Compare the average daily revenues of the stores with the highest median msa_income and the lowest median msa_income. In what city and state were these stores, and which store had a higher average daily revenue?

SELECT Top 3 t.store,
       SUM(t.amt)/COUNT(DISTINCT t.saledate) avg_daily_revenue,
       s_msa.msa_income,
       s_msa.city,
       s_msa.state
  FROM trnsact t
  JOIN (SELECT EXTRACT(YEAR from saledate) y,
               EXTRACT(MONTH from saledate) m,
               store s,
               CASE WHEN m < 10 THEN y||'0'||TRIM(m)
                    ELSE y||TRIM(m)
               END AS y_m
          FROM trnsact
         WHERE stype = 'P'
           AND y_m <> 200508
         GROUP BY y, m, s, y_m
        HAVING COUNT(DISTINCT saledate) >= 20) AS clean
    ON EXTRACT(YEAR from t.saledate) = clean.y
   AND EXTRACT(MONTH from t.saledate) = clean.m
   AND t.store = clean.s
  JOIN store_msa s_msa
    ON t.store = s_msa.store
 WHERE stype = 'P'
 GROUP BY t.store,
          s_msa.msa_income,
          s_msa.city,
          s_msa.state
 ORDER BY s_msa.msa_income DESC;

Top 3 msa_income:

STORE AVG_DAILY_REVENUE MSA_INCOME CITY STATE
3902 17884.08 56099 SPANISH FORT AL
3403 10072.60 46140 SALINA KS
9609 14644.10 45944 LONGMONT CO

Bottom 3 msa_income:

STORE AVG_DAILY_REVENUE MSA_INCOME CITY STATE
2707 56601.99 16022 MCALLEN TX
2907 34590.64 17374 BROWNSVILLE TX
2807 16479.60 17374 HARLINGEN TX

Both stores (highest and lowest median msa_income) can be retrieved in a single query, as well, as follows.

SELECT t.store,
       s_m.msa_income,
       s_m.city,
       s_m.state,
       SUM(t.amt) / COUNT(DISTINCT t.saledate) avg_daily_revenue
  FROM trnsact t
  JOIN (SELECT EXTRACT(YEAR from saledate) y,
               EXTRACT(MONTH from saledate) m,
               store s,
               CASE WHEN m < 10 THEN y||'0'||TRIM(m)
                    ELSE y||TRIM(m)
               END AS y_m
          FROM trnsact
         WHERE stype = 'P'
           AND y_m <> 200508
         GROUP BY y, m, s, y_m
        HAVING COUNT(DISTINCT saledate) >= 20) AS clean
    ON EXTRACT(YEAR from t.saledate) = clean.y
   AND EXTRACT(MONTH from t.saledate) = clean.m
   AND t.store = clean.s
  JOIN store_msa s_m
    ON t.store = s_m.store
 WHERE stype = 'P'
   AND (msa_income = (SELECT MAX(msa_income)
                       FROM store_msa)
    OR msa_income = (SELECT MIN(msa_income)
                       FROM store_msa))
 GROUP BY t.store,
          s_m.msa_income,
          s_m.city,
          s_m.state;
STORE MSA_INCOME CITY STATE AVG_DAILY_REVENUE  
3902 56099 SPANISH FORT AL 17884.08  
2707 16022 MCALLEN TX 56601.99  

What is the brand of the sku with the greatest standard deviation in sprice? Only examine skus that have been part of over 100 transactions.

SELECT t_c.sku,
       s.brand,
       STDDEV_SAMP(t.sprice) stddev_sprice,
       AVG(t.sprice) avg_sprice
  FROM (SELECT sku
          FROM trnsact
         GROUP BY sku
        HAVING COUNT(sku) > 100) AS t_c
  JOIN trnsact t
    ON t_c.sku = t.sku
  JOIN skuinfo s
    ON t_c.sku = s.sku
 GROUP BY t_c.sku,
          s.brand
 ORDER BY stddev_sprice DESC;
SKU BRAND STDDEV_SPRICE AVG_SPRICE
3733090 CABERNET 178.6 11.49
2762683 HART SCH 175.9 335.7
5453849 POLO FAS 170.5 286.7

Examine all the transactions for the sku with the greatest standard deviation in sprice, but only consider skus that are part of more than 100 transactions.

SELECT t.sku,
       t.store,
       t.stype,
       t.orgprice,
       t.amt
  FROM (SELECT TOP 1 t_c.sku,
               STDDEV_SAMP(t.sprice) stddev_sprice
          FROM (SELECT sku
                  FROM trnsact
                 GROUP BY sku
                HAVING COUNT(sku) > 100) AS t_c
          JOIN trnsact t
            ON t_c.sku = t.sku
          JOIN skuinfo s
            ON t_c.sku = s.sku
         GROUP BY t_c.sku
         ORDER BY STDDEV_SAMP(t.sprice) DESC) AS top_sku
 JOIN trnsact t
   ON t.sku = top_sku.sku
ORDER BY amt DESC;
SKU STORE STYPE ORGPRICE AMT
3733090 1007 R 6.00 5005.00
3733090 1007 P 6.00 5005.00
3733090 9804 R 6.00 6.40

The large standard deviation for this SKU is explained by a single mistaken purchase. A $6 item was mistakenly sold for $5005, and then immediately returned.

What was the average daily revenue Dillard’s brought in during each month of the year?

SELECT CASE EXTRACT(MONTH from t.saledate)
            WHEN 1 THEN 'January' WHEN 2 THEN 'February' WHEN 3 THEN 'March'
            WHEN 4 THEN 'April' WHEN 5 THEN 'May' WHEN 6 THEN 'June'
            WHEN 7 THEN 'July' WHEN 8 THEN 'August' WHEN 9 THEN 'September'
            WHEN 10 THEN 'October' WHEN 11 THEN 'November' WHEN 12 THEN 'December'
        END AS sale_month,
       SUM(t.amt) / COUNT(DISTINCT t.saledate) avg_daily_revenue
  FROM trnsact t
  JOIN (SELECT EXTRACT(YEAR from saledate) y,
               EXTRACT(MONTH from saledate) m,
               store s,
               CASE WHEN m < 10 THEN y||'0'||TRIM(m)
                    ELSE y||TRIM(m)
               END AS y_m
          FROM trnsact
         WHERE stype = 'P'
           AND y_m <> 200508
         GROUP BY y, m, s, y_m
        HAVING COUNT(DISTINCT saledate) >= 20) AS clean
    ON EXTRACT(YEAR from t.saledate) = clean.y
   AND EXTRACT(MONTH from t.saledate) = clean.m
   AND t.store = clean.s
 GROUP BY sale_month
 ORDER BY avg_daily_revenue DESC;
SALE_MONTH AVG_DAILY_REVENUE SALE_MONTH AVG_DAILY_REVENUE
December 12528288.13 June 7339800.24
February 8122385.46 November 6932939.04
July 8058432.50 October 6721241.89
April 7703249.70 January 6708819.58
May 7478860.06 August 6167751.73
March 7435026.42 September 6157985.19

Which department, in which city and state of what store, had the greatest % increase in average daily sales revenue from November to December?

SELECT ((d_sales/d_dates)-(n_sales/n_dates))/(n_sales/n_dates)*100 AS perc_inc,
       SUM(CASE EXTRACT(MONTH from t.saledate)
                WHEN 11 THEN t.amt
                ELSE 0
            END) AS n_sales,
       SUM(CASE EXTRACT(MONTH from t.saledate)
                WHEN 12 THEN t.amt
                ELSE 0
            END) AS d_sales,
       COUNT(DISTINCT (CASE EXTRACT(MONTH from t.saledate)
                            WHEN 11 THEN t.saledate
                            ELSE NULL
                        END)) AS n_dates,
       COUNT(DISTINCT (CASE EXTRACT(MONTH from t.saledate)
                            WHEN 12 THEN t.saledate
                            ELSE NULL
                        END)) AS d_dates,
       d_i.deptdesc d_i_deptdesc,
       t.store t_store,
       s_m.city s_m_city,
       s_m.state s_m_state
  FROM trnsact t
  JOIN (SELECT EXTRACT(YEAR from saledate) y,
               EXTRACT(MONTH from saledate) m,
               store s,
               CASE WHEN m < 10 THEN y||'0'||TRIM(m)
                    ELSE y||TRIM(m)
               END AS y_m
          FROM trnsact
         WHERE stype = 'P'
           AND y_m <> 200508
         GROUP BY y, m, s, y_m
        HAVING COUNT(DISTINCT saledate) >= 20) AS clean
    ON EXTRACT(YEAR from t.saledate) = clean.y
   AND EXTRACT(MONTH from t.saledate) = clean.m
   AND t.store = clean.s
  JOIN skuinfo s_i
    ON t.sku = s_i.sku
  JOIN store_msa s_m
    ON t.store = s_m.store
  JOIN deptinfo d_i
    ON s_i.dept = d_i.dept
 WHERE stype = 'P'
   AND EXTRACT(MONTH from t.saledate) IN (11, 12)
 GROUP BY d_i_deptdesc,
          t_store,
          s_m_city,
          s_m_state
HAVING n_sales <> 0
   AND d_sales <> 0
 ORDER BY perc_inc DESC;
PERC_INC N_SALES D_SALES N_DATES D_DATES DEPT STORE CITY STATE
4182.00 5.99 513.00 1 2 GOTTEX 404 PINE BLUFF AR
3550.00 10.00 365.00 1 1 ENVIRON 507 FORT WORTH TX
2751.00 2.99 170.50 1 2 CATALIN 502 ST PETERSBURG FL

What is the city and state of the store that had the greatest decrease in average daily revenue from August to September?

SELECT t.store t_store,
       s_m.city s_m_city,
       s_m.state s_m_state,
       SUM(CASE EXTRACT(MONTH from t.saledate)
                WHEN 8 THEN t.amt
                ELSE 0
            END) AS a_sales,
       SUM(CASE EXTRACT(MONTH from t.saledate)
                WHEN 9 THEN t.amt
                ELSE 0
            END) AS s_sales,
       COUNT(DISTINCT (CASE EXTRACT(MONTH from t.saledate)
                            WHEN 8 THEN t.saledate
                            ELSE NULL
                        END)) AS a_dates,
       COUNT(DISTINCT (CASE EXTRACT(MONTH from t.saledate)
                            WHEN 9 THEN t.saledate
                            ELSE NULL
                        END)) AS s_dates,
       ((a_sales/a_dates)-(s_sales/s_dates)) AS avg_daily_spend_dec
  FROM trnsact t
  JOIN (SELECT EXTRACT(YEAR from saledate) y,
               EXTRACT(MONTH from saledate) m,
               store s,
               CASE WHEN m < 10 THEN y||'0'||TRIM(m)
                    ELSE y||TRIM(m)
               END AS y_m
          FROM trnsact
         WHERE stype = 'P'
           AND y_m <> 200508
         GROUP BY y, m, s, y_m
        HAVING COUNT(DISTINCT saledate) >= 20) AS clean
    ON EXTRACT(YEAR from t.saledate) = clean.y
   AND EXTRACT(MONTH from t.saledate) = clean.m
   AND t.store = clean.s
  JOIN skuinfo s_i
    ON t.sku = s_i.sku
  JOIN store_msa s_m
    ON t.store = s_m.store
 WHERE stype = 'P'
   AND EXTRACT(MONTH from t.saledate) IN (8, 9)
 GROUP BY t_store,
          s_m_city,
          s_m_state
HAVING a_sales <> 0
   AND s_sales <> 0
 ORDER BY avg_daily_spend_dec DESC;
T_STORE S_M_CITY S_M_STATE A_SALES S_SALES A_DATES S_DATES SPEND_DEC
4003 WEST DES MOINES IA 591489.31 378020.92 31 30 6479.60
9103 LOUISVILLE KY 1385203.66 1183525.98 31 30 5233.12
2707 MCALLEN TX 1526547.16 1324019.74 31 30 5109.47

Determine the month of maximum total revenue for each store. Count the number of stores whose month of maximum total revenue was in each of the twelve months. Then determine the month of maximum average daily revenue. Count the number of stores whose month of maximum average daily revenue was in each of the twelve months. How do they compare?

SELECT mo,
       COUNT(*) top_sales_count
  FROM (SELECT t.store t_store,
               EXTRACT(MONTH from t.saledate) mo,
               SUM(t.amt) monthly_sales,
               ROW_NUMBER() OVER (PARTITION BY t_store
                                  ORDER BY monthly_sales DESC) ranking
          FROM trnsact t
          JOIN (SELECT EXTRACT(YEAR from saledate) y,
                       EXTRACT(MONTH from saledate) m,
                       store s,
                       CASE WHEN m < 10 THEN y||'0'||TRIM(m)
                            ELSE y||TRIM(m)
                       END AS y_m
                  FROM trnsact
                 WHERE stype = 'P'
                   AND y_m <> 200508
                 GROUP BY y, m, s, y_m
                HAVING COUNT(DISTINCT saledate) >= 20) AS clean
             ON EXTRACT(YEAR from t.saledate) = clean.y
            AND EXTRACT(MONTH from t.saledate) = clean.m
            AND t.store = clean.s
          WHERE t.stype = 'P'
          GROUP BY t_store,
                   mo) AS store_month_counts
 WHERE ranking=1
 GROUP BY mo
 ORDER BY TOP_SALES_COUNT DESC;

Maximum Total Revenue

MO 12 3 7 9
TOP_SALES_COUNT 321 3 3 1
SELECT mo,
       COUNT(*) top_sales_count
  FROM (SELECT t.store t_store,
               EXTRACT(MONTH from t.saledate) mo,
               SUM(t.amt)/COUNT(DISTINCT t.saledate) avg_daily_revenue,
               ROW_NUMBER() OVER (PARTITION BY t_store
                                  ORDER BY avg_daily_revenue DESC) ranking
          FROM trnsact t
          JOIN (SELECT EXTRACT(YEAR from saledate) y,
                       EXTRACT(MONTH from saledate) m,
                       store s,
                       CASE WHEN m < 10 THEN y||'0'||TRIM(m)
                            ELSE y||TRIM(m)
                       END AS y_m
                  FROM trnsact
                 WHERE stype = 'P'
                   AND y_m <> 200508
                 GROUP BY y, m, s, y_m
                HAVING COUNT(DISTINCT saledate) >= 20) AS clean
             ON EXTRACT(YEAR from t.saledate) = clean.y
            AND EXTRACT(MONTH from t.saledate) = clean.m
            AND t.store = clean.s
          WHERE t.stype = 'P'
          GROUP BY t_store,
                   mo) AS store_month_counts
 WHERE ranking=1
 GROUP BY mo
 ORDER BY TOP_SALES_COUNT DESC;

Maximum Average Daily Revenue

MO 12 3 7 2 9 5
TOP_SALES_COUNT 317 4 3 2 1 1

Content for this note is taken from the Coursera course “Managing Big Data with MySQL.”