SQL Subqueries

The tables queried this blog post are the publicly-available “tutorial.billboard_top_100_year_end” and “tutorial.sat_scores” tables that are available at modeanalytics.com.

Subquery Basics

Subqueries are queries nested within other queries. They are means of creating a new table, and then querying that table of results.

All subqueries are required to have aliases. In the following simple example, the alias is “sub.”

SELECT *
  FROM (
         SELECT *
           FROM tutorial.billboard_top_100_year_end
       ) sub

The order of operations of subqueries is as one would expect: the inner queries execute before outer queries. Once the inner query has run, the outer query will run using the result set generated by the inner query.

Note that most SQL editors allow users to highlight a portion of a large query, and run just that portion. This is convenient when working with inner queries, as the user can ensure the inner query executes as expected without muddying the waters by executing the outer query as well.

Subqueries can be used anywhere that a tablename, column, or individual value would be used.

WITH Command

The WITH command is a means of separating subqueries into their own “Common Table Expressions” (CTEs). In addition to improving legibility of the query structure, this also enables users to run time-consuming CTEs once, and then to iterate on other queries without running the complex queries over again.

The query above is refactored as a Common Table Expression in the following example.

WITH sub AS (
  SELECT *
    FROM tutorial.billboard_top_100_year_end
)

SELECT *
  FROM sub

It is also possible to chain multiple CTEs together, as shown below.

WITH sub1 AS (
  SELECT *
    FROM tutorial.billboard_top_100_year_end
),
sub2 AS (
  SELECT *
    FROM sub1
)

SELECT *
  FROM sub2

Example 1

The following is an example of a slightly more complex subquery. The subquery below queries the number of top 100 hits each artist has each year. The outer query then queries that result for the number of years each artist had at least one top 100 hit. The results gives some indication as to these artists’ career “staying power.” Only artists that have consistent, long-term mass appeal could rank high in these results.

SELECT artist,
       COUNT("Top 100 Hits") "Years with 1+ Song in Top 100"
  FROM (
          SELECT year,
                 artist,
                 COUNT(song_name) "Top 100 Hits"
            FROM tutorial.billboard_top_100_year_end
           WHERE artist IS NOT NULL
           GROUP BY artist, year
           ORDER BY "Top 100 Hits" DESC
        ) top_100_hits_per_year
  GROUP BY artist
  ORDER BY "Years with 1+ Song in Top 100" DESC

The results of the top_100_hits_per_year subquery are as follows:

year artist Top 100 Hits
1964 Beatles 9
2008 Lil Wayne 8
2008 T-Pain 8
2007 Justin Timberlake 7
2007 T-Pain 7

The results of the complete query are as follows:

artist Years with 1+ Song in Top 100
Mariah Carey 18
Madonna 18
Elton John 18
Stevie Wonder 16
Chicago 13

Example 2

T-Pain and Lil Wayne are outputs of the subquery in the previous example, and the likes of Mariah Carey and Elton John are part of the output of the complete query. It is difficult for me to consider T-Pain and Madonna in the same category when it comes to musical legacy.

This leads me to wonder: how many artists had a single productive year where they had at least one song in the top 100, but then never appeared in the top 100 again?

SELECT artist
  FROM (
          SELECT year,
                 artist,
                 COUNT(song_name) "Top 100 Hits"
            FROM tutorial.billboard_top_100_year_end
           WHERE artist IS NOT NULL
           GROUP BY artist, year
        ) top_100_hits_per_year
  GROUP BY artist
 HAVING COUNT("Top 100 Hits") = 1
  ORDER BY "Years with 1+ Song in Top 100"

This first five rows (out of 1678 total) of this query are the following:

Artists with a Single Year of Top 100 Hit(s)
10,000 Maniacs
100 Proof Aged In Soul
20 Fingers
2 Pistols
.38 Special

The inner query returns 4948 rows. Here’s a subset of the results. Note that ‘10cc’ and ‘112’ did not make it into the results of the outer query, as expected.

year artist Top 100 Hits
1994 10,000 Maniacs 1
1970 100 Proof Aged In Soul 1
1977 10cc 1
1975 10cc 1
1997 112 2
1996 112 1

This means that there are 4948 distinct artist/year combinations accounting for the top 100 hits between 1956 and 2013. This is out of a conceivable maximum of 5700, if every artist only had a single top 100 hit per year.

This implies that a large majority of artists only ever have a single song in the top 100 for any given year, which is somewhat surprising. To confirm this:

SELECT year,
       artist,
       COUNT(song_name) "Top 100 Hits"
  FROM tutorial.billboard_top_100_year_end
 WHERE artist IS NOT NULL
 GROUP BY artist, year
HAVING COUNT(song_name) = 1
 ORDER BY artist

The foregoing returns 3899 rows. The conclusion is that it is rare for an artist to have more than one song in the top 100 in any given year. The following subquery will verify this more conclusively.

Example 3

SELECT "Top 100 Hits",
       COUNT(artist) "Number of Artists with Indicated Number of Top 100 Hits in a Given Year"
  FROM (
          SELECT year,
                 artist,
                 COUNT(song_name) "Top 100 Hits in a Given Year"
            FROM tutorial.billboard_top_100_year_end
           WHERE artist IS NOT NULL
             AND song_name IS NOT NULL
           GROUP BY artist, year
           ORDER BY artist
        ) top_100_hits_per_year
  GROUP BY "Top 100 Hits in a Given Year"
  ORDER BY "Top 100 Hits in a Given Year"
Top 100 Hits in a Given Year Number of Artists with Indicated Number of Top 100 Hits in a Given Year
1 3899
2 724
3 225
4 62
5 22
6 7
7 2
8 2
9 1

As an example, consider the complete Top 100 hit count of 2Pac:

year artist song_name
1993 2Pac I Get Around
1994 2Pac Keep Ya Head Up
1995 2Pac Dear Mama / Old School
1996 2Pac How Do U Want It / California Love
1998 2Pac Do For Love
2003 2Pac Thugz Mansion

I would have expected many more top 100 hits from an artist with the enduring impact of 2Pac. It appears that highly-prolific single years are rare, just as are highly-prolific careers.

Example 4

Back to the initial question: what is the largest number of top 100 hits, for an artist that only had a single year with songs in the top 100? IE, which artist was the biggest “flash in the pan” of musical history?

WITH hit_count_per_year_per_artist AS (
  SELECT year,
         artist,
         COUNT(song_name) "Top 100 Hits"
    FROM tutorial.billboard_top_100_year_end
   WHERE artist IS NOT NULL
   GROUP BY artist, year
   ORDER BY year
),
years_w_hits_per_artist AS (
  SELECT artist,
         COUNT(year) "Years with Hits"
    FROM hit_count_per_year_per_artist
   GROUP BY artist
)

SELECT h.artist,
       y."Years with Hits",
       h.year,
       h."Top 100 Hits"
  FROM years_w_hits_per_artist y
  JOIN hit_count_per_year_per_artist h
    ON y.artist = h.artist
   AND "Years with Hits" = 1
 ORDER BY "Top 100 Hits" DESC

The results are not as exciting as I might have hoped, but 10 artists had a single year with three top 100 hits and no further songs in the top 100. Note that the database has data through 2013. It is likely that the artists with 2013 hits have had top 100 hits since 2013.

artist Years with Hits year Top 100 Hits
Teresa Brewer 1 1956 3
Harry Belafonte 1 1957 3
Buckinghams 1 1967 3
Jim Stafford 1 1974 3
C+C Music Factory 1 1991 3
Cathy Dennis 1 1991 3
CeCe Peniston 1 1992 3
Future 1 2013 3
Macklemore 1 2013 3
Ryan Lewis 1 2013 3

Still Going…

SELECT artist,
       count(song_name) "Total Top 100 Hits"
  FROM tutorial.billboard_top_100_year_end
 GROUP BY artist
 ORDER BY "Total Top 100 Hits" DESC
artist Total Top 100 Hits
Elvis Presley 36
Madonna 36
Mariah Carey 33
Rihanna 33
Elton John 28


A final way to analyze this data is determine which artists had the highest ranked songs. An even more involved composite measure might take into account the number of years during which the artist had songs in the top 100, in addition to the actual ranking of the songs.

For this simple analysis, assign “song_scores” between 1 and 100 for each song. Higher is better. This can be accomplished by subtracting the year_rank from 101. The results are shown below, with the (lengthy) query that generated it shown below.

artist Total Top 100 Hits Years with Top 100 Hit(s) Total Song Score
Elvis Presley 36 13 2247
Mariah Carey 33 18 2205
Rihanna 33 9 2002
Madonna 36 18 1888
Usher 25 11 1677
Elton John 28 18 1578
Beatles 27 8 1572
Whitney Houston 21 10 1432
Michael Jackson 26 12 1430
Janet Jackson 23 10 1415
WITH total_hits AS (
    SELECT artist,
           COUNT(song_name) "Total Top 100 Hits"
      FROM tutorial.billboard_top_100_year_end
     GROUP BY artist
     ORDER BY "Total Top 100 Hits" DESC
),
unique_year_artists AS (
    SELECT artist,
           COUNT(year) "Years with Top 100 Hit(s)"
      FROM (
            SELECT DISTINCT year,
                   artist
              FROM tutorial.billboard_top_100_year_end
             WHERE artist IS NOT NULL
             ORDER BY year
           ) sub
     GROUP BY artist
     ORDER BY COUNT(year) DESC
),
song_scores AS (
    SELECT artist,
           SUM(101-year_rank) "Total Song Score"
      FROM tutorial.billboard_top_100_year_end
     GROUP BY artist
)

SELECT s.artist,
       t."Total Top 100 Hits",
       u."Years with Top 100 Hit(s)",
       s."Total Song Score"
  FROM song_scores s
  JOIN total_hits t
    ON s.artist = t.artist
  JOIN unique_year_artists u
    ON s.artist = u.artist
 ORDER BY "Total Song Score" DESC


Another brief analysis examines which artists with 10 or more appearances in the top 100 database had the highest average ranking.

SELECT artist,
       AVG(year_rank) "Average Hit Ranking",
       COUNT(year_rank) "Total Hits"
  FROM tutorial.billboard_top_100_year_end
 WHERE artist IS NOT NULL
 GROUP BY artist
HAVING COUNT(year_rank) > 10
 ORDER BY "Average Hit Ranking"

The resulting cohort of megahit artists is quite different from the groups seen thus far.

artist Average Hit Ranking Total Hits
Bruno Mars 20 12
Katy Perry 23.214285714285715 14
Maroon 5 30.5 12
Celine Dion 30.636363636363637 11
Lady Gaga 30.75 12
50 Cent 30.857142857142858 14
Bee Gees 32.15384615384615 13
Alicia Keys 32.5 12
Whitney Houston 32.80952380952381 21
Boyz II Men 33.357142857142854 14


Finally, which artists had the most appearances in the top 10?

SELECT artist,
       COUNT(year_rank) "Total Top 10 Hits"
  FROM tutorial.billboard_top_100_year_end
 WHERE year_rank <= 10
 GROUP BY artist
 ORDER BY "Total Top 10 Hits" DESC

Familiar names from previous analyses.

artist Total Top 10 Hits
Elvis Presley 9
Mariah Carey 7
Usher 6
Elton John 6
Beatles 5

Content for this note is taken from information I learned while pursuing the Udacity Data Analyst Nanodegree. Udacity is an online tech education service that offers both free and paid, tech-focused training. Learn more.