Basics

Statements and Queries

The function of a “SQL statement” is to communicate to the database what you’d like to do with the data. Types of statements include:

  • CREATE - Creates a new table in the database.
  • DROP TABLE - Removes a table from the database.
  • SELECT - Allows you to read data and display it. Select statements are commonly referred to as queries.

There are a few other types of statements related to the creation and removal of data. Most analytical work involves reading and manipulating data (using SELECT) without altering the underlying source.

Query Formatting

SQL is generally not case sensitive, but common formatting convention is to capitalize keywords (IE, SELECT and FROM) while leaving the names of columns and tables lowercase (account). This aids in the legibility of commands, which only becomes more important as the queries become more complex.

Column names should use underscores instead of spaces (web_events, not "web events"). In Postgres, if there is a space in a column or table name, that name needs to be referred to with double quotes around them.

SELECT account
  FROM web_events;

Different SQL environments may or may not require a semicolon at the end of the query. Best practice is to include it whether or not it is required.

SELECT & FROM

Every query contains the SELECT and FROM keywords. The SELECT keyword is always followed by the columns the user wants returned. The FROM keyword is always is always followed by the table from which those columns should be selected.

Unlike many other types of programming languages, the order of SQL statements cannot vary. SELECT is always followed by FROM, and not the other way around, as shown in the following examples. The order of keywords will likewise be critical for the other keywords that will be introduced in this post.

SELECT flight_date,
       origin,
       dest,
       actual_elapsed_time
  FROM tutorial.us_flights
flight_date origin dest actual_elapsed_time
2015-01-02 00:00:00 JFK LAX 381
2015-01-03 00:00:00 JFK LAX 358
2015-01-04 00:00:00 JFK LAX 385
2015-01-05 00:00:00 JFK LAX 389
2015-01-06 00:00:00 JFK LAX 424

LIMIT

The keyword LIMIT places an upper limit on the number of rows returned by a given query. LIMIT is always the very last keyword included in a query.

The primary usage of this command is to speed up execution of queries that would otherwise take a very long time, due to there being millions of rows. It would be inconvenient to wait several minutes for a query to return especially if the programmer simply wants to quickly get an idea of what the fields contain, for example.

An * (asterisk) is used here to select all available columns. In SQL, * means ‘all.’

SELECT *
  FROM tutorial.city_populations
 LIMIT 3
city state population_estimate_2012 id
New York NY 8336697 1
Los Angeles CA 3857799 2
Chicago IL 2714856 3

ORDER BY

The keyword ORDER BY specifies which field the results of the query will be ordered by.

By default, the query returns results from A to Z, lowest to highest, and earliest to latest. This is called “Ascending Order.” In order to reverse this ordering, add the keyword DESC following the column name.

SELECT school,
       student_id,
       sat_writing,
       sat_verbal,
       sat_math
  FROM tutorial.sat_scores
 ORDER BY sat_math DESC
school student_id sat_writing sat_verbal sat_math
St. John’s 118 578 492 797
Washington HS 21 624 331 796
Petersville HS 40 513 614 780
Washington HS 6 213 500 771
St. John’s 87 697 798 767

It is also possible to ORDER BY multiple fields. This will result in a query being ordered by the first field, first.

As an example, the following query orders by flight_date, then by actual_elapsed_time from smallest to largest. The latter ordering will be “nested within” the former.

SELECT flight_date,
       origin,
       dest,
       actual_elapsed_time
  FROM tutorial.us_flights
 ORDER BY flight_date,
       actual_elapsed_time
flight_date origin dest actual_elapsed_time
2015-01-02 00:00:00 WRG PSG 24
2015-01-02 00:00:00 PSG WRG 27
2015-01-02 00:00:00 DEN COS 28
2015-01-02 00:00:00 MEI PIB 29
2015-01-02 00:00:00 DEN COS 30

WHERE

A WHERE clause allows you to filter a set of results on the basis of a given criteria. This functionality is similar to Excel’s filter capability.

The following would produce results where all the values in the origin column are equal to ‘LAX.’ In SQL, each row is one datapoint or observation. If a given row has an origin other than ‘LAX,’ it will be excluded in its entirety from the query results.

Whenever using WHERE with non-numeric data, the non-numeric data is placed in single-quotes, as shown below.

SELECT flight_date,
       origin,
       dest,
       actual_elapsed_time
  FROM tutorial.us_flights
 WHERE origin = 'LAX'
 ORDER BY actual_elapsed_time
flight_date origin dest actual_elapsed_time
2015-01-06 00:00:00 LAX SBA 32
2015-01-14 00:00:00 LAX SBA 34
2015-01-07 00:00:00 LAX CLD 35
2015-01-04 00:00:00 LAX SBA 36
2015-01-05 00:00:00 LAX SBA 36

Comparison operators such as <, >, and the not equal symbol can also be used with WHERE.

SELECT *
  FROM tutorial.city_populations
 WHERE population_estimate_2012 < 500000
city state population_estimate_2012 id
Sacramento CA 475516 35
Long Beach CA 467892 36
Kansas City MI 464310 37
Mesa AZ 452084 38
Virginia Beach VA 447021 39

SELECT *
  FROM tutorial.city_populations
 WHERE state = 'CA'
city state population_estimate_2012 id
Los Angeles CA 3857799 2
San Diego CA 1338348 8
San Jose CA 982765 10
San Francisco CA 825863 14
Fresno CA 505882 34

Derived Columns

Derived columns are new columns the results of which come from a manipulation of the existing columns in the database.

The following code snippet demonstrates how it is possible to create new columns from the contents of other columns. In this case, the new column is named ‘sat_total’ using the AS keyword.

SELECT school,
       student_id,
       sat_writing,
       sat_verbal,
       sat_math,
       sat_writing + sat_verbal + sat_math AS sat_total
  FROM tutorial.sat_scores
 ORDER BY sat_total DESC
school student_id sat_writing sat_verbal sat_math sat_total
St. John’s 87 697 798 767 2262
Petersville HS 38 779 656 724 2159
St. John’s 125 741 644 702 2087
St. John’s 132 619 742 718 2079
Petersville HS 32 635 743 695 2073

LIKE

LIKE is a keyword used with the WHERE keyword. It enables more flexible searching. The following example returns rows with the word “Life” in the song_name, where “Life” can be followed by any number of characters. This is useful whenever you have lots of similar but slightly different values in a given column.

LIKE requires the use of wildcards. ‘%’ is a wild card that represents a character or any number of characters.

SELECT *
  FROM tutorial.billboard_top_100_year_end
 WHERE song_name LIKE 'Life%'
year year_rank group artist song_name id
1978 87 Joe Walsh Joe Walsh Life’s Been Good 2325
1986 78 Dream Academy Dream Academy Life In A Northern Town 3161
1992 18 Tom Cochrane Tom Cochrane Life Is A Highway 3718
2006 76 Rascal Flatts Rascal Flatts Life is a Highway 5461
2010 96 Daughtry Daughtry Life After You 6051

IN

IN is a keyword that, similar to LIKE, is used in combination with WHERE. The purpose of IN is to allow you to filter data based on several possible values. For example, the following allows you to search for cities in ‘CA,’ ‘TX,’ or ‘NY.’

Any non-numeric data must be placed in single-quotes, as shown, as with comparison operators. Place a comma in between each pair of distinct values.

SELECT *
  FROM tutorial.city_populations
 WHERE state IN ('CA', 'TX', 'NY')
city state population_estimate_2012 id
New York NY 8336697 1
Los Angeles CA 3857799 2
Houston TX 2160821 4
San Antonio TX 1382951 7
San Diego CA 1338348 8

SELECT flight_date,
       origin,
       dest,
       distance,
       actual_elapsed_time
  FROM tutorial.us_flights
 WHERE actual_elapsed_time IN (30, 45)
flight_date origin dest distance actual_elapsed_time
2015-01-06 00:00:00 DFW OKC 175 45
2015-01-06 00:00:00 WRG KTN 82 30
2015-01-06 00:00:00 WRG PSG 31 30
2015-01-03 00:00:00 YAK CDV 213 45
2015-01-04 00:00:00 CDV ANC 160 45

NOT

NOT provides the inverse results for IN, LIKE, and similar operators.

SELECT *
  FROM tutorial.billboard_top_100_year_end
 WHERE song_name NOT LIKE 'The%'
year year_rank group artist song_name id
1956 1 Elvis Presley Elvis Presley Heartbreak Hotel 1
1956 2 Elvis Presley Elvis Presley Don’t Be Cruel 2
1956 3 Nelson Riddle Nelson Riddle Lisbon Antigua 3
1956 4 Platters Platters My Prayer 4
1956 7 Doris Day Doris Day Whatever Will Be Will Be (Que Sera Sera) 7

SELECT *
  FROM tutorial.billboard_top_100_year_end
 WHERE artist NOT IN ('Elvis Presley', 'Nelson Riddle')
year year_rank group artist song_name id
1956 4 Platters Platters My Prayer 4
1956 5 Gogi Grant Gogi Grant The Wayward Wind 5
1956 6 Les Baxter Les Baxter The Poor People Of Paris 6
1956 7 Doris Day Doris Day Whatever Will Be Will Be (Que Sera Sera) 7
1956 9 Dean Martin Dean Martin Memories Are Made Of This 9

AND & BETWEEN

The following example filters for populations less than 500,000 and greater than 450,000. The AND operator allows you to run two complete logical statements; the word ‘population_estimate_2012’ could not be omitted from the second half of the WHERE clause shown below.

SELECT *
  FROM tutorial.city_populations
 WHERE population_estimate_2012 <= 500000
   AND population_estimate_2012 >= 450000

The keyword BETWEEN provides an alternative way of expressing the identical query, above.

SELECT *
  FROM tutorial.city_populations
 WHERE population_estimate_2012 BETWEEN 450000 AND 500000
city state population_estimate_2012 id
Sacramento CA 475516 35
Long Beach CA 467892 36
Kansas City MI 464310 37
Mesa AZ 452084 38

The commands can be chained, as shown in the examples below.

SELECT school,
       student_id,
       sat_writing,
       sat_verbal,
       sat_math
  FROM tutorial.sat_scores
 WHERE sat_math BETWEEN 700 AND 750
   AND sat_verbal < 300
school student_id sat_writing sat_verbal sat_math
Washington HS 25 502 291 716
Petersville HS 84 747 290 729
St. John’s 91 394 214 735

SELECT *
  FROM tutorial.billboard_top_100_year_end
 WHERE artist = 'Elvis Presley'
   AND year_rank BETWEEN 1 AND 5
year year_rank group artist song_name id
1956 1 Elvis Presley Elvis Presley Heartbreak Hotel 1
1956 2 Elvis Presley Elvis Presley Don’t Be Cruel 2
1957 1 Elvis Presley Elvis Presley All Shook Up 104
1958 3 Elvis Presley Elvis Presley Don’t / I Beg Of You 206

OR

OR performs the expected functionality, similar to AND.

SELECT flight_date,
       origin,
       dest,
       distance,
       actual_elapsed_time
  FROM tutorial.us_flights
 WHERE origin = 'LAX' OR origin = 'SFO'
 ORDER BY actual_elapsed_time
flight_date origin dest distance actual_elapsed_time
2015-01-06 00:00:00 LAX SBA 89 32
2015-01-11 00:00:00 SFO MRY 77 33
2015-01-14 00:00:00 LAX SBA 89 34
2015-01-07 00:00:00 LAX CLD 86 35
2015-01-06 00:00:00 SFO SMF 86 35

AND and OR can also be combined to form more complex queries. Parentheses can be used to group statements as appropriate. See the following code example.

SELECT school,
       student_id,
       sat_writing,
       sat_verbal,
       sat_math
  FROM tutorial.sat_scores
 WHERE sat_math BETWEEN 775 AND 800
    OR sat_verbal BETWEEN 775 AND 800
    OR sat_writing BETWEEN 775 AND 800
school student_id sat_writing sat_verbal sat_math
Washington HS 2 401 791 248
Washington HS 19 799 406 247
Washington HS 21 624 331 796
Washington HS 27 661 782 393
Petersville HS 35 304 780 516

HAVING

HAVING is used in conjunction with GROUP BY to filter group rows that do not satisfy a certain condition.

HAVING is analogous to WHERE, except it is used with GROUP BY.

  • The HAVING clause sets the condition for groups of rows created by the GROUP BY clause after the GROUP BY clause applies.
  • The WHERE clause sets the condition for individual rows before the GROUP BY clause applies.
SELECT column_1,
       aggregate_function(column_2)
  FROM table_name
 GROUP BY column_1
HAVING condition;

SELECT customer_id, SUM(amount)
  FROM payment
 GROUP BY customer_id
HAVING SUM(amount) > 190;
customer_id count
144 40
526 42
148 45

Example using WHERE and HAVING.

SELECT rating,
       AVG(rental_rate)
  FROM film
 WHERE rating IN ('R', 'G', 'PG')
 GROUP BY rating
HAVING AVG(rental_rate) < 3;
rating avg
NC-17 5.142
PG 5.082
PG-13 5.053