Queries to Extract Data from Single Tables

Meet The Data Sets

Querying Responsibly

The Dognition data set has over 1M rows. The Dilliard data set has over 100M rows.

With sizable data like this, the way queries are written will impact the speed with which coworkers can access data. Best practice are to limit this impact as much as possible. Two ways of failing to do this include:

  • querying all the data in a large table, and
  • combining tables that results in output with many more rows than expected, due to many-to-many relationships or duplicate row values.

The best way to avoid these problems is to know the datasets well.

Using the Jupyter Interface for MySQL Queries

Intro to Query Syntax

This course focuses on the DQL, or Data Query Language, part of SQL, because data analysts generally do not change stored data. That part of the language allows users to retrieve data in whatever format is desired.

Query: SQL code that describes the desired data and output format

Front end interface: where users enter queries. These can vary widely from company to company, and platform to platform.

Six main SQL keywords: SELECT, FROM, WHERE, GROUP BY, HAVING, and ORDER BY. Only SELECT and FROM are always required. Best practice is to end queries with a ;, even if the type of SQL being used doesn’t require it. Capitalize keywords, put each clause on a separate line. Functions (SUM, COUNT) should be capitalized as well. Table names and column names should not be capitalized.

SQL code is unique in that it does not describe how the system is to obtain the data. Rather, it describes what the user wants, how the database management system obtains the data remains unspecified.

Dognition data is stored in a MySQL database that will be interacted with using Jupyter. Dillard’s data is stored in a Teradata database that will be interacted with from an interface called Teradata Viewpoint Scratchpad.

Using SQL within Jupyter Notebooks

First, load the SQL library.

%load_ext sql

Next, connect to the database you need to use. To connect to the database used in this course, the following line is used. %sql is used to run single lines of SQL, %%sql is used to run multiple lines of SQL.

%sql mysql://studentuser:studentpw@mysqlserver/dognitiondb

The following line makes the previous database the default database.

%sql USE dognitiondb

Now, the system is set up to query the appropriate database. The remainder of the notes will not use %sql and %%sql.

SQL Queries for Looking at Your Data

The first two steps with any new database are to:

  1. determine how many tables it has and
  2. determine what fields each of those tables have.

The following command will return a list of the tables in the database.

SHOW tables;
Tables_in_dognitiondb
complete_tests
dogs
exam_answers
reviews
site_activities
users

The following commands return the same results: a given table’s list of the columns and their attributes.

SHOW columns
FROM reviews;
DESCRIBE reviews;
Field Type Null Key Default Extra
rating int(11) YES   None  
created_at datetime NO   None  
updated_at datetime NO   None  
user_guid varchar(60) YES MUL None  
dog_guid varchar(60) YES MUL None  
subcategory_name varchar(60) YES   None  
test_name varchar(60) YES   None  

The type column details information about the datatype. There are three main types of data in MySQL: text, number, and datetime. For a breakdown of the subtypes within each of those types, review this link.

The key column is interpreted as follows:

  • Empty: the column either is not indexed or is indexed only as a secondary column in a multiple-column, nonunique index.
  • PRI: the column is a PRIMARY KEY or is one of the columns in a multiple-column PRIMARY KEY.
  • UNI: the column is the first column of a UNIQUE index.
  • MUL: the column is the first column of a nonunique index in which multiple occurrences of a given value are permitted within the column.

LIMIT and OFFSET

LIMIT and OFFSET are used in MySQL to restrict returned data to certain parts of the output. The two examples below produce identical results. They limit the result set to 10 rows, offset five rows from the top. Note the numbers in the limit clause reverse places when the OFFEST keyword is omitted.

SELECT breed
  FROM dogs
 LIMIT 10 OFFSET 5;

SELECT breed
  FROM dogs
 LIMIT 5, 10;

The LIMIT command is an example of syntax that can vary across database platforms. MySQL uses LIMIT 3, but Teradata uses a TOP 3 statement within the SELECT, and Oracle uses a WHERE ROWNUM <= 3. This is an area where looking up the specific syntax for a given platform is required.

Setting Criteria with WHERE

SELECT dog_guid, weight
  FROM dogs
 WHERE weight BETWEEN 10 AND 50;

SELECT dog_guid, breed
  FROM dogs
 WHERE breed IN ('golden retriever', 'poodle');

SELECT dog_guid, breed
  FROM dogs
 WHERE breed LIKE ("s%");

SELECT user_guid
  FROM users
 WHERE free_start_user IS NULL;

WHERE with Datetime Data

Time-related data is more complicated to work with than other types of data due to specific formatting requirements.

DATE - format YYYY-MM-DD
DATETIME - format: YYYY-MM-DD HH:MI:SS
TIMESTAMP - format: YYYY-MM-DD HH:MI:SS
YEAR - format YYYY or YY

SELECT dog_guid, created_at
  FROM complete_tests
 WHERE DAYNAME(created_at)="Tuesday";

SELECT dog_guid, created_at
  FROM complete_tests
 WHERE DAY(created_at) > 15;

SELECT dog_guid, created_at
  FROM complete_tests
 WHERE created_at > '2014-02-04';

SELECT dog_guid, test_name, subcategory_name
  FROM complete_tests
 WHERE YEAR(created_at) = 2014
   AND MONTH(created_at) = 10;

Formatting Results

Aliases Using AS

The keyword AS is used to assign aliases to tables and columns. If spaces are used in a name, then the name must be enclosed in single or double quotation marks in MySQL. Other systems may only accept single quotation marks. If SQL keywords are used in an alias, the alias must be enclosed in backticks.

SELECT start_time AS "exam start time"
  FROM exam_answers
 LIMIT 5;

Using DISTINCT

DISTINCT removes duplicate entries in the column(s) modified with the DISTINCT keyword. If the column contains NULL values, then MySQL will include one NULL value in the output.

SELECT DISTINCT breed
  FROM dogs;

If DISTINCT is used with multiple columns, the combination of columns is used to determine the uniqueness of a row in the results. The next query returns 3390 rows, while the one after it returns 3999. This is because there are multiple cities called “Abilene” in multiple states, for example.

SELECT DISTINCT city
  FROM users;

SELECT DISTINCT city, state
  FROM users;

When using DISTINCT in combination with LIMIT, MySQL will stop searching when it finds the number of unique rows in the LIMIT clause. It does not return the unique rows within that number of total rows.

Using ORDER BY

ORDER BY orders the result set by a particular field. It can also sort by multiple fields. In order to do this, include the fields in the order you want them to be used for sorting. This example sorts the users by alphabetical state first, and membership type second.

SELECT DISTINCT state, membership_type
  FROM users
 WHERE country='US'
 ORDER BY state, membership_type
 LIMIT 5;
state membership_type
AE 1
AE 2
AE 3
AK 1
AK 2

Exporting Results to File

The method for exporting query results is generally a property of the SQL interface. It therefore varies with the system. There are two options with the Jupyter interface.

  1. Select and copy the output from the output window. This only works for limited data, because only 1000 rows can be pasted at a time.
  2. MySQL can put the query results into a variable. Then, the data in a variable can be formatted to CSV using Python code, and downloaded.

To instruct MySQL to put the results of a query into a variable, use the following syntax:

variable_name_of_your_choice = %sql [your full query goes here];

A more specific example follows.

breed_list = %sql SELECT DISTINCT breed FROM dogs ORDER BY breed;

Next, execute the following line, which formats the variable as a CSV file.

the_output_name_you_want.csv('the_output_name_you_want.csv')

More specifically:

breed_list.csv('breed_list.csv')

Jupyter then provides a link to download the text file.

The breed data in this dataset includes dashes that should not be there. There are two ways of fixing this. The latter retains the appropriate formatting for breeds that are supposed to include hyphens, for example, “Affenpinscher-Afghan Hound Mix.”

SELECT DISTINCT breed,
REPLACE(breed,'-','') AS breed_fixed
FROM dogs
ORDER BY breed_fixed

SELECT DISTINCT breed, TRIM(LEADING '-' FROM breed) AS breed_fixed
FROM dogs
ORDER BY breed_fixed
breed breed_fixed
Affenpinscher Affenpinscher
Affenpinscher-Afghan Hound Mix Affenpinscher-Afghan Hound Mix
Affenpinscher-Airedale Terrier Mix Affenpinscher-Airedale Terrier Mix
Affenpinscher-Alaskan Malamute Mix Affenpinscher-Alaskan Malamute Mix
Affenpinscher-American English Coonhound Mix Affenpinscher-American English Coonhound Mix

In how many columns of the STRINFO table of the Dillard’s database are NULL values permitted?

HELP TABLE strinfo;

In how many columns of the STRINFO table of the Dillard’s database are NULL values present?

SELECT COUNT(*)
FROM strinfo
WHERE city IS NULL
   OR state IS NULL
   OR zip IS NULL;

What was the highest original price in the Dillard’s database of the item with SKU 3631365?

SELECT TOP 5 sku, store, orgprice
FROM trnsact
WHERE sku = 3631365
ORDER BY orgprice DESC;

What is the color of the Liz Claiborne brand item with the highest SKU # in the Dillard’s database (the Liz Claiborne brand is abbreviated “LIZ CLAI” in the Dillard’s database)?

SELECT TOP 5 sku, color, brand
FROM skuinfo
WHERE brand = 'LIZ CLAI'
ORDER BY sku DESC;

What is the sku number of the item in the Dillard’s database that had the highest original sales price?

SELECT TOP 5 sku, orgprice
FROM trnsact
WHERE sku IN (4420360,7298484,6200173,4628597)
ORDER BY orgprice DESC;

According to the strinfo table, in how many states within the United States are Dillard’s stores located?

SELECT COUNT(DISTINCT state)
FROM strinfo;

How many Dillard’s departments start with the letter “e”?

SELECT *
FROM deptinfo
WHERE deptdesc LIKE 'E%';

What was the date of the earliest sale in the database where the sale price of the item did not equal the original price of the item, and what was the largest margin (original price minus sale price) of an item sold on that earliest date?

SELECT TOP 5 saledate, orgprice, sprice, (orgprice - sprice) AS margin
FROM trnsact
WHERE margin > 0
ORDER BY saledate, margin DESC;

What register number made the sale with the highest original price and highest sale price between the dates of August 1, 2004 and August 10, 2004? Make sure to sort by original price first and sale price second.

SELECT TOP 5 saledate, register, orgprice, sprice
FROM trnsact
WHERE saledate BETWEEN '2004-08-01' AND '2004-08-10'
ORDER BY orgprice DESC, sprice DESC;

Which of the following brand names with the word/letters “liz” in them exist in the Dillard’s database? Select all that apply. Note that you will need more than a single correct selection to answer the question correctly.

SELECT DISTINCT brand
FROM skuinfo
WHERE brand LIKE '%liz%'
ORDER BY brand;

What is the lowest store number of all the stores in the STORE_MSA table that are in the city of “little rock”,”memphis”, or “tulsa”?

SELECT TOP 5 *
FROM store_msa
WHERE city IN ('Little Rock','Memphis','Tulsa')
ORDER BY store;

Some content in this note are taken from the Jupyter Notebooks enumerated below. They are accessible as part of the Managing Big Data with MySQL course on coursera.org, and licensed by Jana Schaich Borg under CC BY-NC 4.0.

  • MySQL Exercise 01 Looking at Your Data
  • MySQL Exercise 02 Selecting Data Subsets using WHERE
  • MySQL Exercise 03 Formatting Selected Data

Other content for this note is taken from the Coursera course “Managing Big Data with MySQL.”