Meet The Data Sets
The Dognition data set has over 1M rows. The Dilliard data set has over 100M rows.
With sizable data like this, the way queries are written will impact the speed with which coworkers can access data. Best practice are to limit this impact as much as possible. Two ways of failing to do this include:
- querying all the data in a large table, and
- combining tables that results in output with many more rows than expected, due to many-to-many relationships or duplicate row values.
The best way to avoid these problems is to know the datasets well.
Using the Jupyter Interface for MySQL Queries
Intro to Query Syntax
This course focuses on the DQL, or Data Query Language, part of SQL, because data analysts generally do not change stored data. That part of the language allows users to retrieve data in whatever format is desired.
Query: SQL code that describes the desired data and output format
Front end interface: where users enter queries. These can vary widely from company to company, and platform to platform.
Six main SQL keywords:
ORDER BY. Only
FROM are always required. Best practice is to end queries with a
;, even if the type of SQL being used doesn’t require it. Capitalize keywords, put each clause on a separate line. Functions (
COUNT) should be capitalized as well. Table names and column names should not be capitalized.
SQL code is unique in that it does not describe how the system is to obtain the data. Rather, it describes what the user wants, how the database management system obtains the data remains unspecified.
Dognition data is stored in a MySQL database that will be interacted with using Jupyter. Dillard’s data is stored in a Teradata database that will be interacted with from an interface called Teradata Viewpoint Scratchpad.
Using SQL within Jupyter Notebooks
First, load the SQL library.
Next, connect to the database you need to use. To connect to the database used in this course, the following line is used.
%sql is used to run single lines of SQL,
%%sql is used to run multiple lines of SQL.
The following line makes the previous database the default database.
Now, the system is set up to query the appropriate database. The remainder of the notes will not use
SQL Queries for Looking at Your Data
The first two steps with any new database are to:
- determine how many tables it has and
- determine what fields each of those tables have.
The following command will return a list of the tables in the database.
The following commands return the same results: a given table’s list of the columns and their attributes.
type column details information about the datatype. There are three main types of data in MySQL: text, number, and datetime. For a breakdown of the subtypes within each of those types, review this link.
key column is interpreted as follows:
- Empty: the column either is not indexed or is indexed only as a secondary column in a multiple-column, nonunique index.
- PRI: the column is a PRIMARY KEY or is one of the columns in a multiple-column PRIMARY KEY.
- UNI: the column is the first column of a UNIQUE index.
- MUL: the column is the first column of a nonunique index in which multiple occurrences of a given value are permitted within the column.
LIMIT and OFFSET
OFFSET are used in MySQL to restrict returned data to certain parts of the output. The two examples below produce identical results. They limit the result set to 10 rows, offset five rows from the top. Note the numbers in the limit clause reverse places when the
OFFEST keyword is omitted.
LIMIT command is an example of syntax that can vary across database platforms. MySQL uses
LIMIT 3, but Teradata uses a
TOP 3 statement within the
SELECT, and Oracle uses a
WHERE ROWNUM <= 3. This is an area where looking up the specific syntax for a given platform is required.
Setting Criteria with WHERE
WHERE with Datetime Data
Time-related data is more complicated to work with than other types of data due to specific formatting requirements.
DATE - format YYYY-MM-DD
DATETIME - format: YYYY-MM-DD HH:MI:SS
TIMESTAMP - format: YYYY-MM-DD HH:MI:SS
YEAR - format YYYY or YY
Aliases Using AS
AS is used to assign aliases to tables and columns. If spaces are used in a name, then the name must be enclosed in single or double quotation marks in MySQL. Other systems may only accept single quotation marks. If SQL keywords are used in an alias, the alias must be enclosed in backticks.
DISTINCT removes duplicate entries in the column(s) modified with the
DISTINCT keyword. If the column contains
NULL values, then MySQL will include one
NULL value in the output.
DISTINCT is used with multiple columns, the combination of columns is used to determine the uniqueness of a row in the results. The next query returns 3390 rows, while the one after it returns 3999. This is because there are multiple cities called “Abilene” in multiple states, for example.
DISTINCT in combination with
LIMIT, MySQL will stop searching when it finds the number of unique rows in the
LIMIT clause. It does not return the unique rows within that number of total rows.
Using ORDER BY
ORDER BY orders the result set by a particular field. It can also sort by multiple fields. In order to do this, include the fields in the order you want them to be used for sorting. This example sorts the users by alphabetical state first, and membership type second.
Exporting Results to File
The method for exporting query results is generally a property of the SQL interface. It therefore varies with the system. There are two options with the Jupyter interface.
- Select and copy the output from the output window. This only works for limited data, because only 1000 rows can be pasted at a time.
- MySQL can put the query results into a variable. Then, the data in a variable can be formatted to CSV using Python code, and downloaded.
To instruct MySQL to put the results of a query into a variable, use the following syntax:
A more specific example follows.
Next, execute the following line, which formats the variable as a CSV file.
Jupyter then provides a link to download the text file.
The breed data in this dataset includes dashes that should not be there. There are two ways of fixing this. The latter retains the appropriate formatting for breeds that are supposed to include hyphens, for example, “Affenpinscher-Afghan Hound Mix.”
|Affenpinscher-Afghan Hound Mix||Affenpinscher-Afghan Hound Mix|
|Affenpinscher-Airedale Terrier Mix||Affenpinscher-Airedale Terrier Mix|
|Affenpinscher-Alaskan Malamute Mix||Affenpinscher-Alaskan Malamute Mix|
|Affenpinscher-American English Coonhound Mix||Affenpinscher-American English Coonhound Mix|
In how many columns of the STRINFO table of the Dillard’s database are NULL values permitted?
In how many columns of the STRINFO table of the Dillard’s database are NULL values present?
What was the highest original price in the Dillard’s database of the item with SKU 3631365?
What is the color of the Liz Claiborne brand item with the highest SKU # in the Dillard’s database (the Liz Claiborne brand is abbreviated “LIZ CLAI” in the Dillard’s database)?
What is the sku number of the item in the Dillard’s database that had the highest original sales price?
According to the strinfo table, in how many states within the United States are Dillard’s stores located?
How many Dillard’s departments start with the letter “e”?
What was the date of the earliest sale in the database where the sale price of the item did not equal the original price of the item, and what was the largest margin (original price minus sale price) of an item sold on that earliest date?
What register number made the sale with the highest original price and highest sale price between the dates of August 1, 2004 and August 10, 2004? Make sure to sort by original price first and sale price second.
Which of the following brand names with the word/letters “liz” in them exist in the Dillard’s database? Select all that apply. Note that you will need more than a single correct selection to answer the question correctly.
What is the lowest store number of all the stores in the STORE_MSA table that are in the city of “little rock”,”memphis”, or “tulsa”?
Some content in this note are taken from the Jupyter Notebooks enumerated below. They are accessible as part of the Managing Big Data with MySQL course on coursera.org, and licensed by Jana Schaich Borg under CC BY-NC 4.0.
- MySQL Exercise 01 Looking at Your Data
- MySQL Exercise 02 Selecting Data Subsets using WHERE
- MySQL Exercise 03 Formatting Selected Data
Other content for this note is taken from the Coursera course “Managing Big Data with MySQL.”