“Ask any analyst or data scientist, and they’ll tell you the bulk of their work happens in SQL.” - Derek Steer, CEO of Mode Analytics.
Why Databases? Because databases are the best way to store large amounts of data, and 21st century organizations have huge, and exponentially growing, datasets.
A database can be thought of a group of connected Excel tables. They also allow information to be accessed rapidly, and by multiple users concurrently. This differentiates them from other, more limited data storage solutions. These characteristics also enable them to scale to sizes that other software storage solutions would not be able accommodate.
SQL stands for Structured Query Language. SQL, which has existed since the 1970s, is a language used to interact with databases. It is the most common way of accessing databases today, and it has a variety of functions that enable it to read, manipulate, and change data.
SQL is so widely used because it:
- Is easy to understand and use,
- Enables you to access data directly,
- Makes it easy to audit and replicate queries, and it
- Enables you to work on multiple tables at once, across large datasets.
SQL functionality includes performing the sort of aggregations you would normally perform in an Excel pivot table, but Excel maxes out at around a million rows. SQL, on the other hand, can cope with billions. SQL is also far more flexible than most out-of-the-box business dashboard software, an example of which is Google Analytics. This flexibility combined with relative ease of use may be the best reasons for its continued popularity.
Given appropriately constructed databases, companies can use SQL to answer questions like:
- Which product lines are best performing?
- Which marketing channel should be invested in?
- How many users return to my site within 1 and 3 weeks of their first visit?
- What strategies typically bring website users back?
SQL’s Importance in the Job Market
Tons of jobs make use of SQL. This article at Zippia.com discusses a list of 50 job titles that likely require the SQL skillset. Among these are:
- Business Analyst,
- Software Engineer,
- Data Analyst, and 46 more.
A cursory glance at job postings makes it clear how crucial this skillset is to 21st century knowledge workers.
Types of SQL
There are many different dialects of SQL that are used with different types of SQL databases. Some of the most popular databases include:
- Postgres, and
- Microsoft SQL Server.
Three of the most popularly used types of SQL are:
- PostgreSQL, and
Each of these different types of SQL have slight variations in functionality and syntax, but they are sufficiently similar that skills gained with one type are generally transferrable to the others.
NoSQL is a type of SQL that is optimized for web based data, but isn’t as commonly used for data that lives in spreadsheets. NoSQL stands for “not only SQL.” One of the most popular NoSQL languages is called MongoDB.
My notes use PostgreSQL because that is the version used by Udacity.
Entity Relationship Diagrams
ERDs, Entity Relationship Diagrams, describe relationships between tables of data in a database. These diagrams contain such information as
- the names of the tables,
- the columns in each table, and
- how the tables “fit together.”
ERDs are worthy of their own post. The nomenclature, not to mention the design of the systems they describe, is a deep subject that will not be covered here. The image above is included as an example of how a database could be structured. Each of the tables in the image above can be thought of as its own Excel table. These tables are linked together by a series of keys, which allows SQL to join them together or otherwise manipulate them as a group.
Statements and Queries
The function of a “SQL statement” is to communicate to the database what you’d like to do with the data. Types of statements include
- Create - Creates a new table in the database.
- Drop Table - Removes a table from the database.
- Select - Allows you to read data and display it. Select statements are commonly referred to as queries.
There are a few other types of statements related to the creation and removal of data. Most analytical work involves reading and manipulating data without altering the underlying source, as those other commands enable. Therefore, these notes will focus primarily the subset of SQL commands referred to as queries.
SQL is generally not case sensitive, but common formatting convention is to capitalize keywords (IE,
FROM) while leaving the names of columns and tables lowercase (
account). This aids in the legibility of commands, which only becomes more important as the queries become more complex.
Column names should use underscores instead of spaces (
"web events"). In Postgres, if there is a space in a column or table name, that name needs to be referred to with double quotes around them.
Different SQL environments may or may not require a semicolon at the end of the query. Best practice is to include it whether or not it is required, which also allows multiple queries to be run at once.
There are a variety of ways to store files containing SQL code. My workflow when taking online courses is to write SQL into my atom text editor, and save it with the .sql extension. SQL text formatting is applied automatically; it renders with text highlighting, as shown below. Another possibility is to use a site that provides online repositories for code. An example is modeanalytics.com, which also hosts a very in-dept SQL tutorial.