Data Cleaning using SQL
For each of the example queries, the first few rows of the result are included.
String Manipulation
LEFT and RIGHT
The LEFT
(and RIGHT
) function is used to pull data from the left (and right) side of the string and then present them as a separate string. LEFT
and RIGHT
are primarily useful when data is well-structured and predictable.
SELECT DISTINCT date "Datetime",
LEFT(date, 10) "Date"
FROM tutorial.sf_crime_incidents_2014_01
Datetime | Date |
---|---|
12/14/2013 08:00:00 AM +0000 | 12/14/2013 |
11/08/2013 08:00:00 AM +0000 | 11/08/2013 |
11/03/2013 07:00:00 AM +0000 | 11/03/2013 |
LENGTH
LENGTH
returns the length of a string.
SELECT DISTINCT pd_district,
LENGTH(pd_district) pd_length
FROM tutorial.sf_crime_incidents_2014_01
pd_district | pd_length |
---|---|
SOUTHERN | 8 |
TARAVAL | 7 |
CENTRAL | 7 |
POSITION and STRPOS
POSITION
allows you to specify a substring, and provides the position of that substring within another string, counting from the left.
STRPOS
is similar, but with slightly different syntax. Note both POSITION
and STRPOS
are case sensitive.
SELECT address,
POSITION(' Block of ' IN address) AS break_position,
STRPOS(address, ' Block of ') AS str_pos_break_position
FROM tutorial.sf_crime_incidents_2014_01
address | break_position | str_pos_break_position |
---|---|---|
0 Block of GARRISON AV | 2 | 2 |
100 Block of FONT BL | 4 | 4 |
0 Block of CASTRO ST | 2 | 2 |
Combinations
Combinations of the foregoing commands allow users to parse complex inputs.
SELECT location,
LEFT(RIGHT(location, LENGTH(location) - (STRPOS(location, '('))), (STRPOS(location, ', ')-2)) lat,
RIGHT(LEFT(location, LENGTH(location)-1), LENGTH(location) - STRPOS(location, ', ')-2) lon
FROM tutorial.sf_crime_incidents_2014_01
location | lat | lon |
---|---|---|
(37.709725805163, -122.413623946206) | 37.709725805163 | -122.413623946206 |
(37.7154876086057, -122.47370623066) | 37.7154876086057 | -122.47370623066 |
(37.7686887134351, -122.435718550322) | 37.7686887134351 | -122.435718550322 |
UPPER and LOWER
UPPER
and LOWER
force the case of a string to be upper or lower, respectively.
SELECT address,
UPPER(address) AS upper_case,
LOWER(address) AS lower_case
FROM tutorial.sf_crime_incidents_2014_01
address | upper_case | lower_case |
---|---|---|
0 Block of GARRISON AV | 0 BLOCK OF GARRISON AV | 0 block of garrison av |
100 Block of FONT BL | 100 BLOCK OF FONT BL | 100 block of font bl |
0 Block of CASTRO ST | 0 BLOCK OF CASTRO ST | 0 block of castro st |
CONCAT
CONCAT
allows users to concatenate strings from multiple columns together into new strings. CONCAT
can also be performed using two pipe characters: ||
.
SELECT date datetime,
time,
LEFT(date, 10) date,
CONCAT(LEFT(date, 10), ' ', time) AS new_datetime_str,
LEFT(date, 10) || ' ' || time AS new_datetime_pipes_str
FROM tutorial.sf_crime_incidents_2014_01
datetime | time | date | new_datetime_str | new_datetime_pipes_str |
---|---|---|---|---|
01/31/2014 08:00:00 AM +0000 | 17:00 | 01/31/2014 | 01/31/2014 17:00 | 01/31/2014 17:00 |
01/31/2014 08:00:00 AM +0000 | 17:45 | 01/31/2014 | 01/31/2014 17:45 | 01/31/2014 17:45 |
01/31/2014 08:00:00 AM +0000 | 15:30 | 01/31/2014 | 01/31/2014 15:30 | 01/31/2014 15:30 |
Data Manipulation
CAST
CAST
allows users to create new columns with a different data type than the type of the source data. CAST
can also be performed using two colon characters: ::
.
Note that the cast has resulted in the format changing from MM/DD/YYYY to the SQL standard YYYY/MM/DD. The rename CTE was required because attempting to cast a column named ‘date’ to the ‘date’ type generated an error.
The CAST
function is most useful for turning a string into a number or a date. A simple way of converting a number to a string is to perform any kind of string operation on it (LEFT
, RIGHT
, SUBSTRING
, etc) will perform the conversion automatically.
WITH rename AS (
SELECT distinct date datetime_string
FROM tutorial.sf_crime_incidents_2014_01
)
SELECT datetime_string,
LEFT(datetime_string, 10) date_string,
CAST(LEFT(datetime_string, 10) AS date) date_cast,
LEFT(datetime_string, 10)::date date_colons
FROM rename
datetime_string | date_string | date_cast | date_colons |
---|---|---|---|
12/10/2013 08:00:00 AM +0000 | 12/10/2013 | 2013-12-10 00:00:00 | 2013-12-10 00:00:00 |
11/25/2013 08:00:00 AM +0000 | 11/25/2013 | 2013-11-25 00:00:00 | 2013-11-25 00:00:00 |
01/05/2014 08:00:00 AM +0000 | 01/05/2014 | 2014-01-05 00:00:00 | 2014-01-05 00:00:00 |
COALESCE
COALESCE
returns the first NULL
value in each row. It is useful for replacing NULL
values with values of the user’s choice.
SELECT year,
year_rank,
"group",
artist,
COALESCE(artist, "No Artist") coalesced_artist,
song_name
FROM tutorial.billboard_top_100_year_end
WHERE year = 1959
year | year_rank | group | artist | coalesced_artist | song_name |
---|---|---|---|---|---|
1959 | 1 | Johnny Horton | Johnny Horton | Johnny Horton | The Battle Of New Orleans |
1959 | 2 | Bobby Darin | Bobby Darin | Bobby Darin | Mack The Knife |
1959 | 3 | Lloyd Price | Lloyd Price | Lloyd Price | Personality |
1959 | 4 | Frankie Avalon | No Artist | Venus | |
1959 | 5 | Paul Anka | Paul Anka | Paul Anka | Lonely Boy |
SELECT COUNT(artist),
COUNT(COALESCE(artist, 'No Artist')) count_after_coalesce
FROM tutorial.billboard_top_100_year_end
WHERE year = 1959
count | count_after_coalesce |
---|---|
101 | 102 |