Data Cleaning using SQL

String Manipulation

LEFT and RIGHT

The LEFT (and RIGHT) function is used to pull data from the left (and right) side of the string and then present them as a separate string. LEFT and RIGHT are primarily useful when data is well-structured and predictable.

SELECT DISTINCT date "Datetime",
       LEFT(date, 10) "Date"
  FROM tutorial.sf_crime_incidents_2014_01
Datetime Date
12/14/2013 08:00:00 AM +0000 12/14/2013
11/08/2013 08:00:00 AM +0000 11/08/2013
11/03/2013 07:00:00 AM +0000 11/03/2013

LENGTH

LENGTH returns the length of a string.

SELECT DISTINCT pd_district,
       LENGTH(pd_district) pd_length
  FROM tutorial.sf_crime_incidents_2014_01
pd_district pd_length
SOUTHERN 8
TARAVAL 7
CENTRAL 7

POSITION and STRPOS

POSITION allows you to specify a substring, and provides the position of that substring within another string, counting from the left.

STRPOS is similar, but with slightly different syntax. Note both POSITION and STRPOS are case sensitive.

SELECT address,
       POSITION(' Block of ' IN address) AS break_position,
       STRPOS(address, ' Block of ') AS str_pos_break_position
  FROM tutorial.sf_crime_incidents_2014_01
address break_position str_pos_break_position
0 Block of GARRISON AV 2 2
100 Block of FONT BL 4 4
0 Block of CASTRO ST 2 2

Combinations

Combinations of the foregoing commands allow users to parse complex inputs.

SELECT location,
       LEFT(RIGHT(location, LENGTH(location) - (STRPOS(location, '('))), (STRPOS(location, ', ')-2)) lat,
       RIGHT(LEFT(location, LENGTH(location)-1), LENGTH(location) - STRPOS(location, ', ')-2) lon
  FROM tutorial.sf_crime_incidents_2014_01
location lat lon
(37.709725805163, -122.413623946206) 37.709725805163 -122.413623946206
(37.7154876086057, -122.47370623066) 37.7154876086057 -122.47370623066
(37.7686887134351, -122.435718550322) 37.7686887134351 -122.435718550322

UPPER and LOWER

UPPER and LOWER force the case of a string to be upper or lower, respectively.

SELECT address,
       UPPER(address) AS upper_case,
       LOWER(address) AS lower_case
  FROM tutorial.sf_crime_incidents_2014_01
address upper_case lower_case
0 Block of GARRISON AV 0 BLOCK OF GARRISON AV 0 block of garrison av
100 Block of FONT BL 100 BLOCK OF FONT BL 100 block of font bl
0 Block of CASTRO ST 0 BLOCK OF CASTRO ST 0 block of castro st

CONCAT

CONCAT allows users to concatenate strings from multiple columns together into new strings. CONCAT can also be performed using two pipe characters: ||.

SELECT date datetime,
       time,
       LEFT(date, 10) date,
       CONCAT(LEFT(date, 10), ' ', time) AS new_datetime_str,
       LEFT(date, 10) || ' ' || time AS new_datetime_pipes_str
  FROM tutorial.sf_crime_incidents_2014_01
datetime time date new_datetime_str new_datetime_pipes_str
01/31/2014 08:00:00 AM +0000 17:00 01/31/2014 01/31/2014 17:00 01/31/2014 17:00
01/31/2014 08:00:00 AM +0000 17:45 01/31/2014 01/31/2014 17:45 01/31/2014 17:45
01/31/2014 08:00:00 AM +0000 15:30 01/31/2014 01/31/2014 15:30 01/31/2014 15:30

Data Manipulation

CAST

CAST allows users to create new columns with a different data type than the type of the source data. CAST can also be performed using two colon characters: ::.

Note that the cast has resulted in the format changing from MM/DD/YYYY to the SQL standard YYYY/MM/DD. The rename CTE was required because attempting to cast a column named ‘date’ to the ‘date’ type generated an error.

The CAST function is most useful for turning a string into a number or a date. A simple way of converting a number to a string is to perform any kind of string operation on it (LEFT, RIGHT, SUBSTRING, etc) will perform the conversion automatically.

WITH rename AS (
  SELECT distinct date datetime_string
    FROM tutorial.sf_crime_incidents_2014_01
)

SELECT datetime_string,
       LEFT(datetime_string, 10) date_string,
       CAST(LEFT(datetime_string, 10) AS date) date_cast,
       LEFT(datetime_string, 10)::date date_colons
  FROM rename
datetime_string date_string date_cast date_colons
12/10/2013 08:00:00 AM +0000 12/10/2013 2013-12-10 00:00:00 2013-12-10 00:00:00
11/25/2013 08:00:00 AM +0000 11/25/2013 2013-11-25 00:00:00 2013-11-25 00:00:00
01/05/2014 08:00:00 AM +0000 01/05/2014 2014-01-05 00:00:00 2014-01-05 00:00:00

COALESCE

COALESCE returns the first NULL value in each row. It is useful for replacing NULL values with values of the user’s choice.

SELECT year,
       year_rank,
       "group",
       artist,
       COALESCE(artist, "No Artist") coalesced_artist,
       song_name
  FROM tutorial.billboard_top_100_year_end
 WHERE year = 1959
year year_rank group artist coalesced_artist song_name
1959 1 Johnny Horton Johnny Horton Johnny Horton The Battle Of New Orleans
1959 2 Bobby Darin Bobby Darin Bobby Darin Mack The Knife
1959 3 Lloyd Price Lloyd Price Lloyd Price Personality
1959 4 Frankie Avalon No Artist Venus
1959 5 Paul Anka Paul Anka Paul Anka Lonely Boy

SELECT COUNT(artist),
       COUNT(COALESCE(artist, 'No Artist')) count_after_coalesce
  FROM tutorial.billboard_top_100_year_end
 WHERE year = 1959
count count_after_coalesce
101 102