Intro

This file contains code for assessing and cleaning the jobs data scraped from Indeed.

Preliminaries

In [1]:
import pandas as pd
import numpy as np
import pprint as pp
In [2]:
master = pd.read_csv('master.tsv', sep='\t', index_col=0)
In [3]:
master.shape
Out[3]:
(22101, 8)
In [4]:
master.head()
Out[4]:
i title company location posted job_type parse_date hash
0 0 Research Data Analyst 1 Stanford University California Just posted Organic 2018-04-23 Research Data Analyst 1Stanford UniversityCali...
1 1 Program Manager- Commerce/Data Compliance Apple Santa Clara Valley, CA Just posted Organic 2018-04-23 Program Manager- Commerce/Data ComplianceApple...
2 2 Analyst, Integrated Analytics The Irvine Company Irvine, CA 92617 Just posted Organic 2018-04-23 Analyst, Integrated AnalyticsThe Irvine Compan...
3 3 Senior Analyst I - Business Intelligence Ops Stanford Health Care Palo Alto, CA 94305 Just posted Organic 2018-04-23 Senior Analyst I - Business Intelligence OpsSt...
4 4 Data Specialist Los Angeles Homeless Services Authority Los Angeles, CA 90017 Just posted Organic 2018-04-23 Data SpecialistLos Angeles Homeless Services A...

Assess

This section details my assessment of the data with regard to quality and tidiness. The general approach will be to assess with regard to the following definitions of those terms.

Quality issues pertain to content. Low quality data is also known as dirty data. There are four dimensions of quality data:

  • Completeness: do we have all of the records that we should? Do we have missing records or not? Are there specific rows, columns, or cells missing?
  • Validity: we have the records, but they're not valid, i.e., they don't conform to a defined schema. A schema is a defined set of rules for data. These rules can be real-world constraints (e.g. negative height is impossible) and table-specific constraints (e.g. unique key constraints in tables).
  • Accuracy: inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect. Example: a patient's weight that is 5 lbs too heavy because the scale was faulty.
  • Consistency: inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing. Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.

Tidiness issues pertain to structure. These structural problems generally prevent easy analysis. Untidy data is also known as messy data. The requirements for tidy data are:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

Assess Quality: Completeness

In [5]:
print('       Sponsored Jobs   ' + str(len(master[master.job_type == 'Sponsored'])))
print('Unique Sponsored Jobs   ' + str(master[master.job_type == 'Sponsored'].hash.nunique()))
print('         Organic Jobs   ' + str(len(master[master.job_type == 'Organic'])))
print('  Unique Organic Jobs   ' + str(master[master.job_type == 'Organic'].hash.nunique()))
       Sponsored Jobs   191
Unique Sponsored Jobs   191
         Organic Jobs   21910
  Unique Organic Jobs   21909
In [6]:
master[master.duplicated()]
Out[6]:
i title company location posted job_type parse_date hash

Assess Quality: Validity

In [7]:
master['title'] = master.title.fillna('')
master['company'] = master.company.fillna('')
master['location'] = master.location.fillna('')

print('   Null Titles   ' + str((master.title.isnull()).sum()))
print('Null Companies   ' + str((master.company.isnull()).sum()))
print(' Null Location   ' + str((master.location.isnull()).sum()))
   Null Titles   0
Null Companies   0
 Null Location   0
  • One job listing has a null company name

Fix this issue now since subsequent commands cannot index with vecotrs containing null values.

In [8]:
master = master[~master.company.isnull()]
In [9]:
print('       Sponsored Jobs     ' + str(len(master[master.job_type == 'Sponsored'])))
print('Unique Sponsored Jobs     ' + str(master[master.job_type == 'Sponsored'].hash.nunique()))
print('         Organic Jobs   ' + str(len(master[master.job_type == 'Organic'])))
print('  Unique Organic Jobs   ' + str(master[master.job_type == 'Organic'].hash.nunique()))
       Sponsored Jobs     191
Unique Sponsored Jobs     191
         Organic Jobs   21910
  Unique Organic Jobs   21909

Assess Quality: Accuracy

No known issues

Assess Quality: Consistency

In [10]:
master.title.value_counts()[0:20]
Out[10]:
Data Analyst                       151
Data Scientist                     149
Administrative Assistant           108
Data Engineer                      107
Business Analyst                    94
Software Engineer                   73
Data Entry Clerk                    65
Financial Analyst                   61
Customer Service Representative     60
Senior Software Engineer            57
Machine Learning Engineer           54
Senior Data Scientist               46
Business Intelligence Analyst       45
Senior Data Engineer                43
Senior Data Analyst                 42
Product Manager                     41
Executive Assistant                 39
Business Systems Analyst            38
Office Assistant                    37
Receptionist                        37
Name: title, dtype: int64
  • Some of these job titles cannot be considered "data science" jobs. Will need to build in a way to filter out the noise.
In [11]:
pp.pprint(list(master.company.value_counts().index.sort_values())[500:520])
['Apartment List',
 'Apeel Sciences',
 'Aperio Group',
 'Apex',
 'Apex 2000',
 'Apex 2000 Inc',
 'Apex 2000 Inc.',
 'Apex Life Sciences',
 'Apex Systems Inc',
 'Apex Sytems, Inc',
 'Apio, Inc.',
 'Apixio',
 'Aplex Technology Inc.',
 'Aplus Search',
 'Apollo Communications',
 'Apollo Interactive',
 'AppBuddy',
 'AppDirect',
 'AppDynamics',
 'AppFolio']
In [12]:
master[master.company.str.contains('Psychcare')]
Out[12]:
i title company location posted job_type parse_date hash
1470 417 ABA Therapist California Psychcare Inc. Santa Clarita, CA Just posted Organic 2018-04-24 ABA TherapistCalifornia Psychcare Inc.Santa Cl...
1482 434 Behavior Therapist California Psychcare Inc. Santa Clarita, CA Just posted Organic 2018-04-24 Behavior TherapistCalifornia Psychcare Inc.San...
1695 742 Patient Liaison - Front End Behavioral Health California Psychcare Van Nuys, CA 91406 Just posted Organic 2018-04-24 Patient Liaison - Front End Behavioral HealthC...
18275 25016 Case Coordinator California Psychcare Special Individualized Tr... Upland, CA Just posted Organic 2018-04-29 00:00:00 Case CoordinatorCalifornia Psychcare Special I...
In [13]:
master[master.company.str.contains('Alta Vista Solutions')]
Out[13]:
i title company location posted job_type parse_date hash
1033 110572 NDE Technician Alta Vista Solutions Inc San Pablo, CA 94806 Today Organic 2018-04-24 NDE TechnicianAlta Vista Solutions IncSan Pabl...
5109 681 Staff Engineer Alta Vista Solutions Richmond, CA Just posted Organic 2018-04-25 Staff EngineerAlta Vista SolutionsRichmond, CA...
5652 103272 Staff Engineer Alta Vista Solutions Inc San Pablo, CA 94806 Just posted Organic 2018-04-26 Staff EngineerAlta Vista Solutions IncSan Pabl...
  • In most cases, the company names are input correctly. In a few situations, alternative company names are specified for the same company ('California Psychcare Inc.' versus 'California Psychcare'). Since there is a large number of company names, it would be best to subset the data to just data science jobs prior to tackling the company names.
In [14]:
master.location.nunique()
Out[14]:
1392
  • Location column usually contains city, but sometimes sometimes contains 'California', a county, zip code, or area.

Assess Tidiness: Each variable forms a column.

  • Convert "posted" and "job_type" to categories.
  • Drop 'i' and 'hash' columns.
In [15]:
list(master.columns)
Out[15]:
['i',
 'title',
 'company',
 'location',
 'posted',
 'job_type',
 'parse_date',
 'hash']

Assess Tidiness: Each observation forms a row.

No known issues

Assess Tidiness: Each type of observational unit forms a table.

No known issues

Assessment Summary

  • One job listing has a null company name
  • Location column usually contains city, but sometimes sometimes contains 'California', a county, zip code, or area.
  • Drop 'i', 'location', 'job_type', 'posted' and 'hash' columns.
  • Very few of these job titles can be considered "data science" jobs. Will need to build in a way to filter out the noise.
  • Convert parse date back to string.

This final assessment may be addressed at a later time. For now it is not a significant concern.

  • In most cases, the company names are input correctly. In a few situations, alternative company names are specified for the same company ('California Psychcare Inc.' versus 'California Psychcare'). Since there is a large number of company names, it would be best to subset the data to just data science jobs prior to tackling the company names.

Cleaning

Perform cleaning on a copy of the dataframe from file.

In [16]:
jobs = master.copy()

Define

  • One job listing has a null company name

Previously fixed.

Define

  • Location column usually contains city, but sometimes sometimes contains 'California', a county, zip code, or area.

Code

In [17]:
jobs.tail()
Out[17]:
i title company location posted job_type parse_date hash
22096 176 Systems Development - Assistant / Associate Pr... San Diego State University San Diego, CA Today Organic 2018-05-02 00:00:00 Systems Development - Assistant / Associate Pr...
22097 177 RPA COE Operational Manager - Business Systems... Wells Fargo San Francisco, CA Today Organic 2018-05-02 00:00:00 RPA COE Operational Manager - Business Systems...
22098 178 Senior Network Architect Instart Logic Palo Alto, CA 94306 (Barron Park area) Today Organic 2018-05-02 00:00:00 Senior Network ArchitectInstart LogicPalo Alto...
22099 180 Customer Success Mgr, Tech Recruiting Stella.ai Redwood City, CA Today Organic 2018-05-02 00:00:00 Customer Success Mgr, Tech RecruitingStella.ai...
22100 183 Machine Learning Engineer PicnicHealth San Francisco, CA Today Organic 2018-05-02 00:00:00 Machine Learning EngineerPicnicHealthSan Franc...
In [18]:
set(jobs[~jobs.location.str.contains(', CA')].location)
Out[18]:
{'California', 'United States'}

If the location does not contain ', CA', then the only location information given is statewide: 'California'.

Every jobs that has a location with ', CA' in it also has a city name.

In [19]:
cnt_contains_ca = len(jobs[jobs.location.str.contains(', CA')])
cnt_contains_county = len(jobs[jobs.location.str.contains('County')])
cnt_contains_digits = len(jobs[jobs.location.str.contains('\d\d\d\d\d')])
cnt_contains_paren = len(jobs[jobs.location.str.contains('(', regex=False)])

print('Cities/Counties: ' + str(cnt_contains_ca))
print('       Counties: ' + str(cnt_contains_county))
print('         Cities: ' + str(cnt_contains_ca - cnt_contains_county))
print('      Zip Codes: ' + str(cnt_contains_digits))
print('          Areas: ' + str(cnt_contains_paren))
Cities/Counties: 21610
       Counties: 74
         Cities: 21536
      Zip Codes: 7229
          Areas: 1615
In [20]:
print(list(set(jobs[jobs.location.str.contains(', CA')].location.values))[0:3])
print(list(set(jobs[jobs.location.str.contains('County')].location.values))[0:3])
print(list(set(jobs[jobs.location.str.contains('\d\d\d\d\d')].location.values))[0:3])
print(list(set(jobs[jobs.location.str.contains('(', regex=False)].location.values))[0:3])
['Firebaugh, CA', 'Edwards AFB, CA 93524', 'Encino, CA 91436']
['Fresno County, CA', 'Alameda County, CA', 'Lake County, CA']
['Fountain Valley, CA 92708', 'Los Angeles, CA 90073', 'Edwards AFB, CA 93524']
['Sacramento, CA 95834 (Natomas Crossing area)', 'San Francisco, CA 94123 (Marina area)', 'Oakland, CA 94621 (Fitchburg area)']
In [21]:
county = jobs.location.str.extractall('(?P<name>[a-zA-Z\s]+)? County, CA')
city = jobs.location.str.extractall('(?P<name>[a-zA-Z\s]+)?, CA')
zip_code = jobs.location.str.extractall(', CA (?P<name>\d{5})?')
area = jobs.location.str.extractall('\((?P<name>.*) area\)')

If the posting has 'County' in it, it does not have a city associated with it.

In [22]:
city = city.drop(county.index)

print(' Counties:   ' + str(cnt_contains_county) + ' = ' + str(len(county)))
print('   Cities: ' + str(cnt_contains_ca - cnt_contains_county) + ' = ' + str(len(city)))
print('Zip Codes: ' + str(cnt_contains_digits) + ' = ' + str(len(zip_code)))
print('    Areas:  ' + str(cnt_contains_paren) + ' = ' + str(len(area)))
 Counties:   74 = 74
   Cities: 21536 = 21536
Zip Codes: 7229 = 7229
    Areas:  1615 = 1615
In [23]:
county.index = county.index.droplevel('match')
city.index = city.index.droplevel('match')
zip_code.index = zip_code.index.droplevel('match')
area.index = area.index.droplevel('match')
In [24]:
zip_to_area = {}

area.loc[30].values[0]

for i in area.index:
    if zip_code.loc[i].values[0]:
        zip_to_area[zip_code.loc[i].values[0]] =  city.loc[i].values[0] + ' : ' + area.loc[i].values[0]

print(' Zips: ' + str(len(set(zip_to_area.keys()))))
print('Areas: ' + str(len(set(zip_to_area.values()))) + '\n')
pp.pprint(zip_to_area)
 Zips: 101
Areas: 81

{'93701': 'Fresno : Central',
 '93703': 'Fresno : McLane',
 '93704': 'Fresno : Bullard',
 '93706': 'Fresno : Edison',
 '93710': 'Fresno : Hoover',
 '93711': 'Fresno : Bullard',
 '93720': 'Fresno : Woodward Park',
 '93721': 'Fresno : Central',
 '93722': 'Fresno : West',
 '93726': 'Fresno : McLane',
 '94102': 'San Francisco : Downtown',
 '94103': 'San Francisco : South Of Market',
 '94104': 'San Francisco : Financial District',
 '94105': 'San Francisco : Financial District',
 '94107': 'San Francisco : South Of Market',
 '94108': 'San Francisco : Financial District',
 '94109': 'San Francisco : Nob Hill',
 '94110': 'San Francisco : Mission',
 '94111': 'San Francisco : Financial District',
 '94114': 'San Francisco : Castro-Upper Market',
 '94115': 'San Francisco : Western Addition',
 '94117': 'San Francisco : Haight-Ashbury',
 '94118': 'San Francisco : Inner Richmond',
 '94121': 'San Francisco : Outer Richmond',
 '94122': 'San Francisco : Outer Sunset',
 '94123': 'San Francisco : Marina',
 '94124': 'San Francisco : Bayview',
 '94126': 'San Francisco : Financial District',
 '94131': 'San Francisco : Diamond Heights',
 '94132': 'San Francisco : Lakeshore',
 '94133': 'San Francisco : Russian Hill',
 '94134': 'San Francisco : Excelsior',
 '94143': 'San Francisco : Inner Sunset',
 '94158': 'San Francisco : South Of Market',
 '94203': 'Sacramento : Alkali Flat',
 '94204': 'Sacramento : Erikson Industrial Park',
 '94301': 'Palo Alto : Professorville',
 '94303': 'Palo Alto : Duveneck-Saint Francis',
 '94306': 'Palo Alto : Barron Park',
 '94401': 'San Mateo : North Central',
 '94402': 'San Mateo : Beresford',
 '94403': 'San Mateo : Sugerloaf',
 '94536': 'Fremont : Centerville',
 '94537': 'Fremont : Cabrillo',
 '94538': 'Fremont : Irvington',
 '94539': 'Fremont : Cameron Hills',
 '94540': 'Hayward : Santa Clara',
 '94544': 'Hayward : Tennyson-Alquire',
 '94545': 'Hayward : Glen Eden',
 '94601': 'Oakland : Harrington',
 '94602': 'Oakland : Upper Dimond',
 '94606': 'Oakland : Rancho San Antonio',
 '94607': 'Oakland : Acorn-Acorn Industrial',
 '94610': 'Oakland : Lakeshore',
 '94612': 'Oakland : Northgate',
 '94613': 'Oakland : Mills College',
 '94618': 'Oakland : Piedmont',
 '94619': 'Oakland : Redwood Heights',
 '94621': 'Oakland : Fitchburg',
 '95101': 'San Jose : North Valley',
 '95110': 'San Jose : Downtown',
 '95111': 'San Jose : Edenvale-Seven Trees',
 '95112': 'San Jose : Downtown',
 '95113': 'San Jose : Downtown',
 '95115': 'San Jose : Willow Glen',
 '95116': 'San Jose : North Valley',
 '95118': 'San Jose : Cambrian Park',
 '95119': 'San Jose : Santa Teresa',
 '95120': 'San Jose : Almaden Valley',
 '95122': 'San Jose : East San Jose',
 '95123': 'San Jose : Blossom Valley',
 '95124': 'San Jose : Cambrian Park',
 '95125': 'San Jose : Willow Glen',
 '95126': 'San Jose : Rose Garden',
 '95128': 'San Jose : Willow Glen',
 '95129': 'San Jose : West San Jose',
 '95131': 'San Jose : North Valley',
 '95133': 'San Jose : North Valley',
 '95134': 'San Jose : North San Jose',
 '95138': 'San Jose : Santa Teresa',
 '95202': 'Stockton : Civic Center',
 '95203': 'Stockton : Civic Center',
 '95204': 'Stockton : Pacific',
 '95206': 'Stockton : Weston Ranch',
 '95207': 'Stockton : Lakeview',
 '95210': 'Stockton : Valley Oak',
 '95211': 'Stockton : Pacific',
 '95219': 'Stockton : Lincoln Village West',
 '95616': 'Davis : Central Davis',
 '95618': 'Davis : Downtown Core',
 '95811': 'Sacramento : Downtown',
 '95814': 'Sacramento : Mansion Flats',
 '95815': 'Sacramento : South Hagginwood',
 '95816': 'Sacramento : Marshall School',
 '95818': 'Sacramento : Land Park',
 '95819': 'Sacramento : East Sacramento',
 '95822': 'Sacramento : Airport',
 '95823': 'Sacramento : Parkway',
 '95833': 'Sacramento : South Natomas',
 '95834': 'Sacramento : Natomas Crossing',
 '95838': 'Sacramento : Robla'}

As parsed, areas are not unique. There are multiple 'Downtowns,' as an example. Append city name to make unique, as shown above.

In [25]:
city = city.reindex(index=jobs.index)
county = county.reindex(index=jobs.index)
zip_code = zip_code.reindex(index=jobs.index)
area = area.reindex(index=jobs.index)
In [26]:
jobs = jobs.replace({'location': {'California': ''}})
In [27]:
jobs['zip'] = zip_code.values
jobs['city'] = city.values
jobs['county'] = county.values

jobs['area'] = area.values
jobs['area'] = np.where(jobs['area'].isnull(),
                              float('NaN'),
                              jobs['city'] + ' - ' + jobs['area'])
In [28]:
jobs.head()
Out[28]:
i title company location posted job_type parse_date hash zip city county area
0 0 Research Data Analyst 1 Stanford University Just posted Organic 2018-04-23 Research Data Analyst 1Stanford UniversityCali... NaN NaN NaN NaN
1 1 Program Manager- Commerce/Data Compliance Apple Santa Clara Valley, CA Just posted Organic 2018-04-23 Program Manager- Commerce/Data ComplianceApple... NaN Santa Clara Valley NaN NaN
2 2 Analyst, Integrated Analytics The Irvine Company Irvine, CA 92617 Just posted Organic 2018-04-23 Analyst, Integrated AnalyticsThe Irvine Compan... 92617 Irvine NaN NaN
3 3 Senior Analyst I - Business Intelligence Ops Stanford Health Care Palo Alto, CA 94305 Just posted Organic 2018-04-23 Senior Analyst I - Business Intelligence OpsSt... 94305 Palo Alto NaN NaN
4 4 Data Specialist Los Angeles Homeless Services Authority Los Angeles, CA 90017 Just posted Organic 2018-04-23 Data SpecialistLos Angeles Homeless Services A... 90017 Los Angeles NaN NaN

Test

In [29]:
print(' Counties:   ' + str(cnt_contains_county) + \
      '  = ' + str(len(jobs[~jobs.county.isnull()])))
print('   Cities: ' + str(cnt_contains_ca - cnt_contains_county) + \
      ' = ' + str(len(jobs[~jobs.city.isnull()])))
print('Zip Codes: ' + str(cnt_contains_digits) + \
      '  = ' + str(len(jobs[~jobs.zip.isnull()])))
print('    Areas:  ' + str(cnt_contains_paren) + \
      ' = ' + str(len(jobs[~jobs.area.isnull()])))
 Counties:   74  = 74
   Cities: 21536 = 21536
Zip Codes: 7229  = 7229
    Areas:  1615 = 1615

Define

Add weeks old column

Code

In [30]:
jobs['weeks_old'] = 0

jobs.loc[jobs['posted'].isnull(), 'weeks_old'] = 'N/A'

jobs.loc[jobs['posted'].isnull(), 'posted'] = 'N/A'
In [31]:
posted_to_age = {
    '30 days ago' : 4,
    '29 days ago' : 4,
    '28 days ago' : 4,
    '27 days ago' : 3,
    '26 days ago' : 3,
    '25 days ago' : 3,
    '24 days ago' : 3,
    '23 days ago' : 3,
    '22 days ago' : 3,
    '21 days ago' : 3,
    '20 days ago' : 2,
    '19 days ago' : 2,
    '18 days ago' : 2,
    '17 days ago' : 2,
    '16 days ago' : 2,
    '15 days ago' : 2,
    '14 days ago' : 2,
    '13 days ago' : 1,
    '12 days ago' : 1,
    '11 days ago' : 1,
    '10 days ago' : 1,
    '9 days ago'  : 1,
    '8 days ago'  : 1,
    '7 days ago'  : 1,
}

for posted_phrase, weeks_old in posted_to_age.items():
    jobs.loc[jobs.posted.str.contains(posted_phrase), 'weeks_old'] = weeks_old

Test

In [32]:
jobs.weeks_old.value_counts()
Out[32]:
0      17270
1       2743
3       1027
2        704
N/A      191
4        166
Name: weeks_old, dtype: int64
In [33]:
jobs.head()
Out[33]:
i title company location posted job_type parse_date hash zip city county area weeks_old
0 0 Research Data Analyst 1 Stanford University Just posted Organic 2018-04-23 Research Data Analyst 1Stanford UniversityCali... NaN NaN NaN NaN 0
1 1 Program Manager- Commerce/Data Compliance Apple Santa Clara Valley, CA Just posted Organic 2018-04-23 Program Manager- Commerce/Data ComplianceApple... NaN Santa Clara Valley NaN NaN 0
2 2 Analyst, Integrated Analytics The Irvine Company Irvine, CA 92617 Just posted Organic 2018-04-23 Analyst, Integrated AnalyticsThe Irvine Compan... 92617 Irvine NaN NaN 0
3 3 Senior Analyst I - Business Intelligence Ops Stanford Health Care Palo Alto, CA 94305 Just posted Organic 2018-04-23 Senior Analyst I - Business Intelligence OpsSt... 94305 Palo Alto NaN NaN 0
4 4 Data Specialist Los Angeles Homeless Services Authority Los Angeles, CA 90017 Just posted Organic 2018-04-23 Data SpecialistLos Angeles Homeless Services A... 90017 Los Angeles NaN NaN 0

Define

  • Drop 'i', 'location', 'job_type', 'posted' and 'hash' columns.

Code

In [34]:
jobs = jobs.drop(['i', 'location', 'hash', 'job_type', 'posted'], axis=1)
# jobs = jobs[['title', 'company', 'zip', 'city', 'area', 'county', 'parse_date']]

Test

In [35]:
jobs.sample(1)
Out[35]:
title company parse_date zip city county area weeks_old
18556 Data Entry Clerk Performance Settlement 2018-04-30 00:00:00 NaN Irvine NaN NaN 0

Define

  • Truncate parse_date back down to 10-character string.

Code

In [36]:
jobs['parse_date'] = jobs['parse_date'].str.slice(0,10)

Test

In [37]:
jobs.parse_date.value_counts()
Out[37]:
2018-04-29    5409
2018-04-26    4564
2018-04-24    2522
2018-04-27    2128
2018-04-25    2119
2018-05-01    1900
2018-04-30    1463
2018-04-23     995
2018-04-28     778
2018-05-02     223
Name: parse_date, dtype: int64

Define

  • Not all of these job titles can be considered "data science" jobs. Will need to build in a way to filter out the noise.

Code

In [38]:
print('Title contains \'Analyst\'    ' + str(len(jobs[jobs.title.str.contains('Analyst')])))
print('Title contains \'Analysis\'     ' + str(len(jobs[jobs.title.str.contains('Analysis')])))
print('Title contains \'Analytics\'   ' + str(len(jobs[jobs.title.str.contains('Analytics')])))
print('Title contains \'Data\'       ' + str(len(jobs[jobs.title.str.contains('Data')])))
print('Title contains \'SQL\'         ' + str(len(jobs[jobs.title.str.contains('SQL')])))
print('Title contains \'Python\'       ' + str(len(jobs[jobs.title.str.contains('Python')])))
Title contains 'Analyst'    3175
Title contains 'Analysis'     56
Title contains 'Analytics'   410
Title contains 'Data'       2931
Title contains 'SQL'         228
Title contains 'Python'       45
In [39]:
terms_to_analytics = {
    'Analyst'                 : True,
    'Analysis'                : True,
    'Analytics'               : True,
    'Data'                    : True,
    'Machine Learning'        : True,
    'SQL'                     : True,
    'Deep Learning'           : True,
    'Business Intelligence'   : True,
    'Python'                  : True,
    'Artificial Intelligence' : True,
    'ML'                      : True,
    'AI'                      : True,

    'Financial'               : False,
    'Finance'                 : False,
    'Data Entry'              : False,
    'Administrative'          : False,
    'Coatings'                : False,
    'Solidworks'              : False,
    'SOLIDWORKS'              : False,
    'AML'                     : False,
    'HTML'                    : False,
    'NMLS'                    : False,
    'MLT'                     : False,
    'AID'                     : False,
    'RETAIL'                  : False,
    'MAIL'                    : False,
    'AIR'                     : False,
    'RAIN'                    : False,
    'RAIL'                    : False,
    'CHAI'                    : False
}
In [40]:
jobs['analytics']=False

for search_term, analytics in terms_to_analytics.items():
    jobs.loc[jobs.title.str.contains(search_term), 'analytics'] = analytics
In [41]:
jobs.head()
Out[41]:
title company parse_date zip city county area weeks_old analytics
0 Research Data Analyst 1 Stanford University 2018-04-23 NaN NaN NaN NaN 0 True
1 Program Manager- Commerce/Data Compliance Apple 2018-04-23 NaN Santa Clara Valley NaN NaN 0 True
2 Analyst, Integrated Analytics The Irvine Company 2018-04-23 92617 Irvine NaN NaN 0 True
3 Senior Analyst I - Business Intelligence Ops Stanford Health Care 2018-04-23 94305 Palo Alto NaN NaN 0 True
4 Data Specialist Los Angeles Homeless Services Authority 2018-04-23 90017 Los Angeles NaN NaN 0 True

Test

This filtering is far from perfect, but a good first approximation. The queries currently set up should find a majority of data-related job postings on Indeed, but will probably also find roles that are not strictly analytics-related.

In [42]:
jobs[jobs.analytics == True].title.value_counts()[0:25]
Out[42]:
Data Analyst                      151
Data Scientist                    149
Data Engineer                     107
Business Analyst                   94
Machine Learning Engineer          54
Senior Data Scientist              46
Business Intelligence Analyst      45
Senior Data Engineer               43
Senior Data Analyst                42
Business Systems Analyst           38
Research Analyst                   30
Senior Business Analyst            24
Analyst                            21
Data Analyst Intern                20
Database Administrator             20
Database Engineer                  16
Business Operations Analyst        15
Machine Learning Scientist         14
Sales Operations Analyst           14
Business Intelligence Engineer     14
Sr. Data Scientist                 13
SQL Developer                      12
Operations Analyst                 12
Data Specialist                    12
Staff Data Scientist               11
Name: title, dtype: int64

Reduce List to only Analytics Jobs

In [43]:
all_unique_jobs = len(jobs)
jobs = jobs[jobs.analytics == True]
jobs = jobs.drop(['analytics'], axis=1)
analytics_jobs_only = len(jobs)

print('Total unique jobs: {}'.format(all_unique_jobs))
print('   Analytics jobs:  {}'.format(analytics_jobs_only))
print('                      {}% Analytics'.format(int((analytics_jobs_only/all_unique_jobs)*100)))
Total unique jobs: 22101
   Analytics jobs:  5816
                      26% Analytics

Aggregate

In [44]:
print('   Jobs with zip and without city      {}'.format(jobs[jobs['city'].isnull()].zip.notnull().sum()))
print('  Jobs with area and without city      {}'.format(jobs[jobs['city'].isnull()].area.notnull().sum()))
print('Jobs with county and without city      {}'.format(jobs[jobs['city'].isnull()].county.notnull().sum()))
print('Jobs with county                    {}'.format(len(jobs[jobs['county'].notnull()])))
print('        Total Jobs without a city    {}'.format(len(jobs[jobs['city'].isnull()])))
print('                       Total Jobs   {}'.format(len(jobs)))
   Jobs with zip and without city      0
  Jobs with area and without city      0
Jobs with county and without city      10
Jobs with county                    10
        Total Jobs without a city    151
                       Total Jobs   5816
In [45]:
jobs.head()
Out[45]:
title company parse_date zip city county area weeks_old
0 Research Data Analyst 1 Stanford University 2018-04-23 NaN NaN NaN NaN 0
1 Program Manager- Commerce/Data Compliance Apple 2018-04-23 NaN Santa Clara Valley NaN NaN 0
2 Analyst, Integrated Analytics The Irvine Company 2018-04-23 92617 Irvine NaN NaN 0
3 Senior Analyst I - Business Intelligence Ops Stanford Health Care 2018-04-23 94305 Palo Alto NaN NaN 0
4 Data Specialist Los Angeles Homeless Services Authority 2018-04-23 90017 Los Angeles NaN NaN 0
In [46]:
cities = pd.read_csv('cities.csv', index_col=0)
cities = cities.drop(['Population'], axis=1)
cities.head()
Out[46]:
City County
0 Los Angeles Los Angeles
1 San Diego San Diego
2 San Jose Santa Clara
3 San Francisco San Francisco
4 Fresno Fresno
In [47]:
jobs = jobs.join(cities.set_index('City'), on='city')
jobs.loc[jobs.county.isnull(), 'county'] = jobs['County']
jobs = jobs.drop('County',axis=1)
In [48]:
print('   Jobs with zip and without city      {}'.format(jobs[jobs['city'].isnull()].zip.notnull().sum()))
print('  Jobs with area and without city      {}'.format(jobs[jobs['city'].isnull()].area.notnull().sum()))
print('Jobs with county and without city      {}'.format(jobs[jobs['city'].isnull()].county.notnull().sum()))
print('Jobs with county                    {}'.format(len(jobs[jobs['county'].notnull()])))
print('        Total Jobs without a city    {}'.format(len(jobs[jobs['city'].isnull()])))
print('                       Total Jobs   {}'.format(len(jobs)))
   Jobs with zip and without city      0
  Jobs with area and without city      0
Jobs with county and without city      10
Jobs with county                    5663
        Total Jobs without a city    151
                       Total Jobs   5816
In [49]:
terms_to_classification = {
    'SQL'                      : 'SQL',

    'Deep.*Learning'           : 'Deep Learning',

    'Artificial.*Intelligence' : 'Artificial Intelligence',
    'AI'                       : 'Artificial Intelligence',

    'Business.*Intelligence'   : 'Business Intelligence',
    'BI'                       : 'Business Intelligence',
    'Intelligence.*Analyst'    : 'Business Intelligence',

    'Business.*Analyst'        : 'Business Analyst',
    'Sales.*Analyst'           : 'Business Analyst',
    'Operations.*Analyst'      : 'Business Analyst',
    'Market.*Analyst'          : 'Business Analyst',
    'Supply.*Analyst'          : 'Business Analyst',
    'Product.*Analyst'         : 'Business Analyst',
    'Research.*Analyst'        : 'Business Analyst',
    'Pricing.*Analyst'         : 'Business Analyst',
    'Reporting.*Analyst'       : 'Business Analyst',
    'Business.*Analytics'      : 'Business Analyst',

    'Machine.*Learning'        : 'Machine Learning',
    'ML'                       : 'Machine Learning',

    'Data.*Engineer'           : 'Data Engineer',
    'Database'                 : 'Data Engineer',
    'Data.*Architect'          : 'Data Engineer',

    'Data.*Analyst'            : 'Data Analyst',
    'Data.*Analytics'          : 'Data Analyst',
    'Programmer.*Analyst'      : 'Data Analyst',

    'Data.*Scientist'          : 'Data Scientist',
    'Data.*Science'            : 'Data Scientist'
}
In [50]:
jobs['classification'] = 'Other'

for search_term, classification in terms_to_classification.items():
    jobs.loc[jobs.title.str.contains(search_term, regex=True), 'classification'] = classification
In [51]:
print('Total Jobs                 ' + str(len(jobs)))

jobs['classification'].value_counts()
Total Jobs                 5816
Out[51]:
Other                      1790
Business Analyst           1053
Data Analyst                859
Data Scientist              652
Data Engineer               566
Machine Learning            386
Artificial Intelligence     160
SQL                         142
Business Intelligence       128
Deep Learning                80
Name: classification, dtype: int64

Assign Week

In [52]:
jobs['parse_y_w'] = pd.to_datetime(jobs['parse_date']).dt.strftime('%Y_%U').str[2:]
In [53]:
jobs = jobs[['classification', 'title', 'county', 'company', 'parse_y_w', 'weeks_old', 'zip', 'city', 'area']]

Reset Index

In [54]:
jobs = jobs.reset_index(drop=True)
In [55]:
jobs.tail(10)
Out[55]:
classification title county company parse_y_w weeks_old zip city area
5806 SQL Oracle PL/SQL Argus Developer San Mateo Quantum World Technologies Inc 18_17 0 NaN Foster City NaN
5807 SQL SQL Developer Los Angeles IT People Corporation 18_17 0 NaN El Segundo NaN
5808 Data Engineer Database Engineer San Mateo PLAYSTUDIOS, Inc. 18_17 0 NaN Burlingame NaN
5809 Other 01. Analyst, Supply Orange MOBIS Parts America LLC 18_17 0 NaN Fountain Valley NaN
5810 Other MSP|IT-Analyst/Developer - Specialty II, San R... Contra Costa IaIsystem 18_17 0 NaN San Ramon NaN
5811 Business Intelligence BI Analyst San Francisco Avispa Technology 18_17 0 NaN San Francisco NaN
5812 Data Analyst Database Developer/ BI Developer/ Data Analyst San Diego PCS Global Tech Pvt Ltd 18_17 0 NaN San Diego NaN
5813 Data Analyst Sr. Data Analyst Santa Clara Matrix Technology Group 18_17 0 NaN San Jose NaN
5814 Artificial Intelligence Sr. Vehicle Systems Validation Engineer – IAI ... Los Angeles Faraday Future, Inc. 18_17 0 NaN Gardena NaN
5815 Machine Learning Machine Learning Engineer San Francisco PicnicHealth 18_17 0 NaN San Francisco NaN

Write Out

In [56]:
#jobs_clean.to_csv('ca_analytics_job_postings.tsv', sep='\t')
jobs.to_csv('clean_jobs.tsv', sep='\t')