Intro

This file contains code for assessing and cleaning the jobs data scraped from Indeed.

Preliminaries

In [22]:
import pandas as pd
import os.path
In [23]:
# Change this to 'jobs_raw.tsv' for subsequent readings
# data-ca-180423.tsv
raw = pd.read_csv('jobs_raw.tsv', sep='\t')

Drop Old Postings

Drop any job postings with a 'posted' field indicating the posting was posted on Indeed more than 30 days prior.

The rationale for this design decision is that postings indicating they were posted '30+ days ago' could, in theory, be very old.

Furthermore, it is not practical to design a posting expiration based upon a 'Post Date,' which could theoretically be deduced from the 'Parse Date' and the 'Posted' field, for two reasons. First, 'Sponsored' posts do not have a post date assigned. Second, the 'Post Date' may be somewhat arbitrary. This is because it may be possible for companies to "spam" job posts in an effort to make their posting more appealing to applicants, and also to appear higher in search rankings. This behavior is likely to vary from company to company.

Filtering as coded below, a given job posting can be parsed, assigned a unique hash, and included in the master data file as many as five times (over the course of five weeks). A given job posting will not be included in master a sixth time, because by that point the posting must necessarily have been posted '30+ days ago'.

In [3]:
raw = raw[~(raw['posted'] == '30+ days ago')]

Assess Raw File Duplication

The data in this dataset is very heavily duplicated. Because of this, I will drop the duplicate rows at the outset, before proceeding with the standard assessment and cleaning steps.

In [4]:
raw.shape
Out[4]:
(354, 7)

Create a "Hash" column, to define row uniqueness.

I define duplicate rows involve the same company trying to fill the same job in the same place.

In [5]:
raw['parse_date'] = pd.to_datetime(raw['parse_date'])
In [6]:
raw['hash'] = raw['title'] + raw['company'] + raw['location'] + raw['parse_date'].dt.strftime('%Y%U')
In [7]:
print('       Sponsored Jobs: ' + str(len(raw[raw.job_type == 'Sponsored'])))
print('Unique Sponsored Jobs: ' + str(raw[raw.job_type == 'Sponsored'].hash.nunique()) + '\n')
print('         Organic Jobs: ' + str(len(raw[raw.job_type == 'Organic'])))
print('  Unique Organic Jobs: ' + str(raw[raw.job_type == 'Organic'].hash.nunique()))
       Sponsored Jobs: 137
Unique Sponsored Jobs: 28

         Organic Jobs: 217
  Unique Organic Jobs: 186

The same set of "Sponsored" postings appear on every page of results, so a very high number of duplicates is expected for these jobs listings.

The set of "Sponsored" postings appears to be changed every several hours.

In [8]:
raw[raw.job_type == 'Sponsored'].hash.value_counts()[0:5]
Out[8]:
Senior Systems Analyst/ProgrammerAlliance Member ServicesSanta Cruz, CA 95060201817    8
DB EngineerNHN GlobalLos Angeles, CA 90010201817                                       8
SQL DeveloperEscape TechnologyRoseville, CA 95661201817                                8
AI / Machine Learning ScientistLG ElectronicsSanta Clara, CA201817                     7
Natural Language Processing (NLP) ScientistAstound.aiMenlo Park, CA201817              7
Name: hash, dtype: int64

It seems as if a small subset of "Organic" postings are duplicated many times in the results.

In [9]:
raw[raw.job_type == 'Organic'].hash.value_counts()[0:5]
Out[9]:
Data ScienceDOYENSYS INC.Cupertino, CA201817                                               3
Data Analyst LeadTekBusinessLos Angeles, CA201817                                          3
Data Analyst - LeadWittaff Inc.Los Angeles, CA201817                                       3
Internship: Developing Artificial and Human Intelligence by...SiemensBerkeley, CA201817    3
Senior SEO ManagerYummlyRedwood City, CA 94063201817                                       2
Name: hash, dtype: int64

Clean Raw File Duplicates

In [10]:
dedup = raw.copy()

Define

dedup contains rows with duplicate hash strings.

Code

In [11]:
dedup = dedup.drop_duplicates(subset='hash');

Test

In [12]:
dedup.shape
Out[12]:
(213, 8)
In [13]:
print('       Sponsored Jobs: ' + str(len(dedup[dedup.job_type == 'Sponsored'])))
print('Unique Sponsored Jobs: ' + str(dedup[dedup.job_type == 'Sponsored'].hash.nunique()) + '\n')
print('         Organic Jobs: ' + str(len(dedup[dedup.job_type == 'Organic'])))
print('  Unique Organic Jobs: ' + str(dedup[dedup.job_type == 'Organic'].hash.nunique()))
       Sponsored Jobs: 28
Unique Sponsored Jobs: 28

         Organic Jobs: 185
  Unique Organic Jobs: 185
In [14]:
dedup[dedup.job_type == 'Sponsored'].hash.value_counts()[0:3]
Out[14]:
Data SpecialistExodus Recovery Inc.Los Angeles, CA201817                                     1
SQL DeveloperEscape TechnologyRoseville, CA 95661201817                                      1
Clinical Informatics Data AnalyticsNorthridge Hospital Medical CenterNorthridge, CA201817    1
Name: hash, dtype: int64
In [15]:
dedup[dedup.job_type == 'Organic'].hash.value_counts()[0:3]
Out[15]:
Director, Data Science & Machine LearningAscendo ResourcesSan Jose, CA201817    1
DevOps EngineerCruise AutomationSan Francisco, CA201817                         1
Data ScientistTonalSan Francisco, CA201817                                      1
Name: hash, dtype: int64

Open Master File

In [16]:
if not os.path.isfile('master.tsv'):
    dedup.to_csv('master.tsv', sep='\t')
    print('master.tsv did not exist and has been created.')
else:
    master = pd.read_csv('master.tsv', sep='\t', index_col=0)
    print('master loaded from master.tsv.')
master loaded from master.tsv.

Append New Jobs to Master

In [17]:
try:
    original_len = len(master)
except:
    print('Failing normally. master newly created.')
    raise

dedup_len = len(dedup)

master = master.append(dedup)
In [18]:
master = master.drop_duplicates(subset='hash')

master = master.reset_index(drop=True)

final_len = len(master)
In [19]:
master.tail(10)
Out[19]:
i title company location posted job_type parse_date hash
22091 166 Mid Market Account Executive Nauto Palo Alto, CA Just posted Organic 2018-05-02 00:00:00 Mid Market Account ExecutiveNautoPalo Alto, CA...
22092 167 Senior Backend QA Engineer Banjo Redwood City, CA Just posted Organic 2018-05-02 00:00:00 Senior Backend QA EngineerBanjoRedwood City, C...
22093 168 Technical Solutions Consultant, Android Modem/... Google Mountain View, CA Just posted Organic 2018-05-02 00:00:00 Technical Solutions Consultant, Android Modem/...
22094 169 Senior Platform Engineer Banjo Redwood City, CA Just posted Organic 2018-05-02 00:00:00 Senior Platform EngineerBanjoRedwood City, CA2...
22095 170 Sr. Vehicle Systems Validation Engineer – IAI ... Faraday Future, Inc. Gardena, CA Just posted Organic 2018-05-02 00:00:00 Sr. Vehicle Systems Validation Engineer – IAI ...
22096 176 Systems Development - Assistant / Associate Pr... San Diego State University San Diego, CA Today Organic 2018-05-02 00:00:00 Systems Development - Assistant / Associate Pr...
22097 177 RPA COE Operational Manager - Business Systems... Wells Fargo San Francisco, CA Today Organic 2018-05-02 00:00:00 RPA COE Operational Manager - Business Systems...
22098 178 Senior Network Architect Instart Logic Palo Alto, CA 94306 (Barron Park area) Today Organic 2018-05-02 00:00:00 Senior Network ArchitectInstart LogicPalo Alto...
22099 180 Customer Success Mgr, Tech Recruiting Stella.ai Redwood City, CA Today Organic 2018-05-02 00:00:00 Customer Success Mgr, Tech RecruitingStella.ai...
22100 183 Machine Learning Engineer PicnicHealth San Francisco, CA Today Organic 2018-05-02 00:00:00 Machine Learning EngineerPicnicHealthSan Franc...
In [20]:
print('{} \t- jobs newly parsed (loaded from "raw_jobs.tsv")'.format(len(raw)))
print('{} \t- of these are unique (contained in "dedup" dataframe)'.format(dedup_len))

print('\r')

print('{} \t- jobs loaded from "master.tsv".'.format(original_len))
print('{} \t- newly-parsed jobs were duplicates of jobs already in "master" dataframe.'.format(dedup_len - (final_len - original_len)))

print('\r')

print('{} \t- newly-parsed jobs added to "master" dataframe.'.format(final_len - original_len))
print('{} \t- unique jobs saved to "master".'.format(final_len))
354 	- jobs newly parsed (loaded from "raw_jobs.tsv")
213 	- of these are unique (contained in "dedup" dataframe)

22000 	- jobs loaded from "master.tsv".
112 	- newly-parsed jobs were duplicates of jobs already in "master" dataframe.

101 	- newly-parsed jobs added to "master" dataframe.
22101 	- unique jobs saved to "master".

Write master back to file

In [21]:
master.to_csv('master.tsv', sep='\t')