Overview

The objective of this notebook is the web scrape the jobs website Indeed.com for data science jobs located in California. For each job posting the program parses, it will gather seven key pieces of information:

  • Job Title
  • Employer
  • Job Location (city, zip code, etc)
  • Posted Field (time since posted to Indeed)
  • Job Type (organic or sponsored)
  • Scrape Date

This ultimate output of this exercise will be at least one CSV file that will be visualized using Tableau. Those visualizations will be submitted as the final project (entitled "Create a Tableau Story") for the Data Analyst nanodegree through Udacity.

Questions

  1. In which regions of California are most data analytics jobs located?
  2. In which counties are most data analytics jobs located in California?
  3. Within those counties, what companies are the largest employers of data professionals?
  4. How do more cutting-edge analytics jobs (ML and AI) compare to more traditional roles with regard to geographic distribution?

Program Structure

File Structure

  • jobs_raw.tsv File: Contains parsed tab-separated values scraped from Indeed job postings.

Program Flow

  • Set first page.
  • While there is an unparsed page of jobs:
    • While there are more jobs on the page
      • Parse jobs to string of job parameters
    • Write jobs to jobs_raw tsv file

Functions

  • Open URL: Opens URL using Requests library. Returns html string if reading is successful.
  • Parse Jobs: Takes a single URL for a list of job postings and parses the important job information as outlined in the Job definition, above. Returns a list of job strings.
  • Write Jobs: Writes the current job list to job_raw.tsv file.
  • Next Page: Navigates to the next page of job postings. Returns an html string if there is another page, or None if there is not.

Preliminaries

In [1]:
from bs4 import BeautifulSoup
import requests
import pprint as pp
from datetime import date
import os.path
import time

Function Definitions

Open URL

In [2]:
def open_url(url):
    '''
    Obtains HTML of URL in the parameter using Requests library. 
    Returns HTML string.
    '''
    try:
        response = requests.get(url)
    except:
        print('Unable to open provided url. Response Code {}.'.format(response.status_code))
        return None
    
    try:
        html = response.text
    except:
        print('Unable to access text of response')
        return None
    
    return html

Parse Jobs

In [3]:
def parse_jobs(html, print_jobs=False):
    '''
    Parses HTML for job information using BeautifulSoup library.
    Returns a list of job strings.
    '''
    soup = BeautifulSoup(html, "html.parser")

    job_list = []
    
    # Finds all "Organic Jobs"
    organic_jobs = soup.find_all('div', {'data-tn-component' : "organicJob"})
    for job in organic_jobs:
        job_string = ''
        
        try:
            title = job.h2.text.strip()
        except:
            print(' --- Null Title Parsed --- ')
            title = ''
        
        try:
            company = job.find('span', class_='company').text.strip()
        except:
            print(' --- Null Company Parsed --- ')
            company = ''
        
        try:
            location = job.find('span', class_='location').text.strip()
        except:
            print(' --- Null Location Parsed --- ')
            location = ''
        
        posted = job.find('span', class_='date').text.strip()
            
        job_type = 'Organic'
        parse_date = str(date.today())
        
        job_string += "\t".join([title, company, location, posted, job_type, parse_date])
        
        job_list.append(job_string)
    
    # Finds all "Sponsored Jobs"
    sponsored_jobs = soup.find_all('a', class_='jobtitle turnstileLink')
    for job in sponsored_jobs:
        job_string = ''
        
        job = job.parent
        
        try:
            title = job.a.text.strip()
        except:
            print(' --- Null Title Parsed --- ')
            title = ''
        
        try:
            company = job.find('span', class_='company').text.strip()
        except:
            print(' --- Null Company Parsed --- ')
            company = ''
        
        try:
            location = job.find('span', class_='location').text.strip()
        except:
            print(' --- Null Location Parsed --- ')
            location = ''
        
        posted = ''
        
        job_type = 'Sponsored'
        parse_date = str(date.today())
        
        job_string += '\t'.join([title, company, location, posted, job_type, parse_date])
        
        job_list.append(job_string)
            
    return job_list

Write Jobs

In [4]:
def write_jobs(job_list, i):
    '''
    Writes jobs data to file.
    Returns the value of the index in the file.
    '''
    with open('jobs_raw.tsv', 'a') as f:
        for job in job_list:
            f.write(str(i) + '\t' + job + '\n')
            i += 1
    return i

Next Page

In [5]:
def next_page(html):
    '''
    Returns the next page of HTML, if it can be opened
    '''
    soup = BeautifulSoup(html, 'html.parser')
    try:
        span = soup.findAll('span', class_='np')
        link = span[len(span)-1].parent.parent.get('href')
    except: 
        print('Unable to get next page of postings. Can indicate normal exit or error.')
        return None
    
    html = open_url('https://www.indeed.com' + link)
    
    if html is None:
        print('Unable to retrieve next page.')
        return None
    return html

Run It

In [6]:
if not os.path.isfile('jobs_raw.tsv'):
    i = 0
    with open('jobs_raw.tsv', 'a') as f:
        f.write('\t'.join(['i', 'title', 'company', 'location', 'posted', 'job_type', 'parse_date']) + '\n')
    print("jobs_raw.tsv did not exist and has been created. Index set to 0.")
else:
    with open('jobs_raw.tsv', 'r') as f:
        for line in f:
            continue
        i = int(line.split('\t')[0])
    print("jobs_raw.tsv already exists. Index {} loaded.".format(i))
jobs_raw.tsv did not exist and has been created. Index set to 0.
In [7]:
search_urls = [ 
    'https://www.indeed.com/jobs?q=machine+learning&l=CA&sort=date',
    'https://www.indeed.com/jobs?q=data&l=CA&sort=date',
    'https://www.indeed.com/jobs?q=sql&l=CA&sort=date',
    'https://www.indeed.com/jobs?q=artificial+intelligence&l=CA&sort=date',
    
    'https://www.indeed.com/jobs?q=machine+learning&l=CA',
    'https://www.indeed.com/jobs?q=data&l=CA',
    'https://www.indeed.com/jobs?q=sql&l=CA',
    'https://www.indeed.com/jobs?q=artificial+intelligence&l=CA'
] # &sort=date

# Parses roughly 40000 jobs/hour
# Actual time to parse 20000*4 + 60000*4 = 320000 jobs: 6.5 hours
# Actual time to parse 25000*4 + 75000*4 = 400000 jobs: 9 hours
dated_increment = 50      # 20000 jobs: 30 mins. 2 hours for all 4 searches.
relevance_increment = 50  # 60000 jobs: 90 mins. 6 hours for all 4 searches.
increment = dated_increment
threshold = increment

print_threshold = 25 #1000
print_threshold_increment = print_threshold

time_list = []

for url in search_urls:
    html = open_url(url)
    print(' ------- Searching ' + url + ' ------- ')
    
    while html is not None:
        try:
            job_list = parse_jobs(html)
            
            if i > print_threshold:
                print_threshold += print_threshold_increment
                print(str(i) + ' jobs parsed')
        except:
            print('Error parsing near index {}. Exiting.'.format(i))
            break

        i = write_jobs(job_list=job_list, i=i)

        time.sleep(.5)
        
        if i > threshold:
            if threshold == increment * len(search_urls)/2:
                increment = relevance_increment
                threshold += increment
            else:
                threshold += increment
            
            time_list.append(time.strftime('%X'))
            # Get next URL
            break

        html = next_page(html)

print('\n*** Completed URL search routine. ***')
print(time_list)
 ------- Searching https://www.indeed.com/jobs?q=machine+learning&l=CA&sort=date ------- 
27 jobs parsed
 ------- Searching https://www.indeed.com/jobs?q=data&l=CA&sort=date ------- 
57 jobs parsed
88 jobs parsed
 ------- Searching https://www.indeed.com/jobs?q=sql&l=CA&sort=date ------- 
104 jobs parsed
132 jobs parsed
 ------- Searching https://www.indeed.com/jobs?q=artificial+intelligence&l=CA&sort=date ------- 
162 jobs parsed
190 jobs parsed
 ------- Searching https://www.indeed.com/jobs?q=machine+learning&l=CA ------- 
205 jobs parsed
237 jobs parsed
 ------- Searching https://www.indeed.com/jobs?q=data&l=CA ------- 
253 jobs parsed
285 jobs parsed
 ------- Searching https://www.indeed.com/jobs?q=sql&l=CA ------- 
301 jobs parsed
331 jobs parsed
 ------- Searching https://www.indeed.com/jobs?q=artificial+intelligence&l=CA ------- 
361 jobs parsed
377 jobs parsed

*** Completed URL search routine. ***
['13:06:29', '13:06:34', '13:06:38', '13:06:42', '13:06:45', '13:06:50', '13:06:54', '13:06:58']