Note: The visualization is best viewed on a screens with resolution greater than 1366x768, which rules out viewing on most mobile devices. Screenshots of the various dashboards are shown below, for the data as it existed on May 3, 2018. Other detailed files are avilable below.
This project is an exercise in data scraping, cleaning, and visualization that conclusively answers the question of where most analytics jobs are based in California. The data are 5500+ analytics job postings scraped from Indeed.com in April and May of 2018. Tools utilized include Python, Pandas, Tableau, regular expressions, and BeautifulSoup.
Despite only containing 8 million of California’s 39.5 million people (20%), the Bay Area is home to roughly 60% of the open analytics positions posted to Indeed.com in April, 2018. The relative proportion of analytics jobs in the bay area increases as higher-tech positions are selected using the checkboxes. As an example, of the roughly 1200 open AI, ML, deep learning, and data science job postings parsed, nearly 960 were in the Bay area (roughly 80%).
Within the Bay Area region, the job postings are concentrated in certain counties. The San Francisco and San Mateo counties accounted for nearly 75% of open positions in the Bay Area. Santa Clara county alone accounted for roughly 43% of total analytics job openings, though the open positions per capita were less than for San Francisco. AI, ML, deep learning, and data scientist positions are among the most concentrated subset of these analytics jobs: nearly 94% are within San Francisco and Santa Clara.
Examining open job posting distribution on the level of individual cities reveals the results one might expect. Analytics jobs appear to be located in the cities between San Jose and San Francisco, on the south side of the Bay. The high-tech subset of the analytics jobs discussed previously (AI, ML, deep learning, and data science) are almost entirely on the south side of the Bay.
Analytics Lessons Learned
- Data cleaning was the biggest time sink, as usual. Total time required to complete the project was approximately 65 hours, of which 40+ were spent cleaning and aggregating scraped data. Roughly 10 hours were required to set up the web scraper, and the remaining 15 were spent iterating on the Tableau visualizations.
- The ease of web scraping is proportional to the degree to which the target websites are well-structured and predictable. Indeed.com was fairly easy to scrape. Scraping is a potent source of raw data (even with half-second pauses between page reloads, overnight scraping for this project consistently produces 40+ MB).
- Indeed.com has massive duplication of job postings. Overnight scraping might yield data from more than 200,000 job postings, but only a few thousand of these generally end up being unique. Determining how best to define and enforce job posting uniqueness was a major design decision that required careful consideration.
- Tableau is easy to use, ridiculously powerful, produces beautiful visualizations, and the base tier is free. Enough said.