Scraping Data with Beautiful Soup and Requests

08 Apr 2020

import requests
from bs4 import BeautifulSoup
import time
from os import walk
import pandas as pd

In this example, I found a list of links to web pages with well-structured, consistent data in a tabular format: link.

To begin, I saved the HTML with the list of links by right-clicking the page in Chrome and selecting View Source. I saved this HTML to a local file. The links in that file are the web-based dataset used as the source for the work that comes after.

Build Dictionary with Several Links to Visit

with open('beautiful-soup/data/tiger-web-link-list.html') as file:
    link_list_html = file.read()
soup = BeautifulSoup(link_list_html,
                     "html.parser")
links = soup.find_all('a')

link_dict = {}
for link in links:
    url = link.get('href')
    if url is not None and 'Files/bas20' in url:
        url_prefix = 'https://tigerweb.geo.census.gov/tigerwebmain/'
        link_dict[link.text] = url_prefix + url
{k: link_dict[k] for k in list(link_dict)[:1]}

{'Alabama': 'https://tigerweb.geo.census.gov/tigerwebmain/Files/bas20/tigerweb_bas20_tract_al.html'}

The result, the first entry of which is shown above, is a dictionary with keys corresponding to the test in the link and values corresponding to the url itself.

Iteratively Visit Those Links to Download HTML to Local File

Next, I use requests to download the html of all 56 URLs included in that dictionary.

Since working with the HTML will require iterating many times to refine the parsing code, it is best to download to local file to minimize the number of calls that need to be made to the server. This is also the reason for the download_new_data variable. It functions as a switch to for use after all the HTML has been downloaded. This way, the entire notebook can be re-run without initiating 56 web requests every time.

Because this will require repeated calls to the same server, I wait half a second between each request.

download_new_data = False
if download_new_data == True:
    for state in link_dict.keys():
        url = link_dict[state]
        r = requests.get(url)
        
        if r.status_code == 200:
            html = r.text
            if len(html) > 0:
                filename = '-'.join(state.split()).lower().replace('.','') + '.html'
                filepath = 'beautiful-soup/data/html/' + filename
                with open(filepath,'w') as file:
                    file.write(html)

            else:
                print('No HTML at url for: {}'.format(state))
        else:
            print('Problem accessing url for: {}'.format(state))
            
        time.sleep(0.5)

Parse the Data from the HTML and Save to CSV

The data is stored in 56 separate CSV files.

parse_new_data = True
if parse_new_data == True:
    date_dir_list = []
    for _, _, filenames in walk('./beautiful-soup/data/html/'):
        for filename in filenames:
            with open('beautiful-soup/data/html/' + filename) as file:
                soup = BeautifulSoup(file,
                                     "html.parser")
            rows = soup.find_all('tr')

            header_list = []
            for col in rows[0]:
                if col != '\n':
                    header_list.append(col.text.strip().lower())

            row_list = []
            for row in rows[1:]:
                col_list = []
                for col in row:
                    if col != '\n':
                        col_list.append(col.text.strip())
                row_list.append(col_list)

            df = pd.DataFrame(row_list,
                              columns=header_list)
            df['filename'] = filename.replace('.html','')
            df.to_csv('beautiful-soup/data/csv/' + filename.replace('html','csv'),
                      index=False)

Open the CSVs and Consolidate

Extract only the necessary columns.

c = pd.DataFrame()
for _, _, filenames in walk('beautiful-soup/data/csv/'):
    for filename in filenames:
        new_df = (pd.read_csv('beautiful-soup/data/csv/' + filename))
        c = c.append(new_df)
c = c.reset_index(drop=True)
c['fips'] = c['state'].astype(str).str.zfill(2) + c['county'].astype(str).str.zfill(3)
c = (c
     .drop(columns=['state','county'])
     .rename(columns={'filename':'state'})
     .sort_values('fips')
     .reset_index(drop=True)
     [['fips','state','arealand']])
c['state'] = (c['state']
              .str.title()
              .str.replace('-',' ',regex=False))
print('Unique Census Tracts: ' + str(c.shape[0]))
c.head()

Unique Census Tracts: 85368

	fips	state	arealand
0	01001	Alabama	9825304.0
1	01001	Alabama	478151753.0
2	01001	Alabama	386865281.0
3	01001	Alabama	187504713.0
4	01001	Alabama	105252227.0

Calculate the Sum of the Land Area for each Unique FIPS (County)

Note: Land areas in square meters Source. This must be converted to square miles.

$$1\ mile = 5280\ \frac{foot}{mile} \times 12\ \frac{inch}{foot} \times 2.54\ \frac{cm}{inch} \times \frac{1}{100} \frac{m}{cm} = 1609.344\ meters$$
$$1\ mile^2 = 2,589,988.1\ meters ^2$$

c = (c
     .groupby(['fips','state'])
     .sum()
     .reset_index())
print('Unique Counties: ' + str(c.shape[0]))
c['arealand'] = c['arealand']/2589988.1
c.head()

Unique Counties: 3234

	fips	state	arealand
0	01001	Alabama	594.443718
1	01003	Alabama	1589.824163
2	01005	Alabama	885.007981
3	01007	Alabama	622.458312
4	01009	Alabama	644.839979

Write Combined Dataset to File

c.to_csv('beautiful-soup/data/processed/county_land_area.csv',
         index=False)