Scraping Data with Beautiful Soup and Requests

import requests
from bs4 import BeautifulSoup
import time
from os import walk
import pandas as pd

In this example, I found a list of links to web pages with well-structured, consistent data in a tabular format: link.

To begin, I saved the HTML with the list of links by right-clicking the page in Chrome and selecting View Source. I saved this HTML to a local file. The links in that file are the web-based dataset used as the source for the work that comes after.


with open('beautiful-soup/data/tiger-web-link-list.html') as file:
    link_list_html = file.read()
soup = BeautifulSoup(link_list_html,
                     "html.parser")
links = soup.find_all('a')

link_dict = {}
for link in links:
    url = link.get('href')
    if url is not None and 'Files/bas20' in url:
        url_prefix = 'https://tigerweb.geo.census.gov/tigerwebmain/'
        link_dict[link.text] = url_prefix + url
{k: link_dict[k] for k in list(link_dict)[:1]}
{'Alabama': 'https://tigerweb.geo.census.gov/tigerwebmain/Files/bas20/tigerweb_bas20_tract_al.html'}

The result, the first entry of which is shown above, is a dictionary with keys corresponding to the test in the link and values corresponding to the url itself.


Next, I use requests to download the html of all 56 URLs included in that dictionary.

Since working with the HTML will require iterating many times to refine the parsing code, it is best to download to local file to minimize the number of calls that need to be made to the server. This is also the reason for the download_new_data variable. It functions as a switch to for use after all the HTML has been downloaded. This way, the entire notebook can be re-run without initiating 56 web requests every time.

Because this will require repeated calls to the same server, I wait half a second between each request.

download_new_data = False
if download_new_data == True:
    for state in link_dict.keys():
        url = link_dict[state]
        r = requests.get(url)
        
        if r.status_code == 200:
            html = r.text
            if len(html) > 0:
                filename = '-'.join(state.split()).lower().replace('.','') + '.html'
                filepath = 'beautiful-soup/data/html/' + filename
                with open(filepath,'w') as file:
                    file.write(html)

            else:
                print('No HTML at url for: {}'.format(state))
        else:
            print('Problem accessing url for: {}'.format(state))
            
        time.sleep(0.5)

Parse the Data from the HTML and Save to CSV

The data is stored in 56 separate CSV files.

parse_new_data = True
if parse_new_data == True:
    date_dir_list = []
    for _, _, filenames in walk('./beautiful-soup/data/html/'):
        for filename in filenames:
            with open('beautiful-soup/data/html/' + filename) as file:
                soup = BeautifulSoup(file,
                                     "html.parser")
            rows = soup.find_all('tr')

            header_list = []
            for col in rows[0]:
                if col != '\n':
                    header_list.append(col.text.strip().lower())

            row_list = []
            for row in rows[1:]:
                col_list = []
                for col in row:
                    if col != '\n':
                        col_list.append(col.text.strip())
                row_list.append(col_list)

            df = pd.DataFrame(row_list,
                              columns=header_list)
            df['filename'] = filename.replace('.html','')
            df.to_csv('beautiful-soup/data/csv/' + filename.replace('html','csv'),
                      index=False)

Open the CSVs and Consolidate

Extract only the necessary columns.

c = pd.DataFrame()
for _, _, filenames in walk('beautiful-soup/data/csv/'):
    for filename in filenames:
        new_df = (pd.read_csv('beautiful-soup/data/csv/' + filename))
        c = c.append(new_df)
c = c.reset_index(drop=True)
c['fips'] = c['state'].astype(str).str.zfill(2) + c['county'].astype(str).str.zfill(3)
c = (c
     .drop(columns=['state','county'])
     .rename(columns={'filename':'state'})
     .sort_values('fips')
     .reset_index(drop=True)
     [['fips','state','arealand']])
c['state'] = (c['state']
              .str.title()
              .str.replace('-',' ',regex=False))
print('Unique Census Tracts: ' + str(c.shape[0]))
c.head()
Unique Census Tracts: 85368

fips state arealand
0 01001 Alabama 9825304.0
1 01001 Alabama 478151753.0
2 01001 Alabama 386865281.0
3 01001 Alabama 187504713.0
4 01001 Alabama 105252227.0

Calculate the Sum of the Land Area for each Unique FIPS (County)

Note: Land areas in square meters Source. This must be converted to square miles.

$$1\ mile = 5280\ \frac{foot}{mile} \times 12\ \frac{inch}{foot} \times 2.54\ \frac{cm}{inch} \times \frac{1}{100} \frac{m}{cm} = 1609.344\ meters$$
$$1\ mile^2 = 2,589,988.1\ meters ^2$$

c = (c
     .groupby(['fips','state'])
     .sum()
     .reset_index())
print('Unique Counties: ' + str(c.shape[0]))
c['arealand'] = c['arealand']/2589988.1
c.head()
Unique Counties: 3234

fips state arealand
0 01001 Alabama 594.443718
1 01003 Alabama 1589.824163
2 01005 Alabama 885.007981
3 01007 Alabama 622.458312
4 01009 Alabama 644.839979

Write Combined Dataset to File

c.to_csv('beautiful-soup/data/processed/county_land_area.csv',
         index=False)