Scraping Data with Beautiful Soup and Requests
import requests
from bs4 import BeautifulSoup
import time
from os import walk
import pandas as pd
In this example, I found a list of links to web pages with well-structured, consistent data in a tabular format: link.
To begin, I saved the HTML with the list of links by right-clicking the page in Chrome and selecting View Source
. I saved this HTML to a local file. The links in that file are the web-based dataset used as the source for the work that comes after.
Build Dictionary with Several Links to Visit
with open('beautiful-soup/data/tiger-web-link-list.html') as file:
link_list_html = file.read()
soup = BeautifulSoup(link_list_html,
"html.parser")
links = soup.find_all('a')
link_dict = {}
for link in links:
url = link.get('href')
if url is not None and 'Files/bas20' in url:
url_prefix = 'https://tigerweb.geo.census.gov/tigerwebmain/'
link_dict[link.text] = url_prefix + url
{k: link_dict[k] for k in list(link_dict)[:1]}
{'Alabama': 'https://tigerweb.geo.census.gov/tigerwebmain/Files/bas20/tigerweb_bas20_tract_al.html'}
The result, the first entry of which is shown above, is a dictionary with keys corresponding to the test in the link and values corresponding to the url itself.
Iteratively Visit Those Links to Download HTML to Local File
Next, I use requests to download the html of all 56 URLs included in that dictionary.
Since working with the HTML will require iterating many times to refine the parsing code, it is best to download to local file to minimize the number of calls that need to be made to the server. This is also the reason for the download_new_data
variable. It functions as a switch to for use after all the HTML has been downloaded. This way, the entire notebook can be re-run without initiating 56 web requests every time.
Because this will require repeated calls to the same server, I wait half a second between each request.
download_new_data = False
if download_new_data == True:
for state in link_dict.keys():
url = link_dict[state]
r = requests.get(url)
if r.status_code == 200:
html = r.text
if len(html) > 0:
filename = '-'.join(state.split()).lower().replace('.','') + '.html'
filepath = 'beautiful-soup/data/html/' + filename
with open(filepath,'w') as file:
file.write(html)
else:
print('No HTML at url for: {}'.format(state))
else:
print('Problem accessing url for: {}'.format(state))
time.sleep(0.5)
Parse the Data from the HTML and Save to CSV
The data is stored in 56 separate CSV files.
parse_new_data = True
if parse_new_data == True:
date_dir_list = []
for _, _, filenames in walk('./beautiful-soup/data/html/'):
for filename in filenames:
with open('beautiful-soup/data/html/' + filename) as file:
soup = BeautifulSoup(file,
"html.parser")
rows = soup.find_all('tr')
header_list = []
for col in rows[0]:
if col != '\n':
header_list.append(col.text.strip().lower())
row_list = []
for row in rows[1:]:
col_list = []
for col in row:
if col != '\n':
col_list.append(col.text.strip())
row_list.append(col_list)
df = pd.DataFrame(row_list,
columns=header_list)
df['filename'] = filename.replace('.html','')
df.to_csv('beautiful-soup/data/csv/' + filename.replace('html','csv'),
index=False)
Open the CSVs and Consolidate
Extract only the necessary columns.
c = pd.DataFrame()
for _, _, filenames in walk('beautiful-soup/data/csv/'):
for filename in filenames:
new_df = (pd.read_csv('beautiful-soup/data/csv/' + filename))
c = c.append(new_df)
c = c.reset_index(drop=True)
c['fips'] = c['state'].astype(str).str.zfill(2) + c['county'].astype(str).str.zfill(3)
c = (c
.drop(columns=['state','county'])
.rename(columns={'filename':'state'})
.sort_values('fips')
.reset_index(drop=True)
[['fips','state','arealand']])
c['state'] = (c['state']
.str.title()
.str.replace('-',' ',regex=False))
print('Unique Census Tracts: ' + str(c.shape[0]))
c.head()
Unique Census Tracts: 85368
fips | state | arealand | |
---|---|---|---|
0 | 01001 | Alabama | 9825304.0 |
1 | 01001 | Alabama | 478151753.0 |
2 | 01001 | Alabama | 386865281.0 |
3 | 01001 | Alabama | 187504713.0 |
4 | 01001 | Alabama | 105252227.0 |
Calculate the Sum of the Land Area for each Unique FIPS (County)
Note: Land areas in square meters Source. This must be converted to square miles.
$$1\ mile = 5280\ \frac{foot}{mile} \times 12\ \frac{inch}{foot} \times 2.54\ \frac{cm}{inch} \times \frac{1}{100} \frac{m}{cm} = 1609.344\ meters$$
$$1\ mile^2 = 2,589,988.1\ meters ^2$$
c = (c
.groupby(['fips','state'])
.sum()
.reset_index())
print('Unique Counties: ' + str(c.shape[0]))
c['arealand'] = c['arealand']/2589988.1
c.head()
Unique Counties: 3234
fips | state | arealand | |
---|---|---|---|
0 | 01001 | Alabama | 594.443718 |
1 | 01003 | Alabama | 1589.824163 |
2 | 01005 | Alabama | 885.007981 |
3 | 01007 | Alabama | 622.458312 |
4 | 01009 | Alabama | 644.839979 |
Write Combined Dataset to File
c.to_csv('beautiful-soup/data/processed/county_land_area.csv',
index=False)