Web Scraping Salesforce Tables

27 Sep 2020

It is easy to forget that Salesforce renders all of its information in HTML tables, just like any other webpage. This means that it is easy to web scrape the information from Salesforce tables that are visible in the browser, instead of doing an export of the data.

As a practical example, we can scrape the fields names, labels, and datatypes that are available at: Setup » Object Manager » Select an Object » Fields & Relationships.

First, navigate to the page that has the data. It may be necessary to scroll all the way to the bottom of the page to get all the data rendered into the table.
Then, in Chrome, select an element on the page that is in the table you want to scrape.
Right-click the item you selected, then select “Inspect” in the menu.

In the HTML viewer Chrome opens, select the table holding the data of interest, right-click on the highlighted table tag, then select copy » Copy outerHTML.
Paste the HTML into a plain text document and save as a .html file. The first item in the text file should be the HTML tag. It should begin somthing like the following.

<table class="slds-table slds-table-bordered uiVirtualDataGrid--default uiVirtualDataGrid"
...

Now, the HTML can be parsed and the data inside extracted into Pandas DataFrame then into a CSV or Excel file, as follows. Some of these parameters may need to be changed, depending on the specific table you are trying to parse.

from bs4 import BeautifulSoup
import pandas as pd

def scrape_objects(file_string):
    with open(file_string) as file:
        soup = BeautifulSoup(file,
                             "html.parser")
        rows = soup.find_all('tr')
        
    header_list = []
    for col in rows[0]:
        if col != '\n':
            header_list.append(col.text.strip().lower())

    row_list = []
    for row in rows[1:]:
        col_list = []
        for col in row:
            if col != '\n':
                col_list.append(col.text.strip())
        row_list.append(col_list)

    df = (pd.DataFrame(row_list,
                       columns=header_list)
          .drop(columns=['sort indexed','actions','sort controlling field'])
          .rename(columns={'sort field labelsorted ascending' : 'label',
                           'sort field name'                  : 'name',
                           'sort data type'                   : 'type'}))
    df.info()
    
    return df

Only the first 31 fields on the Contact object are captured here, because only the first page of contacts was rendered in the browder when I downloaded the HTML.

account = scrape_objects('web-scraping-salesforce-tables/html.html')
account.to_clipboard(index=False)
account

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   31 non-null     object
 1   name    31 non-null     object
 2   type    31 non-null     object
dtypes: object(3)
memory usage: 872.0+ bytes

	label	name	type
0	Account Name	AccountId	Lookup(Account)
1	Assistant	AssistantName	Text(40)
2	Asst. Phone	AssistantPhone	Phone
3	Birthdate	Birthdate	Date
4	Clean Status	CleanStatus	Picklist
5	Contact Owner	OwnerId	Lookup(User)
6	Created By	CreatedById	Lookup(User)
7	Data.com Key	Jigsaw	Text(20)
8	Department	Department	Text(80)
9	Description	Description	Long Text Area(32000)
10	Do Not Call	DoNotCall	Checkbox
11	Email	Email	Email
12	Email Opt Out	HasOptedOutOfEmail	Checkbox
13	Fax	Fax	Fax
14	Fax Opt Out	HasOptedOutOfFax	Checkbox
15	Home Phone	HomePhone	Phone
16	Individual	IndividualId	Lookup(Individual)
17	Languages	Languages__c	Text(100)
18	Last Modified By	LastModifiedById	Lookup(User)
19	Last Stay-in-Touch Request Date	LastCURequestDate	Date/Time
20	Last Stay-in-Touch Save Date	LastCUUpdateDate	Date/Time
21	Lead Source	LeadSource	Picklist
22	Level	Level__c	Picklist
23	Mailing Address	MailingAddress	Address
24	Mobile	MobilePhone	Phone
25	Name	Name	Name
26	Other Address	OtherAddress	Address
27	Other Phone	OtherPhone	Phone
28	Phone	Phone	Phone
29	Reports To	ReportsToId	Lookup(Contact)
30	Title	Title	Text(128)