Automate Your Browser: A Guided Selenium Adventure

Prerequisites: Have Python/Anaconda and Selenium installed. See the previous intro to Selenium if you’re not familiar with it. The full code for this post is included at the end. You might find it fun to first run the entire script and watch how it works before jumping in and following along with the post. And don’t forget to use your browser’s Developer Tools to see how XPaths were chosen. Best of luck.

One morning Todd, a loyal and industrious employee, was hard at work watching cat videos when suddenly his boss, Mr. Peabody, burst into his cubicle.

“Todd, the CEO is sick of paying everyone so much money. Figure out where in the Southwest US the wages are the lowest. We’re relocating there ASAP!”

This wasn’t the first time Todd was asked to look up something like this. Sometimes they want to know where wages are the lowest in the Northeast, or Midwest, or in individual states, and Todd is fed up with it. So many cat videos, but so little time to watch them. If only there was a way to automate it…

With a sudden flash of inspiration, Todd realizes the Selenium package for Python can solve his problem! He just needs to write up a quick script, and then he can kiss these lengthy wage look-ups goodbye.

To start, Todd writes the code to import the packages he thinks he’ll need, open up a browser, and navigate to the Bureau of Labor Statistics (BLS) data page.

import re
from selenium import webdriver

## remember to input your own file path to your chrome driver here if you're using chrome
browser = webdriver.Chrome(executable_path="C:UsersgstantonDownloadschromedriver.exe")
browser.maximize_window()

url = 'http://www.bls.gov/data/'
browser.get(url)

Todd wants to use the “Multi-Screen Data Search” tool in the “Employment” section to get earnings (aka wages) data for US metro areas. To do this, he identifies a viable XPath expression for the tool’s button with Chrome’s Developer Tools and then writes the code to select and click that element.

multi_screen_data_search = browser.find_element_by_xpath("//img[@title='Multi Screen Data Search for CES State and Metro Area']")
multi_screen_data_search.click()

This brings Todd to a page with a couple of checkboxes, one to get data that is seasonally adjusted and one to get data that isn’t. Todd decides he wants the non-adjusted data. He finds the checkbox element, clicks it, then finds and clicks on the “Next form” button to proceed with the query.

not_seasonally_adjusted = browser.find_elements_by_xpath("//input[@name='seasonal']")
not_seasonally_adjusted[1].click()

next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()

Todd wants the average hourly earnings of all employees in dollars, and selects the appropriate options.

average_hourly_earnings = browser.find_element_by_xpath("//select/option[@value='03']")
average_hourly_earnings.click()

next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()

Now Todd needs to select the Southwest US region. He knows the region is typically defined as Arizona, California, Colorado, Nevada, New Mexico, and Utah, and selects the states accordingly.

## selected states: AZ, CA, CO, NM, NV, UT
state_values = ['04', '06', '08', '32', '35', '49']
for value in state_values:
    state_element = browser.find_element_by_xpath("//select/option[@value='{}']".format(value))
    state_element.click()
    
next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()

Todd wants all the metros in these states, and so selects all the metro options, making sure to exclude the “Statewide” option.

all_metro_elements = browser.find_elements_by_xpath("//select[@name='area_code']/option")
for metro_element in all_metro_elements:
    metro_element.click()
    
## de-select the statewide option at the top, we just want metros
statewide_option = browser.find_element_by_xpath("//option[@value='00000']")
statewide_option.click()

next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()

There are then a couple of screens where Todd specifies that he wants just private-sector wages.

total_private_option = browser.find_element_by_xpath("//option[@value='05']")
total_private_option.click()

next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()

total_private_option = browser.find_element_by_xpath("//option[@value='05000000']")
total_private_option.click()

next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()

And finally, Todd gets to retrieve his precious data.

retrieve_data = browser.find_element_by_xpath("//input[@value='Retrieve data']")
retrieve_data.click()

There’s just one problem: the format the data is in. Todd sees that it is displayed in tables on the page, and there are also links, one for each metro, to download spreadsheets of the wage data. Todd doesn’t want to go to the trouble of downloading all these spreadsheets, and he certainly doesn’t want to copy and paste all the most recent wage numbers into a spreadsheet by hand, so instead he comes up with a way to quickly grab the data from the web page and determine which metro has the lowest average earnings.

## all the most recent wage figures have a '(P)' (for 'Preliminary') after the number
## using this fact, one can easily grab all those most recent figures
recent_hourly_earnings = browser.find_elements_by_xpath("//table/tbody/tr/td[contains(text(), '(P)')]")

cleaned_recent_earnings = []
for wage in recent_hourly_earnings:    
    ## use regex to exclude the '(P)' and grab just the wage figure
    cleaned_wage = re.search('d+.d+', wage.text).group()    
    ## we want to convert to floats for finding the minimum later on
    cleaned_recent_earnings.append(float(cleaned_wage))

Great, now Todd has the wages. But he also wants the names of the metros to pair with the wages so he knows what wages go with what metros. The only problem is that the metro name is lumped into the same element as several other pieces of info in the text above each table. But Todd sees a way to extract it.

## get all the groups of text above each table
all_table_texts = browser.find_elements_by_xpath("//table/caption/pre")

metros = []
for table_text in all_table_texts:    
    """Use regex to pick out just the metro name from all the text.
    The name starts with a letter, followed by any combo of letters and spaces,
    has a comma and a space and ends with two uppercase letters (the state abbreviation)"""
    metro_name = re.search('w[w ]+, [A-Z][A-Z]', table_text.text).group()
    metros.append(metro_name)

With a few final lines of code, Todd zips together the metro names and wage data and computes the minimum wage to return the name of the metro in the US Southwest with the lowest average hourly earnings.

metro_earnings_dict = dict(zip(metros, cleaned_recent_earnings))
metro_to_move_to = min(metro_earnings_dict, key=metro_earnings_dict.get)
print(metro_to_move_to)

Triumphantly, Todd runs the script, shoots his boss a quick email, and goes back to watching cat videos. He gets several similar requests throughout the day, but with his script’s help Todd blessedly incurs only minor disruptions to his cat-watching regimen.

For convenience, here’s the full script:

import re
from selenium import webdriver

## remember to input your own file path to your chrome driver here if you're using chrome
browser = webdriver.Chrome(executable_path="C:UsersgstantonDownloadschromedriver.exe")
browser.maximize_window()
url = 'http://www.bls.gov/data/'
browser.get(url)

multi_screen_data_search = browser.find_element_by_xpath("//img[@title='Multi Screen Data Search for CES State and Metro Area']")
multi_screen_data_search.click()

not_seasonally_adjusted = browser.find_elements_by_xpath("//input[@name='seasonal']")
not_seasonally_adjusted[1].click()

next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()

average_hourly_earnings = browser.find_element_by_xpath("//select/option[@value='03']")
average_hourly_earnings.click()

next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()

## selected states: AZ, CA, CO, NM, NV, UT
state_values = ['04', '06', '08', '32', '35', '49']
for value in state_values:
    state_element = browser.find_element_by_xpath("//select/option[@value='{}']".format(value))
    state_element.click()
    
next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()

all_metro_elements = browser.find_elements_by_xpath("//select[@name='area_code']/option")
for metro_element in all_metro_elements:
    metro_element.click()
    
## de-select the statewide option at the top, we just want metros
statewide_option = browser.find_element_by_xpath("//option[@value='00000']")
statewide_option.click()

next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()

total_private_option = browser.find_element_by_xpath("//option[@value='05']")
total_private_option.click()

next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()

total_private_option = browser.find_element_by_xpath("//option[@value='05000000']")
total_private_option.click()

next_form = browser.find_element_by_xpath("//input[@value='Next form']")
next_form.click()

retrieve_data = browser.find_element_by_xpath("//input[@value='Retrieve data']")
retrieve_data.click()

## all the most recent wage figures have a '(P)' (for 'Preliminary') after the number
## using this fact, one can easily grab all those most recent figures
recent_hourly_earnings = browser.find_elements_by_xpath("//table/tbody/tr/td[contains(text(), '(P)')]")
cleaned_recent_earnings = []
for wage in recent_hourly_earnings:
    ## use regex to exclude the '(P)' and grab just the wage figure
    cleaned_wage = re.search('d+.d+', wage.text).group()
    ## we want to convert to floats for finding the minimum
    cleaned_recent_earnings.append(float(cleaned_wage))

## get all the groups of text above each table
all_table_texts = browser.find_elements_by_xpath("//table/caption/pre")
metros = []
for table_text in all_table_texts:
    ## use regex to pick out just the metro name
    ## it starts with a letter, followed by any combo of letters and spaces...
    ## has a comma and a space and ends with two uppercase letters (the state abbreviation)
    metro_name = re.search('w[w ]+, [A-Z][A-Z]', table_text.text).group()
    metros.append(metro_name)
    
metro_earnings_dict = dict(zip(metros, cleaned_recent_earnings))
metro_to_move_to = min(metro_earnings_dict, key=metro_earnings_dict.get)
print(metro_to_move_to)