Scraping Financial Data with Selenium

Note: The following post is a significant step up in difficulty from the previous selenium-based post, Automate Your Browser: A Guided Selenium Adventure. Please see the start of that post for links on getting selenium set up if this is your first time using it. If you really do need financial data, there are likely easier ways to obtain it than scraping Nasdaq or Yahoo or Morningstar with selenium. Examples may include Quandl and Yahoo’s finance API, or perhaps building a scraper with scrapy and splash. And there are many proprietary (and expensive) databases out there that will provide such data. But in any case, I hope this post is helpful in demonstrating a few more of the practices involved in real-life webscraping. The full script is at the end of the post for your convenience.

One fine Monday morning, Todd is sipping a hot cup of decaf green tea, gazing out the office window in a state of Zen oneness as a Selenium script does his work for him. But just as he is on the brink of enlightenment, his boss, Mr. Peabody, bursts into his cubicle and barks, “TODD, quit daydreaming. I just got word from the CEO: we need quarterly financials on some of our competitors.” “Oh? What for?” “Some competitive analysis or something. We’ll be doing it on a regular basis. In any case, we need that data TODAY or YOU’RE FIRED!”

As Mr. Peabody stomps away Todd lets out a sigh. His morning had been going so well, but now it seems he has to actually do some work. He decides though that if he’s going to do work, he’s going to do everything in his power to make sure he never has to do that work again. Brainstorming sources of financial data, Todd figures he could get it from nasdaq.com as easily as anywhere else. He navigates to the quarterly income statement of the first company on the list, Apple (ticker symbol: AAPL).

http://www.nasdaq.com/symbol/aapl/financials?query=income-statement&data=quarterly

The first thing Todd notices is that the actual financial data table is being generated via JavaScript (look for the tags in the html). This means that Python packages such as lxml and beautiful soup, which don’t support javascript, won’t be much help here. Todd knows that selenium doesn’t make for the fastest webscraper, but because he only needs data on 5 companies (Amazon, Apple, Facebook, IBM, Microsoft), he still decides to write up another quick selenium script.

To start, he knows he needs to make some imports, initialize a dataframe to store his scraped data in, and launch the browser.

import pandas as pd
from numpy import nan
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

## create a pandas dataframe to store the scraped data
df = pd.DataFrame(index=range(40),
                  columns=['company', 'quarter', 'quarter_ending', 
                           'total_revenue', 'gross_profit', 'net_income', 
                           'total_assets', 'total_liabilities', 'total_equity', 
                           'net_cash_flow'])

## launch the Chrome browser
my_path = "C:UsersgstantonDownloadschromedriver.exe"
browser = webdriver.Chrome(executable_path=my_path)
browser.maximize_window()

Next, Todd thinks about how he’s going to get from company page to company page. Observing the current page’s url, he sees that substituting in the company’s ticker symbol and desired financial statement at the appropriate places should allow him to navigate to all the pages he needs, no simulated-clicking required. He also sees a common pattern in the xpath for the financial data he’ll be scraping.

url_form = "http://www.nasdaq.com/symbol/{}/financials?query={}&data=quarterly" 
financials_xpath = "//tbody/tr/th[text() = '{}']/../td[contains(text(), '$')]"

## company ticker symbols
symbols = ["amzn", "aapl", "fb", "ibm", "msft"]

for i, symbol in enumerate(symbols):
    ## navigate to income statement quarterly page    
    url = url_form.format(symbol, "income-statement")
    browser.get(url)

The first thing he wants to grab is the company ticker symbol, just so he can verify he’s scraping the correct page.

for i, symbol in enumerate(symbols):
    ## navigate to income statement quarterly page    
    url = url_form.format(symbol, "income-statement")
    browser.get(url)
    
    company_xpath = "//h1[contains(text(), 'Company Financials')]"
    try:
        company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
    except:
        company = nan

Notice the line for the assignment of the company variable. This tells the browser to check and see if the element is present, just as it normally would. If the element isn’t present, the browser will check again for the element every half second until the specified 10 seconds are up. Then it will throw an exception. This sort of precaution can be very useful for making your scrapers more reliable.

Examining the xpaths for the rest of the financial info, Todd sees that he will be collecting data points in groups of 4 (one data point for each quarter). To account for the possibility that some data might be missing, and to efficiently extract the text from the web elements, Todd writes the following function to simplify the scraping code.

## return nan values if elements not found, and convert the webelements to text
def get_elements(xpath):
    ## find the elements
    elements = browser.find_elements_by_xpath(xpath)
    ## if any are missing, return all nan values
    if len(elements) != 4:
        return [nan] * 4
    ## otherwise, return just the text of the element 
    else:
        text = []
        for e in elements:
            text.append(e.text)
        return text

Todd then finishes the code to loop through each of the company symbols and get the quarterly financial data from each of the financial statements.

## company ticker symbols
symbols = ["amzn", "aapl", "fb", "ibm", "msft"]

for i, symbol in enumerate(symbols):
    ## navigate to income statement quarterly page    
    url = url_form.format(symbol, "income-statement")
    browser.get(url)
    
    company_xpath = "//h1[contains(text(), 'Company Financials')]"
    company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
    
    quarters_xpath = "//thead/tr[th[1][text() = 'Quarter:']]/th[position()>=3]"
    quarters = get_elements(quarters_xpath)
    
    quarter_endings_xpath = "//thead/tr[th[1][text() = 'Quarter Ending:']]/th[position()>=3]"
    quarter_endings = get_elements(quarter_endings_xpath)
    
    total_revenue = get_elements(financials_xpath.format("Total Revenue"))
    gross_profit = get_elements(financials_xpath.format("Gross Profit"))
    net_income = get_elements(financials_xpath.format("Net Income"))
    
    ## navigate to balance sheet quarterly page 
    url = url_form.format(symbol, "balance-sheet")
    browser.get(url)
    
    total_assets = get_elements(financials_xpath.format("Total Assets"))
    total_liabilities = get_elements(financials_xpath.format("Total Liabilities"))
    total_equity = get_elements(financials_xpath.format("Total Equity"))
    
    ## navigate to cash flow quarterly page 
    url = url_form.format(symbol, "cash-flow")
    browser.get(url)
    
    net_cash_flow = get_elements(financials_xpath.format("Net Cash Flow"))

So for each iteration of the loop, Todd is collecting these data points. But he needs somewhere to store them. That’s where the pandas dataframe comes in. The following for loop ensures that the data is placed appropriately in the dataframe.

## fill the datarame with the scraped data, 4 rows per company
    for j in range(4):  
        row = i + j
        df.loc[row, 'company'] = company
        df.loc[row, 'quarter'] = quarters[j]
        df.loc[row, 'quarter_ending'] = quarter_endings[j]
        df.loc[row, 'total_revenue'] = total_revenue[j]
        df.loc[row, 'gross_profit'] = gross_profit[j]
        df.loc[row, 'net_income'] = net_income[j]
        df.loc[row, 'total_assets'] = total_assets[j]
        df.loc[row, 'total_liabilities'] = total_liabilities[j]
        df.loc[row, 'total_equity'] = total_equity[j]
        df.loc[row, 'net_cash_flow'] = net_cash_flow[j]

After remembering to close the browser and write his dataframe to a .csv file, Todd has his scraper. Kicking his feet back up on his desk, he breathes a sigh of relief and continues his deep meditations on the nature of being while selenium once again does his work for him.

If you enjoyed this post be sure to subscribe, and let me know if you have any other topics you’d like to see covered.

import pandas as pd
from numpy import nan
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

## return nan values if elements not found, and convert the webelements to text
def get_elements(xpath):
    ## find the elements
    elements = browser.find_elements_by_xpath(xpath)
    ## if any are missing, return all nan values
    if len(elements) != 4:
        return [nan] * 4
    ## otherwise, return just the text of the element 
    else:
        text = []
        for e in elements:
            text.append(e.text)
        return text

## create a pandas dataframe to store the scraped data
df = pd.DataFrame(index=range(40),
                  columns=['company', 'quarter', 'quarter_ending', 
                           'total_revenue', 'gross_profit', 'net_income', 
                           'total_assets', 'total_liabilities', 'total_equity', 
                           'net_cash_flow'])

## launch the Chrome browser   
my_path = "C:UsersastantonDownloadschromedriver.exe"
browser = webdriver.Chrome(executable_path=my_path)
browser.maximize_window()

url_form = "http://www.nasdaq.com/symbol/{}/financials?query={}&data=quarterly" 
financials_xpath = "//tbody/tr/th[text() = '{}']/../td[contains(text(), '$')]"

## company ticker symbols
symbols = ["amzn", "aapl", "fb", "ibm", "msft"]

for i, symbol in enumerate(symbols):
    ## navigate to income statement quarterly page    
    url = url_form.format(symbol, "income-statement")
    browser.get(url)
    
    company_xpath = "//h1[contains(text(), 'Company Financials')]"
    company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
    
    quarters_xpath = "//thead/tr[th[1][text() = 'Quarter:']]/th[position()>=3]"
    quarters = get_elements(quarters_xpath)
    
    quarter_endings_xpath = "//thead/tr[th[1][text() = 'Quarter Ending:']]/th[position()>=3]"
    quarter_endings = get_elements(quarter_endings_xpath)
    
    total_revenue = get_elements(financials_xpath.format("Total Revenue"))
    gross_profit = get_elements(financials_xpath.format("Gross Profit"))
    net_income = get_elements(financials_xpath.format("Net Income"))
    
    ## navigate to balance sheet quarterly page 
    url = url_form.format(symbol, "balance-sheet")
    browser.get(url)
    
    total_assets = get_elements(financials_xpath.format("Total Assets"))
    total_liabilities = get_elements(financials_xpath.format("Total Liabilities"))
    total_equity = get_elements(financials_xpath.format("Total Equity"))
    
    ## navigate to cash flow quarterly page 
    url = url_form.format(symbol, "cash-flow")
    browser.get(url)
    
    net_cash_flow = get_elements(financials_xpath.format("Net Cash Flow"))

    ## fill the datarame with the scraped data, 4 rows per company
    for j in range(4):  
        row = i + j
        df.loc[row, 'company'] = company
        df.loc[row, 'quarter'] = quarters[j]
        df.loc[row, 'quarter_ending'] = quarter_endings[j]
        df.loc[row, 'total_revenue'] = total_revenue[j]
        df.loc[row, 'gross_profit'] = gross_profit[j]
        df.loc[row, 'net_income'] = net_income[j]
        df.loc[row, 'total_assets'] = total_assets[j]
        df.loc[row, 'total_liabilities'] = total_liabilities[j]
        df.loc[row, 'total_equity'] = total_equity[j]
        df.loc[row, 'net_cash_flow'] = net_cash_flow[j]
   
browser.quit()

## create a csv file in our working directory with our scraped data
df.to_csv("test.csv", index=False)