Selenium + PhantomJS Tutorial

This post borrows from the previous selenium-based post here. If you have heard of PhantomJS, would like to try it out, and are curious to see how it performs against other browsers such as Chrome, this post will help. However, in my experience, using the PhantomJS browser for webscraping doesn’t really have many benefits compared to using Chrome or Firefox (unless you need to run your script on a server, in which case it’s your go-to). It is faster, though not as much as you might hope, and I’ve found it to be much less reliable (it can randomly freeze on tasks that run smoothly on Chrome despite extensive tweaking and troubleshooting). My current opinion is that it’s more trouble than it’s worth for webscraping purposes, but if you want to try it out for yourself, I hope you’ll find the below tutorial helpful.

If you aren’t familiar with it, PhantomJS is a browser much like Chrome or Firefox but with one important difference: it’s headless. This means that using PhantomJS doesn’t require an actual browser window to be open. To install the PhantomJS browser, go here and choose the appropriate download (I’ll assume Windows from here on out, though process is similar in other OS’s). Unzip the zip file, named something like “phantomjs-2.1.1-windows.zip”. And there you have it, PhantomJS is installed. If you go into the unzipped folder, and then into the bin folder, you should find a file named “phantomjs.exe”. All we need to do now is reference that file’s path in our script to launch the browser.

Here is the start of our script from last time:

import time
import pandas as pd
from numpy import nan
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

## return nan values if elements not found, and convert the webelements to text
def get_elements(xpath):
    ## find the elements
    elements = browser.find_elements_by_xpath(xpath)
    ## if any are missing, return all nan values
    if len(elements) != 4:
        return [nan] * 4
    ## otherwise, return just the text of the element 
    else:
        text = []
        for e in elements:
            text.append(e.text)
        return text

## create a pandas dataframe to store the scraped data
df = pd.DataFrame(index=range(40),
                  columns=['company', 'quarter', 'quarter_ending', 
                           'total_revenue', 'gross_profit', 'net_income', 
                           'total_assets', 'total_liabilities', 'total_equity', 
                           'net_cash_flow'])

Now we launch the browser, referencing the PhantomJS executable:

my_path = 'C:UsersgstantonDownloadsphantomjs-2.1.1-windowsbinphantomjs.exe'
browser = webdriver.PhantomJS(executable_path=my_path)

However, at least for me, just simply launching the browser like this resulted in highly unreliable webscraping that would freeze at seemingly-random times. To make a long story short, here is some revised code for launching the browser that I found improved performance.

dcaps = webdriver.DesiredCapabilities.PHANTOMJS
dcaps["phantomjs.page.settings.userAgent"] = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'
my_path = 'C:UsersgstantonDownloadsphantomjs-2.1.1-windowsbinphantomjs.exe'
browser = webdriver.PhantomJS(executable_path=my_path, 
                              service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any', '--debug=true'], 
                              desired_capabilities=dcaps)
browser.implicitly_wait(5)

Now I’ll let you compare the runtimes for PhantomJS and Chrome. It’s set to run PhantomJS right now, so just paste the code into your own IDE and when you want to test Chrome just comment out the PhantomJS browser launch section instead.

import time
import pandas as pd
from numpy import nan
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

## return nan values if elements not found, and convert the webelements to text
def get_elements(xpath):
    ## find the elements
    elements = browser.find_elements_by_xpath(xpath)
    ## if any are missing, return all nan values
    if len(elements) != 4:
        return [nan] * 4
    ## otherwise, return just the text of the element 
    else:
        text = []
        for e in elements:
            text.append(e.text)
        return text

## create a pandas dataframe to store the scraped data
df = pd.DataFrame(index=range(40),
                  columns=['company', 'quarter', 'quarter_ending', 
                           'total_revenue', 'gross_profit', 'net_income', 
                           'total_assets', 'total_liabilities', 'total_equity', 
                           'net_cash_flow'])

start_time = time.time()

## launch the PhantomJS browser
###############################################################################
dcaps = webdriver.DesiredCapabilities.PHANTOMJS
dcaps["phantomjs.page.settings.userAgent"] = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'
my_path = 'C:UsersgstantonDownloadsphantomjs-2.1.1-windowsbinphantomjs.exe'
browser = webdriver.PhantomJS(executable_path=my_path, 
                              service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any', '--debug=true'], 
                              desired_capabilities=dcaps)
browser.implicitly_wait(5)
###############################################################################
"""
## launch the Chrome browser
###############################################################################    
my_path = "C:UsersgstantonDownloadschromedriver.exe"
browser = webdriver.Chrome(executable_path=my_path)
browser.maximize_window()
###############################################################################
"""

url_form = "http://www.nasdaq.com/symbol/{}/financials?query={}&data=quarterly" 
financials_xpath = "//tbody/tr/th[text() = '{}']/../td[contains(text(), '$')]"

## company ticker symbols
symbols = ["amzn", "aapl", "fb", "ibm", "msft"]

for i, symbol in enumerate(symbols):
    ## navigate to income statement quarterly page    
    url = url_form.format(symbol, "income-statement")
    browser.get(url)
    
    company_xpath = "//h1[contains(text(), 'Company Financials')]"
    company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
    
    quarters_xpath = "//thead/tr[th[1][text() = 'Quarter:']]/th[position()>=3]"
    quarters = get_elements(quarters_xpath)
    
    quarter_endings_xpath = "//thead/tr[th[1][text() = 'Quarter Ending:']]/th[position()>=3]"
    quarter_endings = get_elements(quarter_endings_xpath)
    
    total_revenue = get_elements(financials_xpath.format("Total Revenue"))
    gross_profit = get_elements(financials_xpath.format("Gross Profit"))
    net_income = get_elements(financials_xpath.format("Net Income"))
    
    ## navigate to balance sheet quarterly page 
    url = url_form.format(symbol, "balance-sheet")
    browser.get(url)
    
    total_assets = get_elements(financials_xpath.format("Total Assets"))
    total_liabilities = get_elements(financials_xpath.format("Total Liabilities"))
    total_equity = get_elements(financials_xpath.format("Total Equity"))
    
    ## navigate to cash flow quarterly page 
    url = url_form.format(symbol, "cash-flow")
    browser.get(url)
    
    net_cash_flow = get_elements(financials_xpath.format("Net Cash Flow"))

    ## fill the datarame with the scraped data, 4 rows per company
    for j in range(4):  
        row = i + j
        df.loc[row, 'company'] = company
        df.loc[row, 'quarter'] = quarters[j]
        df.loc[row, 'quarter_ending'] = quarter_endings[j]
        df.loc[row, 'total_revenue'] = total_revenue[j]
        df.loc[row, 'gross_profit'] = gross_profit[j]
        df.loc[row, 'net_income'] = net_income[j]
        df.loc[row, 'total_assets'] = total_assets[j]
        df.loc[row, 'total_liabilities'] = total_liabilities[j]
        df.loc[row, 'total_equity'] = total_equity[j]
        df.loc[row, 'net_cash_flow'] = net_cash_flow[j]
   
browser.quit()

## create a csv file in our working directory with our scraped data
df.to_csv("test.csv", index=False)

print(time.time() - start_time)

When I compared the browsers, I found PhantomJS was generally faster, but not by enough to make selenium a viable webscraping option if it wasn’t already with using Chrome. Additionally, it took a fair amount of troubleshooting to get the PhantomJS browser to the point where it would perform even semi-reliably.

In conclusion, these two browsers are in the same general bracket of webscraping speed, and because Chrome has given me so fewer issues, I still recommend Chrome. In the future though, I’ll explore other more powerful ways of scraping pages with Javascript-rendered content. Stay tuned.