Quick Tip: Consuming Google Search results to use for web scraping

Web scraping image

While working on a project recently, I needed to grab some google search results for specific search phrases and then scrape the content from the page results.

For example, when searching for a Sony 16-35mm f2.8 GM lens on google, I wanted to grab some content (reviews, text, etc) from the results.  While this isn’t hard to build from scratch, I ran across a couple of libraries that are easy to use and make things so much easier.

The first is ‘Google Search‘ (install via pip install google). This library lets you consume google search results with just one line of code. An example is below (this will import google search and run a search for Sony 16-35mm f2.8 GM lens and print out the urls for the search.

from googlesearch import search

for url in search('Sony 16-35mm f2.8 GM lens', tld='com', stop=1):
    print url

For the above, I’m using google.com for the search and have told it to stop after the first set of results.

The output:

https://www.bhphotovideo.com/c/product/1338516-REG/sony_sel1635gm_fe_16_35mm_f_2_8_gm.html
https://www.amazon.com/Sony-SEL1635GM-16-35mm-2-8-22-Camera/dp/B071LHLS11
https://www.sony.com/electronics/camera-lenses/sel1635gm



https://www.the-digital-picture.com/Reviews/Sony-FE-16-35mm-f-2.8-GM-Lens.aspx
Sony FE 16-35mm f/2.8 GM lens review: Highest-rated wide-angle zoom
Review: Sony 16-35mm f2.8 G Master FE (Sony E Mount, Full Frame)
https://www.adorama.com/iso1635gm.html

That’s pretty easy.

Now, we can use those url’s to scrape the websites that are returned.

To scrape these sites, you could run some fairly complex scraping systems, build your own fairly complex systems…or…if you just need some basic content and aren’t going to be doing a LOT of scraping, you could use the ‘Newspaper‘ library. Of course, there are plenty of other libraries but the newspaper library really simplifies things for those ‘quick and dirty’ projects.  Note: This is best used in python3.

To get started, install newspaper with pip3 install newspaper3k (for python3).

Now, to scrape the urls returned from the google search, you can simply do the following:

from newspaper import Article
article = Article(url)
article.download()
article.parse()

This will grab the url, download it and parse it so you can access the content.  Here’s an example of grabbing the url https://www.the-digital-picture.com/Reviews/Sony-FE-16-35mm-f-2.8-GM-Lens.aspx.

from newspaper import Article
article = Article('https://www.the-digital-picture.com/Reviews/Sony-FE-16-35mm-f-2.8-GM-Lens.aspx')
article.download()
article.parse()
print(article.text)

The output of the print(article.text is below (I’ve only included an excerpt for this example but this will grab the entire text):

‘Those putting together the ultimate Sony E-mount lens kit are going to want this lens included. The Sony FE 16-35mm f/2.8 GM Lens covers a key focal length range in wide aperture with high quality. In this case, the term high quality applies both to the lens’ physical attributes and to the image quality delivered by it.nnMany are first-attracted to the Alpha MILC (Mirrorless Interchangeable Lens Camera) system for Sony’s high-performing full frame imaging sensors, but lenses are as important as cameras and Sony’s lens lineup was initially viewed by many as deficient. Adapting Canon brand lenses for use on Sony cameras was prevalent. The introduction of Sony’s flagship Grand Master line (the “GM” in the name) was very welcomed by Sony owners and this line is proving attractive to those considering a switch to the Sony camp. The 16-35mm f/2.8 GM is one more reason to stay entirely within the Sony brand.nnFocal Length RangennWhen starting a kit, most will first select a general purpose lens (Sony system owners should seriously consider the Sony FE 24-70mm f/2.8 GM Lens) and one of the next-most-needed lenses is typically a wide-angle zoom. This 16-35mm range ideally covers that need.nnThe 107° angle of view provided by a 16mm focal length is ultra-wide and all of the narrower angles of view down to 63°, just modestly-wide, are included. To explore what this focal length range looks like, we head to RB Rickett’s falls in Ricketts Glen State Park.nnOne of the most popular uses for this range is, as illustrated above, landscape photography.

Now, one of the really cool features of the newspaper library is that it has built-in natural language processing capabilities and can return keywords, summaries and other interesting tidbits. To get this to work, you must have the Natural Language Toolkit (NLTK) installed (install with pip install nltk) and have the punkt package installed from nltk. Here’s an example using the previous url (and assuming you’ve already done the above steps).

import nltk

# Let's download punkt. 
# If already installed punkt,
# you can skip this step
nltk.download('punkt')

article.nlp() #this runs the natural language processing
print(article.keywords)

The result:

['focal', '1635mm', 'review', 'gm',
 'lens', 'sony', 'focus', 'aperture', 
'f28', 'fe', 'lenses']

That’s quite nice (and easy!).  Of course, If I were doing this as a serious NLP Project, i’d write my own NLP functions but for a quick look at keywords of an article, this is a fast way to do it.


If you want to learn more about Natural Language Processing using NLTK, the definitive book is Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit.


Photo by Émile Perron on Unsplash

The post Quick Tip: Consuming Google Search results to use for web scraping appeared first on Python Data.