Website scraping using python

Website scraping refers to reading of any website’s structure to extract needed information through an automated system, usually a script. There is a thin line between legal and illegal website scraping. If a content is available without logging in or performing any identity verification or unless explicitly mentioned by the content provider, scraping that website doesn’t fall under the radar.

Today, we are going to extract information from one of the best information provider – Wikipedia. We are going to make use of its random article feature to extract information from a random article. I am going to use following python tools:

Python 2
urllib2 module
BeautifulSoup module

To install beautifulsoup, you can use pip install beautifulsoup and urllib2 is a python bultin module so you need not install it explicitly.

Okay, so first step now is to import required modules:

from bs4 import BeautifulSoup
import urllib2

Now we are going to fetch content of a random wikipedia webpage using urllib2 and create a parseable object using beautifulsoup.

random_page = "https://en.wikipedia.org/wiki/Special:Random"
random_page_content = urllib2.urlopen(ranom_page)
parsed_page = BeautifulSoup(random_page_content)

 

The post Website scraping using python appeared first on PyJournal.