beautifulsoup

Web Scraping in Python with Beautifulsoup

I’m often asked, “Which web scraping library should I choose?” I usually answer choose the one that is the most popular in your programming language. If it’s java then choose Jsoup. If Python BeautifulSoup is your best bet.

BeautifulSoup Installation

You can easily install the most recent version of beautifulsoup with pip:

pip install beautifulsoup4

Web Scraping with BeautifulSoup

Because BeautifulSoup cannot load any html page from the internet you need to use a library such as urllib2.

First, you need to request the page you want to scrape(https://scrapethissite.com/pages/simple/ in our case) then setup a proper user agent to identify yourself.

#create request and set user agent
request = urllib2.Request('https://scrapethissite.com/pages/simple/')
request.add_header('User-Agent', 'ScrapingAuthority (ScrapingAuthority.com')

#open page
open = urllib2.urlopen(request)

Now we have one more task before scraping: Determine which parser lib should BeautifulSoup use. Python has a built-on parser lib html.parser. In this example I’m using this one but you can choose another third party lib for example lxml.

page = BeautifulSoup(open, 'html.parser')

title = page.title.text #title of page

We’ve just setup BeautifulSoup correctly we can use our “page” object to navigate and find elements on the page.

Selectors

Using our “page” you can navigate to any tags on the page and extract it right away:

h3 = page.h3.get_text()    #selects the first H3 tag on the page and extracts its text

Also, you can find elements by passing the tag and class name as arguments like this:

country_name_tags = page.find_all('h3', class_ = 'country-name')

This statement selects all element which has “h3” tag and its class is “country-name”.

BeautifulSoup supports CSS selectors that make you select elements very easy. Let’s see an example:

country_name_tags = page.select('.row .col-md-4.country .country-name')    #select country name elements

country_names = []
for country_name_tag in country_name_tags :    #extract text only
    country_names.append(country_name_tag.get_text())

BeautifulSoup doesn’t support XPath. Check out Darian Moody’s article why CSS selectors are better than XPath!

Pagination

This is an example of how to find the URLs on a page like this.

link_tags = page.select('.pagination a')
for link_tag in link_tags:
	url = link_tag.get('href')
	# scrape url...

If you are interested in how to build interactive python tools with Excel as a frontend and work with Excel workbooks in python check out this article: Tools for Working with Excel and Python