Web Scraping in Python with Beautifulsoup

beautifulsoup

I’m often asked, “Which web scraping library should I choose?” I usually answer choose the one that is the most popular in your programming language. If it’s java then choose Jsoup. If Python BeautifulSoup is your best bet.

 

BeautifulSoup Installation

You can easily install the most recent version of beautifulsoup with pip:

Web Scraping with BeautifulSoup

Because BeautifulSoup cannot load any html page from the internet you need to use a library such as urllib2.

First, you need to request the page you want to scrape(https://scrapethissite.com/pages/simple/ in our case) then setup a proper user agent to identify yourself.

Now we have one more task before scraping: Determine which parser lib should BeautifulSoup use. Python has a built-on parser lib html.parser. In this example I’m using this one but you can choose another third party lib for example lxml.

We’ve just setup BeautifulSoup correctly we can use our “page” object to navigate and find elements on the page.

Selectors

Using our “page” you can navigate to any tags on the page and extract it right away:

Also, you can find elements by passing the tag and class name as arguments like this:

This statement selects all element which has “h3” tag and its class is “country-name”.

BeautifulSoup supports CSS selectors that make you select elements very easy. Let’s see an example:

BeautifulSoup doesn’t support XPath. Check out Darian Moody’s article why CSS selectors are better than XPath!

Pagination

This is an example of how to find the URLs on a page like this.

Download Full Source Code!