Crawling with Scrapy – Pagination with CrawlSpider

scrapy

scrapy

In the previous Scrapy tutorial you learnt how to scrape information from a single page. Going further with web scraping, you will need to visit a bunch of URLs within a website and execute the same scraping script again and again. In my Jsoup tutorial and BeautifulSoup tutorial I showed you how you can paginate on a website now you will learn how to do this with Scrapy. The CrawlSpider module makes it super easy.

Pagination with Scrapy

As relevant example, we are going to scrape some data from Amazon. As usual, scrapy will do most of the work and now we’re using its CrawlSpider Module. It provides an attribute called rule. This is a tuple in which we define rules about links we want our crawler to follow.

First and foremost, we should setup a User Agent because we want our crawler to see the site like we see it in browser and it’s fine because we don’t intend to do anything harmful. You can setup a User Agent in settings.py:

Scrapy CrawlSpider Rule Attribute

So for example we’re looking for the most reviewed books in each category.
In our spider file we create a Rule which contains where book category links are on the page then callback the method we want to execute inside each category page(our starting url is amazon books page):

Now our crawler knows where to go. We know that amazon, like most of modern sites, uses javascript to display content. In some cases it makes scraping much more complicated but it’s a good thing that amazon works perfectly without any javascript so we don’t have to use any kind of head-less browser or such.

Scrapy FormRequest

As I said, we need the most reviewed books in each category, let’s say the first 12 we can find on the first page. But first we should sort the books by most reviews.

scrapy

If you have a look at the source you ’ll see that we need to parse a form in order to sort. The “Go” button which will refresh the page according to the form is visible only if visiting the page without javascript enabled like our crawler does. In Scrapy we can use a FormRequest object to pass a form:

Data extraction

The next and final thing we have to do is to parse each link that redirect the crawler to a book’s page where you invoke the parse_book_page method which will take care of scraping the data we’re looking for.

Finally, we extract the desired details we need in parse_book_page. If you do not know how to extract data from a single page go here: Crawling with Scrapy – How to Scrape a Single Page

Download Full Source Code!