Crawling with Scrapy – Crawling Settings

scrapy-settings

Scrapy provides a convenient way to customize the crawling settings of your scraper. Including the core mechanism, pipelines and spiders.  When you create a new scrapy project with scrapy startproject command you will find a settings.py file. Here you can customize your scraper’s settings.

Scrapy Settings

Let’s examine the key settings which you may have to modify for each project.

Settings.py

The user agent should identify who you are. Most websites you cannot visit without a user agent.

By default this is set to True so your scraper will follow the guidelines defined in the site’s robots.txt. Every time you scrape a website your scraper should operate ethically.

Pipelines are meant to process items right after scraping. It’s important that you have to make it clear which pipelines you want to apply while scraping. In this example the second pipeline is commented out so not activated it won’t be invoked.

Here you have to declare where the spiders are inside your project.

The requests scrapy can make at the same time. This is 16 by default. Be careful when you set it to avoid damaging the website!

The delay between requests given in seconds. Default value is 0! You might want modify this to be nicer to the website.

Some website recognizes scrapy’s default request headers so it might be a good idea to customize it.

Scrapy’s AutoThrottle extension is designed to adjust the speed of crawling according to the scrapy server and the crawled website server. In high-volume projects it’s useful to enable.

This is a very brief guide to scrapy settings. These are the most frequent settings I adjust in almost every project. I suggest you checking out the official doc here if you want to know more.

 

Download a FREE copy of ‘“Learn Web Scraping From Scratch”

Modify Settings in Command Line

You can override any settings in the command line with -s (or –set):

Settings for Specific Spiders

You can define settings specifically for certain spiders:

It will override the DOWNLOAD_DELAY attribute in settings.py.

Accessing Settings Objects

In your spiders you have access to the settings through self.settings:

If you want to access settings in your pipeline you have to override from_crawler method. Crawler has settings attribute:

 

You should use these settings obejcts according to its API.

Download a FREE copy of ‘“Learn Web Scraping From Scratch”