Crawling with Scrapy – Download Images

scrapy download image

One of the most useful features of Scrapy is that it can download and process images. For example in the ecommerce world, retail companies use web scraping technology to make use of online data of products. Scraping images is necessary in order to match competitors’ products with their own products. With scrapy, you can easily download images from websites with the ImagesPipeline.

Downloading Images with Scrapy

The process of downloading images:

  1. Very first, you should install Pillow, an imaging library because scrapy uses it.
  2. You have to enable ImagesPipeline. Go to settings.py and include ImagesPipeline as an item pipeline.
  3. Again in settings.py, define IMAGES_STORE which is the path where the images should be downloaded.
  4. In your item class you create these fields: image_urls and images
  5. Inside your spider you scrape the URLs of the images you want to download and put it into image_urls field(It has to be a list). Now your job is done in your spider. scrapy’s ImagesPipeline downloads the image(s) and the scraper waits until downloading finishes(or fails).
  6. After the images are downloaded the images field will be populated with the results. It will contain a list of dictionaries of information about the image such as download path, URL, and the checksum of the file.

So let’s do it step by step:

    1. Install Pillow with pip:

    2. Enable ImagesPipeline:

    3.Define a path for the images:

    4. Create field in item class:

    5. Scrape URLs in your spider:

img_urls needs to be a list and needs to contain ABSOLUTE URLs that’s why sometimes you have to create a function to transform relative URLs to absolute.

    6. If everything works correctly you will see an output something like this:

Download a FREE copy of ‘“Learn Web Scraping From Scratch”

Custom Names for Image Downloading fields

You can define your own field names instead of image_urls and images. In the settings file set this:

Create Thumbnails of the Images

The ImagesPipeline can do it for you. You just have to include in settings.py the dimensions of the desired thumbnails and it creates them automatically. Like this:

It generates two kinds of thumbnails(a smaller and a bigger) for each images saving them into two different folder. The aspect ratio will be kept.

File Expiration

Scrapy is capable of checking if the image has been already downloaded recently so it won’t download it again if not necessary. You can define how long scrapy should not download the same image again in the settings:

Custom Filenames for Images

The default filenames of the downloaded images are based on a SHA1 hash of their URLs. But in real world it doesn’t help you to know what’s on the image without opening it. You can use whatever filenames you want for the images. You have to extend the ImagesPipeline. You override two functions: get_media_requests and file_path. In the first function you will return a Request object with meta information. This meta information will carry the name of the desired filename. In the second function you will simply use the meta information you passed to override the default file path. Sample code:

You create a field called image_name in the item class. Then you populate it with the desired data in your spider.

 

Download FREE ebook!