Crawling with Scrapy – ItemLoader

scrapy itemloader

Item Loaders are used to populate your items. Earlier, you learnt how to create Scrapy Items and store your scraped data in them. Essentially, Item Loaders provide a way to populate these Items and run any input or output process you want alongside. Maybe you need to parse the scraped items even more, apply a filter on them or simply cleansing html or validate data. You can do all these with Item Loader.

Scrapy Item Loader

When instantiating an Item Loader to populate items you don’t use dict-like syntax instead you define it by giving an item which you want to populate and a Response or Selector object in the constructor. The item is a Scrapy Item instance that you want to populate. The selector or response object will define where is the data you need to extract.

 

This is how you create an item loader object and then how you populate the item. With add_css(field_name, css_selector) method you populate the text field which is defined in MyItem. Then you define a css selector which will select the data from response. As you see in the snippet above xpath is supported as well. You also can assign value which is not scraped from the website but directly defined in the code.

It’s important that when you invoke add_css or add_xpath or add_value the data will be extracted but the item will not be populated yet. The item will be populated when you invoke load_item() method. It returns the scraped item.

Input and Output Processors

As I mentioned you can do further parsing and processing on your scraped data with Item Loader. When you declare an ItemLoader you should assign an item to it. Now, each field of this item has an input processor and an output processor.

Here’s a small snippet and explanation how it works:

  1. The data is extracted according to selector1, input processor is invoked and scraped data is stored in ItemLoader(Not in the item!) as name.
  2. The data is extracted according to selector2, the same input processor is invoked and scraped data is appended to name(collected in 1.), data is stored in ItemLoader.
  3. The data is extracted according to xpath1,  the same input processor is invoked and scraped data is appended to name(collected in 1. and 2.), data is stored in ItemLoader.
  4. my_title is assigned to name and the input processor is invoked.
  5. As we collected all data we need now output processor is invoked and the data is populated in items by ItemLoader.

 

Now that you know how and when input/output processors are invoked you’re going to learn how to declare them. You can do it many ways now we will declare them in the Item class.

You can see that we use the same input processor for each field. MapCompose(remove_tags) removes html tags around the text which contain the data. For the first and second field we apply the same output processor which is TakeFirst(). It does the same thing as Selector‘s extract_first() in a spider. It returns a single value not a list. For the last field we apply Join() as output processor which simply appends the elements one after another with comma separation.

You can define a default input/output processor to your ItemLoader like this:

Scrapy ItemLoader Built-in Processors

Scrapy has some built-in input/output processors which you can use. Also, you can call any function as a processor.

It returns the original values it doesn’t modify anything. You might use it when you’ve defined a different a default processor for your ItemLoader.

It returns the first value which is not null or empty out of the selected elements. It is usually used as an output processor if we want only one single value to be stored.

It returns the values joined together with the separator specified in the constructor.(u’ ‘ is the default).

This processor takes multiple functions in its constructor then it invokes them in the given order. So the first function will return the modified value then the next function will modify it and return it as well and so on.

It works the same way as Compose processor the difference is how returned values are passed to the next function. The input value(probably a list) of this processor is iterated and the first function is applied to each element of the list. The returned values of the first function are appended to create a new iterable which will be given to the next(the second) function and so on. The output iterable of the last function will be the output of the processor.

Download Full Source Code!