Crawling with Scrapy – Scrapy Items

We use web scraping to turn unstructured data into highly structured data. Essentially, it’s the goal of web scraping. Structured data means collected information in database such as mongoDB or SQL database. Also, in most cases we only need some simple data structure such as JSON, CSV or XML. This way you can process data to fit your desires. Web scraping collects gross information from websites into an organized system. Data has to be stored in built-in data types, at least for a while, so you can process it into a database. In Scrapy you can use the Item class to store data before extracting into any kind of structured data system mentioned before.

Scrapy Item

To extract data with Scrapy we use particular Item classes in our project to store information we need to fetch. You can have as many items as you wish. Strive to organize your scrapy items in the most readable and logical way. Each item class has to derive from scrapy.Item class.

class BookExampleItem(scrapy.Item):
    title = scrapy.Field()
    author = scrapy.Field()
    length = scrapy.Field()
    paperback = scrapy.Field()
    publisher = scrapy.Field()

It’s as simple as that. You create Scrapy Fields in your item to store data.

Associate Fields with data

Now you know how to create Item classes. Setting values to them works the same way as you set values in a Dictionary.

book_item = BookExampleItem()
book_item["title"] = 'title'
book_item["author"] = 'author'
book_item["length"] = 'length'
book_item["paperback"] = 'paperback'
book_item["publisher"] = 'publisher'

This way you can easily set values to your Item Fields in your Spider script just like a dict.

Now you have a basic understanding how Scrapy Items work so you can start web scraping right away. Check out how you can build a spider or paginate on a website with Scrapy. If you have any questions or issues comment below and I will help you.

If you liked the post you should join the Scraping Authority Community on Facebook.

Also, if you’re interested in a working example:

Download Full Source Code!