Crawling with Scrapy – Exporting Json and CSV

scrapy json

scrapy json

If you’ve had a look at my previous posts in this Scrapy series now you have an idea how to scrape data from a page and how to follow links with Scrapy. The real beauty in web scraping is actually to be able to use the scraped data. In most cases, the easiest and smartest way to store scraped data is a simple Json or CSV file. They are readable by humans and other softwares as well so it should be applicable almost everytime though when you work with huge amount of data it might be better to choose a database structure which is more scalable.

Exporting Json and CSV in Scrapy

There are some ways to produce Json or CSV files including your data in Scrapy.

The first way is to use Feed Exports. You can run your scraper and store your data from the command line by setting the filename and desired format.

You may want to customize your output and produce structured Json or CSV while your scraper runs. You can use Item Pipeline to set your output properties in a pipeline and not from command line.

Exporting with Feed Export

As you learnt in the first post of this series you can run your scraper from command line with the scrapy crawl myspider command. If you want to create output files you have to set the filename and extension you want to use.

Scrapy has its built-in tool to generate json, csv, xml and other serialization formats.

If you want to specify either relative or absolute path of the produced file or set other properties from command line you can do it as well.

Exporting with Item Pipeline

Scrapy Item Pipeline is a universal tool to process your data. Tipical usages are cleaning html, validating scraped data, dropping duplicates and storing scraped data in database. You can use pipelines if you want a convenient and customizable process to store your data.

You need to use JsonItemExporter:

It works the same way with CSV but you have to invoke CsvItemExporter:

Be aware that in a csv file the fields are separated with “,”(comma) by default. If your fields contain text with commas which screw up the whole structure you may want to create a function which fixes this:

You have to invoke this function before exporting the item so the ItemExporter will recognize commas in the data and structure accordingly.

Configure settings.py

It’s very important to tell scrapy you dare to use Item Pipelines otherwise your pipelines won’t be invoked.

You have to add these lines to your settings.py in your Scrapy project.

If you are wondering what those numbers mean, those are meant to indicate the priority of the pipelines. In this example the JsonPipeline will be executed sooner. The numbers must be in range of 0-1000.

 

Download FREE ebook!