scrapy json

Crawling with Scrapy – Exporting Json and CSV

If you’ve had a look at my previous posts in this Scrapy series now you have an idea how to scrape data from a page and how to follow links with Scrapy. The real beauty in web scraping is actually to be able to use the scraped data. In most cases, the easiest and smartest way to store scraped data is a simple Json or CSV file. They are readable by humans and other softwares as well so it should be applicable almost everytime though when you work with huge amount of data it might be better to choose a database structure which is more scalable.

Exporting Json and CSV in Scrapy

There are some ways to produce Json or CSV files including your data in Scrapy.

The first way is to use Feed Exports. You can run your scraper and store your data from the command line by setting the filename and desired format.

You may want to customize your output and produce structured Json or CSV while your scraper runs. You can use Item Pipeline to set your output properties in a pipeline and not from command line.

Exporting with Feed Export

As you learnt in the first post of this series you can run your scraper from command line with the scrapy crawl myspider command. If you want to create output files you have to set the filename and extension you want to use.

scrapy crawl myspider -o data.json 

scrapy crawl myspider -o data.csv scrapy crawl myspider -o data.xml

Scrapy has its built-in tool to generate json, csv, xml and other serialization formats.

If you want to specify either relative or absolute path of the produced file or set other properties from command line you can do it as well.

scrapy crawl reddit -s FEED_URI='/home/user/folder/mydata.csv' -s FEED_FORMAT=csv 

scrapy crawl reddit -s FEED_URI='mydata.json' -s FEED_FORMAT=json

Exporting with Item Pipeline

Scrapy Item Pipeline is a universal tool to process your data. Tipical usages are cleaning html, validating scraped data, dropping duplicates and storing scraped data in database. You can use pipelines if you want a convenient and customizable process to store your data.

You need to use JsonItemExporter:

class JsonPipeline(object):
    def __init__(self):
        self.file = open("books.json", 'wb')
        self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

It works the same way with CSV but you have to invoke CsvItemExporter:

class CsvPipeline(object):
    def __init__(self):
        self.file = open("booksdata.csv", 'wb')
        self.exporter = CsvItemExporter(self.file, unicode)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

Be aware that in a csv file the fields are separated with “,”(comma) by default. If your fields contain text with commas which screw up the whole structure you may want to create a function which fixes this:

def create_valid_csv(self, item):
    for key, value in item.items():
        is_string = (isinstance(value, basestring))
        if (is_string and ("," in value.encode('utf-8'))):
            item[key] = "\"" + value + "\""

You have to invoke this function before exporting the item so the ItemExporter will recognize commas in the data and structure accordingly.

Configure settings.py

It’s very important to tell scrapy you dare to use Item Pipelines otherwise your pipelines won’t be invoked.

You have to add these lines to your settings.py in your Scrapy project.

ITEM_PIPELINES = {
    'RedditScraper.pipelines.JsonPipeline': 300,
    'RedditScraper.pipelines.CsvPipeline': 500,
}

If you are wondering what those numbers mean, those are meant to indicate the priority of the pipelines. In this example the JsonPipeline will be executed sooner. The numbers must be in range of 0-1000.

Free Ebook

scrapy fundamentalsScrapy Fundamentals

Your information will be used to send you these ebooks and subscribe you to our weekly newsletter. We will only send you relevant information. We may use your email address for marketing purposes but we will never sell or share your information to any third parties. You can unsubscribe at any time.