Store Scraped Data In SQL Database With Scrapy

When we scrape data from the web it’s probably because we will need to use it later. In order to do that we gotta store it somehow. In some cases a simple data file like json or csv is just excellent for the job. But other times we need more structure and control over our data. If this is the case storing data into an sql database might be our best choice.

In this article we’re gonna discover how to store properly scraped data into a database. Also we’re gonna discuss some best practices to do that inside scrapy. We’re gonna be using scrapy the web scraping framework to extract, prepare data and python-mysqldb library to create connection with the db

Database Pipeline

The most straightforward way to store scraped items into a database is to use a database pipeline. In scrapy, pipelines can be used to filter, drop, maybe clean and process scraped items.  So first, let’s create our pipeline and add a constructor that receives the database settings:

class DatabasePipeline(object):

    def __init__(self, db, user, passwd, host):
        self.db = db
        self.user = user
        self.passwd = passwd
        self.host = host

    def process_item(self, item, spider):
        return item

Now, we could also just hardcode these db settings into the pipeline some people may prefer that. What I believe is a better way to go about this is to define credentials and stuff like that in scrapy settings (settings.py) then implement the from_crawler function. This function returns a new pipeline instance and the good thing about it is that it has access to the scrapy settings as well.

Inside from_crawler function, we grab the credentials and other settings from the settings.py file and initiate the new pipeline. The code:

@classmethod
def from_crawler(cls, crawler):
    db_settings = crawler.settings.getdict("DB_SETTINGS")
    if not db_settings: # if we don't define db config in settings
        raise NotConfigured # then reaise error
    db = db_settings['db']
    user = db_settings['user']
    passwd = db_settings['passwd']
    host = db_settings['host']
    return cls(db, user, passwd, host) # returning pipeline instance

Our settings.py should include this piece of code

DB_SETTINGS = {
    'db': "my_db",
    'user': 'root',
    'passwd': 'my_pass',
    'host': '0.0.0.0',
}

Okay now we have the pipeline and db settings. Next step is to implement the pipeline to do these:

  1. Initiate database connection
  2. Insert items into db
  3. Close connection

1. Initiate database connection

We’re going to implement the open_spider method. This method is invoked when the spider starts running. This method will deal with initiating the db connection:

def open_spider(self, spider):
    self.conn = MySQLdb.connect(db=self.db,
                                user=self.user, passwd=self.passwd,
                                host=self.host,
                                charset='utf8', use_unicode=True)
    self.cursor = self.conn.cursor()

We are using the values defined earlier in settings.py.

2. Insert items into db

We got the connection. Now we need to process the items. The process_item function needs to return an item. But before returning it we execute a query which inserts it into the db:

def process_item(self, item, spider):
    sql = "INSERT INTO table (field1, field2, field3) VALUES (%s, %s, %s)"
    self.cursor.execute(sql,
                        (
                            item.get("field1"),
                            item.get("field2"),
                            item.get("field3"),
                        )
                       )
    self.conn.commit()
    return item

3. Close connection

Similar to open_spider, scrapy has a method called close_spider. This method is invoked when the spider is done working and closed. We close our db connection inside this method:

def close_spider(self, spider):
    self.conn.close()

And that’s it. Our data is saved into the database using a scrapy pipeline. The full pipeline code:

import MySQLdb
from scrapy.exceptions import NotConfigured


class DatabasePipeline(object):

    def __init__(self, db, user, passwd, host):
        self.db = db
        self.user = user
        self.passwd = passwd
        self.host = host

    @classmethod
    def from_crawler(cls, crawler):
        db_settings = crawler.settings.getdict("DB_SETTINGS")
        if not db_settings:
            raise NotConfigured
        db = db_settings['db']
        user = db_settings['user']
        passwd = db_settings['passwd']
        host = db_settings['host']
        return cls(db, user, passwd, host)

    def open_spider(self, spider):
        self.conn = MySQLdb.connect(db=self.db,
                               user=self.user, passwd=self.passwd,
                               host=self.host,
                               charset='utf8', use_unicode=True)
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):
        sql = "INSERT INTO table (field1, field2, field3) VALUES (%s, %s, %s)"
        self.cursor.execute(sql,
                            (
                                item.get("field1"),
                                item.get("field2"),
                                item.get("field3"),
                            )
                            )
        self.conn.commit()
        return item


    def close_spider(self, spider):
        self.conn.close()

Scrapinhub Platform

What do we do if we use Scrapinghub’s platform to run our spiders? The challange here is that we will not be able to initiate a db connection on scrapinghub’s platform in our code. Fortunately, there’s a solution to it. Shub provides a simple API to get the scraped data as JSON. An example API request

https://storage.scrapinghub.com/items/*project_id*?apikey=*your_api_key*&format=json

Inserting the correct project id and apikey, this URL returns our latest scraped data in json format. Then we can go on and process this file just like any other json file. Example, using requests library:

url = "https://storage.scrapinghub.com/items/*project_id*?apikey=*your_api_key*&format=json"
r = requests.get(url)
for item in r.json():
	self.insert_item(item)

Of course it depends on the structure of your scraped data how you need to process it. A “scraped item” usually means one record in the database table.

Exporting Data File

Lastly, I wanna mention one more way to do it. Sometimes you cannot insert the data as soon as you scrape it. Sometimes you have some (ETL) processes between scraping and the final database. Possible solution is to just simply export the scraped data in a structured format like JSON. Then pass this file to the processing service you already have in place

To export JSON you need to create a pipeline for that. Luckily, scrapy has a built-in way to do this:

class JsonPipeline(object):
    def __init__(self):
        self.file = open("data_export.json", 'wb')
        self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

Then you can easily read deserialize this file in python like this:

import json
json_data = json.loads(open("data_export.json").read())

Now you can do whatever you want with your scraped data.

Wrapping Up

The purpose of this article was to show you different ways to save your scraped data into a database and make it easy for you to decide which solution is the best fit for your situation. I hope it gave you some value.

Free Ebook

scrapy fundamentals

Scrapy Fundamentals

Your information will be used to send you these ebooks and subscribe you to our weekly newsletter. We will only send you relevant information. We may use your email address for marketing purposes but we will never sell or share your information to any third parties. You can unsubscribe at any time.