When we scrape data from the web it’s probably because we will need to use it later. In order to do that we gotta store it somehow. In some cases a simple data file like json or csv is just excellent for the job. But other times we need more structure and control over our data. If this is the case storing data into an sql database might be our best choice.
In this article we’re gonna discover how to store properly scraped data into a database. Also we’re gonna discuss some best practices to do that inside scrapy. We’re gonna be using scrapy the web scraping framework to extract, prepare data and python-mysqldb library to create connection with the db
Database Pipeline
The most straightforward way to store scraped items into a database is to use a database pipeline. In scrapy, pipelines can be used to filter, drop, maybe clean and process scraped items. So first, let’s create our pipeline and add a constructor that receives the database settings:
class DatabasePipeline(object): def __init__(self, db, user, passwd, host): self.db = db self.user = user self.passwd = passwd self.host = host def process_item(self, item, spider): return item
Now, we could also just hardcode these db settings into the pipeline some people may prefer that. What I believe is a better way to go about this is to define credentials and stuff like that in scrapy settings (settings.py) then implement the from_crawler function. This function returns a new pipeline instance and the good thing about it is that it has access to the scrapy settings as well.
Inside from_crawler function, we grab the credentials and other settings from the settings.py file and initiate the new pipeline. The code:
@classmethod def from_crawler(cls, crawler): db_settings = crawler.settings.getdict("DB_SETTINGS") if not db_settings: # if we don't define db config in settings raise NotConfigured # then reaise error db = db_settings['db'] user = db_settings['user'] passwd = db_settings['passwd'] host = db_settings['host'] return cls(db, user, passwd, host) # returning pipeline instance
Our settings.py should include this piece of code
DB_SETTINGS = { 'db': "my_db", 'user': 'root', 'passwd': 'my_pass', 'host': '0.0.0.0', }
Okay now we have the pipeline and db settings. Next step is to implement the pipeline to do these:
- Initiate database connection
- Insert items into db
- Close connection
1. Initiate database connection
We’re going to implement the open_spider method. This method is invoked when the spider starts running. This method will deal with initiating the db connection:
def open_spider(self, spider): self.conn = MySQLdb.connect(db=self.db, user=self.user, passwd=self.passwd, host=self.host, charset='utf8', use_unicode=True) self.cursor = self.conn.cursor()
We are using the values defined earlier in settings.py.
2. Insert items into db
We got the connection. Now we need to process the items. The process_item function needs to return an item. But before returning it we execute a query which inserts it into the db:
def process_item(self, item, spider): sql = "INSERT INTO table (field1, field2, field3) VALUES (%s, %s, %s)" self.cursor.execute(sql, ( item.get("field1"), item.get("field2"), item.get("field3"), ) ) self.conn.commit() return item
3. Close connection
Similar to open_spider, scrapy has a method called close_spider. This method is invoked when the spider is done working and closed. We close our db connection inside this method:
def close_spider(self, spider): self.conn.close()
And that’s it. Our data is saved into the database using a scrapy pipeline. The full pipeline code:
import MySQLdb from scrapy.exceptions import NotConfigured class DatabasePipeline(object): def __init__(self, db, user, passwd, host): self.db = db self.user = user self.passwd = passwd self.host = host @classmethod def from_crawler(cls, crawler): db_settings = crawler.settings.getdict("DB_SETTINGS") if not db_settings: raise NotConfigured db = db_settings['db'] user = db_settings['user'] passwd = db_settings['passwd'] host = db_settings['host'] return cls(db, user, passwd, host) def open_spider(self, spider): self.conn = MySQLdb.connect(db=self.db, user=self.user, passwd=self.passwd, host=self.host, charset='utf8', use_unicode=True) self.cursor = self.conn.cursor() def process_item(self, item, spider): sql = "INSERT INTO table (field1, field2, field3) VALUES (%s, %s, %s)" self.cursor.execute(sql, ( item.get("field1"), item.get("field2"), item.get("field3"), ) ) self.conn.commit() return item def close_spider(self, spider): self.conn.close()
Scrapinhub Platform
What do we do if we use Scrapinghub’s platform to run our spiders? The challange here is that we will not be able to initiate a db connection on scrapinghub’s platform in our code. Fortunately, there’s a solution to it. Shub provides a simple API to get the scraped data as JSON. An example API request
https://storage.scrapinghub.com/items/*project_id*?apikey=*your_api_key*&format=json
Inserting the correct project id and apikey, this URL returns our latest scraped data in json format. Then we can go on and process this file just like any other json file. Example, using requests library:
url = "https://storage.scrapinghub.com/items/*project_id*?apikey=*your_api_key*&format=json" r = requests.get(url) for item in r.json(): self.insert_item(item)
Of course it depends on the structure of your scraped data how you need to process it. A “scraped item” usually means one record in the database table.
Exporting Data File
Lastly, I wanna mention one more way to do it. Sometimes you cannot insert the data as soon as you scrape it. Sometimes you have some (ETL) processes between scraping and the final database. Possible solution is to just simply export the scraped data in a structured format like JSON. Then pass this file to the processing service you already have in place
To export JSON you need to create a pipeline for that. Luckily, scrapy has a built-in way to do this:
class JsonPipeline(object): def __init__(self): self.file = open("data_export.json", 'wb') self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False) self.exporter.start_exporting() def close_spider(self, spider): self.exporter.finish_exporting() self.file.close() def process_item(self, item, spider): self.exporter.export_item(item) return item
Then you can easily read deserialize this file in python like this:
import json json_data = json.loads(open("data_export.json").read())
Now you can do whatever you want with your scraped data.
Wrapping Up
The purpose of this article was to show you different ways to save your scraped data into a database and make it easy for you to decide which solution is the best fit for your situation. I hope it gave you some value.