In the last post of my web scraping business blog post series I mentioned that I have a spider-creating system. This system makes me able to build scrapy spiders literally in minutes. With this system my only goal is to be able to produce new spiders for websites as soon as possible. Spiders that actually work. I’m not sure if this is the best way to do it but right now I do it this way. I use my own scrapy templates. You can check it out on Github.
I did a research on the internet how I should organize my spider templates and how I should name them so I will know every time which one fits my needs. Unfortunately I didn’t really find any clue on this topic. So I had to figure out a way and now I’m sharing it with you guys maybe you find it helpful. I’m telling you the basic gist of it so you understand my templates.
How the scrapy templates work
First of all, I analyze how deep I need to go into the website to get my data fields. For example if I have to follow a link then scrape an item and paginate to another item it would look like this:
1st level: Following links 2nd level: Scraping and Pagination
Another example: On the starting url I need to follow links and pagination. On the followed links I need to scrape data. This task would look like this:
1st level: Following and Pagination 2nd level: Scraping
Speaking about the template naming convention. I came up with an idea how I can describe briefly what the template spider does. For example I have a template named 1fol2scr.py. fol stands for following and scr stands for scraping. It means that this template is capable of scraping websites that are two level deep: Following links on the first level and scraping/populating item on the second level.
I consider the starting url of my spider as the first level. You get to second level when you follow one url on the starting page. Third level is when you follow another url on the second level and you go deeper in the website. It’s important in order to understand my templates that when you paginate you don’t go deeper in the website. You stay on the same level.
These templates are so useful for me because everytime I need to write a new spider for a website the first thing I do is to figure out how I can get to the data fields and get them. As a result of this short investigation I’ve got something like 1scr_pag. It means I need to scrape and paginate on the starting urls. Then I check if I have already a template which does the same thing. If I do that’s great spider is done in minutes. If not then I will create a template for this kind of website as well so the next time I encounter such websites I can use this template. Using templates, I only have to specify the starting urls and some selectors of hrefs and fields. This way I can quickly produce new scrapy spiders for most of websites.
If you had a look at the source code you saw that I use item loaders in all of my scrapy templates. It’s a good way to insert a new layer between the pipeline and the selectors in my spider to clean and sanitize data before sending to pipelines. It makes my spider code more readable. Though when I need to write super simple spiders with basic pipelines I neglect to use item loaders.
What is your routine when it comes to writing spiders? Tell me in the comments!