Crawling with Scrapy – How to Debug Your Spider

scrapy debug

When you write a software it’s obvious that sooner or later there will be a function or method which doesn’t work as you expected or doesn’t work at all. It’s the same when you code a web scraper and it doesn’t scrape a piece of data or the response you get from the server is somewhat different to what you would expect. There are bunch of possible mistakes your scraper can make while running. To find the root of the problem as soon as possible you have to debug your code. In Scrapy there are several ways to do this. So let’s dive into it.

Debugging Your Web Scraper

The first thing that you and probably many developers do when something doesn’t work perfectly is to put some print statements here and there and let’s hope it’s gonna show you what’s wrong quickly. Well, it might be the first step but what if it doesn’t help you? You should give scrapy’s parse command a shot. It will show you what happens under the hood. Or maybe you want to go hardcore and debug with Scrapy Shell. You will definitely want to use Scrapy Shell at some point because it’s a really effective way to track each of your objects while scraping. Perhaps you would rather have a look at the html response in a real browser. You will learn exactly how to debug your spiders in all the different ways now.

Logging

A really similar way to print statements is to log short messages in your methods. There are 5 logging levels in Scrapy (and Python): critical, error, warning, info and debug. As you’re debugging you tend to use the debug level. But if you would like to log general messages to track your scraper you may use the other ones.

There are two ways to log a message:

In your spider you can easily use logger because each Scrapy Spider instance has a logger object. It’s a good practice to log about something significant which was done or should have done by your scraper. So when you run your scraper you can be sure if it’s working or not. You might want to check if your scraper scraped data properly.

Scrapy Commands

Ok so after you logged the hell out of your spider and yet haven’t found any solution to your problem it’s time to move on and try i different debugging method. You should look what is happening under scrapy’s hood. The scrapy parse command gives you a good insight on method level. You can check the scraped items by defining the name of your spider, the name of the parsing(callback) function, depth level and the website url. And if you want you can add -v (–verbose) to get information about each depth level.

Beside scrapy parse, there are two more commands which could help you debugging: scrapy fetch and scrapy view.

scrapy fetch downloads the HTML file from the server and prints to stdout.
scrapy view opens the response in a real browser so you can see what Scrapy “sees” while scraping.

You can see that both commands take a url which is then downloaded.

Scrapy Shell

Here comes the monster. If you want to track and inspect your spider thoroughly Scrapy Shell is the best tool to do it. A good thing about Scrapy Shell is that you can use it in the command line and you can invoke it from your spider code as well. You can check and test each object you have: crawler, spider, request, response, settings.

You can launch the shell with a website url you want to scrape.

Then you will see that the shell gives you a list of objects you can use and other stuff.

Now it’s up to you what you want to test or check or just randomly play around.

A really useful thing Scrapy Shell can do and I like it a lot is that it’s capable of testing your selectors and xpath right away. So you can check your extraction code alone don’t have to run the whole scraper.

This is how to invoke the shell from your spider:

When you invoke Scrapy Shell from your spider code the crawler stops at that point and lets you check if everything works as expected. Then when you are done you press Ctrl+Z(or Ctrl+D) to exit the shell and resume your scraper.

Attaching PyCharm Debugger

I usually use IntelliJ every time as my IDE so I use PyCharm(which is essentially IntelliJ) as my Python IDE. Sometimes when I don’t feel like using the shell or other debugging methods I just create some breakpoints and run debugger. I show you how you can run and debug your scrapy project in PyCharm.

First you need to create a new python file(run.py) in your project directory. Then in that file you import scrapy’s cmdline module and add the command you would write in the command line as a string parameter like this, this is your run.py:

Finally, you should create a python run configuration and add run.py as script:

scrapy debug

Now you can define breakpoints and debug your scraper in PyCharm.

So, you have learnt some ways to debug your web scraper. I definitely suggest using Scrapy Shell most of the time because I found that it’s very useful in every case and you can test everything very quickly.

Before you go, let me give you more Scrapy tutorials here and if you really want to delve deep in Scrapy I’m going to give away some Scrapy Framework ebooks for FREE sign up here to get it