Crawling with Scrapy – How to Scrape a Single Page

scrapy

Web scraping is something that can be really useful, inevitable and a good framework makes it really easy. When working with Python, I like using Scrapy framework because it’s very powerful and easy to use even for a novice and capable of scraping large sites like amazon.com. If you haven’t used Scrapy before check out my guide on How to install scrapy on Ubuntu.

Create a Web Scraper with Scrapy

We’re gonna create a scraper that crawl some data from reddit’s programming subreddit. More specifically, we are going to crawl the hottest programming topics from the first page, their links and their posting time.

Very first, we create a scrapy project using this command:

Our project structure looks like this:

scrapy

In Scrapy, we have to store scraped data in Item classes. In our case, an Item will have fields like title, link and posting_time.

Next, we create a new python file in Spiders folder in which we’re going to implement parse method. But first, we should check out the page‘s HTML to figure out a way to achieve the information we need.

scrapy

We can see that each data we need is inside a class named “entry unvoted”.

Now in our custom spider we need to define its name as a string and its urls as a list. Then in parse method we create a for loop which go through each “entry unvoted” div and find every title, link and posting time. We are using xpath because now I find it more efficient and readable than CSS selectors.

Running the scrapy crawl reddit command we get what we wanted wrapped in Items:

Download Full Source Code!