Crawling with Scrapy – How to Scrape a Single Page

Web scraping is something that can be really useful, inevitable and a good framework makes it really easy. When working with Python, I like using Scrapy framework because it’s very powerful and easy to use even for a novice and capable of scraping large sites like If you haven’t used Scrapy before check out my guide on How to install scrapy on Ubuntu.

Create a Web Scraper with Scrapy

We’re gonna create a scraper that crawl some data from reddit’s programming subreddit. More specifically, we are going to crawl the hottest programming topics from the first page, their links and their posting time.

Very first, we create a scrapy project using this command:

scrapy startproject RedditScraper

Our project structure looks like this:


In Scrapy, we have to store scraped data in Item classes. In our case, an Item will have fields like title, link and posting_time.

class RedditscraperItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    posting_time = scrapy.Field()

Next, we create a new python file in Spiders folder in which we’re going to implement parse method. But first, we should check out the page‘s HTML to figure out a way to achieve the information we need.


We can see that each data we need is inside a class named “entry unvoted”.

Now in our custom spider we need to define its name as a string and its urls as a list. Then in parse method we create a for loop which go through each “entry unvoted” div and find every title, link and posting time. We are using xpath because now I find it more efficient and readable than CSS selectors.

import scrapy
from RedditScraper.items import RedditItem
class RedditSpider(scrapy.Spider):
    name = "reddit"
    start_urls = [
    def parse(self, response):
        for selector in response.xpath("//div[@class='entry unvoted']"):
            item = RedditItem()
            item["title"] = selector.xpath("p[@class='title']/a/text()").extract()
            item["link"] = selector.xpath("p[@class='title']/a/@href").extract()
            item["posting_time"] = selector.xpath("p[@class='tagline']/time/text()").extract()
            yield item

Running the scrapy crawl reddit command we get what we wanted wrapped in Items:

2016-03-07 18:49:52 [scrapy] DEBUG: Scraped from 200
{'link': [u''],
 'posting_time': [u'3 hours ago'],
 'title': [u'[Codeless Code] Case 225: The Three Most Terrifying Words']}
2016-03-07 18:49:52 [scrapy] DEBUG: Scraped from 200
{'link': [u''],
 'posting_time': [u'2 hours ago'],
 'title': [u'Using HTTPS Properly']}

Download Full Source Code!