scrapy multiple sites

How To Scrape Multiple Websites With One Spider

Lately, I’ve come across a scraping job where I needed to scrape the same kind of information from multiple websites. The whole story was to create a spider that scrapes price data of certain products from various ecommerce sites. Also each scraped item needed to have a unique id (uuid). So I decided to write a single spider to scrape each website rather than writing multiple spiders for each one.

Because I wanted to make it work with one single spider I had to write product name and price selectors specifically for each website. Though for some websites I could use the same selector.

The product URLs were predefined so I didn’t have to deal with some pagination, following and such. Just put a ton of URLs into start_urls. Right after scrapy processes a request, in the parsing function I figure out which website is being scraped at the moment and select the name and price selectors accordingly.

Choosing the Right Selector For Each Website

So I pass a shop meta parameter in the request to be able to figure out which shop’s website is being parsed. Then with a bunch of IFs (which is ugly but whatever) I assign the right name and price selector to the item which is about to be populated. Here’s the code:

def parse(self, response):
    item_loader = ItemLoader(item=ProductItem(), response=response)
    item_loader.default_input_processor = MapCompose(remove_tags)
    item_loader.default_output_processor = TakeFirst()
    shop = response.meta["shop"]
    if shop == "shop0":
        price_selector = "meta[itemprop='price']::attr(content)"
        name_selector = "h1[itemprop='name']"
    elif shop == "shop1":
        price_selector = ".price-md > span"
        name_selector = "meta[itemprop='name']::attr(content)"
    elif shop == "shop2":
        price_selector = ".product-price"
        name_selector = ".product-detail-section > h3"
    elif shop == "shop3":
        price_selector = ".new-price"
        name_selector = "span.arial-12-bold"
    elif shop == "shop4":
        price_selector = "span[itemprop='price']"
        name_selector = "h1[itemprop='name']"
    elif shop == "shop5":
        price_selector = "dd[itemprop='price']"
        name_selector = "h1[itemprop='name']"

    item_loader.add_css("price", price_selector)
    item_loader.add_css("name", name_selector)

    item_loader.add_value("shop", shop)
    item_loader.add_value("updated", str(datetime.datetime.now()))
    item_loader.add_value("scrape_id", str(uuid.uuid1()))
    item_loader.add_value("url", response.url)
    return item_loader.load_item()

It’s not a fancy solution I guess but it does get the job done which was the priority now.

Download FREE ebook!

As a quick off-topic, I’ve updated my scrapy-templates Github repo. I modified all the pagination code in the templates because of the latest Scrapy release. Now the recommended way to create requests is to use response.follow function.