Lately, I’ve come across a scraping job where I needed to scrape the same kind of information from multiple websites. The whole story was to create a spider that scrapes price data of certain products from various ecommerce sites. Also each scraped item needed to have a unique id (uuid). So I decided to write a single spider to scrape each website rather than writing multiple spiders for each one.
Because I wanted to make it work with one single spider I had to write product name and price selectors specifically for each website. Though for some websites I could use the same selector.
The product URLs were predefined so I didn’t have to deal with some pagination, following and such. Just put a ton of URLs into start_urls. Right after scrapy processes a request, in the parsing function I figure out which website is being scraped at the moment and select the name and price selectors accordingly.
Choosing the Right Selector For Each Website
So I pass a shop meta parameter in the request to be able to figure out which shop’s website is being parsed. Then with a bunch of IFs (which is ugly but whatever) I assign the right name and price selector to the item which is about to be populated. Here’s the code:
def parse(self, response): item_loader = ItemLoader(item=ProductItem(), response=response) item_loader.default_input_processor = MapCompose(remove_tags) item_loader.default_output_processor = TakeFirst() shop = response.meta["shop"] if shop == "shop0": price_selector = "meta[itemprop='price']::attr(content)" name_selector = "h1[itemprop='name']" elif shop == "shop1": price_selector = ".price-md > span" name_selector = "meta[itemprop='name']::attr(content)" elif shop == "shop2": price_selector = ".product-price" name_selector = ".product-detail-section > h3" elif shop == "shop3": price_selector = ".new-price" name_selector = "span.arial-12-bold" elif shop == "shop4": price_selector = "span[itemprop='price']" name_selector = "h1[itemprop='name']" elif shop == "shop5": price_selector = "dd[itemprop='price']" name_selector = "h1[itemprop='name']" item_loader.add_css("price", price_selector) item_loader.add_css("name", name_selector) item_loader.add_value("shop", shop) item_loader.add_value("updated", str(datetime.datetime.now())) item_loader.add_value("scrape_id", str(uuid.uuid1())) item_loader.add_value("url", response.url) return item_loader.load_item()
It’s not a fancy solution I guess but it does get the job done which was the priority now.
As a quick off-topic, I’ve updated my scrapy-templates Github repo. I modified all the pagination code in the templates because of the latest Scrapy release. Now the recommended way to create requests is to use response.follow function.