Today, we’re not gonna talk about scrapy. Instead we’re gonna talk about things you have to make sure you do before starting to scrape a website. It doesn’t matter if you use a framework like scrapy or a simple scraping library like beautifulsoup these concepts will help you save time. Sometimes you don’t need to scrape the website to get tha data because you can use a sort of “hidden API” I talk about in the video. Or sometimes you shouldn’t scrape the website to be ethical and you should find other workarounds to get the data you need.
First of all, you should check out if there’s an official API that you can use to get your data. It’s 2017 so it wouldn’t be surprsising to find an API for any website. Also, in some cases the official API is not updated properly or there’s a piece of data missing from it that you need. Even then, you should think twice about it if you really need to scrape the site because getting data from an API is definitely the best way to do it.
So what is this hidden API stuff? Some websites display data through a technology called XHR. XHR means XMLHttpRequest. Essentially, the backend generates an XML or JSON file that contains data and the frondend request this data file directly. This way in the browser inspector we can find this XHR file that is just like an official API. I’m showing you in the video exactly how you can find it.
Websites use sitemaps to make it easier for search engines to index their pages. We can also take advantage of this by looking up product URLs or whatever else we need to scrape. Getting URLs from sitemaps is much faster than gathering it with our scraper.
Almost each website have this file. This text file contains some rules web crawlers and site indexers should follow. Things like which page you cannot visit with your bot or how many requests you should make per second etc.If you don’t follow the rules defined here probably nothing will happen to you but I urge you to be ethical and follow the rules.
In the video I show you a specific website that mentions in the TOS that you should not scrape and gather data from the site. The website probably encountered someone scraping their site and they didn’t like it so they included this in the TOS. I’m not a lawyer or anything but as I know having a link to the TOS at the bottom of the page is not enough to protect the site from crawling. I mean you should be okay if you scrape the website anyway. But again I’m not a lawyer I know jack shit about the legal stuff I only speak from my experience. One thing I’m sure you’ll be unethical if you scrape a site like that.