As I always say web scraping is really useful and inevitable sometimes. Making raw web data useful is very important nowadays. If you’ve followed my Scrapy tutorial series you already know how to scrape hundreds of thousands of pages with Scrapy. (If you don’t click the link)
Another great thing about web scraping is that you can make it work fully automatically on a server and it will give you data periodically. When using Scrapy you should use Scrapinghub Scrapy Cloud. It provides a convenient way to easily run, schedule and track your scrapers on a remote server.
What is Scrapy Cloud? It is a platform which is capable of deploying your scrapy spiders and scale them if you need to. Also, you can watch your spiders running then review the collected data. Furthermore, main features include monitoring spiders and tracking them while they’re running, you can read the log real-time, reviewing the scraped items in your browser, and scheduling periodical jobs. The data is stored on Scrapinghub’s servers. You can interact with your spiders and fetch structured data through a simple HTTP API.
Deploy Your Spider
After you coded your web scraper and it works as expected on your machine it’s time to deploy to the Scrapy Cloud. If you don’t have a scrapy project that you can use to try scrapy cloud here’s a sample project you can deploy.
Create Project on Scrapy Cloud
First, you have to register to scrapinghub. It’s free. Then, you can create your first project with defining its name and that it’s a scrapy project(not Portia).
If you click create you land on the project’s Job Dashboard. Here you can check your completed, running and scheduled jobs.
On the side bar click Code & Deploys. You’ll find here your API key and the ID of this project. You’ll need these in order to deploy your spider. You have to install shub this command will help you to deploy and interact with your spider.
pip install shub
After you successfully installed the scrapinghub command line tool, cd into your scrapy project folder and run this:
The first time you deploy it will ask you to provide your API key and project ID.
Whoala, if you did everything well you’ve just deployed your first scrapy project to the cloud. Now It’s up to you what your scraper needs to do and when.
On Job Dashboard you can test your scraper right away just click Run and watch the results of your scraper.
Maybe you want to add a periodic job so your scraper will run regularly:
Ok now you have all the data you need on a remote web server but how do you access it?
You have to use a single API to read your data as json file, like this: