Crawling with Scrapy – Scrapy Cloud

scrapy-cloud

 

As I always say web scraping is really useful and inevitable sometimes. Making raw web data useful is very important nowadays. If you’ve followed my Scrapy tutorial series you already know how to scrape hundreds of thousands of pages with Scrapy. (If you don’t click the link)

Another great thing about web scraping is that you can make it work fully automatically on a server and it will give you data periodically. When using Scrapy you should use Scrapinghub Scrapy Cloud. It provides a convenient way to easily run, schedule and track your scrapers on a remote server.

Scrapy Cloud

What is Scrapy Cloud? It is a platform which is capable of deploying your scrapy spiders and scale them if you need to. Also, you can watch your spiders running then review the collected data. Furthermore, main features include monitoring spiders and tracking them while they’re running, you can read the log real-time, reviewing the scraped items in your browser, and scheduling periodical jobs. The data is stored on Scrapinghub’s servers. You can interact with your spiders and fetch structured data through a simple HTTP API.

ScrapingAuthority readers get an exclusive 50% off discount on Scrapy Cloud units for 6 months using this discount code: SASOCCER

Deploy Your Spider

After you coded your web scraper and it works as expected on your machine it’s time to deploy to the Scrapy Cloud. If you don’t have a scrapy project that you can use to try scrapy cloud here’s a sample project you can deploy.

Create Project on Scrapy Cloud

First, you have to register to scrapinghub. It’s free. Then, you can create your first project with defining its name and that it’s a scrapy project(not Portia).

scrapy cloud

If you click create you land on the project’s Job Dashboard. Here you can check your completed, running and scheduled jobs.

Deploy

On the side bar click Code & Deploys. You’ll find here your API key and the ID of this project. You’ll need these in order to deploy your spider. You have to install shub this command will help you to deploy and interact with your spider.

After you successfully installed the scrapinghub command line tool, cd into your scrapy project folder and run this:

The first time you deploy it will ask you to provide your API key and project ID.

Whoala, if you did everything well you’ve just deployed your first scrapy project to the cloud. Now It’s up to you what your scraper needs to do and when.

 

On Job Dashboard you can test your scraper right away just click Run and watch the results of your scraper.

scrapy cloud

scrapy cloud

 

Maybe you want to add a periodic job so your scraper will run regularly:

scrapy cloud

Scrapinghub API

Ok now you have all the data you need on a remote web server but how do you access it?

You have to use a single API to read your data as json file, like this:

 

So this is the basic gist of Scrapy Cloud. I urge you to read further about Scrapy Cloud here and Scrapinghub API here.

Download FREE ebook!