Crawling data from ecommerce websites is pretty usual nowadays. There’s so much data and information on ecommerce sites that it’s tempting to make use of it. Scraping product fields like name, price, category, brand etc.. means a basis for competitor monitoring and intelligence solutions. We can talk about manufacturers, web stores and the like that are in constant need of ecommerce data.
Technically an ecommerce site is no different than any other kind of website you can find on the internet. In this post I wanna discover some commonalities that these sites share. Some easy-way-outs and patterns you can quickly check before start developing your scraper.
Check meta and hidden tags to select fields
When I first learnt this it really made my job easier and quicker. When we try to scrape a page first we are looking for the fields. Considering an ecommerce site, one of the fields should be price for sure. Another one might be stock info.
<meta itemprop="price" content="619"> <meta itemprop="priceCurrency" content="GBP"> <meta itemprop="availability" content="in_stock">
In the example above, we can find the price, currency and stock info about a product. The cool thing about it, is that there’s a really high chance when the website changes its design – sooner or later it will – we don’t have to modify the scraper because these meta tags are totally independent from the layout. We can easily fetch the element using the itemprop attribute then selecting the value of content. Selecting data fields using meta tags like these is a great way to make our scraper a little bit more robust.
If you’re interested in what other meta tags you should be looking for on sites check out this.
Fetch links from sitemap
If your project involves crawling the whole website then it’s generally a good tip to first look for a sitemap. A sitemap sometimes contains all the urls of a site sometimes not. When it comes to ecommerce sites, you should be able the get at least all the urls of product pages. In most cases we don’t need other than those anyways.
If the ecommerce site you are after has tens of thousands of products then definitely check first if the site has a proper sitemap or not. Also, some sites like this one has multiple sitemap xml files that contain product urls.
Recognize URL patterns
This tip is not always usable but when it is, it comes handy. I’m sure you’ve already recognized that some websites, not just ecommerce sites, use a really simple way to generate dynamic urls. Something like this: superwebstore.com/products/125
Be sure to check that the number at the end of the url is just a basic auto-increment-like value and not any sort of product id. In this case, you can just create a scraper that has a loop with an increasing number that you stick to the end of the url everytime. If it’s a unique product id you will probably not be able to use it to ease web scraping.