Web Scraping in Java with Jsoup

jsoup

jsoup

When I was starting out as a programmer and as a web scraper I was addicted to Java. I didn’t care that other languages existed. I was so stubborn that in my hobby projects I literally used Java for everything. I wrote desktop applications, web applications and Web Scrapers in java. It was cool because I gained a great knowledge in java. Besides, I learnt the basics of web scraping in Java too. The first web scraping/html parsing library I ever used is Jsoup. Jsoup is awesome. In many cases you need no more than Jsoup.

 

How to Scrape a Website with Jsoup

This post is just a quick overview what Jsoup can do for you. I will cover the main web scraping tasks you may encounter in your project.

In the examples below I will use my useragent but you should use YOUR own or spoof.

So first, obviously you need to open a web page which you are going to scrape:

Selectors

In Jsoup there are two ways to navigate in our html and select the element we need to fetch or manipulate. DOM and CSS selectors. Jsoup doesn’t support XPath (though you can check out XSoup which does).

This snippet shows you how you can select only country names from the example page:

Now we’re gonna use CSS selectors to select country capital, population and area:

You can see that using CSS selectors your selection can be more accurate about what you really need from the page.

You can easily extract these html elements to String texts like this:

Forms

Jsoup makes it super easy to work with submitable forms. In our example, let’s consider this page.¬†There, you can see bunch of hockey teams and their stats. Let’s say you need to search for Calgary Flames’s stats only. To correctly submit a form with our scraper you should analyze the source code quite a bit to get an idea what data you need to post/get in order to reach what you need.

jsoup

You can see you will need to add¬†data action and “q” to your GET request. Here it is:

That’s all. It works the same as if you typed the text and clicked Search button in a browser.

Login

Logging in to a website is pretty similar to submitting a form but you have to take care about cookies. With handling cookies you can achieve that you don’t need to login again and again when you want to scrape different pages. As I mentioned above, you should inspect the source code of the page to learn what it does exactly when it logs you in. So this is the page I’m going to use to log in with Jsoup and store cookies. Here’s what I see in inspector:

jsoupIt looks really similar to the other form above, except, that now we need to POST something after GET and at the same time handle cookies. Here you can do this:

You see that we store the cookies in a simple Map and now you are good to go. Now you can open pages which needs you to be logged in. Stay logged in by setting the cookies for the page you are going to scrape

Pagination

In Jsoup, as everything else, pagination is very simple to do. Here’s an example of this page:

This way you can iterate over infinite number of pages and proceed with the same scraping code.

Jsoup can do much more, I advise you to check out Jsoup.org to learn more about the library. Also, if you are interested in web scraping/html parsing libraries just like Jsoup in other languages check out The Ultimate Resource Guide To Html Parsers.

 

If you struggle with scraping a web page, comment below I will help you out.