Web Scraping in Java in 2024: The Complete Guide

May 10, 2024 ยท 14 min read

In this Java web scraping tutorial,ย you'll learn everything you need to know about web scraping in Java. Follow this step-by-step tutorial, andย you'll become a web scraping expert. In detail, you'll learn how to master the web scraping basics as well as the most advanced aspects.

Let's not waste more time!ย Learn how to build a web scraper in Java. This script will be able to crawl an entire website and automatically extract data from it. Cool, isn't it?

Can You Web Scrape With Java?

The short answer is "Yes, you can!"

Java is one of the most reliable object-oriented programming languages available. Thus, Java can count on a wide range of libraries. This means thatย there are several Java web scraping libraries you can choose from.

Two examples are Jsoup and Selenium.ย These libraries allow you to connect to a web page. Also,ย they come with many functions to help you extract the data that interests you. In this Java web scraping tutorial, you'll learn how to use both.

How Do You Scrape a Page in Java?

You can scrape a web page in Java as you can perform web scraping in any other programming language.ย You need a web scraping Java library that allows you to visit a web page, retrieve HTML elements, and extract data from them.

You can easilyย install a Java web scraping library withย Mavenย orย Gradle. There are the two most popular Java dependency tools. Follow this web scraping Java tutorial, and learn more about how to do web scraping in Java.

Getting Started

Before starting to build your Java web scraper, you need to meet the following requirements:

  • Java LTS 8+: any version of Java LTS (Long-Term Support) greater than or equal to 8 will do. In detail, this Java web scraping tutorial refers to Java 21. At the time of writing, this is the last LTS version of Java.
  • Gradleย orย Maven: choose one of the two build automation tools. You'll need one of them for its dependency management features to install your Java web scraping library.
  • A Java IDE: any IDE that supports Java and can integrate with Maven and Gradle will do.ย IntelliJ IDEAย is one of the best options available.

If you don't meet these prerequisites, follow the links above.ย Download and install Java, Gradle and Maven, and the Java IDE in order.ย If you encounter problems, follow the official installation guides. Then, you can verify everything went as expected with the following terminal command:

Terminal
java -version

This should return something like this:

Output
openjdk version "21" 2023-09-19 LTS
OpenJDK Runtime Environment Temurin-21+35 (build 21+35-LTS)
OpenJDK 64-Bit Server VM Temurin-21+35 (build 21+35-LTS, mixed mode, sharing)

As you can see, that represents the info related to the version of Java installed on your machine.

Then, if you're a Gradle user, type in your terminal:

Terminal
gradle -v

Again, this would return the version of Gradle you installed, as follows:

Output
 ------------------------------------------------------------
Gradle 8.4
------------------------------------------------------------

Build time:   2023-10-04 20:52:13 UTC
Revision:     e9251e572c9bd1d01e503a0dfdf43aedaeecdc3f

Kotlin:       1.9.10
Groovy:       3.0.17
Ant:          Apache Ant(TM) version 1.10.13 compiled on January 4 2023
JVM:          21 (Eclipse Adoptium 21+35-LTS)
OS:           Linux 6.2.0-34-generic amd64

Or, if you're a Maven user, launch the command below:

Terminal
mvn -v

If the Maven installation process worked as expected, this should return something like this:

Output
Apache Maven 3.9.5 (57804ffe001d7215b5e7bcb531cf83df38f93546)

You're ready to follow this step-by-step web scraping Java tutorial. In detail,ย you're going to learn how to perform web scraping in Java on ScrapingCourse.com, a demo site dedicated to web scrapers with a real ecommerce features.

Have a look atย the target product page.

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

Note thatย the ScrapingCourse.comย e-commerce website is just a simple paginated list of different products.ย The goal of the Java web scraper will be to crawl the entire website and retrieve all product data.

New project on IDE
Click to open the image in full screen

Select Maven or Gradleย based on the build automation tool you installed or want to use. Then,ย click "Create" to initialize your Java project. Wait for the setup process to end, and you should now have access to the following Java web scraping project:

Scraping project on IDE
Click to open the image in full screen

Let's now learn the basics of web scraping using Java!

Basic Web Scraping in Java

The first thing you need to learn isย how to scrape a static website in Java. You canย think of a static website as a collection of pre-built HTML documents. Each of these HTML pages will have its CSS and JavaScript files. A static website relies on server-side rendering.

In a static web page, the content is embedded in the HTML document provided by the server. So, youย don't need a web browser to extract data from it. In detail, static web scraping is about:

  1. Downloading a web page
  2. Parsing the HTML document retrieved from the server.
  3. Selecting the HTML elements containing the data of interest from the web page.
  4. Extracting the data from them

In Java, scraping web pages isn't difficult, especially when it comes to static web scraping. Let's now learn the basics of web scraping with Java.

Step #1: Install Jsoup

First, you need a web-scraping Java library.ย Jsoupย  is a Java library to perform that makes web scraping easy.ย In detail, Jsoup comes with anย advanced Java web scraping API. This allows you to connect to a web page with its URL, select HTML elements with CSS selectors, and extract data from them.

In other terms,ย Jsoup offers you almost everything you need to perform static web scraping with Java. If you're a Gradle user, addย jsoupย to theย dependenciesย section of yourย build.gradleย file:

Terminal
implementation "org.jsoup:jsoup:1.16.1"

For Maven users, add the following to the dependency tag of your pom.xml file:

Terminal
<dependency> 
	<groupId>org.jsoup</groupId> 
	<artifactId>jsoup</artifactId> 
	<version>1.16.1</version> 
</dependency>

Then, if you're an IntelliJ user,ย don't forget to click on the Gradle/Maven reload button below to install the new dependencies:

Reload to install dependencies
Click to open the image in full screen

Jsoup is now installed and ready to use. Import it in yourย Scraper.javaย file as follows:

Terminal
import org.jsoup.*; 
import org.jsoup.nodes.*; 
import org.jsoup.select.*;

Let's write your first web scraping Java script!

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step #2: Connect to your target website

You canย use Jsoup to connect to a website using its URLย with the following lines:

Scraper.java
// initializing the HTML Document page variable 
Document doc; 
 
try { 
	// fetching the target website 
	doc = Jsoup.connect("https://www.scrapingcourse.com/ecommerce/").get(); 
} catch (IOException e) { 
	throw new RuntimeException(e); 
}

This snippet uses theย connect()ย method from Jsoup to connect to the target website. Note thatย if the connection fails, Jsoup throws aย IOException. This is why you need aย try ... catchย logic. Then,ย theย get()ย method returns a Jsoup HTMLย Documentย object you can use to explore the DOM.

Keep in mind thatย many websites automatically block requests that don't have a set of some expected HTTP headers. This is one of theย most basicย anti-scraping systems. Thus, you can simplyย avoid being blocked by manually setting these HTTP headers.

In general, the most important headerย you should always set is theย User-Agentย header. This isย a string that helps the server identifies the application, operating system, and vendorย from which the HTTP request comes from.

You can set theย User-Agentย header and other HTTP headers in Jsoup as follows:

Scraper.java
doc = Jsoup 
	.connect("https://www.scrapingcourse.com/ecommerce/") 
	.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36") 
	.header("Accept-Language", "*") 
	.get();

Specifically,ย you can set theย User-Agentย with the Jsoupย userAgent()ย method. Similarly, you can specify any other HTTP header withย header().

Step #3: Select the HTML elements of interest

Open your target web page in the browser and identify the HTML elements of interest. In this case, you want to scrape all product HTML elements.ย Right-click on a product HTML element and select the "Inspect" option. This shouldย open the DevTools windowย below:

scrapingcourse ecommerce homepage inspect first product li
Click to open the image in full screen

As you can see, a product isย li.productย HTML element. This includes:

  • Anย aย HTML element: contains the URL associated with the product
  • Anย imgย HTML element: contains the product image.
  • Aย h2ย HTML element: contains the product name.
  • Aย spanย HTML element: contains the product price.

Now, let's learn how to extract data from a product HTML element with web scraping in Java.

Step #4: Extracting data from the HTML elements

First,ย you need a Java object where to store the scraped data. Create aย dataย folder in the main package andย define aย Product.javaย classย as follows:

Product.java
package com.zenrows.data; 
 
public class Product { 
	private String url; 
	private String image; 
	private String name; 
	private String price; 
 
	// getters and setters omitted for brevity... 
 
	@Override 
	public String toString() { 
		return "{ \"url\":\"" + url + "\", " 
				+ " \"image\": \"" + image + "\", " 
				+ "\"name\":\"" + name + "\", " 
				+ "\"price\": \"" + price + "\" }"; 
	} 
}

Note thatย theย toString()ย method produces a string in JSON format. This will come in handy later.

Now, let'sย retrieve the list ofย li.productย HTML products on the target web page. You can achieve this with Jsoup as below:

Scraper.java
Elements productElements = doc.select("li.product");

What this snippet does is simple. Theย Jsoupย select()ย function apply theย CSS selectorย strategy to retrieve allย li.productย on the web page. In details,ย Elementsย extends anย ArrayList. So, you can easily iterate over it.

Thus, you canย iterate overย productElementsย to extract the info of interest and store it inย Productย objects:

Scraper.java
// initializing the list of Java object to store 
// the scraped data 
List<Product> products = new ArrayList<>(); 
 
// retrieving the list of product HTML elements 
Elements productElements = doc.select("li.product"); 
 
// iterating over the list of HTML products 
for (Element productElement : productElements) { 
	Product product = new Product(); 
 
	// extracting the data of interest from the product HTML element 
	// and storing it in product
	product.setUrl(productElement.selectFirst("a").attr("href")); 
	product.setImage(productElement.selectFirst("img").attr("src")); 
	product.setName(productElement.selectFirst("h2").text()); 
	product.setPrice(productElement.selectFirst("span").text()); 
 
	// adding product to the list of the scraped products 
	products.add(product); 
}

This logic uses theย Java web scraping API offered by Jsoup to extract all the data of interest from each product HTML element. Then, it initializes aย Productย with this data and adds it to the list of scraped products.

Congrats! You just learned how to scrape data from a web page with Jsoup. Let's nowย convert this data into a more useful format.

Step #5: Export the data to JSON

Don't forget that theย toString()ย method ofย Productย returns a JSON string. So, simply callย toString()ย on theย List<Product>ย object:

Scraper.java
products.toString();
System.out.println(products);

toString()ย on anย ArrayListย calls theย toString()ย method on each element of the list.ย Then, it embeds the result in square brackets.

In other words, this will produce the following JSON data:

Output
[ 
	{ 
		"url": "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/", 
		"image": "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg", 
		"name": "Abominable Hoodie", 
		"price": "$69.00" 
	}, 
 
	// ... 
 
	{ 
		"url": "https://www.scrapingcourse.com/ecommerce/product/artemis-running-short/", 
		"image": "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wsh04-black_main-324x324.jpg", 
		"name": "Artemis Running Short", 
		"price": "$49.00" 
	} 
]

Et voilร !ย You just performed web scraping using Java! Yet, the website consists of several web pages.ย Let's see how to scrape them all.

Web Crawling in Java

Let's nowย retrieve the list of all pagination links to scrape the entire website. This isย what web crawling is about. Right-click on theย pagination number HTML element and choose the "Inspect" option.

scrapingcourse ecommerce homepage inspect
Click to open the image in full screen

The browser should open theย DevTools section and highlight the selected DOM element, as below:

scrapingcourse ecommerce homepage devtools
Click to open the image in full screen

From here, note thatย you can extract all the pagination number HTML elements with theย a.page-numbersย CSS selector.ย These elementsย contain the links you want to scrape. You can retrieve them all with Jsoup as below:

Scraper.java
Elements paginationElements = doc.select("a.page-numbers");

If you want to scrape all web pages,ย you have to implement some crawling logic. Also, you'll need to rely on some lists and sets to avoid scraping a web page twice. You can implement web crawling logic to visit a limited amount of webpages controlled byย limitย as follows:

Scraper.java
package com.zenrows; 
 
import com.zenrows.data.Product; 
import org.jsoup.*; 
import org.jsoup.nodes.*; 
import org.jsoup.select.*; 
import java.io.IOException; 
import java.util.*; 
 
public class Scraper { 
	public static void scrapeProductPage( 
			List<Product> products, 
			Set<String> pagesDiscovered, 
			List<String> pagesToScrape 
	) { 
		// the current web page is about to be scraped and 
		// should no longer be part of the scraping queue 
		String url = pagesToScrape.remove(0); 
 
		pagesDiscovered.add(url); 
 
		// ... scraping logic omitted for brevity

		Elements paginationElements = doc.select("a.page-numbers");
 
		// iterating over the pagination HTML elements 
		for (Element pageElement : paginationElements) { 
			// the new link discovered 
			String pageUrl = pageElement.attr("href"); 
 
			// if the web page discovered is new and should be scraped 
			if (!pagesDiscovered.contains(pageUrl) && !pagesToScrape.contains(pageUrl)) { 
				pagesToScrape.add(pageUrl); 
			} 
 
			// adding the link just discovered 
			// to the set of pages discovered so far 
			pagesDiscovered.add(pageUrl); 
		} 
	} 
 
	public static void main(String[] args) { 
		// initializing the list of Java object to store 
		// the scraped data 
		List<Product> products = new ArrayList<>(); 
 
		// initializing the set of web page urls 
		// discovered while crawling the target website 
		Set<String> pagesDiscovered = new HashSet<>(); 
 
		// initializing the queue of urls to scrape 
		List<String> pagesToScrape = new ArrayList<>(); 
		// initializing the scraping queue with the 
		// first pagination page 
		pagesToScrape.add("https://www.scrapingcourse.com/ecommerce/page/1/"); 
 
		// the number of iteration executed 
		int i = 0; 
		// to limit the number to scrape to 5 
		int limit = 12; 
 
		while (!pagesToScrape.isEmpty() && i < limit) { 
			scrapeProductPage(products, pagesDiscovered, pagesToScrape); 
			// incrementing the iteration number 
			i++; 
		} 
 
		System.out.println(products.size()); 
 
		// writing the scraped data to a db or export it to a file... 
	} 
}

scrapeProductPage()ย scrapes a web page, discovers new links to scrape, and adds their URL to the scraping queue. You've set theย limitย to 12 because the target website has 12 pages. So, at the end of theย whileย cycle,ย pagesToScrapeย will be empty andย pagesDiscoveredย will contain all the 12 pagination URLs.

Here's the full code:

Scraper.java
package com.zenrows;

import com.zenrows.data.Product;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
import java.io.IOException;
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Scraper {
    public static void scrapeProductPage(
            List<Product> products,
            Set<String> pagesDiscovered,
            List<String> pagesToScrape
    ) {
        if (!pagesToScrape.isEmpty()) {
            // the current web page is about to be scraped and
            // should no longer be part of the scraping queue
            String url = pagesToScrape.remove(0);

            pagesDiscovered.add(url);

            // initializing the HTML Document page variable
            Document doc;

            try {
                // fetching the target website
                doc = Jsoup
                        .connect(url)
                        .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")
                        .get();
            } catch (IOException e) {
                throw new RuntimeException(e);
            }

            // retrieving the list of product HTML elements
            // in the target page
            Elements productElements = doc.select("li.product");

            // iterating over the list of HTML products
            for (Element productElement : productElements) {
                Product Product = new Product();

                // extracting the data of interest from the product HTML element
                // and storing it in Product
                Product.setUrl(productElement.selectFirst("a").attr("href"));
                Product.setImage(productElement.selectFirst("img").attr("src"));
                Product.setName(productElement.selectFirst("h2").text());
                Product.setPrice(productElement.selectFirst("span").text());

                // adding Product to the list of the scraped products
                products.add(Product);
            }

            // retrieving the list of pagination HTML element
            Elements paginationElements = doc.select("a.page-numbers");

            // iterating over the pagination HTML elements
            for (Element pageElement : paginationElements) {
                // the new link discovered
                String pageUrl = pageElement.attr("href");

                // if the web page discovered is new and should be scraped
                if (!pagesDiscovered.contains(pageUrl) && !pagesToScrape.contains(pageUrl)) {
                    pagesToScrape.add(pageUrl);
                }

                // adding the link just discovered
                // to the set of pages discovered so far
                pagesDiscovered.add(pageUrl);
            }

            // logging the end of the scraping operation
            System.out.println(url + " -> page scraped");
        }
    }

    public static void main(String[] args) { 
		// initializing the list of Java object to store 
		// the scraped data 
		List<Product> products = new ArrayList<>(); 
 
		// initializing the set of web page urls 
		// discovered while crawling the target website 
		Set<String> pagesDiscovered = new HashSet<>(); 
 
		// initializing the queue of urls to scrape 
		List<String> pagesToScrape = new ArrayList<>(); 
		// initializing the scraping queue with the 
		// first pagination page 
		pagesToScrape.add("https://www.scrapingcourse.com/ecommerce/page/1/"); 
 
		// the number of iteration executed 
		int i = 0; 
		// to limit the number to scrape to 5 
		int limit = 12; 
 
		while (!pagesToScrape.isEmpty() && i < limit) { 
			scrapeProductPage(products, pagesDiscovered, pagesToScrape); 
			// incrementing the iteration number 
			i++; 
		} 
 
		System.out.println(products.size());
        // writing the scraped data to a db or export it to a file...
    }
}

If you're an IntelliJ IDEA user, click on the run icon to run the web scraping Java example. Wait for the process to end. This will take some seconds. At the end of the process,ย productsย will contain all 188 products.

Congratulations!ย You just extracted all the product data automatically!

Parallel Web Scraping in Java

Web scraping in Java can become a time-consuming process.ย This is especially true ifย your target website consists of many web pages and/or the server takes time to respond.ย Also, Java isn't popular for being a performant programming language.

At the same time,ย Java 8 introduced a lot of features toย make parallelism easier.ย So,ย transforming your Java web scraper to work in parallel takes only a few updates. Let's see how you can perform parallel web scraping in Java:

Scraper.java
package com.zenrows; 
 
import com.zenrows.data.Product; 
import org.jsoup.*; 
import org.jsoup.nodes.*; 
import org.jsoup.select.*; 
import java.io.IOException; 
import java.util.*; 
import java.util.concurrent.ExecutorService; 
import java.util.concurrent.Executors; 
import java.util.concurrent.TimeUnit; 
 
public class Scraper { 
	public static void scrapeProductPage( 
			List<Product> products, 
			Set<String> pagesDiscovered, 
			List<String> pagesToScrape 
	) { 
		//... omitted for brevity
	} 
 
	public static void main(String[] args) throws InterruptedException { 
		// initializing the list of Java object to store 
		// the scraped data 
		List<Product> products = Collections.synchronizedList(new ArrayList<>()); 
 
		// initializing the set of web page urls 
		// discovered while crawling the target website 
		Set<String> pagesDiscovered = Collections.synchronizedSet(new HashSet<>()); 
 
		// initializing the queue of urls to scrape 
		List<String> pagesToScrape = Collections.synchronizedList(new ArrayList<>()); 
		// initializing the scraping queue with the 
		// first pagination page 
		pagesToScrape.add("https://www.scrapingcourse.com/ecommerce/page/1/"); 
 
		// initializing the ExecutorService to run the 
		// web scraping process in parallel on 4 pages at a time 
		ExecutorService executorService = Executors.newFixedThreadPool(4); 
 
		// launching the web scraping process to discover some 
		// urls and take advantage of the parallelization process 
		scrapeProductPage(products, pagesDiscovered, pagesToScrape); 
 
		// the number of iteration executed 
		int i = 1; 
		// to limit the number to scrape to 5 
		int limit = 10; 
 
		while (!pagesToScrape.isEmpty() && i < limit) { 
			// registering the web scraping task 
			executorService.execute(() -> scrapeProductPage(products, pagesDiscovered, pagesToScrape)); 
 
			// adding a 200ms delay to avoid overloading the server 
			TimeUnit.MILLISECONDS.sleep(200); 
 
			// incrementing the iteration number 
			i++; 
		} 
 
		// waiting up to 300 seconds for all pending tasks to end 
		executorService.shutdown(); 
		executorService.awaitTermination(300, TimeUnit.SECONDS); 
 
		System.out.println(products.size()); 
	} 
}

See the complete code below:

Scraper.java
package com.zenrows;

import com.zenrows.data.Product;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
import java.io.IOException;
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Scraper {
    public static void scrapeProductPage(
            List<Product> products,
            Set<String> pagesDiscovered,
            List<String> pagesToScrape
    ) {
        if (!pagesToScrape.isEmpty()) {
            // the current web page is about to be scraped and
            // should no longer be part of the scraping queue
            String url = pagesToScrape.remove(0);

            pagesDiscovered.add(url);

            // initializing the HTML Document page variable
            Document doc;

            try {
                // fetching the target website
                doc = Jsoup
                        .connect(url)
                        .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")
                        .get();
            } catch (IOException e) {
                throw new RuntimeException(e);
            }

            // retrieving the list of product HTML elements
            // in the target page
            Elements productElements = doc.select("li.product");

            // iterating over the list of HTML products
            for (Element productElement : productElements) {
                Product Product = new Product();

                // extracting the data of interest from the product HTML element
                // and storing it in Product
                Product.setUrl(productElement.selectFirst("a").attr("href"));
                Product.setImage(productElement.selectFirst("img").attr("src"));
                Product.setName(productElement.selectFirst("h2").text());
                Product.setPrice(productElement.selectFirst("span").text());

                // adding Product to the list of the scraped products
                products.add(Product);
            }

            // retrieving the list of pagination HTML element
            Elements paginationElements = doc.select("a.page-numbers");

            // iterating over the pagination HTML elements
            for (Element pageElement : paginationElements) {
                // the new link discovered
                String pageUrl = pageElement.attr("href");

                // if the web page discovered is new and should be scraped
                if (!pagesDiscovered.contains(pageUrl) && !pagesToScrape.contains(pageUrl)) {
                    pagesToScrape.add(pageUrl);
                }

                // adding the link just discovered
                // to the set of pages discovered so far
                pagesDiscovered.add(pageUrl);
            }

            // logging the end of the scraping operation
            System.out.println(url + " -> page scraped");
        }
    }

    public static void main(String[] args) throws InterruptedException {
        // initializing the list of Java object to store
        // the scraped data
        List<Product> products = Collections.synchronizedList(new ArrayList<>());

        // initializing the set of web page urls
        // discovered while crawling the target website
        Set<String> pagesDiscovered = Collections.synchronizedSet(new HashSet<>());

        // initializing the queue of urls to scrape
        List<String> pagesToScrape = Collections.synchronizedList(new ArrayList<>());
        // initializing the scraping queue with the
        // first pagination page
        pagesToScrape.add("https://www.scrapingcourse.com/ecommerce/page/1/");

        // initializing the ExecutorService to run the
        // web scraping process in parallel on 4 pages at a time
        ExecutorService executorService = Executors.newFixedThreadPool(4) ;

        // launching the web scraping process to discover some
        // urls and take advantage of the parallelization process
        scrapeProductPage(products, pagesDiscovered, pagesToScrape);

        // the number of iteration executed
        int i = 1;
        // to limit the number to scrape to 5
        int limit = 12;

        while (!pagesToScrape.isEmpty() && i < limit) {
            // registering the web scraping task
            executorService.execute(() -> scrapeProductPage(products, pagesDiscovered, pagesToScrape));

            // adding a 200ms delay for avoid overloading the server
            TimeUnit.MILLISECONDS.sleep(200);

            // incrementing the iteration number
            i++;
        }

        // waiting up to 300 seconds to all pending tasks to end
        executorService.shutdown();
        executorService.awaitTermination(300, TimeUnit.SECONDS);

        System.out.println(products.size());

        // writing the scraped data to a db or export it to a file...
    }
}

Keep in mind thatย ArrayListย andย HashSetย are not thread-safe in Java. This is whyย you need to wrap your collections withย Collections.synchronizedList()ย andย Collections.synchronizedSet(), respectively.ย These methods will turn them into thread-safe collections you can then use in threads.

Then, you canย useย ExecutorServicesย to run tasks asynchronously.ย Thanks toย ExecutorServices, you can execute and manage parallel tasks with no effort. Specifically,ย newFixedThreadPool()ย allows you to initialize anย Executorย that can simultaneously run as many threads as the number passed to the initialization method.

You don't want to overload the target server or local machine. This is why you need toย add a few milliseconds of timeout between threads withย sleep(). Your goal is to perform web scraping, not a DOS attack.

Then, always remember to shutdown yourย ExecutorServiceย and release its resource. Since when the code exists theย whileย cycle some task may still be running, you should use theย awaitTermination()ย method.

You must call this method after a shutdown request. In detail,ย awaitTermination()ย blocks the code and wait for all tasks to complete within the interval of time passed as a parameter.

Run this java web scraping example script andย you'll experience a noticeable increase in performance compared to before. You just learnedย how to perform instant web scraping with Java.

Well done! You now know how to do parallel web scraping with Java! Butย there are still a few lessons to learn!

Scraping Dynamic Content Websites in Java

Don't forget thatย a web page is more than its corresponding HTML document. Web pages can perform HTTP requests in the browser viaย AJAX.ย This mechanism allows web pages to retrieve data asynchronously and update the content shown to the user accordingly.

Most websites now rely on frontend API requests to retrieve data. These requests are AJAX calls. So,ย these API calls provide valuable data you can't ignore when it comes to web scraping.ย You can sniff these calls, replicate them in your scraping script, and retrieve this data.

To sniff a AJAX call, use the DevTools your browser.ย Right-click on a web page, choose "Inspect", and select the "Network" tab.ย In the "Fetch/XHR" tab, you'll find the list of AJAX calls the web page executed, as below.

POST AJAX call
Click to open the image in full screen

Here, you canย retrieve all the info you need to replicate these calls in your web scraping script. Yet,ย this isn't the best approach.

Web Scraping With a Headless Browser

Web pages perform most of the AJAX calls in response to user interaction. This is whyย you need a tool to load a web page in a browser and replicate user interaction. This isย what aย headless browserย is about.

In detail,ย a headless browser is a web browser with no GUI that enables you to programmatically control a web page. In other terms,ย a headless browser allows you to instruct a web browser to perform some tasks.

Thanks to a headless browser, you canย interact with a web page through JavaScript as a human being would. One of theย most popular libraries in Java offering headless browser functionality isย Selenium WebDriver.

Note that ZenRows API comes with headless browser capabilities. Learn more aboutย how to extract dynamically loaded data.

If you use Gradle, addย selenium-javaย with the line below in theย dependenciesย section of yourย build.gradleย file:

build.gradle
implementation "org.seleniumhq.selenium:selenium-java:4.14.1"

Otherwise, if you use Maven, insert the following lines in yourย pom.xmlย file:

pom.xml
<dependency> 
	<groupId>org.seleniumhq.selenium</groupId> 
	<artifactId>selenium-java</artifactId> 
	<version>4.14.1</version> 
</dependency>

Make sure to install your new dependency by running update command from the terminal or in your IDE. Ensure you have the latest version of Chrome installed. WebDriver previously needed its own setup, but now it's automatically included in version 4 and later. If you use Gradle, check the dependencies section of your build.gradle file. Otherwise, if you use Maven, check your pom.xml file. You're now ready to start using Selenium.

You can replicate the web scraping logic seen above on a single page with the following script:

Scraper.java
import org.openqa.selenium.*; 
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.chrome.ChromeDriver; 
import org.openqa.selenium.support.ui.WebDriverWait; 
import java.util.*; 
import com.zenrows.data.Product; 

 
public class Scraper { 
	public static void main(String[] args) {
		// defining the options to run Chrome in headless mode 
		ChromeOptions options = new ChromeOptions(); 
		options.addArguments("--headless"); 
 
		// initializing a Selenium WebDriver ChromeDriver instance 
		// to run Chrome in headless mode 
		WebDriver driver = new ChromeDriver(options); 
 
		// connecting to the target web page 
		driver.get("https://www.scrapingcourse.com/ecommerce/"); 
 
		// initializing the list of Java object to store 
		// the scraped data 
		List<Product> products = new ArrayList<>(); 
		 
		// retrieving the list of product HTML elements 
		List<WebElement>productElements = driver.findElements(By.cssSelector("li.product")); 
		 
		// iterating over the list of HTML products 
		for (WebElement productElement : productElements) { 
			Product product = new Product(); 
		 
			// extracting the data of interest from the product HTML element 
			// and storing it in product 
			product.setUrl(productElement.findElement(By.tagName("a")).getAttribute("href"));
            product.setImage(productElement.findElement(By.tagName("img")).getAttribute("src"));
            product.setName(productElement.findElement(By.tagName("h2")).getText());
            product.setPrice(productElement.findElement(By.tagName("span")).getText());

			// adding product to the list of the scraped products 
			products.add(product); 
		} 

		// ... 
		driver.quit();
	} 
}

As you can see, the web scraping logic isn't that different from that seen before.ย What truly changes is that Selenium runs the web scraping logic in the browser. This means thatย Selenium has access to all features offered by a browser.

For example,ย you can click on a pagination element to directly navigate to a new page as below:

Scraper.java
WebElement paginationElement = driver.findElement(By.cssSelector("a.page-numbers")); 
// navigating to a new web page 
paginationElement.click(); 
 
// wait for the page to load... 
 
System.out.println(driver.getTitle()); // "Ecommerce Test Site to Learn Web Scraping โ€“ ScrapingCourse.com"

In other words, Selenium allows you to perform web crawling by interacting with the elements in a web page. Just like a human being would. This makes a web scraper based on a headless browser harder to detect and block. Learn more on how to perform web scraping without getting blocked.

Other Web Scraping Libraries For Java

Other useful Java libraries for web scraping are:

  • HtmlUnit: a GUI-less/headless browser for Java. HtmlUnit can perform all browser-specific operations on a web page. Like Selenium,ย it was born for testing but you can use it for web crawling and scraping.
  • Playwright: an end-to-end testing library for web apps developed by Microsoft. Again, it enables you to instruct a browser. So,ย you can use it for web scraping like Selenium.

Conclusion

In this web scraping java tutorial,ย you learned everything you should know about performing professional web scraping with Java. In detail, you saw:

  1. Why Java is a good programming language when it comes to web scraping
  2. How to perform basic web scraping in Java with Jsoup
  3. How to crawl an entire website in Java
  4. Why you might need a headless browser
  5. How to use Selenium to perform scraping in Java on dynamic content websites

What you should never forget is thatย your web scraper needs to be able to bypass anti-scraping systems. This is whyย you need a complete web scraping Java API.ย ZenRowsย offers that and much more.

In detail,ย ZenRows is a tool that offers many services to help you perform web scraping. ZenRows also gives access to a headless browser with just a simple API call.ย Try ZenRows for free and start scraping data from the web with no effort.

Ready to get started?

Up to 1,000 URLs for free are waiting for you