Are you looking to scrape a paginated website in Java? HtmlUnit is an ideal choice! It's lightweight, fast, and has built-in JavaScript support, making it a powerful tool for scraping dynamic content across multiple pages.
In this article, you'll learn how to implement pagination scraping with HtmlUnit in Java. Let's jump right in!
Scrape With a Navigation Bar In HtmlUnit
Paginated websites with a navigation bar often feature clickable next/previous buttons and page numbers to switch between pages.
An example is the E-commerce Challenge page below, which has 12 product pages:

A good way to scrape such websites with your HtmlUnit Java web scraper is to follow the next page link and extract content continuously until the last page. This technique simplifies pagination scraping, minimizing maintenance effort and eliminating the need for URL construction.
In this Html pagination tutorial, you'll follow all the 12 product pages on the above website to extract product names, prices, and image URLs.
Let's start with the first page.
Create a Scraper Java class to obtain the target URL's full-page HTML. Obtain the parent element containing all products and iterate through each to extract the target product data using their class names. Then, store the scraped data into a Product
object:
package org.example;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import java.util.ArrayList;
import java.util.List;
public class Scraper {
private static final String URL = "https://www.scrapingcourse.com/ecommerce/";
// output handler
public static void main(String[] args) {
List<Product> products = scrapeProducts(URL);
for (Product product : products) {
System.out.println(product);
}
}
public static List<Product> scrapeProducts(String url) {
List<Product> products = new ArrayList<>();
try (WebClient webClient = new WebClient()) {
// disable JavaScript and CSS
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
// request the target URL to get its full-page HTML
HtmlPage page = webClient.getPage(url);
// select all product parent nodes
List<HtmlElement> productCards = page.getByXPath("//li[contains(@class, 'product')]");
// extract product data from each parent node
for (HtmlElement productCard : productCards) {
String name = getTextContent(productCard, ".//h2[contains(@class, 'product-name')]");
String price = getTextContent(productCard, ".//span[contains(@class, 'product-price')]");
String imageUrl = getImageSrc(productCard, ".//img[contains(@class, 'product-image')]");
products.add(new Product(name, price, imageUrl));
}
} catch (Exception e) {
// print full stack trace for debugging
e.printStackTrace();
}
return products;
}
private static String getTextContent(HtmlElement parent, String xpath) {
HtmlElement element = parent.getFirstByXPath(xpath);
return (element != null) ? element.asNormalizedText().trim() : "N/A";
}
private static String getImageSrc(HtmlElement parent, String xpath) {
HtmlElement element = parent.getFirstByXPath(xpath);
return (element != null) ? element.getAttribute("src") : "N/A";
}
}
// store the extracted product data
class Product {
private final String name;
private final String price;
private final String imageUrl;
public Product(String name, String price, String imageUrl) {
this.name = name;
this.price = price;
this.imageUrl = imageUrl;
}
@Override
public String toString() {
return String.format("{\"name\": \"%s\", \"price\": \"%s\", \"imageUrl\": \"%s\"}", name, price, imageUrl);
}
}
The above code returns the product data for the first page:
{
"name": "Abominable Hoodie",
"price": "$69.00",
"imageUrl": "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main.jpg",
}
// ... other products omitted for brevity
{
"name": "Artemis Running Short",
"price": "$45.00",
"imageUrl": "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wsh04-black_main.jpg",
}
You're now extracting data from the target website's first page. You just scaled the first step to building an HtmlUnit pagination scraper in Java.
The current scraper is limited to the first page because it doesn't include a pagination logic. To apply pagination, you'll need to modify the code to follow all the next page link elements.
First, inspect the next page element in the developer console. You'll see it's an a
tag with the class name next page-numbers
:

Next, update the previous scraper to extract the next page link. Continuously check if the element exists in the DOM and follow its link only if that condition is valid. Then, recursively call the scrapeProducts
function to extract data from the next page link:
public class Scraper {
private static final String URL = "https://www.scrapingcourse.com/ecommerce/";
// ... output handler
public static List<Product> scrapeProducts(String url) {
// ...
try (WebClient webClient = new WebClient()) {
// ...
// look for the "Next" button
HtmlElement nextButton = page.getFirstByXPath("//a[contains(@class, 'next')]");
if (nextButton != null) {
// follow the next page link
String nextPageUrl = nextButton.getAttribute("href");
if (!nextPageUrl.startsWith("http")) {
// convert relative to absolute URL
nextPageUrl = URL + nextPageUrl;
}
// recursively scrape the next page
List<Product> nextPageProducts = scrapeProducts(nextPageUrl);
products.addAll(nextPageProducts);
}
} catch (Exception e) {
// error handling
}
// ...
}
// ...
}
// ... Product object storage logic
Extend the previous scraper with the above modification, and you'll get the following complete code:
package org.example;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import java.util.ArrayList;
import java.util.List;
public class Scraper {
private static final String URL = "https://www.scrapingcourse.com/ecommerce/";
// output handler
public static void main(String[] args) {
List<Product> products = scrapeProducts(URL);
for (Product product : products) {
System.out.println(product);
}
}
public static List<Product> scrapeProducts(String url) {
List<Product> products = new ArrayList<>();
try (WebClient webClient = new WebClient()) {
// disable JavaScript and CSS
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
// request the target URL to get its full-page HTML
HtmlPage page = webClient.getPage(url);
// select all product parent nodes
List<HtmlElement> productCards = page.getByXPath("//li[contains(@class, 'product')]");
// extract product data from each parent node
for (HtmlElement productCard : productCards) {
String name = getTextContent(productCard, ".//h2[contains(@class, 'product-name')]");
String price = getTextContent(productCard, ".//span[contains(@class, 'product-price')]");
String imageUrl = getImageSrc(productCard, ".//img[contains(@class, 'product-image')]");
products.add(new Product(name, price, imageUrl));
}
// look for the "Next" button
HtmlElement nextButton = page.getFirstByXPath("//a[contains(@class, 'next')]");
if (nextButton != null) {
// follow the next page link
String nextPageUrl = nextButton.getAttribute("href");
if (!nextPageUrl.startsWith("http")) {
// convert relative to absolute URL
nextPageUrl = URL + nextPageUrl;
}
// recursively scrape the next page
List<Product> nextPageProducts = scrapeProducts(nextPageUrl);
products.addAll(nextPageProducts);
}
} catch (Exception e) {
// print full stack trace for debugging
e.printStackTrace();
}
return products;
}
private static String getTextContent(HtmlElement parent, String xpath) {
HtmlElement element = parent.getFirstByXPath(xpath);
return (element != null) ? element.asNormalizedText().trim() : "N/A";
}
private static String getImageSrc(HtmlElement parent, String xpath) {
HtmlElement element = parent.getFirstByXPath(xpath);
return (element != null) ? element.getAttribute("src") : "N/A";
}
}
// store the extracted product data
class Product {
private final String name;
private final String price;
private final String imageUrl;
public Product(String name, String price, String imageUrl) {
this.name = name;
this.price = price;
this.imageUrl = imageUrl;
}
@Override
public String toString() {
return String.format("{\"name\": \"%s\", \"price\": \"%s\", \"imageUrl\": \"%s\"}", name, price, imageUrl);
}
}
The code now extracts products from all 12 pages, as shown:
{
"name": "Abominable Hoodie",
"price": "$69.00",
"imageUrl": "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main.jpg",
}
// ... 186 products omitted for brevity
{
"name": "Zoltan Gym Tee",
"price": "$29.00",
"imageUrl": "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/ms06-blue_main.jpg",
}
Nicely done! Your HtmlUnit pagination scraper now extracts products from all pages.
Scrape JavaScript-based Pagination In HtmlUnit
Websites with JavaScript-based pagination load content dynamically using JavaScript. Unlike static pagination, where all pages exist in the HTML from the start, dynamic pagination requires JavaScript execution to retrieve and display data.
JavaScript-based pagination can display more pages using infinite scrolling or a "Load More" button.
In infinite scrolling, content automatically loads as you scroll down the page. An example is the Infinite Scrolling Challenge page below:

A "Load More" dynamic pagination requires clicking a button to display more content as you scroll. See how it works in the Load More Challenge page demonstration below:

These dynamic websites load content asynchronously, often using AJAX, Fetch API, or WebSockets to retrieve data after the initial page load. Extracting such content requires simulating user interactions, scrolling, clicking, or dispatching custom JavaScript events.
A Java headless browser like HtmlUnit lets you scrape such sites easily. This tool supports JavaScript execution, allowing you to simulate human interactions, such as scrolling, clicking, and more.
Avoid Getting Blocked While Scraping Multiple Pages
Pagination scraping increases the risk of being blocked due to multiple requests. Websites use detection methods like rate limiting, geo-restrictions, and CAPTCHA to block scrapers. You need to bypass these blocks to scrape successfully.
For example, the current HtmlUnit pagination scraper won't pass on a protected site like the Antibot Challenge page. Try it out with the following simple request:
package org.example;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class Scraper {
public static void main(String[] args) {
// create a WebClient instance to simulate a browser
WebClient webClient = new WebClient();
try {
// disable JavaScript and CSS
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setCssEnabled(false);
// request the target URL
HtmlPage page = webClient.getPage("https://www.scrapingcourse.com/antibot-challenge");
// get the full page HTML as a string
String pageHtml = page.asXml();
// print the HTML content
System.out.println(pageHtml);
} catch (Exception e) {
e.printStackTrace();
} finally {
// close the WebClient to release resources
webClient.close();
}
}
}
The scraper gets blocked with the following 403 forbidden error:
403 Forbidden for https://www.scrapingcourse.com/antibot-challenge
You can increase your scraper's stealth by adding custom scraping request headers from a real browser or using web scraping proxies. Unfortunately, these solutions are insufficient against sophisticated anti-bot measures.
The best way to avoid blocks while scraping is to use a web scraping solution, such as the ZenRows' Universal Scraper API. ZenRows is a user-friendly, all-in-one scraping toolkit for bypassing even the most complex anti-bot systems. It features premium proxy rotation, JavaScript rendering support, advanced fingerprint evasions, anti-bot auto-bypass, and more.
ZenRows also has headless browser features, making it an excellent solution for scraping JavaScript-based paginations.
Let's see how the ZenRows' Universal Scraper API works by scraping the Antibot Challenge page that blocked you previously.
Sign up and go to the ZenRows Request Builder. Paste the target URL in the address box and activate Premium Proxies and JS Rendering.

Choose Java as your programming language and select the API connection mode. Copy and paste the generated code into your Java file:
Here's the generated Java code:
import org.apache.hc.client5.http.fluent.Request;
public class APIRequest {
public static void main(final String... args) throws Exception {
String apiUrl = "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.scrapingcourse.com%2Fantibot-challenge&js_render=true&premium_proxy=true";
String response = Request.get(apiUrl)
.execute().returnContent().asString();
System.out.println(response);
}
}
The code accesses the protected website and extracts its full-page HTML. See the output below:
<html lang="en">
<head>
<!-- ... -->
<title>Antibot Challenge - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<h2>
You bypassed the Antibot challenge! :D
</h2>
<!-- other content omitted for brevity -->
</body>
</html>
Congratulations! 🎉You just built an HtmlUnit scraper with reliable anti-bot bypass capabilities.
Conclusion
You've learned to handle pagination and scrape multiple pages with HtmlUnit in Java.
However, remember that large-scale scraping without the correct tooling can get you blocked. We recommend using a web scraping solution like ZenRows to bypass CAPTCHAs and other anti-bot measures at any scale. All it takes is a single API call, and ZenRows handles all the complex tasks behind the scenes.