How to Do Web Scraping with HtmlUnit in Java

May 3, 2024 · 12 min read

HtmlUnit web scraping is a powerful way to extract content from HTML documents in Java. The tool’s power lies in the DOM exploration API.

In this tutorial, you'll learn the basics of web scraping with HtmlUnit in Java and explore more complex use cases.

You'll see:

Let's dive in!

What Is HtmlUnit?

HtmlUnit is a Java HTML parser that operates over a GUI-less browser built on the Rhino JavaScript engine. It can visit web pages, parse their HTML documents, and simulate Chrome and Firefox.

Compared to other HTML parsers in Java, e.g., jsoup, HtmlUnit is a more complete solution. It can simulate user interactions, such as clicks, form submissions, and others. Since it doesn't rely on an external browser, it's also more efficient than Selenium.

However, HtmlUnit's ability to execute JavaScript isn't as well-rounded as that of modern browsers. Still, it's a reliable tool for web scraping in Java.

How to Scrape With HtmlUnit

Let’s see how to start web scraping in Java with HtmlUnit. The target site will be ScrapingCourse.com, a static e-commerce site with fake products:

Target Site
Click to open the image in full screen

The scraping objective is to extract data from each product on that page. Follow the instructions below and learn how to achieve that goal with HtmlUnit.

Step 1: Install HtmlUnit in Java

Before diving into coding, ensure you have the latest JDK and Gradle or Maven installed locally. In this tutorial, we'll use Maven because it's more popular in the Java ecosystem.

Set up a Java Maven project. Open a terminal in the folder you want to use for your HtmlUnit web scraping project. Then, initialize a Maven project inside it with the following command:

Terminal
mvn archetype:generate -DgroupId="com.zenrows.scraper" -DartifactId="htmlunit-scraper" -DarchetypeArtifactId="maven-archetype-quickstart" -DarchetypeVersion="1.4" -DinteractiveMode="false"

Wait for Maven to download the required tools and create your project. The htmlunit-scraper directory will now contain a Maven project.

Time to add HtmlUnit to your project's dependencies. Open the pom.xml file created by Maven. Install HtmlUnit by adding the following lines inside the <dependencies> tag:

pom.xml
<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.70.0</version>
</dependency>

Launch the command below to update the dependencies. It will download, install, and configure the HtmlUnit package:

Terminal
mvn dependency:resolve

Load the project folder in your Java IDE. IntelliJ IDEA Community Edition or Visual Studio Code with the Java extension will do. Import HtmlUnit by adding these two lines to the App.java file in the nested folders inside /src:

app.java
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;

App.java, the entry script of your web scraping project, will now contain:

app.java
package com.zenrows.scraper;

import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;

public class App {
    public static void main( String[] args ) {
        System.out.println( "Hello World!" );
    }
}
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Build your Java application with the following Maven command:

Terminal
mvn package

You can then execute the .jar file that will appear in the /target folder with:

Terminal
java -cp target/htmlunit-scraper-1.0-SNAPSHOT.jar com.zenrows.scraper.App

Otherwise, execute the Java application with the run button of your IDE.

That script will produce the output below in the terminal:

Output
Hello, World!

You now have a Java Maven project in place. Get ready to turn it into an HtmlUnit scraping application!

Step 2: Initialize a Java Class to Fetch the HTML Code

Open the App.java file and initialize a WebClient object in a try-with-resources as shown below. The WebClient class simulates a browser and exposes a complete web scraping API:

app.java
try (final WebClient webClient = new WebClient()) {
    // ...
}

This try syntax ensures the webClient resource gets closed at the end of the statement.

The target site doesn't need JavaScript and CSS rendering, so disable it to save resources:

app.java
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setCssEnabled(false);

Then, use the getPage() method to load the target page in the HtmlUnit browser:

app.java
final HtmlPage page = webClient.getPage("https://scrapingcourse.com/ecommerce/");

If an I/O error occurs, getPage() raises a checked IOException you must handle. Since the try statement doesn't have a catch section, add a throws declaration to the main function:

app.java
public static void main(String[] args) throws IOException {
    // ....
}

Don't forget to add the following import on top of your App.java file:

app.java
import java.io.IOException;

Use the following lines to access the raw HTML of the page from the server response and print it in the terminal:

app.java
final String html = page.getWebResponse().getContentAsString();
System.out.println(html);

Put it all together, and you'll get:

app.java
package com.zenrows.scraper;

import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import java.io.IOException;

public class App {
    public static void main(String[] args) throws IOException {
        try (final WebClient webClient = new WebClient()) {
            // to disable JS rendering
            webClient.getOptions().setJavaScriptEnabled(false);
            // to disable CSS rendering
            webClient.getOptions().setCssEnabled(false);

            // connect to the target page
            final HtmlPage page = webClient.getPage("https://scrapingcourse.com/ecommerce/");

            // get the HTML source of the page
            // and print it in the terminal
            final String html = page.getWebResponse().getContentAsString();
            System.out.println(html);
        }
    }
}

Launch the script, and it'll print:

Output
<!doctype html>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="profile" href="https://gmpg.org/xfn/11">
<link rel="pingback" href="https://scrapingcourse.com/ecommerce/xmlrpc.php">
 
<title>An ecommerce store to scrape &#8211; Scraping Laboratory</title>
<meta name='robots' content='max-image-preview:large' />
<link rel='dns-prefetch' href='//stats.wp.com' />
<link rel='dns-prefetch' href='//scrapingcourse.com' />
<link rel='dns-prefetch' href='//fonts.googleapis.com' />
<!-- omitted for brevity... -->

Awesome! The HtmlUnit web scraping script connects to the target page as intended.

However, keep in mind that ScrapingCourse.com is just a scraping sandbox. Real-world sites can use anti-bot systems to stop requests from automated scripts, in which case your scraper's requests would fail.

One way to avoid that is to use a web scraping API, such as ZenRows. The tool bypasses all anti-scraping measures for you, getting the HTML content of any web page. You can then feed it to HtmlUnit or any other Java HTML parsing library and start extracting data from it.

Step 3: Extract a Single Element

The HtmlPage object returned by getPage() has several methods to select HTML nodes on the page:

  • getFirstByXPath(): Evaluates the specified XPath expression and returns the first matching HTML element. It returns null if no node matches the specified XPath expression.
  • getByXPath(): Evaluates the specified XPath expression and returns the matching HTML elements.
  • querySelector(): Returns the first element within the document that matches the specified CSS selector. It returns null if no element matches the given CSS selector.
  • querySelectorAll(): Retrieves all HTML nodes matching the specified CSS selector.

Note that HtmUnit supports both XPath expressions and CSS selectors. If you don't know which element selection language is best for you, read our article on XPath vs CSS selector.

Let's keep things simple and go for CSS selectors. See querySelector() in action to select the single <title> element on the page. Then, print its content with getTextContent():

app.java
final HtmlElement titleHTMLElement = page.querySelector("title");
final String title = titleHTMLElement.getTextContent();
System.out.println(title);

The snippet above will produce the following output:

Output
An ecommerce store to scrape - Scraping Laboratory

Great! This is exactly the title of the page that you can also see in your browser.

Extracting the title from a page is a popular task. So, HtmlUnit provides the getTitleText() utility for that:

app.java
final String title = page.getTitleText();
System.out.println(title);

These two lines of code produce the same result as the snippet seen earlier.

Step 4: Extract Multiple Elements

It's time to build on what you've just learned and extract the name, image, and price data from a product HTML element on the page.

Inspect a product element to figure out the best node selection strategy. Open ScrapingCourse.com in the browser, right-click on a product HTML node, and select "Inspect":

Inspecting with DevTools
Click to open the image in full screen

Expand the HTML code and notice that the first product node is a <li> element with a product class. Inside that HTML node, there are:

  • An <h2> element with the product price.
  • An <img> element containing the product image.
  • A <span> element with a price class storing the product price.

Follow the instructions below and learn how to scrape data from those HTML elements.

Select the first product node on the page:

app.java
final HtmlElement productHTMLElement = page.querySelector("li.product");

The HtmlElement class exposes the XPath and CSS selector node selection methods as well. Specifically, these work on the descendant of the current node. Use three querySelector() calls to select the nodes of interest:

app.java
final HtmlElement productNameHTMLElement = productHTMLElement.querySelector("h2");
final HtmlElement productImageHTMLElement = productHTMLElement.querySelector("img");
final HtmlElement productPriceHTMLElement = productHTMLElement.querySelector("span.price");

Retrieve the desired data from their text or attributes with getTextContent() and getAttribute():

app.java
final String productName = productNameHTMLElement.getTextContent();
final String productImage = productImageHTMLElement.getAttribute("src");
final String productPrice = productPriceHTMLElement.getTextContent();

Print the scraped information with the following System.out.println() instruction:

app.java
System.out.println(
    "Product{name='" + productName +
    "', image='" + productImage +
    "', price='" + productPrice +
    "'}"
);

Your App.java class will now contain:

app.java
package com.zenrows.scraper;

import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import java.io.IOException;

public class App {
    public static void main(String[] args) throws IOException {
        try (final WebClient webClient = new WebClient()) {
            // to disable JS rendering
            webClient.getOptions().setJavaScriptEnabled(false);
            // to disable CSS rendering
            webClient.getOptions().setCssEnabled(false);

            // connect to the target page
            final HtmlPage page = webClient.getPage("https://scrapingcourse.com/ecommerce/");

            // select the first product node on the page
            final HtmlElement productHTMLElement = page.querySelector("li.product");

            // select the inner product nodes
            // to scrape data from
            final HtmlElement productNameHTMLElement = productHTMLElement.querySelector("h2");
            final HtmlElement productImageHTMLElement = productHTMLElement.querySelector("img");
            final HtmlElement productPriceHTMLElement = productHTMLElement.querySelector("span.price");

            // data extraction logic
            final String productName = productNameHTMLElement.getTextContent();
            final String productImage = productImageHTMLElement.getAttribute("src");
            final String productPrice = productPriceHTMLElement.getTextContent();

            // log the scraped data
            System.out.println(
                "Product{name='" + productName +
                "', image='" + productImage +
                "', price='" + productPrice +
                "'}"
            );
        }
    }
}

Run the HtmlUnit web scraping script, and it'll produce this output in the terminal:

Output
Product{name='Abominable Hoodie', image='https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg', price='$69.00'}

Amazing! You’ve just learned how to parse data from a single HTML element.

Step 5: Extract All Matching Elements from a Page

You now know how to collect data from a single product element. However, the destination page contains several products! Let's see how to scrape them all.

Retrieve all product HTML nodes with querySelectorAll(). Next, iterate over the product nodes in a for loop and apply the data extraction logic to each of them:

app.java
// select all product nodes on the page
final DomNodeList<DomNode> productHTMLElements = page.querySelectorAll("li.product");

// iterate over all product HTML nodes on the page
// and apply the scraping logic on each of them
for (DomNode productHTMLElement : productHTMLElements) {
    // select the inner product nodes
    // to scrape data from
    final HtmlElement productNameHTMLElement = productHTMLElement.querySelector("h2");
    final HtmlElement productImageHTMLElement = productHTMLElement.querySelector("img");
    final HtmlElement productPriceHTMLElement = productHTMLElement.querySelector("span.price");

    // data extraction logic
    final String productName = productNameHTMLElement.getTextContent();
    final String productImage = productImageHTMLElement.getAttribute("src");
    final String productPrice = productPriceHTMLElement.getTextContent();

    // log the scraped data
    System.out.println(
            "Product{name='" + productName +
                    "', image='" + productImage +
                    "', price='" + productPrice +
                    "'}");
}

Note that querySelectorAll() returns a list of DomNode, and not HtmlElement as it did before. That isn't a problem, as DomNode is the parent class of HtmlElement that exposes the node selection methods.

Integrate the logic into the App.java file and get the final HtmlUnit scraper:

app.java
package com.zenrows.scraper;

import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import java.io.IOException;

public class App {
    public static void main(String[] args) throws IOException {
        try (final WebClient webClient = new WebClient()) {
            // to disable JS rendering
            webClient.getOptions().setJavaScriptEnabled(false);
            // to disable CSS rendering
            webClient.getOptions().setCssEnabled(false);

            // connect to the target page
            final HtmlPage page = webClient.getPage("https://scrapingcourse.com/ecommerce/");

            // select all product nodes on the page
            final DomNodeList<DomNode> productHTMLElements = page.querySelectorAll("li.product");

            // iterate over all product HTML nodes on the page
            // and apply the scraping logic on each of them
            for (DomNode productHTMLElement : productHTMLElements) {
                // select the inner product nodes
                // to scrape data from
                final HtmlElement productNameHTMLElement = productHTMLElement.querySelector("h2");
                final HtmlElement productImageHTMLElement = productHTMLElement.querySelector("img");
                final HtmlElement productPriceHTMLElement = productHTMLElement.querySelector("span.price");

                // data extraction logic
                final String productName = productNameHTMLElement.getTextContent();
                final String productImage = productImageHTMLElement.getAttribute("src");
                final String productPrice = productPriceHTMLElement.getTextContent();

                // log the scraped data
                System.out.println(
                        "Product{name='" + productName +
                                "', image='" + productImage +
                                "', price='" + productPrice +
                                "'}");
            }
        }
    }
}

Launch the web scraping HtmlUnit script to achieve the following result:

Output
Product{name='Abominable Hoodie', image='https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg', price='$69.00'}
// ...
Product{name='Artemis Running Short', image='https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wsh04-black_main-324x324.jpg', price='$45.00'}

Et voilà! You’ve learned to use HtmlUnit to extract all data from a single product page. To go through each product page on the site and scrape all products, you must build a crawler. Find out how in our guide on Java web crawling.

Still, these aren't all the use cases supported by HtmlUnit. Let’s explore more in the next section!

Examples of Extracting Specific Data

The scraping API offered by HtmlUnit is so powerful that it can cover most scenarios. Here's a table showcasing the methods from HtmlUnit along with their descriptions:

Method Description Example
getElementByName() Returns the HTML element with the specified name. HtmlElement element = page.getElementByName("username");
getElementsByIdAndOrName() Returns the HTML elements with the given name or id attribute. List<HtmlElement> elements = page.getElementsByIdAndOrName("password");
getElementsByName() Returns the HTML elements with the specified name attribute. List<HtmlElement> elements = page.getElementsByName("subject");
getFocusedElement() Returns the HTML element currently on focus or null if no element has the focus. HtmlElement focusedElement = page.getFocusedElement();
getFormByName() Returns the first form element that matches the specified name. HtmlForm form = page.getFormByName("loginForm");
getForms() Returns a list of all the <form> elements in the page. List<HtmlForm> forms = page.getForms();
getHead() Returns the <head> element. HtmlHead head = page.getHead();
getHtmlElementById() Returns the HTML element with the specified id attribute. HtmlElement element = page.getHtmlElementById("product-14");
getFirstByXPath() Evaluates the specified XPath expression and returns the first matching HTML element. Returns null if no node matches the specified XPath expression. HtmlElement element = page.getFirstByXPath("//div[@class='article']");
getByXPath() Evaluates the specified XPath expression and returns the matching HTML elements. List<HtmlElement> elements = page.getByXPath("//div[@class='article']");
querySelector() Returns the first element within the document that matches the specified CSS selector. It returns null if no element matches the given CSS selector. DomNode element = page.querySelector(".article");
querySelectorAll() Retrieves all HTML nodes matching the specified CSS selector. DomNodeList<DomNode> elements = page.querySelectorAll(".article");

Let's take a look at how to use these methods in common data extraction scenarios through quick examples!

Find Nodes by Text

Sometimes, defining an effective node selection strategy isn't easy. In such scenarios, you need to search for HTML elements based on their internal text.

Suppose you want to find all HTML elements that contain the "Select options" string. You can select them all with a plan XPath expression:

app.java
final List<HtmlElement> buttonHTMLElements = page.getByXPath("//*[contains(text(), 'Select options')]");

This instruction returns all lower-level HTML nodes with "Select options" in their text.

You might consider going through all the nodes on the page and keeping only those containing the desired text, but unfortunately, this approach wouldn’t work. That’s because it would also result in top-level nodes such as <body> and <html>.

Get All Link URLs

Retrieving all the URLs of the links on the page point is essential in web crawling. Find out more in our comparison guide on web crawling vs. web scraping.

Select all <a> nodes with a CSS selector and then extract their URLs with a stream:

app.java
final List<String> urls = page
        .querySelectorAll("a")
        .stream()
        .map(a -> ((HtmlElement) a).getAttribute("href"))
        .collect(Collectors.toList());

The resulting list will also contain anchor and relative links. Filter them out and get absolute links only by keeping all URLs that start with "https":

app.java
final List<String> absoluteUrls = page
        .querySelectorAll("a")
        .stream()
        .map(a -> ((HtmlElement) a).getAttribute("href"))
        .filter(url -> url.startsWith("https:"))
        .collect(Collectors.toList());

Select Elements by Attribute Value

Targeting specific HTML attributes is one of the most effective ways to select nodes on a page. HtmlUnit allows you to accomplish that with three different approaches.

Assume your target page has many card nodes with custom HTML attributes, as shown in the snippet below. Your goal is to select only the cards for which the value of data-category element is "shoes".

Example
<div class="card" data-category="shoes">Trailblaze Trekking Shoes</div>
<div class="card" data-category="t-shirt">Black XL T-Shirt</div>
<div class="card" data-category="shoes">Urban Hiker Sneakers</div>
<div class="card" data-category="other">Gold Necklace</div>
app.java
final DomNodeList<DomNode> shoesHTMLElements = page.querySelectorAll("[data-category='shoes']");

You can also use an XPath expression:

app.java
final List<HtmlElement> shoesHTMLElements = page.getByXPath("//*[@data-category='shoes']");

Otherwise, get all .card and then filter them using the Java 8+ stream filtering capabilities:

app.java
final List<HtmlElement> shoesHTMLElements = page
        .querySelectorAll(".card")
        .stream()
        .map(HtmlElement.class::cast)
        .filter(card-> card.getAttribute("data-category") == "shoes")
        .collect(Collectors.toList());

All three code snippets above will select the same nodes.

Retrieve All Elements in a List

The HtmlUnit web scraping API makes it easy to retrieve data from an HTML list. As you’ll see here, it only takes a few lines of code!

Assume your target page has the following bulleted list of books:

app.java
<ul class="book-list">
    <li>Book 1</li>
    <li>Book 2</li>
    <li>Book 3</li>
</ul>

The goal is to retrieve the name of each element in the list and store it in a Java array. Achieve that with this snippet:

app.java
final List<String> books = page
        .querySelectorAll(".book-list li")
        .stream()
        .map(bookHTTMLElement -> bookHTTMLElement.getTextContent())
        .collect(Collectors.toList());

Once again, Java stream capabilities are critical to achieving the goal. books will contain the following values:

Output
["Book 1", "Book 2", "Book 3"]

Scrape Data From a Table

Scraping tables is one of the most common scenarios in web scraping with HtmlUnit. Consider the HTML code of the <table> below:

app.java
<table class="seasons">
    <thead>
        <tr>
            <th>#</th>
            <th>Title</th>
            <th>Episodes</th>
            <th>Airdates</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>1</td>
            <td>Season 1: Prelude</td>
            <td>28</td>
            <td>2024-06-01 - 2024-31-03</td>
        </tr>
        <tr>
            <td>2</td>
            <td>Season 2: A New Mistery</td>
            <td>24</td>
            <td>2025-03-02 - 2025-09-08</td>
        </tr>
        <!-- Other rows... -->
    </tbody>
</table>

Scraping all data from this table requires a few steps:

  1. Define a custom class Episode that matches the values in the table columns.
  2. Initialize a list of Episode where to store each entry in the table.
  3. Use the .seasons tr in HtmlUnit to select each row in the table.
  4. Iterate over each row HTML node.
  5. Instantiate a new Episode instance by collecting data from the <td> elements.
  6. Add the Episode instance to the list of scraped data.

The snippet below uses the above algorithm to extract data from a table with HtmlUnit:

app.java
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import java.util.List;

public class HtmlUnitExample {
    class Episode {
        private String number;
        private String title;
        private Integer episodes;
        private String airdate;
    
        // getters and setters omitted for brevity...
    }

    public static void main(String[] args) throws Exception {
        try (final WebClient webClient = new WebClient()) {
            // navigiave to the target page
            final HtmlPage page = webClient.getPage("https://YOUR-DOMAIN.com/season-table");
            // where to store the scraped data
            List<Episode> episodes = new ArrayList<>();

            // select all rows in the table
            DomNodeList<DomNode> rowHTMLElements = page.querySelectorAll(".episodes tr");

            // iterate over each row in the table
            for (DomNode rowHTMLElement: rowHTMLElements) {
                // select all cell in the row 
                List<HtmlElement> cellHTMLElements = rowHTMLElement.getElementsByTagName("td");
                if (cellHTMLElements.size() >= 4) {
                    // create a new the episode object 
                    // and populate it with some data 
                    Episode episode = new Episode();
                    episode.setNumber(cellHTMLElements.get(0).getTextContent());
                    episode.setTitle(cellHTMLElements.get(1).getTextContent());
                    episode.setEpisodes(Integer.parseInt(cellHTMLElements.get(2).getTextContent()));
                    episode.setAirdate(cellHTMLElements.get(3).getTextContent());

                    // add the new episode to the list
                    episodes.add(episode);
                }
            }
        }
    }
}

Handle JavaScript-Based Pages

Some web pages use JavaScript for rendering purposes or dynamically loading data, such as the infinite scrolling demo page shown below. The page makes new AJAX calls to retrieve new products as the user scrolls down:

Infinite Scroll Demo
Click to open the image in full screen

Your new scraping goal is to extract data from the first 50 products on this dynamic content page.

Enable the CSS and JS rendering and connect to the target page in Chrome mode:

app.java
package com.zenrows.scraper;

import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import java.io.IOException;

public class App {
    public static void main(String[] args) throws IOException {
        try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {

            // to disable JS rendering
            webClient.getOptions().setJavaScriptEnabled(true);
            // to disable CSS rendering
            webClient.getOptions().setCssEnabled(true);

            // connect to the target page
            final HtmlPage page = webClient.getPage("https://scrapingcourse.com/infinite-scrolling");
        }
    }
}

You'll see a lot of warnings in the terminal:

Output
WARNING: CSS error: 'https://scrapingcourse.com/build/assets/app-OVTGD5UP.css' [1:14363] Error in expression. (Invalid token "var(". Was expecting one of: <S>, <NUMBER>, "-", <PLUS>, <PERCENTAGE>.)
WARNING: CSS error: 'https://scrapingcourse.com/build/assets/app-OVTGD5UP.css' [1:14518] Error in expression. (Invalid token "var(". Was expecting one of: <S>, <NUMBER>, "-", <PLUS>, <PERCENTAGE>.)
WARNING: CSS error: 'https://scrapingcourse.com/build/assets/app-OVTGD5UP.css' [1:15070] Error in expression. (Invalid token "0". Was expecting one of: <S>, ")", <COMMA>.)

The problem is that the CSS and JavaScript engine in HtmlUnit hasn't received significant updates in years. On some pages, you'll get a ScriptException that directly crashes your script. Silencing these errors won't help, as HtmlUnit still won't be able to load the page properly.

The real solution for scraping dynamic pages in Java is to use a headless browser. Selenium is the most popular browser automation tool used to simulate user interaction. Install it by adding the selenium-java package to your pom.xml:

pom.xml
<dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-java</artifactId>
    <version>4.19.1</version>
</dependency>

Update the project's dependencies with:

Terminal
mvn dependency:resolve

Next, import Selenium into your script. Use it to instruct headless Chrome to connect to the target page:

app.java
package com.zenrows.scraper;

import org.openqa.selenium.*;
import org.openqa.selenium.chrome.*;
import org.openqa.selenium.support.ui.*;
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import java.io.IOException;
import java.time.Duration;

public class App {
    public static void main(String[] args) throws IOException {
        // define the options to run Chrome in headless mode
        ChromeOptions options = new ChromeOptions();
        options.addArguments("--headless");

        // initialize a Selenium instance
        // to run Chrome in headless mode
        WebDriver driver = new ChromeDriver(options);

        // connect to the target web page
        driver.get("https://scrapingcourse.com/infinite-scrolling");

        // interaction and scraping logic...

        // release the web driver resources
        driver.quit();
    }
}

Selenium doesn't provide a method for scrolling down. Thus, you need to simulate the interaction with a custom JS script:

app.java
const delay = ms => new Promise(res => setTimeout(res, ms));
const scrolls = 20;
let scrollCount = 0;

const scrollInterval = setInterval(async () => {
  window.scrollTo(0, 0);
  await delay(1000);
  window.scrollTo(0, document.body.scrollHeight);
  scrollCount++;

  if (scrollCount === scrolls) {
    clearInterval(scrollInterval);
  }
}, 1500);

Execute it with executeScript() and wait for the last product to be on the page before proceeding:

app.java
// script to perform the scroll interaction
String jsScrollingScript = "const delay = ms => new Promise(res => setTimeout(res, ms));\n" +
        "const scrolls = 20;\n" +
        "let scrollCount = 0;\n" +
        "\n" +
        "const scrollInterval = setInterval(async () => {\n" +
        "  window.scrollTo(0, 0);\n" +
        "  await delay(1000);\n"+
        "  window.scrollTo(0, document.body.scrollHeight);\n" +
        "  scrollCount++;\n" +
        "\n" +
        "  if (scrollCount === scrolls) {\n" +
        "    clearInterval(scrollInterval);\n" +
        "  }\n" +
        "}, 1500);";

// execute the JavaScript script on the page
((JavascriptExecutor) driver).executeScript(jsScrollingScript);

// wait for the 50th product to be visible
WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(60));
wait.until(ExpectedConditions.presenceOfElementLocated(
        By.cssSelector("#product-container > .flex.flex-col.items-center.rounded-lg:nth-child(50)")));

Now, you have two choices. You can implement the scraping logic directly with Selenium. Or, you can parse the actual HTML of the page with HtmlUnit. Let's follow the second option:

app.java
// extract the page HTML
String html = driver.getPageSource();

// parse the HTML of the current page with HtmlUnit
try (WebClient webClient = new WebClient()) {
    // disable CSS and JS rendering
    webClient.getOptions().setJavaScriptEnabled(false);
    webClient.getOptions().setCssEnabled(false);

    // inject the current HTML into
    // the HtmlUnit browser window
    HtmlPage page = webClient.loadHtmlCodeIntoCurrentWindow(html);
}

Next, apply the scraping logic in HtmlUnit:

app.java
// select all product nodes on the page
final DomNodeList<DomNode> productHTMLElements = page
        .querySelectorAll("#product-container > .flex.flex-col.items-center.rounded-lg");

// iterate over all product HTML nodes on the page
// and apply the scraping logic on each of them
for (DomNode productHTMLElement : productHTMLElements) {
    // select the inner product nodes
    // to scrape data from
    final HtmlElement productNameHTMLElement = productHTMLElement.querySelector("span");
    final HtmlElement productImageHTMLElement = productHTMLElement.querySelector("img");
    final HtmlElement productPriceHTMLElement = productHTMLElement.querySelector("span.text-slate-600");

    // data extraction logic
    final String productName = productNameHTMLElement.getTextContent();
    final String productImage = productImageHTMLElement.getAttribute("src");
    final String productPrice = productPriceHTMLElement.getTextContent();

    // log the scraped data
    System.out.println(
            "Product{name='" + productName +
                    "', image='" + productImage +
                    "', price='" + productPrice +
                    "'}");
}

This is what the complete Selenium HtmlUnit web scraping script will look like:

app.java
package com.zenrows.scraper;

import org.openqa.selenium.*;
import org.openqa.selenium.chrome.*;
import org.openqa.selenium.support.ui.*;
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import java.io.IOException;
import java.time.Duration;

public class App {
    public static void main(String[] args) throws IOException {
        // define the options to run Chrome in headless mode
        ChromeOptions options = new ChromeOptions();
        options.addArguments("--headless");

        // initializing a Selenium instance
        // to run Chrome in headless mode
        WebDriver driver = new ChromeDriver(options);

        // connect to the target web page
        driver.get("https://scrapingcourse.com/infinite-scrolling");

        // script to perform the scroll interaction
        String jsScrollingScript = "const delay = ms => new Promise(res => setTimeout(res, ms));\n" +
                "const scrolls = 20;\n" +
                "let scrollCount = 0;\n" +
                "\n" +
                "const scrollInterval = setInterval(async () => {\n" +
                "  window.scrollTo(0, 0);\n" +
                "  await delay(1000);\n" +
                "  window.scrollTo(0, document.body.scrollHeight);\n" +
                "  scrollCount++;\n" +
                "\n" +
                "  if (scrollCount === scrolls) {\n" +
                "    clearInterval(scrollInterval);\n" +
                "  }\n" +
                "}, 1500);";

        // execute the JavaScript script on the page
        ((JavascriptExecutor) driver).executeScript(jsScrollingScript);

        // wait for the 50th product to be visible
        WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(60));
        wait.until(ExpectedConditions.presenceOfElementLocated(
                By.cssSelector("#product-container > .flex.flex-col.items-center.rounded-lg:nth-child(50)")));

        // extract the page HTML
        String html = driver.getPageSource();

        // parse the HTML of the current page with HtmlUnit
        try (WebClient webClient = new WebClient()) {
            // disable CSS and JS rendering
            webClient.getOptions().setJavaScriptEnabled(false);
            webClient.getOptions().setCssEnabled(false);

            // inject the current HTML into
            // the HtmlUnit browser window
            HtmlPage page = webClient.loadHtmlCodeIntoCurrentWindow(html);

            // select all product nodes on the page
            final DomNodeList<DomNode> productHTMLElements = page
                    .querySelectorAll("#product-container > .flex.flex-col.items-center.rounded-lg");

            // iterate over all product HTML nodes on the page
            // and apply the scraping logic on each of them
            for (DomNode productHTMLElement : productHTMLElements) {
                // select the inner product nodes
                // to scrape data from
                final HtmlElement productNameHTMLElement = productHTMLElement.querySelector("span");
                final HtmlElement productImageHTMLElement = productHTMLElement.querySelector("img");
                final HtmlElement productPriceHTMLElement = productHTMLElement.querySelector("span.text-slate-600");

                // data extraction logic
                final String productName = productNameHTMLElement.getTextContent();
                final String productImage = productImageHTMLElement.getAttribute("src");
                final String productPrice = productPriceHTMLElement.getTextContent();

                // log the scraped data
                System.out.println(
                        "Product{name='" + productName +
                                "', image='" + productImage +
                                "', price='" + productPrice +
                                "'}");
            }
        }

        // release the web driver resources
        driver.quit();
    }
}

Launch it, and it'll print over 50 products:

Output
Product{name='Chaz Kangeroo Hoodie', image='https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh01-gray_main.jpg', price='$52'}
Product{name='Teton Pullover Hoodie', image='https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh02-black_main.jpg', price='$70'}
// omitted for brevity...
Product{name='Helios EverCool&trade; Tee', image='https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/ms05-blue_main.jpg', price='$24'}
Product{name='Zoltan Gym Tee', image='https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/ms06-blue_main.jpg', price='$29'}

Well done! You can now scrape modern dynamic pages with HtmlUnit.

Conclusion

In this HtmlUnit tutorial, you’ve learned the fundamentals of parsing HTML documents. You addressed the basics and explored more advanced scenarios to become a scraping expert.

Now you know:

  • What the HtmlUnit Java library is.
  • How to use it to retrieve data from an HTML document.
  • How to use it in common scraping use cases through real-world examples.
  • The advanced features for JavaScript execution, and more.

No matter how good your web data parser is, anti-bot measures will still catch you. Bypass them all with ZenRows, a web scraping API with headless browser capabilities, IP rotation, and an AI-powered built-in toolkit to avoid any anti-scraping measures. Try ZenRows for free!

Ready to get started?

Up to 1,000 URLs for free are waiting for you