HtmlUnit web scraping is a powerful way to extract content from HTML documents in Java. The tool’s power lies in the DOM exploration API.
In this tutorial, you'll learn the basics of web scraping with HtmlUnit in Java and explore more complex use cases.
You'll see:
Let's dive in!
What Is HtmlUnit?
HtmlUnit is a Java HTML parser that operates over a GUI-less browser built on the Rhino JavaScript engine. It can visit web pages, parse their HTML documents, and simulate Chrome and Firefox.
Compared to other HTML parsers in Java, e.g., jsoup, HtmlUnit is a more complete solution. It can simulate user interactions, such as clicks, form submissions, and others. Since it doesn't rely on an external browser, it's also more efficient than Selenium.
However, HtmlUnit's ability to execute JavaScript isn't as well-rounded as that of modern browsers. Still, it's a reliable tool for web scraping in Java.
How to Scrape With HtmlUnit
Let’s see how to start web scraping in Java with HtmlUnit. The target site will be ScrapingCourse.com, a static e-commerce site with fake products:
The scraping objective is to extract data from each product on that page. Follow the instructions below and learn how to achieve that goal with HtmlUnit.
Step 1: Install HtmlUnit in Java
Before diving into coding, ensure you have the latest JDK and Gradle or Maven installed locally. In this tutorial, we'll use Maven because it's more popular in the Java ecosystem.
Set up a Java Maven project. Open a terminal in the folder you want to use for your HtmlUnit web scraping project. Then, initialize a Maven project inside it with the following command:
mvn archetype:generate -DgroupId="com.zenrows.scraper" -DartifactId="htmlunit-scraper" -DarchetypeArtifactId="maven-archetype-quickstart" -DarchetypeVersion="1.4" -DinteractiveMode="false"
Wait for Maven to download the required tools and create your project. The htmlunit-scraper
directory will now contain a Maven project.
Time to add HtmlUnit to your project's dependencies. Open the pom.xml
file created by Maven. Install HtmlUnit by adding the following lines inside the <dependencies>
tag:
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.70.0</version>
</dependency>
Launch the command below to update the dependencies. It will download, install, and configure the HtmlUnit package:
mvn dependency:resolve
Load the project folder in your Java IDE. IntelliJ IDEA Community Edition or Visual Studio Code with the Java extension will do. Import HtmlUnit by adding these two lines to the App.java
file in the nested folders inside /src
:
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
App.java
, the entry script of your web scraping project, will now contain:
package com.zenrows.scraper;
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
public class App {
public static void main( String[] args ) {
System.out.println( "Hello World!" );
}
}
Build your Java application with the following Maven command:
mvn package
You can then execute the .jar
file that will appear in the /target
folder with:
java -cp target/htmlunit-scraper-1.0-SNAPSHOT.jar com.zenrows.scraper.App
Otherwise, execute the Java application with the run button of your IDE.
That script will produce the output below in the terminal:
Hello, World!
You now have a Java Maven project in place. Get ready to turn it into an HtmlUnit scraping application!
Step 2: Initialize a Java Class to Fetch the HTML Code
Open the App.java
file and initialize a WebClient
object in a try-with-resources as shown below. The WebClient
class simulates a browser and exposes a complete web scraping API:
try (final WebClient webClient = new WebClient()) {
// ...
}
This try
syntax ensures the webClient
resource gets closed at the end of the statement.
The target site doesn't need JavaScript and CSS rendering, so disable it to save resources:
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setCssEnabled(false);
Then, use the getPage()
method to load the target page in the HtmlUnit browser:
final HtmlPage page = webClient.getPage("https://scrapingcourse.com/ecommerce/");
If an I/O error occurs, getPage()
raises a checked IOException
you must handle. Since the try
statement doesn't have a catch
section, add a throws
declaration to the main function:
public static void main(String[] args) throws IOException {
// ....
}
Don't forget to add the following import on top of your App.java
file:
import java.io.IOException;
Use the following lines to access the raw HTML of the page from the server response and print it in the terminal:
final String html = page.getWebResponse().getContentAsString();
System.out.println(html);
Put it all together, and you'll get:
package com.zenrows.scraper;
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import java.io.IOException;
public class App {
public static void main(String[] args) throws IOException {
try (final WebClient webClient = new WebClient()) {
// to disable JS rendering
webClient.getOptions().setJavaScriptEnabled(false);
// to disable CSS rendering
webClient.getOptions().setCssEnabled(false);
// connect to the target page
final HtmlPage page = webClient.getPage("https://scrapingcourse.com/ecommerce/");
// get the HTML source of the page
// and print it in the terminal
final String html = page.getWebResponse().getContentAsString();
System.out.println(html);
}
}
}
Launch the script, and it'll print:
<!doctype html>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="profile" href="https://gmpg.org/xfn/11">
<link rel="pingback" href="https://scrapingcourse.com/ecommerce/xmlrpc.php">
<title>An ecommerce store to scrape – Scraping Laboratory</title>
<meta name='robots' content='max-image-preview:large' />
<link rel='dns-prefetch' href='//stats.wp.com' />
<link rel='dns-prefetch' href='//scrapingcourse.com' />
<link rel='dns-prefetch' href='//fonts.googleapis.com' />
<!-- omitted for brevity... -->
Awesome! The HtmlUnit web scraping script connects to the target page as intended.
However, keep in mind that ScrapingCourse.com is just a scraping sandbox. Real-world sites can use anti-bot systems to stop requests from automated scripts, in which case your scraper's requests would fail.
One way to avoid that is to use a web scraping API, such as ZenRows. The tool bypasses all anti-scraping measures for you, getting the HTML content of any web page. You can then feed it to HtmlUnit or any other Java HTML parsing library and start extracting data from it.
Step 3: Extract a Single Element
The HtmlPage
object returned by getPage()
has several methods to select HTML nodes on the page:
-
getFirstByXPath()
: Evaluates the specified XPath expression and returns the first matching HTML element. It returnsnull
if no node matches the specified XPath expression. -
getByXPath()
: Evaluates the specified XPath expression and returns the matching HTML elements. -
querySelector()
: Returns the first element within the document that matches the specified CSS selector. It returnsnull
if no element matches the given CSS selector. -
querySelectorAll()
: Retrieves all HTML nodes matching the specified CSS selector.
Note that HtmUnit supports both XPath expressions and CSS selectors. If you don't know which element selection language is best for you, read our article on XPath vs CSS selector.
Let's keep things simple and go for CSS selectors. See querySelector()
in action to select the single <title>
element on the page. Then, print its content with getTextContent()
:
final HtmlElement titleHTMLElement = page.querySelector("title");
final String title = titleHTMLElement.getTextContent();
System.out.println(title);
The snippet above will produce the following output:
An ecommerce store to scrape - Scraping Laboratory
Great! This is exactly the title of the page that you can also see in your browser.
Extracting the title from a page is a popular task. So, HtmlUnit provides the getTitleText()
utility for that:
final String title = page.getTitleText();
System.out.println(title);
These two lines of code produce the same result as the snippet seen earlier.
Step 4: Extract Multiple Elements
It's time to build on what you've just learned and extract the name, image, and price data from a product HTML element on the page.
Inspect a product element to figure out the best node selection strategy. Open ScrapingCourse.com in the browser, right-click on a product HTML node, and select "Inspect":
Expand the HTML code and notice that the first product node is a <li>
element with a product
class. Inside that HTML node, there are:
- An
<h2>
element with the product price. - An
<img>
element containing the product image. - A
<span>
element with aprice
class storing the product price.
Follow the instructions below and learn how to scrape data from those HTML elements.
Select the first product node on the page:
final HtmlElement productHTMLElement = page.querySelector("li.product");
The HtmlElement
class exposes the XPath and CSS selector node selection methods as well. Specifically, these work on the descendant of the current node. Use three querySelector()
calls to select the nodes of interest:
final HtmlElement productNameHTMLElement = productHTMLElement.querySelector("h2");
final HtmlElement productImageHTMLElement = productHTMLElement.querySelector("img");
final HtmlElement productPriceHTMLElement = productHTMLElement.querySelector("span.price");
Retrieve the desired data from their text or attributes with getTextContent()
and getAttribute()
:
final String productName = productNameHTMLElement.getTextContent();
final String productImage = productImageHTMLElement.getAttribute("src");
final String productPrice = productPriceHTMLElement.getTextContent();
Print the scraped information with the following System.out.println()
instruction:
System.out.println(
"Product{name='" + productName +
"', image='" + productImage +
"', price='" + productPrice +
"'}"
);
Your App.java
class will now contain:
package com.zenrows.scraper;
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import java.io.IOException;
public class App {
public static void main(String[] args) throws IOException {
try (final WebClient webClient = new WebClient()) {
// to disable JS rendering
webClient.getOptions().setJavaScriptEnabled(false);
// to disable CSS rendering
webClient.getOptions().setCssEnabled(false);
// connect to the target page
final HtmlPage page = webClient.getPage("https://scrapingcourse.com/ecommerce/");
// select the first product node on the page
final HtmlElement productHTMLElement = page.querySelector("li.product");
// select the inner product nodes
// to scrape data from
final HtmlElement productNameHTMLElement = productHTMLElement.querySelector("h2");
final HtmlElement productImageHTMLElement = productHTMLElement.querySelector("img");
final HtmlElement productPriceHTMLElement = productHTMLElement.querySelector("span.price");
// data extraction logic
final String productName = productNameHTMLElement.getTextContent();
final String productImage = productImageHTMLElement.getAttribute("src");
final String productPrice = productPriceHTMLElement.getTextContent();
// log the scraped data
System.out.println(
"Product{name='" + productName +
"', image='" + productImage +
"', price='" + productPrice +
"'}"
);
}
}
}
Run the HtmlUnit web scraping script, and it'll produce this output in the terminal:
Product{name='Abominable Hoodie', image='https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg', price='$69.00'}
Amazing! You’ve just learned how to parse data from a single HTML element.
Step 5: Extract All Matching Elements from a Page
You now know how to collect data from a single product element. However, the destination page contains several products! Let's see how to scrape them all.
Retrieve all product HTML nodes with querySelectorAll()
. Next, iterate over the product nodes in a for
loop and apply the data extraction logic to each of them:
// select all product nodes on the page
final DomNodeList<DomNode> productHTMLElements = page.querySelectorAll("li.product");
// iterate over all product HTML nodes on the page
// and apply the scraping logic on each of them
for (DomNode productHTMLElement : productHTMLElements) {
// select the inner product nodes
// to scrape data from
final HtmlElement productNameHTMLElement = productHTMLElement.querySelector("h2");
final HtmlElement productImageHTMLElement = productHTMLElement.querySelector("img");
final HtmlElement productPriceHTMLElement = productHTMLElement.querySelector("span.price");
// data extraction logic
final String productName = productNameHTMLElement.getTextContent();
final String productImage = productImageHTMLElement.getAttribute("src");
final String productPrice = productPriceHTMLElement.getTextContent();
// log the scraped data
System.out.println(
"Product{name='" + productName +
"', image='" + productImage +
"', price='" + productPrice +
"'}");
}
Note that querySelectorAll()
returns a list of DomNode
, and not HtmlElement
as it did before. That isn't a problem, as DomNode
is the parent class of HtmlElement
that exposes the node selection methods.
Integrate the logic into the App.java file and get the final HtmlUnit scraper:
package com.zenrows.scraper;
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import java.io.IOException;
public class App {
public static void main(String[] args) throws IOException {
try (final WebClient webClient = new WebClient()) {
// to disable JS rendering
webClient.getOptions().setJavaScriptEnabled(false);
// to disable CSS rendering
webClient.getOptions().setCssEnabled(false);
// connect to the target page
final HtmlPage page = webClient.getPage("https://scrapingcourse.com/ecommerce/");
// select all product nodes on the page
final DomNodeList<DomNode> productHTMLElements = page.querySelectorAll("li.product");
// iterate over all product HTML nodes on the page
// and apply the scraping logic on each of them
for (DomNode productHTMLElement : productHTMLElements) {
// select the inner product nodes
// to scrape data from
final HtmlElement productNameHTMLElement = productHTMLElement.querySelector("h2");
final HtmlElement productImageHTMLElement = productHTMLElement.querySelector("img");
final HtmlElement productPriceHTMLElement = productHTMLElement.querySelector("span.price");
// data extraction logic
final String productName = productNameHTMLElement.getTextContent();
final String productImage = productImageHTMLElement.getAttribute("src");
final String productPrice = productPriceHTMLElement.getTextContent();
// log the scraped data
System.out.println(
"Product{name='" + productName +
"', image='" + productImage +
"', price='" + productPrice +
"'}");
}
}
}
}
Launch the web scraping HtmlUnit script to achieve the following result:
Product{name='Abominable Hoodie', image='https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg', price='$69.00'}
// ...
Product{name='Artemis Running Short', image='https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wsh04-black_main-324x324.jpg', price='$45.00'}
Et voilà ! You’ve learned to use HtmlUnit to extract all data from a single product page. To go through each product page on the site and scrape all products, you must build a crawler. Find out how in our guide on Java web crawling.
Still, these aren't all the use cases supported by HtmlUnit. Let’s explore more in the next section!
Examples of Extracting Specific Data
The scraping API offered by HtmlUnit is so powerful that it can cover most scenarios. Here's a table showcasing the methods from HtmlUnit along with their descriptions:
Method | Description | Example |
---|---|---|
getElementByName() |
Returns the HTML element with the specified name. | HtmlElement element = page.getElementByName("username"); |
getElementsByIdAndOrName() |
Returns the HTML elements with the given name or id attribute. |
List<HtmlElement> elements = page.getElementsByIdAndOrName("password"); |
getElementsByName() |
Returns the HTML elements with the specified name attribute. |
List<HtmlElement> elements = page.getElementsByName("subject"); |
getFocusedElement() |
Returns the HTML element currently on focus or null if no element has the focus. |
HtmlElement focusedElement = page.getFocusedElement(); |
getFormByName() |
Returns the first form element that matches the specified name. | HtmlForm form = page.getFormByName("loginForm"); |
getForms() |
Returns a list of all the <form> elements in the page. |
List<HtmlForm> forms = page.getForms(); |
getHead() |
Returns the <head> element. |
HtmlHead head = page.getHead(); |
getHtmlElementById() |
Returns the HTML element with the specified id attribute. |
HtmlElement element = page.getHtmlElementById("product-14"); |
getFirstByXPath() |
Evaluates the specified XPath expression and returns the first matching HTML element. Returns null if no node matches the specified XPath expression. |
HtmlElement element = page.getFirstByXPath("//div[@class='article']"); |
getByXPath() |
Evaluates the specified XPath expression and returns the matching HTML elements. | List<HtmlElement> elements = page.getByXPath("//div[@class='article']"); |
querySelector() |
Returns the first element within the document that matches the specified CSS selector. It returns null if no element matches the given CSS selector. |
DomNode element = page.querySelector(".article"); |
querySelectorAll() |
Retrieves all HTML nodes matching the specified CSS selector. | DomNodeList<DomNode> elements = page.querySelectorAll(".article"); |
Let's take a look at how to use these methods in common data extraction scenarios through quick examples!
Find Nodes by Text
Sometimes, defining an effective node selection strategy isn't easy. In such scenarios, you need to search for HTML elements based on their internal text.
Suppose you want to find all HTML elements that contain the "Select options" string. You can select them all with a plan XPath expression:
final List<HtmlElement> buttonHTMLElements = page.getByXPath("//*[contains(text(), 'Select options')]");
This instruction returns all lower-level HTML nodes with "Select options" in their text.
You might consider going through all the nodes on the page and keeping only those containing the desired text, but unfortunately, this approach wouldn’t work. That’s because it would also result in top-level nodes such as <body>
and <html>
.
Get All Link URLs
Retrieving all the URLs of the links on the page point is essential in web crawling. Find out more in our comparison guide on web crawling vs. web scraping.
Select all <a>
nodes with a CSS selector and then extract their URLs with a stream:
final List<String> urls = page
.querySelectorAll("a")
.stream()
.map(a -> ((HtmlElement) a).getAttribute("href"))
.collect(Collectors.toList());
The resulting list will also contain anchor and relative links. Filter them out and get absolute links only by keeping all URLs that start with "https"
:
final List<String> absoluteUrls = page
.querySelectorAll("a")
.stream()
.map(a -> ((HtmlElement) a).getAttribute("href"))
.filter(url -> url.startsWith("https:"))
.collect(Collectors.toList());
Select Elements by Attribute Value
Targeting specific HTML attributes is one of the most effective ways to select nodes on a page. HtmlUnit allows you to accomplish that with three different approaches.
Assume your target page has many card nodes with custom HTML attributes, as shown in the snippet below. Your goal is to select only the cards for which the value of data-category
element is "shoes"
.
<div class="card" data-category="shoes">Trailblaze Trekking Shoes</div>
<div class="card" data-category="t-shirt">Black XL T-Shirt</div>
<div class="card" data-category="shoes">Urban Hiker Sneakers</div>
<div class="card" data-category="other">Gold Necklace</div>
final DomNodeList<DomNode> shoesHTMLElements = page.querySelectorAll("[data-category='shoes']");
You can also use an XPath expression:
final List<HtmlElement> shoesHTMLElements = page.getByXPath("//*[@data-category='shoes']");
Otherwise, get all .card
and then filter them using the Java 8+ stream filtering capabilities:
final List<HtmlElement> shoesHTMLElements = page
.querySelectorAll(".card")
.stream()
.map(HtmlElement.class::cast)
.filter(card-> card.getAttribute("data-category") == "shoes")
.collect(Collectors.toList());
All three code snippets above will select the same nodes.
Retrieve All Elements in a List
The HtmlUnit web scraping API makes it easy to retrieve data from an HTML list. As you’ll see here, it only takes a few lines of code!
Assume your target page has the following bulleted list of books:
<ul class="book-list">
<li>Book 1</li>
<li>Book 2</li>
<li>Book 3</li>
</ul>
The goal is to retrieve the name of each element in the list and store it in a Java array. Achieve that with this snippet:
final List<String> books = page
.querySelectorAll(".book-list li")
.stream()
.map(bookHTTMLElement -> bookHTTMLElement.getTextContent())
.collect(Collectors.toList());
Once again, Java stream capabilities are critical to achieving the goal. books
will contain the following values:
["Book 1", "Book 2", "Book 3"]
Scrape Data From a Table
Scraping tables is one of the most common scenarios in web scraping with HtmlUnit. Consider the HTML code of the <table>
below:
<table class="seasons">
<thead>
<tr>
<th>#</th>
<th>Title</th>
<th>Episodes</th>
<th>Airdates</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Season 1: Prelude</td>
<td>28</td>
<td>2024-06-01 - 2024-31-03</td>
</tr>
<tr>
<td>2</td>
<td>Season 2: A New Mistery</td>
<td>24</td>
<td>2025-03-02 - 2025-09-08</td>
</tr>
<!-- Other rows... -->
</tbody>
</table>
Scraping all data from this table requires a few steps:
- Define a custom class
Episode
that matches the values in the table columns. - Initialize a list of
Episode
where to store each entry in the table. - Use the
.seasons tr
in HtmlUnit to select each row in the table. - Iterate over each row HTML node.
- Instantiate a new
Episode
instance by collecting data from the<td>
elements. - Add the
Episode
instance to the list of scraped data.
The snippet below uses the above algorithm to extract data from a table with HtmlUnit:
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import java.util.List;
public class HtmlUnitExample {
class Episode {
private String number;
private String title;
private Integer episodes;
private String airdate;
// getters and setters omitted for brevity...
}
public static void main(String[] args) throws Exception {
try (final WebClient webClient = new WebClient()) {
// navigiave to the target page
final HtmlPage page = webClient.getPage("https://YOUR-DOMAIN.com/season-table");
// where to store the scraped data
List<Episode> episodes = new ArrayList<>();
// select all rows in the table
DomNodeList<DomNode> rowHTMLElements = page.querySelectorAll(".episodes tr");
// iterate over each row in the table
for (DomNode rowHTMLElement: rowHTMLElements) {
// select all cell in the row
List<HtmlElement> cellHTMLElements = rowHTMLElement.getElementsByTagName("td");
if (cellHTMLElements.size() >= 4) {
// create a new the episode object
// and populate it with some data
Episode episode = new Episode();
episode.setNumber(cellHTMLElements.get(0).getTextContent());
episode.setTitle(cellHTMLElements.get(1).getTextContent());
episode.setEpisodes(Integer.parseInt(cellHTMLElements.get(2).getTextContent()));
episode.setAirdate(cellHTMLElements.get(3).getTextContent());
// add the new episode to the list
episodes.add(episode);
}
}
}
}
}
Handle JavaScript-Based Pages
Some web pages use JavaScript for rendering purposes or dynamically loading data, such as the infinite scrolling demo page shown below. The page makes new AJAX calls to retrieve new products as the user scrolls down:
Your new scraping goal is to extract data from the first 50 products on this dynamic content page.
Enable the CSS and JS rendering and connect to the target page in Chrome mode:
package com.zenrows.scraper;
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import java.io.IOException;
public class App {
public static void main(String[] args) throws IOException {
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
// to disable JS rendering
webClient.getOptions().setJavaScriptEnabled(true);
// to disable CSS rendering
webClient.getOptions().setCssEnabled(true);
// connect to the target page
final HtmlPage page = webClient.getPage("https://scrapingcourse.com/infinite-scrolling");
}
}
}
You'll see a lot of warnings in the terminal:
WARNING: CSS error: 'https://scrapingcourse.com/build/assets/app-OVTGD5UP.css' [1:14363] Error in expression. (Invalid token "var(". Was expecting one of: <S>, <NUMBER>, "-", <PLUS>, <PERCENTAGE>.)
WARNING: CSS error: 'https://scrapingcourse.com/build/assets/app-OVTGD5UP.css' [1:14518] Error in expression. (Invalid token "var(". Was expecting one of: <S>, <NUMBER>, "-", <PLUS>, <PERCENTAGE>.)
WARNING: CSS error: 'https://scrapingcourse.com/build/assets/app-OVTGD5UP.css' [1:15070] Error in expression. (Invalid token "0". Was expecting one of: <S>, ")", <COMMA>.)
The problem is that the CSS and JavaScript engine in HtmlUnit hasn't received significant updates in years. On some pages, you'll get a ScriptException
that directly crashes your script. Silencing these errors won't help, as HtmlUnit still won't be able to load the page properly.
The real solution for scraping dynamic pages in Java is to use a headless browser. Selenium is the most popular browser automation tool used to simulate user interaction. Install it by adding the selenium-java
package to your pom.xml
:
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>4.19.1</version>
</dependency>
Update the project's dependencies with:
mvn dependency:resolve
Next, import Selenium into your script. Use it to instruct headless Chrome to connect to the target page:
package com.zenrows.scraper;
import org.openqa.selenium.*;
import org.openqa.selenium.chrome.*;
import org.openqa.selenium.support.ui.*;
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import java.io.IOException;
import java.time.Duration;
public class App {
public static void main(String[] args) throws IOException {
// define the options to run Chrome in headless mode
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless");
// initialize a Selenium instance
// to run Chrome in headless mode
WebDriver driver = new ChromeDriver(options);
// connect to the target web page
driver.get("https://scrapingcourse.com/infinite-scrolling");
// interaction and scraping logic...
// release the web driver resources
driver.quit();
}
}
Selenium doesn't provide a method for scrolling down. Thus, you need to simulate the interaction with a custom JS script:
const delay = ms => new Promise(res => setTimeout(res, ms));
const scrolls = 20;
let scrollCount = 0;
const scrollInterval = setInterval(async () => {
window.scrollTo(0, 0);
await delay(1000);
window.scrollTo(0, document.body.scrollHeight);
scrollCount++;
if (scrollCount === scrolls) {
clearInterval(scrollInterval);
}
}, 1500);
Execute it with executeScript()
and wait for the last product to be on the page before proceeding:
// script to perform the scroll interaction
String jsScrollingScript = "const delay = ms => new Promise(res => setTimeout(res, ms));\n" +
"const scrolls = 20;\n" +
"let scrollCount = 0;\n" +
"\n" +
"const scrollInterval = setInterval(async () => {\n" +
" window.scrollTo(0, 0);\n" +
" await delay(1000);\n"+
" window.scrollTo(0, document.body.scrollHeight);\n" +
" scrollCount++;\n" +
"\n" +
" if (scrollCount === scrolls) {\n" +
" clearInterval(scrollInterval);\n" +
" }\n" +
"}, 1500);";
// execute the JavaScript script on the page
((JavascriptExecutor) driver).executeScript(jsScrollingScript);
// wait for the 50th product to be visible
WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(60));
wait.until(ExpectedConditions.presenceOfElementLocated(
By.cssSelector("#product-container > .flex.flex-col.items-center.rounded-lg:nth-child(50)")));
Now, you have two choices. You can implement the scraping logic directly with Selenium. Or, you can parse the actual HTML of the page with HtmlUnit. Let's follow the second option:
// extract the page HTML
String html = driver.getPageSource();
// parse the HTML of the current page with HtmlUnit
try (WebClient webClient = new WebClient()) {
// disable CSS and JS rendering
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setCssEnabled(false);
// inject the current HTML into
// the HtmlUnit browser window
HtmlPage page = webClient.loadHtmlCodeIntoCurrentWindow(html);
}
Next, apply the scraping logic in HtmlUnit:
// select all product nodes on the page
final DomNodeList<DomNode> productHTMLElements = page
.querySelectorAll("#product-container > .flex.flex-col.items-center.rounded-lg");
// iterate over all product HTML nodes on the page
// and apply the scraping logic on each of them
for (DomNode productHTMLElement : productHTMLElements) {
// select the inner product nodes
// to scrape data from
final HtmlElement productNameHTMLElement = productHTMLElement.querySelector("span");
final HtmlElement productImageHTMLElement = productHTMLElement.querySelector("img");
final HtmlElement productPriceHTMLElement = productHTMLElement.querySelector("span.text-slate-600");
// data extraction logic
final String productName = productNameHTMLElement.getTextContent();
final String productImage = productImageHTMLElement.getAttribute("src");
final String productPrice = productPriceHTMLElement.getTextContent();
// log the scraped data
System.out.println(
"Product{name='" + productName +
"', image='" + productImage +
"', price='" + productPrice +
"'}");
}
This is what the complete Selenium HtmlUnit web scraping script will look like:
package com.zenrows.scraper;
import org.openqa.selenium.*;
import org.openqa.selenium.chrome.*;
import org.openqa.selenium.support.ui.*;
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import java.io.IOException;
import java.time.Duration;
public class App {
public static void main(String[] args) throws IOException {
// define the options to run Chrome in headless mode
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless");
// initializing a Selenium instance
// to run Chrome in headless mode
WebDriver driver = new ChromeDriver(options);
// connect to the target web page
driver.get("https://scrapingcourse.com/infinite-scrolling");
// script to perform the scroll interaction
String jsScrollingScript = "const delay = ms => new Promise(res => setTimeout(res, ms));\n" +
"const scrolls = 20;\n" +
"let scrollCount = 0;\n" +
"\n" +
"const scrollInterval = setInterval(async () => {\n" +
" window.scrollTo(0, 0);\n" +
" await delay(1000);\n" +
" window.scrollTo(0, document.body.scrollHeight);\n" +
" scrollCount++;\n" +
"\n" +
" if (scrollCount === scrolls) {\n" +
" clearInterval(scrollInterval);\n" +
" }\n" +
"}, 1500);";
// execute the JavaScript script on the page
((JavascriptExecutor) driver).executeScript(jsScrollingScript);
// wait for the 50th product to be visible
WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(60));
wait.until(ExpectedConditions.presenceOfElementLocated(
By.cssSelector("#product-container > .flex.flex-col.items-center.rounded-lg:nth-child(50)")));
// extract the page HTML
String html = driver.getPageSource();
// parse the HTML of the current page with HtmlUnit
try (WebClient webClient = new WebClient()) {
// disable CSS and JS rendering
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setCssEnabled(false);
// inject the current HTML into
// the HtmlUnit browser window
HtmlPage page = webClient.loadHtmlCodeIntoCurrentWindow(html);
// select all product nodes on the page
final DomNodeList<DomNode> productHTMLElements = page
.querySelectorAll("#product-container > .flex.flex-col.items-center.rounded-lg");
// iterate over all product HTML nodes on the page
// and apply the scraping logic on each of them
for (DomNode productHTMLElement : productHTMLElements) {
// select the inner product nodes
// to scrape data from
final HtmlElement productNameHTMLElement = productHTMLElement.querySelector("span");
final HtmlElement productImageHTMLElement = productHTMLElement.querySelector("img");
final HtmlElement productPriceHTMLElement = productHTMLElement.querySelector("span.text-slate-600");
// data extraction logic
final String productName = productNameHTMLElement.getTextContent();
final String productImage = productImageHTMLElement.getAttribute("src");
final String productPrice = productPriceHTMLElement.getTextContent();
// log the scraped data
System.out.println(
"Product{name='" + productName +
"', image='" + productImage +
"', price='" + productPrice +
"'}");
}
}
// release the web driver resources
driver.quit();
}
}
Launch it, and it'll print over 50 products:
Product{name='Chaz Kangeroo Hoodie', image='https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh01-gray_main.jpg', price='$52'}
Product{name='Teton Pullover Hoodie', image='https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh02-black_main.jpg', price='$70'}
// omitted for brevity...
Product{name='Helios EverCool™ Tee', image='https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/ms05-blue_main.jpg', price='$24'}
Product{name='Zoltan Gym Tee', image='https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/ms06-blue_main.jpg', price='$29'}
Well done! You can now scrape modern dynamic pages with HtmlUnit.
Conclusion
In this HtmlUnit tutorial, you’ve learned the fundamentals of parsing HTML documents. You addressed the basics and explored more advanced scenarios to become a scraping expert.
Now you know:
- What the HtmlUnit Java library is.
- How to use it to retrieve data from an HTML document.
- How to use it in common scraping use cases through real-world examples.
- The advanced features for JavaScript execution, and more.
No matter how good your web data parser is, anti-bot measures will still catch you. Bypass them all with ZenRows, a web scraping API with headless browser capabilities, IP rotation, and an AI-powered built-in toolkit to avoid any anti-scraping measures. Try ZenRows for free!