In this Java web scraping tutorial, you'll learn everything you need to know about web scraping in Java. Follow this step-by-step tutorial, and you'll become a web scraping expert. In detail, you'll learn how to master the web scraping basics as well as the most advanced aspects.
Let's not waste more time! Learn how to build a web scraper in Java. This script will be able to crawl an entire website and automatically extract data from it. Cool, isn't it?
Can You Web Scrape With Java?
The short answer is "Yes, you can!"
Java is one of the most reliable object-oriented programming languages available. Thus, Java can count on a wide range of libraries. This means that there are several Java web scraping libraries you can choose from.
Two examples are Jsoup and Selenium. These libraries allow you to connect to a web page. Also, they come with many functions to help you extract the data that interests you. In this Java web scraping tutorial, you'll learn how to use both.
How Do You Scrape a Page in Java?
You can scrape a web page in Java as you can perform web scraping in any other programming language. You need a web scraping Java library that allows you to visit a web page, retrieve HTML elements, and extract data from them.
You can easily install a Java web scraping library with Maven or Gradle. There are the two most popular Java dependency tools. Follow this web scraping Java tutorial, and learn more about how to do web scraping in Java.
Getting Started
Before starting to build your Java web scraper, you need to meet the following requirements:
- Java LTS 8+: any version of Java LTS (Long-Term Support) greater than or equal to 8 will do. In detail, this Java web scraping tutorial refers to Java 21. At the time of writing, this is the last LTS version of Java.
- Gradle or Maven: choose one of the two build automation tools. You'll need one of them for its dependency management features to install your Java web scraping library.
- A Java IDE: any IDE that supports Java and can integrate with Maven and Gradle will do. IntelliJ IDEA is one of the best options available.
If you don't meet these prerequisites, follow the links above. Download and install Java, Gradle and Maven, and the Java IDE in order. If you encounter problems, follow the official installation guides. Then, you can verify everything went as expected with the following terminal command:
java -version
This should return something like this:
openjdk version "21" 2023-09-19 LTS
OpenJDK Runtime Environment Temurin-21+35 (build 21+35-LTS)
OpenJDK 64-Bit Server VM Temurin-21+35 (build 21+35-LTS, mixed mode, sharing)
As you can see, that represents the info related to the version of Java installed on your machine.
Then, if you're a Gradle user, type in your terminal:
gradle -v
Again, this would return the version of Gradle you installed, as follows:
------------------------------------------------------------
Gradle 8.4
------------------------------------------------------------
Build time: 2023-10-04 20:52:13 UTC
Revision: e9251e572c9bd1d01e503a0dfdf43aedaeecdc3f
Kotlin: 1.9.10
Groovy: 3.0.17
Ant: Apache Ant(TM) version 1.10.13 compiled on January 4 2023
JVM: 21 (Eclipse Adoptium 21+35-LTS)
OS: Linux 6.2.0-34-generic amd64
Or, if you're a Maven user, launch the command below:
mvn -v
If the Maven installation process worked as expected, this should return something like this:
Apache Maven 3.9.5 (57804ffe001d7215b5e7bcb531cf83df38f93546)
You're ready to follow this step-by-step web scraping Java tutorial. In detail, you're going to learn how to perform web scraping in Java on ScrapingCourse.com, a demo site dedicated to web scrapers with a real ecommerce features.
Have a look at the target product page.

Note that the ScrapingCourse.com e-commerce website is just a simple paginated list of different products. The goal of the Java web scraper will be to crawl the entire website and retrieve all product data.

Select Maven or Gradle based on the build automation tool you installed or want to use. Then, click "Create" to initialize your Java project. Wait for the setup process to end, and you should now have access to the following Java web scraping project:

Let's now learn the basics of web scraping using Java!
Basic Web Scraping in Java
The first thing you need to learn is how to scrape a static website in Java. You can think of a static website as a collection of pre-built HTML documents. Each of these HTML pages will have its CSS and JavaScript files. A static website relies on server-side rendering.
In a static web page, the content is embedded in the HTML document provided by the server. So, you don't need a web browser to extract data from it. In detail, static web scraping is about:
- Downloading a web page
- Parsing the HTML document retrieved from the server.
- Selecting the HTML elements containing the data of interest from the web page.
- Extracting the data from them
In Java, scraping web pages isn't difficult, especially when it comes to static web scraping. Let's now learn the basics of web scraping with Java.
Step #1: Install Jsoup
First, you need a web-scraping Java library. Jsoup is a Java library to perform that makes web scraping easy. In detail, Jsoup comes with an advanced Java web scraping API. This allows you to connect to a web page with its URL, select HTML elements with CSS selectors, and extract data from them.
In other terms, Jsoup offers you almost everything you need to perform static web scraping with Java. If you're a Gradle user, add jsoup
 to the dependencies
 section of your build.gradle
 file:
implementation "org.jsoup:jsoup:1.16.1"
For Maven users, add the following to the dependency tag of your pom.xml
file:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.16.1</version>
</dependency>
Then, if you're an IntelliJ user, don't forget to click on the Gradle/Maven reload button below to install the new dependencies:

Jsoup is now installed and ready to use. Import it in your Scraper.java
 file as follows:
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
Let's write your first web scraping Java script!
Step #2: Connect to your target website
You can use Jsoup to connect to a website using its URL with the following lines:
// initializing the HTML Document page variable
Document doc;
try {
// fetching the target website
doc = Jsoup.connect("https://www.scrapingcourse.com/ecommerce/").get();
} catch (IOException e) {
throw new RuntimeException(e);
}
This snippet uses the connect()
 method from Jsoup to connect to the target website. Note that if the connection fails, Jsoup throws a IOException
. This is why you need a try ... catch
 logic. Then, the get()
 method returns a Jsoup HTML Document
 object you can use to explore the DOM.
Keep in mind that many websites automatically block requests that don't have a set of some expected HTTP headers. This is one of the most basic anti-scraping systems. Thus, you can simply avoid being blocked by manually setting these HTTP headers.
In general, the most important header you should always set is the User-Agent
 header. This is a string that helps the server identifies the application, operating system, and vendor from which the HTTP request comes from.
You can set the User-Agent
 header and other HTTP headers in Jsoup as follows:
doc = Jsoup
.connect("https://www.scrapingcourse.com/ecommerce/")
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")
.header("Accept-Language", "*")
.get();
Specifically, you can set the User-Agent
 with the Jsoup userAgent()
 method. Similarly, you can specify any other HTTP header with header()
.
Step #3: Select the HTML elements of interest
Open your target web page in the browser and identify the HTML elements of interest. In this case, you want to scrape all product HTML elements. Right-click on a product HTML element and select the "Inspect" option. This should open the DevTools window below:

As you can see, a product is li.product
 HTML element. This includes:
- AnÂ
a
 HTML element: contains the URL associated with the product - AnÂ
img
 HTML element: contains the product image. - AÂ
h2
 HTML element: contains the product name. - AÂ
span
 HTML element: contains the product price.
Now, let's learn how to extract data from a product HTML element with web scraping in Java.
Step #4: Extracting data from the HTML elements
First, you need a Java object where to store the scraped data. Create a data
 folder in the main package and define a Product.java
 class as follows:
package com.zenrows.data;
public class Product {
private String url;
private String image;
private String name;
private String price;
// getters and setters omitted for brevity...
@Override
public String toString() {
return "{ \"url\":\"" + url + "\", "
+ " \"image\": \"" + image + "\", "
+ "\"name\":\"" + name + "\", "
+ "\"price\": \"" + price + "\" }";
}
}
Note that the toString()
 method produces a string in JSON format. This will come in handy later.
Now, let's retrieve the list of li.product
 HTML products on the target web page. You can achieve this with Jsoup as below:
Elements productElements = doc.select("li.product");
What this snippet does is simple. The Jsoup select()
 function apply the CSS selector strategy to retrieve all li.product
 on the web page. In details, Elements
 extends an ArrayList
. So, you can easily iterate over it.
Thus, you can iterate over productElements
 to extract the info of interest and store it in Product
 objects:
// initializing the list of Java object to store
// the scraped data
List<Product> products = new ArrayList<>();
// retrieving the list of product HTML elements
Elements productElements = doc.select("li.product");
// iterating over the list of HTML products
for (Element productElement : productElements) {
Product product = new Product();
// extracting the data of interest from the product HTML element
// and storing it in product
product.setUrl(productElement.selectFirst("a").attr("href"));
product.setImage(productElement.selectFirst("img").attr("src"));
product.setName(productElement.selectFirst("h2").text());
product.setPrice(productElement.selectFirst("span").text());
// adding product to the list of the scraped products
products.add(product);
}
This logic uses the Java web scraping API offered by Jsoup to extract all the data of interest from each product HTML element. Then, it initializes a Product
 with this data and adds it to the list of scraped products.
Congrats! You just learned how to scrape data from a web page with Jsoup. Let's now convert this data into a more useful format.
Step #5: Export the data to JSON
Don't forget that the toString()
 method of Product
 returns a JSON string. So, simply call toString()
 on the List<Product>
 object:
products.toString();
System.out.println(products);
toString()
 on an ArrayList
 calls the toString()
 method on each element of the list. Then, it embeds the result in square brackets.
In other words, this will produce the following JSON data:
[
{
"url": "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/",
"image": "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg",
"name": "Abominable Hoodie",
"price": "$69.00"
},
// ...
{
"url": "https://www.scrapingcourse.com/ecommerce/product/artemis-running-short/",
"image": "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wsh04-black_main-324x324.jpg",
"name": "Artemis Running Short",
"price": "$49.00"
}
]
Et voilà ! You just performed web scraping using Java! Yet, the website consists of several web pages. Let's see how to scrape them all.
Web Crawling in Java
Let's now retrieve the list of all pagination links to scrape the entire website. This is what web crawling is about. Right-click on the pagination number HTML element and choose the "Inspect" option.

The browser should open the DevTools section and highlight the selected DOM element, as below:

From here, note that you can extract all the pagination number HTML elements with the a.page-numbers
 CSS selector. These elements contain the links you want to scrape. You can retrieve them all with Jsoup as below:
Elements paginationElements = doc.selectFirst("a.next");
To scrape all pages, modify the previous scraper to continuously follow the next page element by introducing a while
loop. The while
loop ensures that the scrapeProducts
function keeps running until there are no more next-page elements in the HTML.
// ...
public class Scraper {
// ...
public static List<Product> scrapeProducts(String url) {
// ...
while (url != null) {
try {
// ...
Element nextButton = doc.selectFirst("a.next");
if (nextButton != null) {
String nextPageUrl = nextButton.attr("href");
if (!nextPageUrl.startsWith("http")) {
nextPageUrl = url + nextPageUrl.replaceFirst("^/", "");
}
url = nextPageUrl; // update URL for next iteration
} else {
url = null; // no more pages, exit loop
}
} catch (IOException e) {
// ...error handling
break; // stop on error
}
}
return products;
}
}
Update the Scraper
class with the above changes. Here's the complete code:
package com.zenrows;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class Scraper {
private static final String URL = "https://www.scrapingcourse.com/ecommerce/";
public static void main(String[] args) {
List<Product> products = scrapeProducts(URL);
// print all scraped products
System.out.println(products.toString());
}
public static List<Product> scrapeProducts(String url) {
List<Product> products = new ArrayList<>();
while (url != null) {
try {
// connect to the website and retrieve the HTML document
Document doc = Jsoup.connect(url).get();
// select the list of product elements
Elements productElements = doc.selectFIrst("li.product");
// iterate over each product element
for (Element productElement : productElements) {
Product product = new Product();
// extracting product details safely
Element linkElement = productElement.selectFirst(".woocommerce-LoopProduct-link");
Element imgElement = productElement.selectFirst(".product-image");
Element nameElement = productElement.selectFirst(".product-name");
Element priceElement = productElement.selectFirst(".price");
product.setUrl(linkElement != null ? linkElement.attr("href") : "N/A");
product.setImage(imgElement != null ? imgElement.attr("src") : "N/A");
product.setName(nameElement != null ? nameElement.text() : "N/A");
product.setPrice(priceElement != null ? priceElement.text() : "N/A");
// add the product to the list
products.add(product);
}
Element nextButton = doc.selectFirst("a.next");
if (nextButton != null) {
String nextPageUrl = nextButton.attr("href");
if (!nextPageUrl.startsWith("http")) {
nextPageUrl = url + nextPageUrl.replaceFirst("^/", "");
}
url = nextPageUrl; // update URL for next iteration
} else {
url = null; // no more pages, exit loop
}
} catch (IOException e) {
System.err.println("Error fetching page: " + e.getMessage());
break; // stop on error
}
}
return products;
}
}
If you're an IntelliJ IDEA user, click on the run icon to run the web scraping Java example. Wait for the process to end. This will take some seconds. At the end of the process, products
 will contain all 188 products.
Congratulations! You just extracted all the product data automatically!
Parallel Web Crawling in Java
Web crawling in Java can become a time-consuming process. This is especially true if your target website consists of many web pages and/or the server takes time to respond. Also, Java isn't popular for being a performant programming language.
At the same time, Java 8 introduced a lot of features to make parallelism easier. So, transforming your Java web scraper to work in parallel takes only a few updates. Let's see how you can perform parallel web scraping in Java:
The following updated code implements parallel scraping in Java. Pay attention to the new scrapePageConcurrently
function, which implements concurrency with Java's ExecutorService
. scrapeProducts
retains the previous scraping. However, this time, it returns the absolute URL of the next page:
package com.zenrows;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
public class Scraper {
private static final String URL = "https://www.scrapingcourse.com/ecommerce/";
private static final int THREAD_POOL_SIZE = 5; // number of parallel threads
public static void main(String[] args) {
List<Product> products = scrapeProductsConcurrently(URL);
System.out.println(products);
}
public static List<Product> scrapeProductsConcurrently(String startUrl) {
ExecutorService executor = Executors.newFixedThreadPool(THREAD_POOL_SIZE);
List<Product> products = new CopyOnWriteArrayList<>();
List<Future<String>> futures = new ArrayList<>();
ConcurrentSkipListSet<String> visitedPages = new ConcurrentSkipListSet<>();
// add the first page to the task queue
visitedPages.add(startUrl);
futures.add(executor.submit(() -> scrapeProducts(startUrl, products, visitedPages)));
while (!futures.isEmpty()) {
List<Future<String>> newFutures = new ArrayList<>();
for (Future<String> future : futures) {
try {
String nextUrl = future.get(); // wait for page scrape to complete
if (nextUrl != null && visitedPages.add(nextUrl)) {
newFutures.add(executor.submit(() -> scrapeProducts(nextUrl, products, visitedPages)));
}
} catch (InterruptedException | ExecutionException e) {
System.err.println("error executing task: " + e.getMessage());
}
}
futures = newFutures;
}
executor.shutdown();
try {
executor.awaitTermination(60, TimeUnit.SECONDS);
} catch (InterruptedException e) {
System.err.println("executor shutdown interrupted: " + e.getMessage());
}
return products;
}
private static String scrapeProducts(String url, List<Product> products, ConcurrentSkipListSet<String> visitedPages) {
try {
// connect to the website and retrieve the HTML document
Document doc = Jsoup.connect(url).get();
// select the list of product elements
Elements productElements = doc.select("li.product");
// iterate over each product element
for (Element productElement : productElements) {
Product product = new Product();
// extracting product details safely
Element linkElement = productElement.selectFirst(".woocommerce-LoopProduct-link");
Element imgElement = productElement.selectFirst(".product-image");
Element nameElement = productElement.selectFirst(".product-name");
Element priceElement = productElement.selectFirst(".price");
product.setUrl(linkElement != null ? linkElement.absUrl("href") : "N/A");
product.setImage(imgElement != null ? imgElement.absUrl("src") : "N/A");
product.setName(nameElement != null ? nameElement.text() : "N/A");
product.setPrice(priceElement != null ? priceElement.text() : "N/A");
// add the product to the list
products.add(product);
}
// look for the "next" button and return its absolute URL
Element nextButton = doc.selectFirst("a.next");
return (nextButton != null) ? nextButton.absUrl("href") : null;
} catch (IOException e) {
System.err.println("error fetching page: " + e.getMessage());
}
return null;
}
}
Keep in mind that ArrayList and HashSet are not thread-safe in Java. For thread-safe operations in concurrent environments, it is better to use CopyOnWriteArrayList
rather than Collections.synchronizedList()
. Then, you can use ExecutorService
to run and manage tasks asynchronously.
You don't want to overload the target server or your local machine. Unlike adding manual delays with sleep()
, limiting the thread pool size ensures a balanced workload while preventing excessive requests that might trigger anti-bot mechanisms or cause a denial-of-service (DOS) effect.
Remember to shut down your ExecutorService
properly to free up resources. Since some tasks may still be running when the loop exits, use shutdown()
to request termination. If you need to ensure all tasks finish before proceeding, call awaitTermination()
, which blocks execution until all threads are complete within the specified time limit.
Run this Java web scraping script, and you'll experience a noticeable performance boost compared to a sequential approach. You've just learned how to scrape multiple pages concurrently in Java!
Well done! You now know how to perform parallel web scraping efficiently. But there's still more to explore! 🚀
Scraping Dynamic Content Websites in Java
Don't forget that a web page is more than its corresponding HTML document. Web pages can perform HTTP requests in the browser via AJAX. This mechanism allows web pages to retrieve data asynchronously and update the content shown to the user accordingly.
Most websites now rely on frontend API requests to retrieve data. These requests are AJAX calls. So, these API calls provide valuable data you can't ignore when it comes to web scraping. You can sniff these calls, replicate them in your scraping script, and retrieve this data.
To sniff a AJAX call, use the DevTools your browser. Right-click on a web page, choose "Inspect", and select the "Network" tab. In the "Fetch/XHR" tab, you'll find the list of AJAX calls the web page executed, as below.

Here, you can retrieve all the info you need to replicate these calls in your web scraping script. Yet, this isn't the best approach.
Web Scraping With a Headless Browser
Web pages perform most of the AJAX calls in response to user interaction. This is why you need a tool to load a web page in a browser and replicate user interaction. This is what a headless browser is about.
In detail, a headless browser is a web browser with no GUI that enables you to programmatically control a web page. In other terms, a headless browser allows you to instruct a web browser to perform some tasks.
Thanks to a headless browser, you can interact with a web page through JavaScript as a human being would. One of the most popular libraries in Java offering headless browser functionality is Selenium WebDriver.
Note that ZenRows API comes with headless browser capabilities. Learn more about how to extract dynamically loaded data.
If you use Gradle, add selenium-java
 with the line below in the dependencies
 section of your build.gradle
 file:
implementation "org.seleniumhq.selenium:selenium-java:4.14.1"
Otherwise, if you use Maven, insert the following lines in your pom.xml
 file:
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>4.14.1</version>
</dependency>
Make sure to install your new dependency by running update command from the terminal or in your IDE. Ensure you have the latest version of Chrome installed. WebDriver previously needed its own setup, but now it's automatically included in version 4 and later. If you use Gradle, check the dependencies section of your build.gradle
file. Otherwise, if you use Maven, check your pom.xml
file. You're now ready to start using Selenium.
You can replicate the web scraping logic seen above on a single page with the following script:
import org.openqa.selenium.*;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.support.ui.WebDriverWait;
import java.util.*;
import com.zenrows.data.Product;
public class Scraper {
public static void main(String[] args) {
// defining the options to run Chrome in headless mode
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless");
// initializing a Selenium WebDriver ChromeDriver instance
// to run Chrome in headless mode
WebDriver driver = new ChromeDriver(options);
// connecting to the target web page
driver.get("https://www.scrapingcourse.com/ecommerce/");
// initializing the list of Java object to store
// the scraped data
List<Product> products = new ArrayList<>();
// retrieving the list of product HTML elements
List<WebElement>productElements = driver.findElements(By.cssSelector("li.product"));
// iterating over the list of HTML products
for (WebElement productElement : productElements) {
Product product = new Product();
// extracting the data of interest from the product HTML element
// and storing it in product
product.setUrl(productElement.findElement(By.tagName("a")).getAttribute("href"));
product.setImage(productElement.findElement(By.tagName("img")).getAttribute("src"));
product.setName(productElement.findElement(By.tagName("h2")).getText());
product.setPrice(productElement.findElement(By.tagName("span")).getText());
// adding product to the list of the scraped products
products.add(product);
}
// ...
driver.quit();
}
}
As you can see, the web scraping logic isn't that different from that seen before. What truly changes is that Selenium runs the web scraping logic in the browser. This means that Selenium has access to all features offered by a browser.
For example, you can click on a pagination element to directly navigate to a new page as below:
WebElement paginationElement = driver.findElement(By.cssSelector("a.page-numbers"));
// navigating to a new web page
paginationElement.click();
// wait for the page to load...
System.out.println(driver.getTitle()); // "Ecommerce Test Site to Learn Web Scraping – ScrapingCourse.com"
In other words, Selenium allows you to perform web crawling by interacting with the elements in a web page. Just like a human being would. This makes a web scraper based on a headless browser harder to detect and block. Learn more on how to perform web scraping without getting blocked.
Other Web Scraping Libraries For Java
Other useful Java libraries for web scraping are:
- HtmlUnit: a GUI-less/headless browser for Java. HtmlUnit Java can perform all browser-specific operations on a web page. Like Selenium, it was born for testing but you can use it for web crawling and scraping.
- Playwright: an end-to-end testing library for web apps developed by Microsoft. Again, it enables you to instruct a browser. So, you can use it for web scraping like Selenium.
Conclusion
In this web scraping java tutorial, you learned everything you should know about performing professional web scraping with Java. In detail, you saw:
- Why Java is a good programming language when it comes to web scraping
- How to perform basic web scraping in Java with Jsoup
- How to crawl an entire website in Java
- Why you might need a headless browser
- How to use Selenium to perform scraping in Java on dynamic content websites
What you should never forget is that your web scraper needs to be able to bypass anti-scraping systems. This is why you need a complete web scraping Java API. ZenRows offers that and much more.
In detail, ZenRows is a tool that offers many services to help you perform web scraping. ZenRows also gives access to a headless browser with just a simple API call. Try ZenRows for free and start scraping data from the web with no effort.