How to Parse HTML With Java and Jsoup: 2024 Tutorial

Yuvraj Chandra
Yuvraj Chandra
September 24, 2024 · 4 min read

Parsing HTML helps present raw data in a structured format and is a fundamental part of any web scraping process. If you web scrape with Java, you can use Jsoup, a popular Java HTML parser, to help you with the task.

In this tutorial, you'll learn the most efficient way of parsing HTML in Java with Jsoup. We'll guide you through setting up Jsoup in your Java project, parsing basic HTML, and advancing to more complex use cases with real-world examples.

Let's roll!

What Is Jsoup?

Jsoup is an open-source Java library that provides an intuitive API for fetching URLs and extracting and manipulating data using DOM API methods. It parses HTML to the same DOM (Document Object Model) as modern browsers.

As a versatile library, Jsoup supports CSS selectors and XPath, which are powerful options for identifying and selecting elements in an HTML document. This flexibility allows you to choose the best method for the task at hand.

Beyond that, Jsoup's ability to handle malformed HTML, such as those containing invalid or incomplete tags, makes it a valuable tool for extracting data from all kinds of websites.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

How to Parse HTML With Jsoup in Java?

To parse HTML using Jsoup, load the fetched data into a Document object, which presents the HTML in a DOM tree. Then, navigate through the document and select the desired information.

Below is a step-by-step Jsoup tutorial on how to parse HTML in Java. As an exercise, you'll extract data from Scraping Course's eCommerce Test Site, a sample website for testing web scrapers.

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

Step #1: Install Jsoup

First, include Jsoup in your project. Depending on your project requirements, there are different ways to do this. However, the most common approach is adding the Jsoup library as a dependency to your build configuration (such as Maven or Gradle).

To add Jsoup to a Maven project, include the following XML snippet in your pom.xml <dependencies> section.

pom.xml
<dependency>
  <!-- jsoup HTML parser library @ https://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.17.2</version>
</dependency>

Alternatively, if you're using Gradle, add the following code line in your build.gradle file.

build.gradle
// jsoup HTML parser library @ https://jsoup.org/
implementation 'org.jsoup:jsoup:1.17.2'

All done! Let's extract some data.

Step #2: Extract HTML

To extract the HTML source file that you'll parse in this tutorial, make a GET request to the target web page (https://www.scrapingcourse.com/ecommerce/) using HttpClient, the built-in Java module and retrieve the response.

Example
package com.example;
 
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
 
public class Main {
    public static void main(String[] args) {
        // create HtttpClient instance
        HttpClient client = HttpClient.newHttpClient();
 
        // build a HttpRequest
        HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create("https://www.scrapingcourse.com/ecommerce/"))
                .build();
 
        // send asynchronous GET request and handle response.
        client.sendAsync(request, HttpResponse.BodyHandlers.ofString())
                // extract body as string
                .thenApply(HttpResponse::body)
                // retrieve extracted body
                .thenAccept(htmlContent -> {
                    System.out.println(htmlContent);
                })
                    
                .join();
    }
}

This code sends an asynchronous GET request to the target website and retrieves its raw HTML content as a string.

Now that you have your HTML source file, the next step is extracting some more valuable information.

Step #3: Parse Your First Data

As stated earlier, Jsoup allows you to select elements using CSS selectors or XPath. While CSS selectors are often preferred for their ease of use and simpler syntax, your choice depends on your project's needs and use cases.

If you need help deciding which method to use, read our XPath vs. CSS selectors guide to learn more.

In this tutorial, we'll keep things simple by using CSS selectors to extract the first product's title on the page.

To parse your first data, you need the HTML attributes of the data you're after to define a CSS selector that will navigate to the target element.

You can inspect the target web page on a browser using DevTools. To access DevTools, navigate to the website, right-click anywhere on the page, and select Inspect.

You'll find that each product card is an <li> and the product title is contained in the only <H2> within that <li>

ScrapingCourse Products Inspection
Click to open the image in full screen

Using this information, let's define a CSS selector for the first product's title.

First, import the required Jsoup modules. Then, in your thenAccept block, load the HTML file into a Document object to parse it into a DOM tree.

Example
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
 
//...
 
.thenAccept(htmlContent -> {
 
    // parse the HTML content using Jsoup
    Document doc = Jsoup.parse(htmlContent);
 
})

Then, use doc. select to define a CSS selector for the first product's title and retrieve its text content.

Example
//...
 
.thenAccept(htmlContent -> {
 
    //...
 
    // select the first h2 using CSS selector and extract its text content
    Element titleElement = doc.select("h2").first();
    String productTitle = titleElement.text();
    System.out.println("Product Title: " + productTitle);
 
})

Combine these snippets with the code from Step 2 to get the following complete result:

Example
package com.example;
 
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
 
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
 
public class Main {
    public static void main(String[] args) {
        // create HtttpClient instance
        HttpClient client = HttpClient.newHttpClient();
 
        // build a HttpRequest
        HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create("https://www.scrapingcourse.com/ecommerce/"))
                .build();
 
        // send asynchronous GET request and handle response.
        client.sendAsync(request, HttpResponse.BodyHandlers.ofString())
                // extract body as string
                .thenApply(HttpResponse::body)
                // retrieve extracted body
                .thenAccept(htmlContent -> {
 
                    // parse the HTML content using Jsoup
                    Document doc = Jsoup.parse(htmlContent);
 
                    // select the first h2 using CSS selector and extract its text content
                    Element titleElement = doc.select("h2").first();
                    String productTitle = titleElement.text();
                    System.out.println("Product Title: " + productTitle);
            
                })
                    
                .join();
    }
}

Run it, and it'll return the product title of the first item.

Output
Product Title: Abominable Hoodie

Congratulations! You've created your first Java HTML parser using Jsoup.

Step #4: Extract More Data

Now that you know the basics, let's take up a task closer to a real-world use case and extract more than the first product title.

For this exercise, we'll extract each item's product title, image URL, and link on the page.

You must select every item on the page and iterate through them to extract the desired details using selectors like in the previous exercise.

Let's put this into practice.

Start by selecting all product elements. You may need to inspect the page on a browser to identify the CSS selector (.product).

Example
.thenAccept(htmlContent -> {
 
    //...
 
    // select all product elements
    Elements productElements = doc.select(".product");
 
})

Next, iterate through each item and retrieve the product title, image URL, and link. You can include a print statement to verify if your code works.

Example
.thenAccept(htmlContent -> {
 
    //...
 
    // iterate through each product
    for (Element productElement : productElements) {
        // retrieve the product title
        String productTitle = productElement.select("h2").text();
        // retrieve the image URL
        String imageUrl = productElement.select("img[src]").attr("src");
        // retrieve the link
        String link = productElement.select("a[href]").attr("href");
 
        System.out.println("Product Title: " + productTitle + "\nImage URL: " + imageUrl + "\nLink: " + link + "\n");
 
    }
 
})

Put everything together. Your complete code should look like this:

Example
package com.example;
 
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
 
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
 
public class Main {
    public static void main(String[] args) {
        // create HtttpClient instance
        HttpClient client = HttpClient.newHttpClient();
 
        // build a HttpRequest
        HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create("https://www.scrapingcourse.com/ecommerce/"))
                .build();
 
        // send asynchronous GET request and handle response.
        client.sendAsync(request, HttpResponse.BodyHandlers.ofString())
                // extract body as string
                .thenApply(HttpResponse::body)
                // retrieve extracted body
                .thenAccept(htmlContent -> {
 
                    // parse the HTML content using Jsoup
                    Document doc = Jsoup.parse(htmlContent);
 
                    // select all product elements
                    Elements productElements = doc.select(".product");
                    
                    // iterate through each product
                    for (Element productElement : productElements) {
                        // retrieve the product title
                        String productTitle = productElement.select("h2").text();
                        // retrieve the image URL
                        String imageUrl = productElement.select("img[src]").attr("src");
                        // retrieve the link
                        String link = productElement.select("a[href]").attr("href");
 
                        System.out.println("Product Title: " + productTitle + "\nImage URL: " + imageUrl + "\nLink: " + link + "\n");
 
                    }
            
                })
                    
                .join();
    }
}

Run this code, and you'll get the product title, image URL, and link for each item on the page.

Output
Product Title: Abominable Hoodie
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg
Link: https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/
 
Product Title: Adrienne Trek Jacket
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_main-324x324.jpg
Link: https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/
 
//... truncated for brevity ...//

Step #5: Export Data to CSV

Printing results to your terminal isn't helpful for future data mining. You should export the data to CSV to make it usable for further analysis. Let's learn how to do it.

Add opencsv to your project by including the following XML snippet in the section of your pom.xml file.

pom.xml
<!-- https://mvnrepository.com/artifact/com.opencsv/opencsv -->
<dependency>
    <groupId>com.opencsv</groupId>
    <artifactId>opencsv</artifactId>
    <version>5.9</version>
</dependency>

Next, import the opencsv library to enable you to create a CSV writer. Then, after selecting all product elements, initialize a CSV writer using a FileWriter and write headers.

Example
     
import com.opencsv.CSVWriter;
 
//...
 
.thenAccept(htmlContent -> {
//...
 
    // initialize CSV writer
    try (CSVWriter csvWriter = new CSVWriter(new FileWriter("products.csv"))) {
    // write header
    csvWriter.writeNext(new String[]{"Product Title", "Image URL", "Link"});
 
    } catch (IOException e) {
        e.printStackTrace();
    }
 
}

Next, within your loop that iterates through each product, write product details to the CSV file.

Example
// write data to CSV
csvWriter.writeNext(new String[]{productTitle, imageUrl, link});

To get a working script, update the previous script with these two snippets.

Example
package com.example;
 
import java.io.FileWriter;
import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
 
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
 
import com.opencsv.CSVWriter;
 
public class Main {
    public static void main(String[] args) {
        // create HtttpClient instance
        HttpClient client = HttpClient.newHttpClient();
 
        // build a HttpRequest
        HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create("https://www.scrapingcourse.com/ecommerce/"))
                .build();
 
        // send asynchronous GET request and handle response.
        client.sendAsync(request, HttpResponse.BodyHandlers.ofString())
                // extract body as string
                .thenApply(HttpResponse::body)
                // retrieve extracted body
                .thenAccept(htmlContent -> {
 
                    // parse the HTML content using Jsoup
                    Document doc = Jsoup.parse(htmlContent);
 
                    // select all product elements
                    Elements productElements = doc.select(".product");
 
                    // initialize CSV writer
                    try (CSVWriter csvWriter = new CSVWriter(new FileWriter("products.csv"))) {
                    // write header
                    csvWriter.writeNext(new String[]{"Product Title", "Image URL", "Link"});
 
                        // iterate through each product
                        for (Element productElement : productElements) {
                            // retrieve the product title
                            String productTitle = productElement.select("h2").text();
                            // retrieve the image URL
                            String imageUrl = productElement.select("img[src]").attr("src");
                            // retrieve the link
                            String link = productElement.select("a[href]").attr("href");
 
                            // write data to CSV
                            csvWriter.writeNext(new String[]{productTitle, imageUrl, link});
                        }
 
                        System.out.println("Data successfully exported to CSV");
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                    
            
                })
                    
                .join();
    }
}

Your result should be a product.csv file, like in the image below.

Extracted Product Data in CSV File
Click to open the image in full screen

Well done!

Conclusion

A Java HTML parser like Jsoup makes it easy to parse HTML and retrieve the desired data using CSS selectors or XPath.

Following this tutorial, you've gone from parsing your first data to more complex scenarios. Yet, there's more to web scraping in Java. For example, most web pages contain JavaScript elements, which require extra steps to scrape. To learn how to deal with such scenarios and even more advanced topics, check out this guide on web crawling in Java.

Ready to get started?

Up to 1,000 URLs for free are waiting for you