Parsing HTML helps present raw data in a structured format and is a fundamental part of any web scraping process. If you web scrape with Java, you can use Jsoup, a popular Java HTML parser, to help you with the task.
In this tutorial, you'll learn the most efficient way of parsing HTML in Java with Jsoup. We'll guide you through setting up Jsoup in your Java project, parsing basic HTML, and advancing to more complex use cases with real-world examples.
Let's roll!
What Is Jsoup?
Jsoup is an open-source Java library that provides an intuitive API for fetching URLs and extracting and manipulating data using DOM API methods. It parses HTML to the same DOM (Document Object Model) as modern browsers.
As a versatile library, Jsoup supports CSS selectors and XPath, which are powerful options for identifying and selecting elements in an HTML document. This flexibility allows you to choose the best method for the task at hand.
Beyond that, Jsoup's ability to handle malformed HTML, such as those containing invalid or incomplete tags, makes it a valuable tool for extracting data from all kinds of websites.
How to Parse HTML With Jsoup in Java?
To parse HTML using Jsoup, load the fetched data into a Document
object, which presents the HTML in a DOM tree. Then, navigate through the document and select the desired information.
Below is a step-by-step Jsoup tutorial on how to parse HTML in Java. As an exercise, you'll extract data from Scraping Course's eCommerce Test Site, a sample website for testing web scrapers.
Step #1: Install Jsoup
First, include Jsoup in your project. Depending on your project requirements, there are different ways to do this. However, the most common approach is adding the Jsoup library as a dependency to your build configuration (such as Maven or Gradle).
To add Jsoup to a Maven project, include the following XML snippet in your pom.xml
<dependencies>
section.
<dependency>
<!-- jsoup HTML parser library @ https://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.17.2</version>
</dependency>
Alternatively, if you're using Gradle, add the following code line in your build.gradle
file.
// jsoup HTML parser library @ https://jsoup.org/
implementation 'org.jsoup:jsoup:1.17.2'
All done! Let's extract some data.
Step #2: Extract HTML
To extract the HTML source file that you'll parse in this tutorial, make a GET request to the target web page (https://www.scrapingcourse.com/ecommerce/
) using HttpClient, the built-in Java module and retrieve the response.
package com.example;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
public class Main {
public static void main(String[] args) {
// create HtttpClient instance
HttpClient client = HttpClient.newHttpClient();
// build a HttpRequest
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("https://www.scrapingcourse.com/ecommerce/"))
.build();
// send asynchronous GET request and handle response.
client.sendAsync(request, HttpResponse.BodyHandlers.ofString())
// extract body as string
.thenApply(HttpResponse::body)
// retrieve extracted body
.thenAccept(htmlContent -> {
System.out.println(htmlContent);
})
.join();
}
}
This code sends an asynchronous GET request to the target website and retrieves its raw HTML content as a string.
If you ever encounter a target website that blocks your request, use ZenRows, a web scraping API, to bypass any anti-bot system and retrieve your desired data. It fully supports Java and provides the complete toolkit to scrape any website.
Now that you have your HTML source file, the next step is extracting some more valuable information.
Step #3: Parse Your First Data
As stated earlier, Jsoup allows you to select elements using CSS selectors or XPath. While CSS selectors are often preferred for their ease of use and simpler syntax, your choice depends on your project's needs and use cases.
If you need help deciding which method to use, read our XPath vs. CSS selectors guide to learn more.
In this tutorial, we'll keep things simple by using CSS selectors to extract the first product's title on the page.
To parse your first data, you need the HTML attributes of the data you're after to define a CSS selector that will navigate to the target element.
You can inspect the target web page on a browser using DevTools. To access DevTools, navigate to the website, right-click anywhere on the page, and select Inspect.
You'll find that each product card is an <li>
and the product title is contained in the only <H2>
within that <li>
Using this information, let's define a CSS selector for the first product's title.
First, import the required Jsoup modules. Then, in your thenAccept
block, load the HTML file into a Document
object to parse it into a DOM tree.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
//...
.thenAccept(htmlContent -> {
// parse the HTML content using Jsoup
Document doc = Jsoup.parse(htmlContent);
})
Then, use doc. select
to define a CSS selector for the first product's title and retrieve its text content.
//...
.thenAccept(htmlContent -> {
//...
// select the first h2 using CSS selector and extract its text content
Element titleElement = doc.select("h2").first();
String productTitle = titleElement.text();
System.out.println("Product Title: " + productTitle);
})
Combine these snippets with the code from Step 2 to get the following complete result:
package com.example;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) {
// create HtttpClient instance
HttpClient client = HttpClient.newHttpClient();
// build a HttpRequest
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("https://www.scrapingcourse.com/ecommerce/"))
.build();
// send asynchronous GET request and handle response.
client.sendAsync(request, HttpResponse.BodyHandlers.ofString())
// extract body as string
.thenApply(HttpResponse::body)
// retrieve extracted body
.thenAccept(htmlContent -> {
// parse the HTML content using Jsoup
Document doc = Jsoup.parse(htmlContent);
// select the first h2 using CSS selector and extract its text content
Element titleElement = doc.select("h2").first();
String productTitle = titleElement.text();
System.out.println("Product Title: " + productTitle);
})
.join();
}
}
Run it, and it'll return the product title of the first item.
Product Title: Abominable Hoodie
Congratulations! You've created your first Java HTML parser using Jsoup.
Step #4: Extract More Data
Now that you know the basics, let's take up a task closer to a real-world use case and extract more than the first product title.
For this exercise, we'll extract each item's product title, image URL, and link on the page.
You must select every item on the page and iterate through them to extract the desired details using selectors like in the previous exercise.
Let's put this into practice.
Start by selecting all product elements. You may need to inspect the page on a browser to identify the CSS selector (.product
).
.thenAccept(htmlContent -> {
//...
// select all product elements
Elements productElements = doc.select(".product");
})
Next, iterate through each item and retrieve the product title, image URL, and link. You can include a print statement to verify if your code works.
.thenAccept(htmlContent -> {
//...
// iterate through each product
for (Element productElement : productElements) {
// retrieve the product title
String productTitle = productElement.select("h2").text();
// retrieve the image URL
String imageUrl = productElement.select("img[src]").attr("src");
// retrieve the link
String link = productElement.select("a[href]").attr("href");
System.out.println("Product Title: " + productTitle + "\nImage URL: " + imageUrl + "\nLink: " + link + "\n");
}
})
Put everything together. Your complete code should look like this:
package com.example;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) {
// create HtttpClient instance
HttpClient client = HttpClient.newHttpClient();
// build a HttpRequest
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("https://www.scrapingcourse.com/ecommerce/"))
.build();
// send asynchronous GET request and handle response.
client.sendAsync(request, HttpResponse.BodyHandlers.ofString())
// extract body as string
.thenApply(HttpResponse::body)
// retrieve extracted body
.thenAccept(htmlContent -> {
// parse the HTML content using Jsoup
Document doc = Jsoup.parse(htmlContent);
// select all product elements
Elements productElements = doc.select(".product");
// iterate through each product
for (Element productElement : productElements) {
// retrieve the product title
String productTitle = productElement.select("h2").text();
// retrieve the image URL
String imageUrl = productElement.select("img[src]").attr("src");
// retrieve the link
String link = productElement.select("a[href]").attr("href");
System.out.println("Product Title: " + productTitle + "\nImage URL: " + imageUrl + "\nLink: " + link + "\n");
}
})
.join();
}
}
Run this code, and you'll get the product title, image URL, and link for each item on the page.
Product Title: Abominable Hoodie
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg
Link: https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/
Product Title: Adrienne Trek Jacket
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_main-324x324.jpg
Link: https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/
//... truncated for brevity ...//
Step #5: Export Data to CSV
Printing results to your terminal isn't helpful for future data mining. You should export the data to CSV to make it usable for further analysis. Let's learn how to do it.
Add opencsv
to your project by including the following XML snippet in the pom.xml
file.
<!-- https://mvnrepository.com/artifact/com.opencsv/opencsv -->
<dependency>
<groupId>com.opencsv</groupId>
<artifactId>opencsv</artifactId>
<version>5.9</version>
</dependency>
Next, import the opencsv
library to enable you to create a CSV writer. Then, after selecting all product elements, initialize a CSV writer using a FileWriter
and write headers.
import com.opencsv.CSVWriter;
//...
.thenAccept(htmlContent -> {
//...
// initialize CSV writer
try (CSVWriter csvWriter = new CSVWriter(new FileWriter("products.csv"))) {
// write header
csvWriter.writeNext(new String[]{"Product Title", "Image URL", "Link"});
} catch (IOException e) {
e.printStackTrace();
}
}
Next, within your loop that iterates through each product, write product details to the CSV file.
// write data to CSV
csvWriter.writeNext(new String[]{productTitle, imageUrl, link});
To get a working script, update the previous script with these two snippets.
package com.example;
import java.io.FileWriter;
import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.opencsv.CSVWriter;
public class Main {
public static void main(String[] args) {
// create HtttpClient instance
HttpClient client = HttpClient.newHttpClient();
// build a HttpRequest
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("https://www.scrapingcourse.com/ecommerce/"))
.build();
// send asynchronous GET request and handle response.
client.sendAsync(request, HttpResponse.BodyHandlers.ofString())
// extract body as string
.thenApply(HttpResponse::body)
// retrieve extracted body
.thenAccept(htmlContent -> {
// parse the HTML content using Jsoup
Document doc = Jsoup.parse(htmlContent);
// select all product elements
Elements productElements = doc.select(".product");
// initialize CSV writer
try (CSVWriter csvWriter = new CSVWriter(new FileWriter("products.csv"))) {
// write header
csvWriter.writeNext(new String[]{"Product Title", "Image URL", "Link"});
// iterate through each product
for (Element productElement : productElements) {
// retrieve the product title
String productTitle = productElement.select("h2").text();
// retrieve the image URL
String imageUrl = productElement.select("img[src]").attr("src");
// retrieve the link
String link = productElement.select("a[href]").attr("href");
// write data to CSV
csvWriter.writeNext(new String[]{productTitle, imageUrl, link});
}
System.out.println("Data successfully exported to CSV");
} catch (IOException e) {
e.printStackTrace();
}
})
.join();
}
}
Your result should be a product.csv file, like in the image below.
Well done!
Conclusion
A Java HTML parser like Jsoup makes it easy to parse HTML and retrieve the desired data using CSS selectors or XPath.
Following this tutorial, you've gone from parsing your first data to more complex scenarios. Yet, there's more to web scraping in Java. For example, most web pages contain JavaScript elements, which require extra steps to scrape. To learn how to deal with such scenarios and even more advanced topics, check out this guide on web crawling in Java.