In this Java web scraping tutorial,ย you'll learn everything you need to know about web scraping in Java. Follow this step-by-step tutorial, andย you'll become a web scraping expert. In detail, you'll learn how to master the web scraping basics as well as the most advanced aspects.
Let's not waste more time!ย Learn how to build a web scraper in Java. This script will be able to crawl an entire website and automatically extract data from it. Cool, isn't it?
Can You Web Scrape With Java?
The short answer is "Yes, you can!"
Java is one of the most reliable object-oriented programming languages available. Thus, Java can count on a wide range of libraries. This means thatย there are several Java web scraping libraries you can choose from.
Two examples are Jsoup and Selenium.ย These libraries allow you to connect to a web page. Also,ย they come with many functions to help you extract the data that interests you. In this Java web scraping tutorial, you'll learn how to use both.
How Do You Scrape a Page in Java?
You can scrape a web page in Java as you can perform web scraping in any other programming language.ย You need a web scraping Java library that allows you to visit a web page, retrieve HTML elements, and extract data from them.
You can easilyย install a Java web scraping library withย Mavenย orย Gradle. There are the two most popular Java dependency tools. Follow this web scraping Java tutorial, and learn more about how to do web scraping in Java.
Getting Started
Before starting to build your Java web scraper, you need to meet the following requirements:
- Java LTS 8+: any version of Java LTS (Long-Term Support) greater than or equal to 8 will do. In detail, this Java web scraping tutorial refers to Java 21. At the time of writing, this is the last LTS version of Java.
- Gradleย orย Maven: choose one of the two build automation tools. You'll need one of them for its dependency management features to install your Java web scraping library.
- A Java IDE: any IDE that supports Java and can integrate with Maven and Gradle will do.ย IntelliJ IDEAย is one of the best options available.
If you don't meet these prerequisites, follow the links above.ย Download and install Java, Gradle and Maven, and the Java IDE in order.ย If you encounter problems, follow the official installation guides. Then, you can verify everything went as expected with the following terminal command:
java -version
This should return something like this:
openjdk version "21" 2023-09-19 LTS
OpenJDK Runtime Environment Temurin-21+35 (build 21+35-LTS)
OpenJDK 64-Bit Server VM Temurin-21+35 (build 21+35-LTS, mixed mode, sharing)
As you can see, that represents the info related to the version of Java installed on your machine.
Then, if you're a Gradle user, type in your terminal:
gradle -v
Again, this would return the version of Gradle you installed, as follows:
------------------------------------------------------------
Gradle 8.4
------------------------------------------------------------
Build time: 2023-10-04 20:52:13 UTC
Revision: e9251e572c9bd1d01e503a0dfdf43aedaeecdc3f
Kotlin: 1.9.10
Groovy: 3.0.17
Ant: Apache Ant(TM) version 1.10.13 compiled on January 4 2023
JVM: 21 (Eclipse Adoptium 21+35-LTS)
OS: Linux 6.2.0-34-generic amd64
Or, if you're a Maven user, launch the command below:
mvn -v
If the Maven installation process worked as expected, this should return something like this:
Apache Maven 3.9.5 (57804ffe001d7215b5e7bcb531cf83df38f93546)
You're ready to follow this step-by-step web scraping Java tutorial. In detail,ย you're going to learn how to perform web scraping in Java on ScrapingCourse.com, a demo site dedicated to web scrapers with a real ecommerce features.
Have a look atย the target product page.
Note thatย the ScrapingCourse.comย e-commerce website is just a simple paginated list of different products.ย The goal of the Java web scraper will be to crawl the entire website and retrieve all product data.
Select Maven or Gradleย based on the build automation tool you installed or want to use. Then,ย click "Create" to initialize your Java project. Wait for the setup process to end, and you should now have access to the following Java web scraping project:
Let's now learn the basics of web scraping using Java!
Basic Web Scraping in Java
The first thing you need to learn isย how to scrape a static website in Java. You canย think of a static website as a collection of pre-built HTML documents. Each of these HTML pages will have its CSS and JavaScript files. A static website relies on server-side rendering.
In a static web page, the content is embedded in the HTML document provided by the server. So, youย don't need a web browser to extract data from it. In detail, static web scraping is about:
- Downloading a web page
- Parsing the HTML document retrieved from the server.
- Selecting the HTML elements containing the data of interest from the web page.
- Extracting the data from them
In Java, scraping web pages isn't difficult, especially when it comes to static web scraping. Let's now learn the basics of web scraping with Java.
Step #1: Install Jsoup
First, you need a web-scraping Java library.ย Jsoupย is a Java library to perform that makes web scraping easy.ย In detail, Jsoup comes with anย advanced Java web scraping API. This allows you to connect to a web page with its URL, select HTML elements with CSS selectors, and extract data from them.
In other terms,ย Jsoup offers you almost everything you need to perform static web scraping with Java. If you're a Gradle user, addย jsoup
ย to theย dependencies
ย section of yourย build.gradle
ย file:
implementation "org.jsoup:jsoup:1.16.1"
For Maven users, add the following to the dependency tag of your pom.xml
file:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.16.1</version>
</dependency>
Then, if you're an IntelliJ user,ย don't forget to click on the Gradle/Maven reload button below to install the new dependencies:
Jsoup is now installed and ready to use. Import it in yourย Scraper.java
ย file as follows:
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
Let's write your first web scraping Java script!
Step #2: Connect to your target website
You canย use Jsoup to connect to a website using its URLย with the following lines:
// initializing the HTML Document page variable
Document doc;
try {
// fetching the target website
doc = Jsoup.connect("https://www.scrapingcourse.com/ecommerce/").get();
} catch (IOException e) {
throw new RuntimeException(e);
}
This snippet uses theย connect()
ย method from Jsoup to connect to the target website. Note thatย if the connection fails, Jsoup throws aย IOException
. This is why you need aย try ... catch
ย logic. Then,ย theย get()
ย method returns a Jsoup HTMLย Document
ย object you can use to explore the DOM.
Keep in mind thatย many websites automatically block requests that don't have a set of some expected HTTP headers. This is one of theย most basicย anti-scraping systems. Thus, you can simplyย avoid being blocked by manually setting these HTTP headers.
In general, the most important headerย you should always set is theย User-Agent
ย header. This isย a string that helps the server identifies the application, operating system, and vendorย from which the HTTP request comes from.
You can set theย User-Agent
ย header and other HTTP headers in Jsoup as follows:
doc = Jsoup
.connect("https://www.scrapingcourse.com/ecommerce/")
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")
.header("Accept-Language", "*")
.get();
Specifically,ย you can set theย User-Agent
ย with the Jsoupย userAgent()
ย method. Similarly, you can specify any other HTTP header withย header()
.
Step #3: Select the HTML elements of interest
Open your target web page in the browser and identify the HTML elements of interest. In this case, you want to scrape all product HTML elements.ย Right-click on a product HTML element and select the "Inspect" option. This shouldย open the DevTools windowย below:
As you can see, a product isย li.product
ย HTML element. This includes:
-
Anย
a
ย HTML element: contains the URL associated with the product -
Anย
img
ย HTML element: contains the product image. -
Aย
h2
ย HTML element: contains the product name. -
Aย
span
ย HTML element: contains the product price.
Now, let's learn how to extract data from a product HTML element with web scraping in Java.
Step #4: Extracting data from the HTML elements
First,ย you need a Java object where to store the scraped data. Create aย data
ย folder in the main package andย define aย Product.java
ย classย as follows:
package com.zenrows.data;
public class Product {
private String url;
private String image;
private String name;
private String price;
// getters and setters omitted for brevity...
@Override
public String toString() {
return "{ \"url\":\"" + url + "\", "
+ " \"image\": \"" + image + "\", "
+ "\"name\":\"" + name + "\", "
+ "\"price\": \"" + price + "\" }";
}
}
Note thatย theย toString()
ย method produces a string in JSON format. This will come in handy later.
Now, let'sย retrieve the list ofย li.product
ย HTML products on the target web page. You can achieve this with Jsoup as below:
Elements productElements = doc.select("li.product");
What this snippet does is simple. Theย Jsoupย select()
ย function apply theย CSS selectorย strategy to retrieve allย li.product
ย on the web page. In details,ย Elements
ย extends anย ArrayList
. So, you can easily iterate over it.
Thus, you canย iterate overย productElements
ย to extract the info of interest and store it inย Product
ย objects:
// initializing the list of Java object to store
// the scraped data
List<Product> products = new ArrayList<>();
// retrieving the list of product HTML elements
Elements productElements = doc.select("li.product");
// iterating over the list of HTML products
for (Element productElement : productElements) {
Product product = new Product();
// extracting the data of interest from the product HTML element
// and storing it in product
product.setUrl(productElement.selectFirst("a").attr("href"));
product.setImage(productElement.selectFirst("img").attr("src"));
product.setName(productElement.selectFirst("h2").text());
product.setPrice(productElement.selectFirst("span").text());
// adding product to the list of the scraped products
products.add(product);
}
This logic uses theย Java web scraping API offered by Jsoup to extract all the data of interest from each product HTML element. Then, it initializes aย Product
ย with this data and adds it to the list of scraped products.
Congrats! You just learned how to scrape data from a web page with Jsoup. Let's nowย convert this data into a more useful format.
Step #5: Export the data to JSON
Don't forget that theย toString()
ย method ofย Product
ย returns a JSON string. So, simply callย toString()
ย on theย List<Product>
ย object:
products.toString();
System.out.println(products);
toString()
ย on anย ArrayList
ย calls theย toString()
ย method on each element of the list.ย Then, it embeds the result in square brackets.
In other words, this will produce the following JSON data:
[
{
"url": "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/",
"image": "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg",
"name": "Abominable Hoodie",
"price": "$69.00"
},
// ...
{
"url": "https://www.scrapingcourse.com/ecommerce/product/artemis-running-short/",
"image": "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wsh04-black_main-324x324.jpg",
"name": "Artemis Running Short",
"price": "$49.00"
}
]
Et voilร !ย You just performed web scraping using Java! Yet, the website consists of several web pages.ย Let's see how to scrape them all.
Web Crawling in Java
Let's nowย retrieve the list of all pagination links to scrape the entire website. This isย what web crawling is about. Right-click on theย pagination number HTML element and choose the "Inspect" option.
The browser should open theย DevTools section and highlight the selected DOM element, as below:
From here, note thatย you can extract all the pagination number HTML elements with theย a.page-numbers
ย CSS selector.ย These elementsย contain the links you want to scrape. You can retrieve them all with Jsoup as below:
Elements paginationElements = doc.select("a.page-numbers");
If you want to scrape all web pages,ย you have to implement some crawling logic. Also, you'll need to rely on some lists and sets to avoid scraping a web page twice. You can implement web crawling logic to visit a limited amount of webpages controlled byย limit
ย as follows:
package com.zenrows;
import com.zenrows.data.Product;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
import java.io.IOException;
import java.util.*;
public class Scraper {
public static void scrapeProductPage(
List<Product> products,
Set<String> pagesDiscovered,
List<String> pagesToScrape
) {
// the current web page is about to be scraped and
// should no longer be part of the scraping queue
String url = pagesToScrape.remove(0);
pagesDiscovered.add(url);
// ... scraping logic omitted for brevity
Elements paginationElements = doc.select("a.page-numbers");
// iterating over the pagination HTML elements
for (Element pageElement : paginationElements) {
// the new link discovered
String pageUrl = pageElement.attr("href");
// if the web page discovered is new and should be scraped
if (!pagesDiscovered.contains(pageUrl) && !pagesToScrape.contains(pageUrl)) {
pagesToScrape.add(pageUrl);
}
// adding the link just discovered
// to the set of pages discovered so far
pagesDiscovered.add(pageUrl);
}
}
public static void main(String[] args) {
// initializing the list of Java object to store
// the scraped data
List<Product> products = new ArrayList<>();
// initializing the set of web page urls
// discovered while crawling the target website
Set<String> pagesDiscovered = new HashSet<>();
// initializing the queue of urls to scrape
List<String> pagesToScrape = new ArrayList<>();
// initializing the scraping queue with the
// first pagination page
pagesToScrape.add("https://www.scrapingcourse.com/ecommerce/page/1/");
// the number of iteration executed
int i = 0;
// to limit the number to scrape to 5
int limit = 12;
while (!pagesToScrape.isEmpty() && i < limit) {
scrapeProductPage(products, pagesDiscovered, pagesToScrape);
// incrementing the iteration number
i++;
}
System.out.println(products.size());
// writing the scraped data to a db or export it to a file...
}
}
scrapeProductPage()
ย scrapes a web page, discovers new links to scrape, and adds their URL to the scraping queue. You've set theย limit
ย to 12 because the target website has 12 pages. So, at the end of theย while
ย cycle,ย pagesToScrape
ย will be empty andย pagesDiscovered
ย will contain all the 12 pagination URLs.
Here's the full code:
package com.zenrows;
import com.zenrows.data.Product;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
import java.io.IOException;
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class Scraper {
public static void scrapeProductPage(
List<Product> products,
Set<String> pagesDiscovered,
List<String> pagesToScrape
) {
if (!pagesToScrape.isEmpty()) {
// the current web page is about to be scraped and
// should no longer be part of the scraping queue
String url = pagesToScrape.remove(0);
pagesDiscovered.add(url);
// initializing the HTML Document page variable
Document doc;
try {
// fetching the target website
doc = Jsoup
.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")
.get();
} catch (IOException e) {
throw new RuntimeException(e);
}
// retrieving the list of product HTML elements
// in the target page
Elements productElements = doc.select("li.product");
// iterating over the list of HTML products
for (Element productElement : productElements) {
Product Product = new Product();
// extracting the data of interest from the product HTML element
// and storing it in Product
Product.setUrl(productElement.selectFirst("a").attr("href"));
Product.setImage(productElement.selectFirst("img").attr("src"));
Product.setName(productElement.selectFirst("h2").text());
Product.setPrice(productElement.selectFirst("span").text());
// adding Product to the list of the scraped products
products.add(Product);
}
// retrieving the list of pagination HTML element
Elements paginationElements = doc.select("a.page-numbers");
// iterating over the pagination HTML elements
for (Element pageElement : paginationElements) {
// the new link discovered
String pageUrl = pageElement.attr("href");
// if the web page discovered is new and should be scraped
if (!pagesDiscovered.contains(pageUrl) && !pagesToScrape.contains(pageUrl)) {
pagesToScrape.add(pageUrl);
}
// adding the link just discovered
// to the set of pages discovered so far
pagesDiscovered.add(pageUrl);
}
// logging the end of the scraping operation
System.out.println(url + " -> page scraped");
}
}
public static void main(String[] args) {
// initializing the list of Java object to store
// the scraped data
List<Product> products = new ArrayList<>();
// initializing the set of web page urls
// discovered while crawling the target website
Set<String> pagesDiscovered = new HashSet<>();
// initializing the queue of urls to scrape
List<String> pagesToScrape = new ArrayList<>();
// initializing the scraping queue with the
// first pagination page
pagesToScrape.add("https://www.scrapingcourse.com/ecommerce/page/1/");
// the number of iteration executed
int i = 0;
// to limit the number to scrape to 5
int limit = 12;
while (!pagesToScrape.isEmpty() && i < limit) {
scrapeProductPage(products, pagesDiscovered, pagesToScrape);
// incrementing the iteration number
i++;
}
System.out.println(products.size());
// writing the scraped data to a db or export it to a file...
}
}
If you're an IntelliJ IDEA user, click on the run icon to run the web scraping Java example. Wait for the process to end. This will take some seconds. At the end of the process,ย products
ย will contain all 188 products.
Congratulations!ย You just extracted all the product data automatically!
Parallel Web Scraping in Java
Web scraping in Java can become a time-consuming process.ย This is especially true ifย your target website consists of many web pages and/or the server takes time to respond.ย Also, Java isn't popular for being a performant programming language.
At the same time,ย Java 8 introduced a lot of features toย make parallelism easier.ย So,ย transforming your Java web scraper to work in parallel takes only a few updates. Let's see how you can perform parallel web scraping in Java:
package com.zenrows;
import com.zenrows.data.Product;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
import java.io.IOException;
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class Scraper {
public static void scrapeProductPage(
List<Product> products,
Set<String> pagesDiscovered,
List<String> pagesToScrape
) {
//... omitted for brevity
}
public static void main(String[] args) throws InterruptedException {
// initializing the list of Java object to store
// the scraped data
List<Product> products = Collections.synchronizedList(new ArrayList<>());
// initializing the set of web page urls
// discovered while crawling the target website
Set<String> pagesDiscovered = Collections.synchronizedSet(new HashSet<>());
// initializing the queue of urls to scrape
List<String> pagesToScrape = Collections.synchronizedList(new ArrayList<>());
// initializing the scraping queue with the
// first pagination page
pagesToScrape.add("https://www.scrapingcourse.com/ecommerce/page/1/");
// initializing the ExecutorService to run the
// web scraping process in parallel on 4 pages at a time
ExecutorService executorService = Executors.newFixedThreadPool(4);
// launching the web scraping process to discover some
// urls and take advantage of the parallelization process
scrapeProductPage(products, pagesDiscovered, pagesToScrape);
// the number of iteration executed
int i = 1;
// to limit the number to scrape to 5
int limit = 10;
while (!pagesToScrape.isEmpty() && i < limit) {
// registering the web scraping task
executorService.execute(() -> scrapeProductPage(products, pagesDiscovered, pagesToScrape));
// adding a 200ms delay to avoid overloading the server
TimeUnit.MILLISECONDS.sleep(200);
// incrementing the iteration number
i++;
}
// waiting up to 300 seconds for all pending tasks to end
executorService.shutdown();
executorService.awaitTermination(300, TimeUnit.SECONDS);
System.out.println(products.size());
}
}
See the complete code below:
package com.zenrows;
import com.zenrows.data.Product;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
import java.io.IOException;
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class Scraper {
public static void scrapeProductPage(
List<Product> products,
Set<String> pagesDiscovered,
List<String> pagesToScrape
) {
if (!pagesToScrape.isEmpty()) {
// the current web page is about to be scraped and
// should no longer be part of the scraping queue
String url = pagesToScrape.remove(0);
pagesDiscovered.add(url);
// initializing the HTML Document page variable
Document doc;
try {
// fetching the target website
doc = Jsoup
.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")
.get();
} catch (IOException e) {
throw new RuntimeException(e);
}
// retrieving the list of product HTML elements
// in the target page
Elements productElements = doc.select("li.product");
// iterating over the list of HTML products
for (Element productElement : productElements) {
Product Product = new Product();
// extracting the data of interest from the product HTML element
// and storing it in Product
Product.setUrl(productElement.selectFirst("a").attr("href"));
Product.setImage(productElement.selectFirst("img").attr("src"));
Product.setName(productElement.selectFirst("h2").text());
Product.setPrice(productElement.selectFirst("span").text());
// adding Product to the list of the scraped products
products.add(Product);
}
// retrieving the list of pagination HTML element
Elements paginationElements = doc.select("a.page-numbers");
// iterating over the pagination HTML elements
for (Element pageElement : paginationElements) {
// the new link discovered
String pageUrl = pageElement.attr("href");
// if the web page discovered is new and should be scraped
if (!pagesDiscovered.contains(pageUrl) && !pagesToScrape.contains(pageUrl)) {
pagesToScrape.add(pageUrl);
}
// adding the link just discovered
// to the set of pages discovered so far
pagesDiscovered.add(pageUrl);
}
// logging the end of the scraping operation
System.out.println(url + " -> page scraped");
}
}
public static void main(String[] args) throws InterruptedException {
// initializing the list of Java object to store
// the scraped data
List<Product> products = Collections.synchronizedList(new ArrayList<>());
// initializing the set of web page urls
// discovered while crawling the target website
Set<String> pagesDiscovered = Collections.synchronizedSet(new HashSet<>());
// initializing the queue of urls to scrape
List<String> pagesToScrape = Collections.synchronizedList(new ArrayList<>());
// initializing the scraping queue with the
// first pagination page
pagesToScrape.add("https://www.scrapingcourse.com/ecommerce/page/1/");
// initializing the ExecutorService to run the
// web scraping process in parallel on 4 pages at a time
ExecutorService executorService = Executors.newFixedThreadPool(4) ;
// launching the web scraping process to discover some
// urls and take advantage of the parallelization process
scrapeProductPage(products, pagesDiscovered, pagesToScrape);
// the number of iteration executed
int i = 1;
// to limit the number to scrape to 5
int limit = 12;
while (!pagesToScrape.isEmpty() && i < limit) {
// registering the web scraping task
executorService.execute(() -> scrapeProductPage(products, pagesDiscovered, pagesToScrape));
// adding a 200ms delay for avoid overloading the server
TimeUnit.MILLISECONDS.sleep(200);
// incrementing the iteration number
i++;
}
// waiting up to 300 seconds to all pending tasks to end
executorService.shutdown();
executorService.awaitTermination(300, TimeUnit.SECONDS);
System.out.println(products.size());
// writing the scraped data to a db or export it to a file...
}
}
Keep in mind thatย ArrayList
ย andย HashSet
ย are not thread-safe in Java. This is whyย you need to wrap your collections withย Collections.synchronizedList()
ย andย Collections.synchronizedSet()
, respectively.ย These methods will turn them into thread-safe collections you can then use in threads.
Then, you canย useย ExecutorServices
ย to run tasks asynchronously.ย Thanks toย ExecutorServices
, you can execute and manage parallel tasks with no effort. Specifically,ย newFixedThreadPool()
ย allows you to initialize anย Executor
ย that can simultaneously run as many threads as the number passed to the initialization method.
You don't want to overload the target server or local machine. This is why you need toย add a few milliseconds of timeout between threads withย sleep()
. Your goal is to perform web scraping, not a DOS attack.
Then, always remember to shutdown yourย ExecutorService
ย and release its resource. Since when the code exists theย while
ย cycle some task may still be running, you should use theย awaitTermination()
ย method.
You must call this method after a shutdown request. In detail,ย awaitTermination()
ย blocks the code and wait for all tasks to complete within the interval of time passed as a parameter.
Run this java web scraping example script andย you'll experience a noticeable increase in performance compared to before. You just learnedย how to perform instant web scraping with Java.
Well done! You now know how to do parallel web scraping with Java! Butย there are still a few lessons to learn!
Scraping Dynamic Content Websites in Java
Don't forget thatย a web page is more than its corresponding HTML document. Web pages can perform HTTP requests in the browser viaย AJAX.ย This mechanism allows web pages to retrieve data asynchronously and update the content shown to the user accordingly.
Most websites now rely on frontend API requests to retrieve data. These requests are AJAX calls. So,ย these API calls provide valuable data you can't ignore when it comes to web scraping.ย You can sniff these calls, replicate them in your scraping script, and retrieve this data.
To sniff a AJAX call, use the DevTools your browser.ย Right-click on a web page, choose "Inspect", and select the "Network" tab.ย In the "Fetch/XHR" tab, you'll find the list of AJAX calls the web page executed, as below.
Here, you canย retrieve all the info you need to replicate these calls in your web scraping script. Yet,ย this isn't the best approach.
Web Scraping With a Headless Browser
Web pages perform most of the AJAX calls in response to user interaction. This is whyย you need a tool to load a web page in a browser and replicate user interaction. This isย what aย headless browserย is about.
In detail,ย a headless browser is a web browser with no GUI that enables you to programmatically control a web page. In other terms,ย a headless browser allows you to instruct a web browser to perform some tasks.
Thanks to a headless browser, you canย interact with a web page through JavaScript as a human being would. One of theย most popular libraries in Java offering headless browser functionality isย Selenium WebDriver.
Note that ZenRows API comes with headless browser capabilities. Learn more aboutย how to extract dynamically loaded data.
If you use Gradle, addย selenium-java
ย with the line below in theย dependencies
ย section of yourย build.gradle
ย file:
implementation "org.seleniumhq.selenium:selenium-java:4.14.1"
Otherwise, if you use Maven, insert the following lines in yourย pom.xml
ย file:
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>4.14.1</version>
</dependency>
Make sure to install your new dependency by running update command from the terminal or in your IDE. Ensure you have the latest version of Chrome installed. WebDriver previously needed its own setup, but now it's automatically included in version 4 and later. If you use Gradle, check the dependencies section of your build.gradle
file. Otherwise, if you use Maven, check your pom.xml
file. You're now ready to start using Selenium.
You can replicate the web scraping logic seen above on a single page with the following script:
import org.openqa.selenium.*;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.support.ui.WebDriverWait;
import java.util.*;
import com.zenrows.data.Product;
public class Scraper {
public static void main(String[] args) {
// defining the options to run Chrome in headless mode
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless");
// initializing a Selenium WebDriver ChromeDriver instance
// to run Chrome in headless mode
WebDriver driver = new ChromeDriver(options);
// connecting to the target web page
driver.get("https://www.scrapingcourse.com/ecommerce/");
// initializing the list of Java object to store
// the scraped data
List<Product> products = new ArrayList<>();
// retrieving the list of product HTML elements
List<WebElement>productElements = driver.findElements(By.cssSelector("li.product"));
// iterating over the list of HTML products
for (WebElement productElement : productElements) {
Product product = new Product();
// extracting the data of interest from the product HTML element
// and storing it in product
product.setUrl(productElement.findElement(By.tagName("a")).getAttribute("href"));
product.setImage(productElement.findElement(By.tagName("img")).getAttribute("src"));
product.setName(productElement.findElement(By.tagName("h2")).getText());
product.setPrice(productElement.findElement(By.tagName("span")).getText());
// adding product to the list of the scraped products
products.add(product);
}
// ...
driver.quit();
}
}
As you can see, the web scraping logic isn't that different from that seen before.ย What truly changes is that Selenium runs the web scraping logic in the browser. This means thatย Selenium has access to all features offered by a browser.
For example,ย you can click on a pagination element to directly navigate to a new page as below:
WebElement paginationElement = driver.findElement(By.cssSelector("a.page-numbers"));
// navigating to a new web page
paginationElement.click();
// wait for the page to load...
System.out.println(driver.getTitle()); // "Ecommerce Test Site to Learn Web Scraping โ ScrapingCourse.com"
In other words, Selenium allows you to perform web crawling by interacting with the elements in a web page. Just like a human being would. This makes a web scraper based on a headless browser harder to detect and block. Learn more on how to perform web scraping without getting blocked.
Other Web Scraping Libraries For Java
Other useful Java libraries for web scraping are:
- HtmlUnit: a GUI-less/headless browser for Java. HtmlUnit can perform all browser-specific operations on a web page. Like Selenium,ย it was born for testing but you can use it for web crawling and scraping.
- Playwright: an end-to-end testing library for web apps developed by Microsoft. Again, it enables you to instruct a browser. So,ย you can use it for web scraping like Selenium.
Conclusion
In this web scraping java tutorial,ย you learned everything you should know about performing professional web scraping with Java. In detail, you saw:
- Why Java is a good programming language when it comes to web scraping
- How to perform basic web scraping in Java with Jsoup
- How to crawl an entire website in Java
- Why you might need a headless browser
- How to use Selenium to perform scraping in Java on dynamic content websites
What you should never forget is thatย your web scraper needs to be able to bypass anti-scraping systems. This is whyย you need a complete web scraping Java API.ย ZenRowsย offers that and much more.
In detail,ย ZenRows is a tool that offers many services to help you perform web scraping. ZenRows also gives access to a headless browser with just a simple API call.ย Try ZenRows for free and start scraping data from the web with no effort.