Web crawlers are the heart of data extraction tools and various other analysis applications. If you're looking to build a web crawler in Java, you've come to the right place.
This guide will take you step by step through the process of creating a web crawler in Java. From project setup to development and optimization, you'll learn how to efficiently discover URLs and extract data at scale.
Before we begin, let's cover some background information.
What Is Web Crawling?
Web crawling is the process of navigating the web to discover specific information (usually URLs and page links) for different purposes, such as web scraping, archiving, or indexing, as with search engines like Google and Bing.
Although web crawling is often used interchangeably with web scraping, it's important to know the differences between both terms.Â
- Web scraping focuses on extracting data from one or more websites.
- Web crawling is about discovering URLs and page links.
In most large-scale data extraction projects, both processes are used together. For example, you might first crawl a target domain to identify relevant links and then scrape those links to extract the desired information.
For a deeper dive, check out our web crawling vs. web scraping comparison guide.
Build Your First Java Web Crawler
The best way to start your web crawling journey is through hands-on experience with a real-world example. For this tutorial, we'll crawl the ScrapingCourse e-commerce test site.
This website contains numerous pages, including paginated products, carts, and checkout. After crawling the page for all usable URLs, we'll collect some of those links and extract valuable product information.
If you're new to data extraction using Java, or you'd like a quick refresher on the topic, check out our guide on web scraping with Java.
In the meantime, let's progressively build a Java web crawler capable of discovering every URL on a target domain and retrieving the necessary data.
Step 1: Prerequisites for Building a Java Web Crawler
To follow along in this tutorial, ensure you meet the following requirements.
- Java Development Kit (JDK): Install the latest JDK. You can download it from Oracle or other providers, such as OpenJDK.
- Your preferred IDE: In this tutorial, we'll use Visual Studio Code, but you can use any IDE you choose.
- JSoup: You'll require this library to fetch and parse HTML.
There are different ways to add the JSoup library to your project. However, the most common approach involves using dependency managers like Maven and Gradle.
To add JSoup to a Maven project, include the following XML snippet in your pom.xml
<dependencies>
section.
<dependency>
<!-- jsoup HTML parser library @ https://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.18.3</version>
</dependency>
Alternatively, if you're using Gradle, add the following code line in your build.gradle
file.
// jsoup HTML parser library @ https://jsoup.org/
implementation 'org.jsoup:jsoup:1.18.3'
For more information on the JSoup library, check out our JSoup web scraping blog.
That's it. You're all set up.
Now, navigate to the directory where you'd like to store your code, create a Crawler
Java class, and get ready to write some code.
Step 2: Follow all the Links on a Website
Let's start with the most basic functionality of a Java web crawler: making a GET
request to the target website (also known as the seed URL) and retrieving its HTML.
Here's a function to retrieve the HTML.
It's often recommended that you catch and handle exceptions to keep your code clean. Thus, we've added a try-catch block to handle possible errors when fetching HTML.
package com.example;
// import the required modules
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
public class Crawler {
public static void main(String[] args) {
// URL of the target webpage
String seedUrl = "https://www.scrapingcourse.com/ecommerce/";
// retrieve HTML
Document doc = retrieveHTML(seedUrl);
// check if the document was successfully retrieved
if (doc != null) {
System.out.println("HTML successfully retrieved!");
}
}
// define function to retrieve HTML
private static Document retrieveHTML(String url) {
try {
// download HTML document using JSoup's connect class
Document doc = Jsoup.connect(url).get();
// log the HTML content
System.out.println(doc.html());
// return the HTML document
return doc;
} catch (IOException e) {
// handle exceptions
System.err.println("Unable to fetch HTML of: " + url);
}
return null;
}
}
This code retrieves and logs the HTML content, as seen in the result below.
<html lang="en">
<head>
<!-- ... -->
<title>
Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com
</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<div class="beta site-title">
<a href="https://www.scrapingcourse.com/ecommerce/" rel="home">
Ecommerce Test Site to Learn Web Scraping
</a>
</div>
<!-- other content omitted for brevity -->
</body>
</html>
HTML successfully retrieved!
The next step is to modify the crawler to find links on the target page, visit those links, and extract links from them.
You'll need to track visited URLs to avoid crawling the same URL multiple times, which can result in an infinite loop. Then, recursively crawl found links to discover more links.
Here's a step-by-step guide to do this.
In Java, the Set
data structure automatically handles duplicates, ensuring you store URLs only once. So, initialize a visitedUrls Set
and prepare to create a recursive crawl()
function.
A recursive function, in this context, is one that calls itself to continuously crawl pages until it reaches a point or a condition where the recursion stops.
To control how far your crawler goes from the starting page, you must set a depth limit. This will prevent indefinite crawling, which could misuse resources or trigger anti-bot restrictions.
Here's a practical example:
- Depth = 1: Only find and follow the links on the starting page.
- Depth = 2: Crawl the starting page and the linked pages.
Without setting this limit, you could unintentionally be crawling an endless list of pages.
In that case, your crawl()
function will take two arguments: the URL to crawl and the current depth. Remember to Initialize a max depth.
Within this function, check if the URL is valid (starts with http://
or https://
). Since you'll follow all the links on the page, you'll most likely encounter links with invalid protocols. You'll want to skip them to avoid errors.
After that, check if the crawler has already visited this URL. If no, call the retreiveHTML()
function to get the HTML of the current URL. Then, add a crawling logic to follow all links on the page. You'll see how to create this logic shortly.
// import the required modules
// ...
import java.util.HashSet;
import java.util.Set;
public class Crawler {
// intialize a set to store visited URLs
private static Set<String> visitedUrls = new HashSet<>();
// initialize max depth
private static int maxDepth = 2;
// ...
// define the recursive crawl function
private static void crawl(String url, int depth) {
// check if the URL is valid (starts with http or https)
if (!url.startsWith("http://") && !url.startsWith("https://")) {
System.err.println("Skipping invalid URL: " + url);
return;
}
// check if you've reached maximum depth or URL has already been visited
if (depth > maxDepth || visitedUrls.contains(url)) {
return;
}
// log current URL
System.out.println("Crawling: " + url);
// add URL to visitedUrls set
visitedUrls.add(url);
// call the retrieveHTML function to fetch HTML of the current page
Document doc = retrieveHTML(url);
if (doc != null){
// crawling logic to recursively follow all links on the page goes here.
}
}
}
It's time to create the crawling logic. You want to find all links on the current page and follow them to extract more links until there are no more links or max depth is reached. This is where the recursion comes in.
Website links are often defined in anchor tags. Therefore, you can find all the links on a web page by selecting the href
attribute of all anchor tags on the page.
JSoup provides a doc.select()
method that allows you to select all HTML elements with a specific CSS attribute or selector. Using this method, find all links and follow them by recursively calling the crawl function and incrementing the depth.
Remember to get the absolute URLs. This ensures that even if the href
attributes are relative (they do not contain the base URL, for example, /product-url
), they'll be automatically concatenated to form a complete URL.
//import the required libraries
// ...
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Crawler {
// ...
private static void crawl(String url, int depth) {
// ...
if (doc != null){
// find all links on the page
Elements links = doc.select("a[href]");
for (Element link : links) {
// get absolute URL
String nextUrl = link.absUrl("href");
// check if nextUrl exists and link hasn't been visited
if (!nextUrl.isEmpty() && !visitedUrls.contains(nextUrl)) {
// recursively call the crawl function
crawl(nextUrl, depth++);
}
}
}
}
}
That's it.
Now, combine all the steps and call the crawl function to begin the crawling process. You'll get the following complete code:
package com.example;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
public class Crawler {
// intialize a set to store visited URLs
private static Set<String> visitedUrls = new HashSet<>();
// initialize max depth
private static int maxDepth = 2;
public static void main(String[] args) {
// URL of the target webpage
String seedUrl = "https://www.scrapingcourse.com/ecommerce/";
// start crawling from the seed URL
crawl(seedUrl, 1);
}
// define function to retrieve HTML
private static Document retrieveHTML(String url) {
try {
// download HTML document using JSoup's connect class
Document doc = Jsoup.connect(url).get();
// log the page title
System.out.println("Page Title: " + doc.title());
// return the HTML document
return doc;
} catch (IOException e) {
// handle exceptions
System.err.println("Unable to fetch HTML of: " + url);
}
return null;
}
// define the recursive crawl function
private static void crawl(String url, int depth) {
// check if the URL is valid (starts with http or https)
if (!url.startsWith("http://") && !url.startsWith("https://")) {
System.err.println("Skipping invalid URL: " + url);
return;
}
// check if you've reached maximum depth or URL has been visited
if (depth > maxDepth || visitedUrls.contains(url)) {
return;
}
// log current URL
System.out.println("Crawling: " + url);
// add URL to visitedUrls set
visitedUrls.add(url);
// call the retrieveHTML function to fetch HTML of the current page
Document doc = retrieveHTML(url);
if (doc != null){
// find all links on the page
Elements links = doc.select("a[href]");
for (Element link : links) {
// get absolute URL
String nextUrl = link.absUrl("href");
// check if nextUrl exists and link hasn't been visited
if (!nextUrl.isEmpty() && !visitedUrls.contains(nextUrl)) {
// recursively call the crawl function
crawl(nextUrl, depth++);
}
}
}
}
}
This code will find and follow all the links on the target page. Here's what your console would look like:
Crawling: https://www.scrapingcourse.com/ecommerce/
Page Title: Ecommerce Test Site to Learn Web Scraping ? ScrapingCourse.com
Crawling: https://www.scrapingcourse.com/ecommerce/#site-navigation
Page Title: Ecommerce Test Site to Learn Web Scraping ? ScrapingCourse.com
// ... omitted for brevity ... //
Crawling: https://www.scrapingcourse.com/ecommerce/page/2/#content
Page Title: Ecommerce Test Site to Learn Web Scraping ? Page 2 ? ScrapingCourse.com
// ... omitted for brevity ... //
However, to save time and boost overall crawler performance, most data extraction projects focus on specific data, such as pagination links.
Here's how you can modify your code to crawl specific links. In this example, we'll crawl the pagination links on the seed URL.
Start by inspecting the page. Right-click on a pagination element and select Inspect.
This will open the DevTools window.
Here, you'll find that there are 12 product pages, and all pagination elements share the same page-numbers
class.
Using this information, modify your crawling logic to target only pagination links.
package com.example;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
public class Crawler {
// intialize a set to store visited URLs
private static Set<String> visitedUrls = new HashSet<>();
// initialize max depth
private static int maxDepth = 2;
public static void main(String[] args) {
// URL of the target webpage
String seedUrl = "https://www.scrapingcourse.com/ecommerce/";
// start crawling from the seed URL
crawl(seedUrl, 1);
}
// define function to retrieve HTML
private static Document retrieveHTML(String url) {
try {
// download HTML document using JSoup's connect class
Document doc = Jsoup.connect(url).get();
// log the page title
System.out.println("Page Title: " + doc.title());
// return the HTML document
return doc;
} catch (IOException e) {
// handle exceptions
System.err.println("Unable to fetch HTML of: " + url);
}
return null;
}
// define the recursive crawl function
private static void crawl(String url, int depth) {
// check if the URL is valid (starts with http or https)
if (!url.startsWith("http://") && !url.startsWith("https://")) {
System.err.println("Skipping invalid URL: " + url);
return;
}
// check if you've reached maximum depth or URL has been visited
if (depth > maxDepth || visitedUrls.contains(url)) {
return;
}
// log current URL
System.out.println("Crawling: " + url);
// add URL to visitedUrls set
visitedUrls.add(url);
// call the retrieveHTML function to fetch HTML of the current page
Document doc = retrieveHTML(url);
if (doc != null){
// select all pagination links
Elements paginationLinks = doc.select("a.page-numbers");
for (Element link : paginationLinks) {
// get absolute URL
String nextUrl = link.absUrl("href");
// check if nextUrl exists and link hasn't been visited
if (!nextUrl.isEmpty() && !visitedUrls.contains(nextUrl)) {
// recursively call the crawl function
crawl(nextUrl, depth++);
}
}
}
}
}
This will crawl only paginations and output the following result.
//... truncated for brevity ... //
Crawling: https://www.scrapingcourse.com/ecommerce/page/10/
Page Title: Ecommerce Test Site to Learn Web Scraping ? Page 10 ? ScrapingCourse.com
Crawling: https://www.scrapingcourse.com/ecommerce/page/11/
Page Title: Ecommerce Test Site to Learn Web Scraping ? Page 11 ? ScrapingCourse.com
Crawling: https://www.scrapingcourse.com/ecommerce/page/12/
Page Title: Ecommerce Test Site to Learn Web Scraping ? Page 12 ? ScrapingCourse.com
Congratulations! You've created your first Java web crawler.
Step 3: Extract Data From Your Crawler
In this section, you'll learn how to further enhance your Java web crawler to extract specific product details from the crawled pagination links. But before we begin, let's set up the basics.
Once the crawler navigates to each pagination link, we'll extract the following data points:
- Product name.
- Product price.
- Product image.
Let's begin!
Inspect the page to identify the CSS selectors for each data point.
You'll notice that each product is a list item with class product
, and the data points are as follows:
- Product name:
<h2>
with classproduct-name
. - Product price: span element with class,
product-price
. - Product image:
<img>
tag with classproduct-image
.
Using this information, create a scraping logic to select all product items on the current page, loop through them, and extract the product name, price, and image URL.
We recommend abstracting this scraping logic into a function so you can easily apply it to the crawl()
function. This will keep your code clean and modular.
public class Crawler {
// ...
// function to extract product details from the current page
private static void extractProductData(Document document) {
// select all product items on the current page
Elements products = document.select("li.product");
// loop through each item
for (Element product : products) {
// extract product name, price, and image URL
String productName = product.select(".product-name").text();
String price = product.select(".product-price").text();
String imageUrl = product.select(".product-image").attr("src");
// log the result
System.out.println("product-name: " + productName);
System.out.println("product-price: " + price);
System.out.println("product-image: " + imageUrl);
}
}
}
That's it.
Now combine all the steps above and call the extractProductData()
function in the crawl()
method just after the crawler navigates to the current page.
You'll have the following complete code:
package com.example;
// import the required modules
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.io.FileWriter;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
public class Crawler {
// intialize a set to store visited URLs
private static Set<String> visitedUrls = new HashSet<>();
// initialize an empty list to store scraped product data
private static List<String[]> productData = new ArrayList<>();
// initialize max depth
private static int maxDepth = 2;
public static void main(String[] args) {
// URL of the target webpage
String seedUrl = "https://www.scrapingcourse.com/ecommerce/";
// start crawling from the seed URL
crawl(seedUrl, 1);
}
// define function to retrieve HTML
private static Document retrieveHTML(String url) {
try {
// download HTML document using JSoup's connect class
Document doc = Jsoup.connect(url).get();
// log the page title
System.out.println("Page Title: " + doc.title());
// return the HTML document
return doc;
} catch (IOException e) {
// handle exceptions
System.err.println("Unable to fetch HTML of: " + url);
}
return null;
}
// define the recursive crawl function
private static void crawl(String url, int depth) {
// check if the URL is valid (starts with http or https)
if (!url.startsWith("http://") && !url.startsWith("https://")) {
System.err.println("Skipping invalid URL: " + url);
return;
}
// check if you've reached maximum depth or URL has been visited
if (depth > maxDepth || visitedUrls.contains(url)) {
return;
}
// log current URL
System.out.println("Crawling: " + url);
// add URL to visitedUrls set
visitedUrls.add(url);
// call the retrieveHTML function to fetch HTML of the current page
Document doc = retrieveHTML(url);
if (doc != null){
// extract product data
extractProductData(doc);
// select all pagination links
Elements paginationLinks = doc.select("a.page-numbers");
for (Element link : paginationLinks) {
// get absolute URL
String nextUrl = link.absUrl("href");
// check if nextUrl exists and link hasn't been visited
if (!nextUrl.isEmpty() && !visitedUrls.contains(nextUrl)) {
// recursively call the crawl function
crawl(nextUrl, depth++);
}
}
}
}
// function to extract product details from the current page
private static void extractProductData(Document document) {
// select all product items on the current page
Elements products = document.select("li.product");
// loop through each item
for (Element product : products) {
// extract product name, price, and image URL
String productName = product.select(".product-name").text();
String price = product.select(".product-price").text();
String imageUrl = product.select(".product-image").attr("src");
// log the result
System.out.println("product-name: " + productName);
System.out.println("product-price: " + price);
System.out.println("product-image: " + imageUrl);
}
}
}
This code crawls all the pagination links and extracts the product details on each page. Here's what your terminal would look like.
Crawling: https://www.scrapingcourse.com/ecommerce/
Page Title: Ecommerce Test Site to Learn Web Scraping ? ScrapingCourse.com
product-name: Abominable Hoodie
product-price: $69.00
product-image: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main.jpg
// ... other content omitted brevity ... //
Step 4: Export the Scraped Data to CSV
Storing data in a structured format is often essential for easy analysis. You can do this in Java using the built-in FileWriter
class.
Like the previous section, let's abstract this functionality into a reusable method. To achieve that, start by initializing an empty list to store scraped data.
public class Crawler {
// ...
// initialize an empty list to store scraped product data
private static List<String[]> productData = new ArrayList<>();
// ...
}
Then, modify the extractProductData()
function to add scraped data for each page to the empty list.
public class Crawler {
// ...
// function to extract product details from the current page
private static void extractProductData(Document document) {
// ...
// store the product details in the data list
productData.add(new String[]{productName, price, imageUrl});
}
}
After that, create a function to export scraped data to CSV. Within this function, initialize a FileWriter
class. Then, write the CSV headers and populate the rows with the scraped data.
public class Crawler {
// ...
// method to save data to a CSV file
private static void exportDataToCsv(String filePath) {
try (FileWriter writer = new FileWriter(filePath)) {
// write headers
writer.append("Product Name,Price,Image URL\n");
// write data rows
for (String[] row : data) {
writer.append(String.join(",", row));
writer.append("\n");
}
System.out.println("Data saved to " + filePath);
} catch (IOException e) {
e.printStackTrace();
}
}
}
That's it!
Now, combine the steps above. Also, call the exportDataToCsv()
function in main()
.
You'll have the following complete code.
package com.example;
// import the required modules
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.io.FileWriter;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
public class Crawler {
// intialize a set to store visited URLs
private static Set<String> visitedUrls = new HashSet<>();
// initialize an empty list to store scraped product data
private static List<String[]> productData = new ArrayList<>();
// initialize max depth
private static int maxDepth = 2;
public static void main(String[] args) {
// URL of the target webpage
String seedUrl = "https://www.scrapingcourse.com/ecommerce/";
// start crawling from the seed URL
crawl(seedUrl, 1);
// export scraped data to CSV
exportDataToCsv("product_data.csv");
}
// define function to retrieve HTML
private static Document retrieveHTML(String url) {
try {
// download HTML document using JSoup's connect class
Document doc = Jsoup.connect(url).get();
// log the page title
System.out.println("Page Title: " + doc.title());
// return the HTML document
return doc;
} catch (IOException e) {
// handle exceptions
System.err.println("Unable to fetch HTML of: " + url);
}
return null;
}
// define the recursive crawl function
private static void crawl(String url, int depth) {
// check if the URL is valid (starts with http or https)
if (!url.startsWith("http://") && !url.startsWith("https://")) {
System.err.println("Skipping invalid URL: " + url);
return;
}
// check if you've reached maximum depth or URL has been visited
if (depth > maxDepth || visitedUrls.contains(url)) {
return;
}
// log current URL
System.out.println("Crawling: " + url);
// add URL to visitedUrls set
visitedUrls.add(url);
// call the retrieveHTML function to fetch HTML of the current page
Document doc = retrieveHTML(url);
if (doc != null){
// extract product data
extractProductData(doc);
// select all pagination links
Elements paginationLinks = doc.select("a.page-numbers");
for (Element link : paginationLinks) {
// get absolute URL
String nextUrl = link.absUrl("href");
// check if nextUrl exists and link hasn't been visited
if (!nextUrl.isEmpty() && !visitedUrls.contains(nextUrl)) {
// recursively call the crawl function
crawl(nextUrl, depth++);
}
}
}
}
// function to extract product details from the current page
private static void extractProductData(Document document) {
// select all product items on the current page
Elements products = document.select("li.product");
// loop through each item
for (Element product : products) {
// extract product name, price, and image URL
String productName = product.select(".product-name").text();
String price = product.select(".product-price").text();
String imageUrl = product.select(".product-image").attr("src");
// store the product details in the data list
productData.add(new String[]{productName, price, imageUrl});
}
}
// method to save data to a CSV file
private static void exportDataToCsv(String filePath) {
try (FileWriter writer = new FileWriter(filePath)) {
// write headers
writer.append("Product Name,Price,Image URL\n");
// write data rows
for (String[] row : productData) {
writer.append(String.join(",", row));
writer.append("\n");
}
System.out.println("Data saved to " + filePath);
} catch (IOException e) {
e.printStackTrace();
}
}
}
This code creates a new CSV file and exports the scraped data. You'll find this file in your project's root directory.
Awesome! You now know how to crawl links and extract data from your crawler.
Optimize Your Web Crawler
The following are key areas to consider when optimizing your web crawler.
Avoid Duplicate Links
Duplicate links can cause your web crawler to revisit the same URL multiple times, leading to inefficiency and wasted time and resources. This often happens due to inconsistent URL formats or the presence of multiple identical links on a page.
To prevent this, ensure each link is visited only once. One effective approach is to use a HashSet
to track visited URLs. In Java, a HashSet
automatically checks for duplicates before adding a new URL, ensuring efficient and streamlined crawling.
Prioritize Specific Pages
Prioritizing specific pages can optimize your web crawler as it allows you to focus on relevant pages. In our current crawler, we use CSS selectors to prioritize pagination links which saves time and resources.
However, that approach only crawled pagination links. If you're also interested in other links, you can maintain separate queues for pagination links and other links. Then, process pagination links first.
To implement prioritization in the current crawler, create two queues for pagination links and other links and store both link categories in their respective queues. Then, define a function to process the queues (crawl pagination links first).
// import the required modules
// ...
import java.util.LinkedList;
import java.util.Queue;
public class Crawler {
// ...
// initialize a queue for pagination links
private static Queue<String> paginationQueue = new LinkedList<>();
// initialize a queue for other links
private static Queue<String> otherLinksQueue = new LinkedList<>();
public static void main(String[] args) {
// define seed URL
String seedUrl = "https://www.scrapingcourse.com/ecommerce/";
// start crawling from the seed URL
crawl(seedUrl, 1);
// process the queues
processQueues();
}
// define the recursive crawl function
private static void crawl(String url, int depth) {
// ...
// call the retrieveHTML function to fetch HTML of the current page
Document doc = retrieveHTML(url);
if (doc != null) {
// select all pagination links
Elements paginationLinks = doc.select("a.page-numbers");
for (Element link : paginationLinks) {
// get absolute URL
String nextUrl = link.absUrl("href");
// check if nextUrl exists and link hasn't been visited
if (!nextUrl.isEmpty() && !visitedUrls.contains(nextUrl)) {
// add to pagination queue
paginationQueue.add(nextUrl);
}
}
// select other links
Elements otherLinks = doc.select("a[href]");
for (Element link : otherLinks) {
// get absolute URL
String nextUrl = link.absUrl("href");
// check if nextUrl exists and link hasn't been visited
if (!nextUrl.isEmpty() && !visitedUrls.contains(nextUrl) && !paginationQueue.contains(nextUrl)) {
// add to other links queue
otherLinksQueue.add(nextUrl);
}
}
}
}
// define function to process queues
private static void processQueues() {
// Process pagination queue first
while (!paginationQueue.isEmpty()) {
String nextUrl = paginationQueue.poll();
crawl(nextUrl, maxDepth);
}
// Process other links queue
while (!otherLinksQueue.isEmpty()) {
String nextUrl = otherLinksQueue.poll();
crawl(nextUrl, maxDepth);
}
}
}
Maintain a Single Crawl Session
A session is a persistent connection to a target website, often preserved using session parameters, such as cookies, headers, and authentication. By maintaining a single session for the entire crawling process, you can significantly boost your web crawler's efficiency.
This is particularly useful when crawling websites that employ rate-limiting technologies to mitigate excess traffic.
In JSoup, you can use the Connection
class to create and maintain sessions using headers.
public class Crawler {
// ...
// create a single session
private static connection session = Jsoup.connect("");
public static void main(String[] args) {
string seedUrl = "https://www.scrapingcourse.com/ecommerce/";
session.url(seedUrl).timeout(5000).userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36");
crawl(seedUrl, 1);
// ...
}
However, optimizing your crawler is one thing, and getting access to a target website is another.
Most modern websites implement sophisticated anti-bot measures that can block your requests. You must overcome these obstacles to take advantage of your optimized crawler's efficiency and performance.
In the next section, we'll show you how to handle these anti-bot measures and crawl seamlessly.
Avoid Getting Blocked While Crawling With Java
Anti-bot solutions employ various techniques to mitigate bot traffic. One of these techniques involves tracking request behaviors and looking for distinguishable patterns between humans and bots.
To make matters worse, web crawlers are easily detected as they follow a systematic pattern, making multiple requests in a bot-like manner.
That said, you can configure your crawler to seem human to the target server. Some common patches you can implement to this end include proxy rotation, request header spoofing, and reducing request frequency.
However, like most manual configurations, they can get tedious to implement, especially when scaling your web crawling or dealing with advanced anti-bot systems.
The ZenRows Scraper API offers the easiest and most reliable solution for scalable web crawling.
This tool empowers you with everything you need to crawl any website without getting blocked. Some of its features include advanced anti-bot bypass out of the box, geo-located requests, fingerprinting evasion, actual user spoofing, request header management, and more.
To use ZenRows, sign up to get your free API key.
You'll be redirected to the Request Builder page, where you can find your ZenRows API key at the top right.
Input your target URL and activate Premium Proxies and JS Rendering boost mode. Let's use the ScrapingCourse Antibot Challenge page as the target URL for this example.
Then, select the Java language and choose the API option. ZenRows works with any language and provides ready-to-use snippets for the most popular ones.
Lastly, copy the generated code on the right to your editor for testing.
Your code should look like this:
import org.apache.hc.client5.http.fluent.Request;
public class APIRequest {
public static void main(final String... args) throws Exception {
String apiUrl = "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.scrapingcourse.com%2Fantibot-challenge&js_render=true&premium_proxy=true";
String response = Request.get(apiUrl)
.execute().returnContent().asString();
System.out.println(response);
}
}
Remember to add the Apache HttpClient Fluent dependency to your pom.xml
file, as shown below.
<dependency>
<groupId>org.apache.httpcomponents.client5</groupId>
<artifactId>httpclient5-fluent</artifactId>
<version>5.4.1</version>
</dependency>
Here's the result.
This code bypasses the anti-bot solution, makes a GET request to the page, and prints its response.
<html lang="en">
<head>
<!-- ... -->
<title>Antibot Challenge - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<h2>
You bypassed the Antibot challenge! :D
</h2>
<!-- other content omitted for brevity -->
</body>
</html>
Congratulations! You're now well-equipped to crawl any website without getting blocked. Â
Web Crawling Tools for Java
The right web crawling tools can significantly impact the outcome of your data extraction projects. Here are some tools to consider when creating a Java web crawler.
- ZenRows: An all-in-one web scraping API that provides everything you need to crawl without getting blocked. Its headless browser functionality also means it can handle dynamic content, making it a valuable web crawling tool.
- Selenium: Another valuable web crawling tool for Java. While it's popular for its browser automation capabilities, Selenium is also a great web crawling tool, allowing you to interact with web pages like a natural user.
- JSoup: A popular Java library for fetching and parsing HTML documents. This tool can handle malformed HTML, allowing you to parse complex real-world HTML files, which are often broken.
Java Web Crawling Best Practices and Considerations
The following recommended best practices can enhance your crawler's efficiency and performance.
Parallel Crawling and Concurrency
Crawling multiple pages sequentially can be inefficient and time-consuming as your crawler spends most of its time waiting for HTTP responses. However, you can significantly reduce your overall crawl time by using parallel crawling and Java's concurrency features.
Java's ExecutorService
framework provides a way to manage concurrency. Here's a code snippet showing how to implement parallel crawling for the crawler you built earlier.
package com.example;
// import the required modules
// ...
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class Crawler {
// ...
// initialize thread count
private static final int THREAD_COUNT = 10;
private static ExecutorService executorService = Executors.newFixedThreadPool(THREAD_COUNT);
public static void main(String[] args) {
// ...
// shutdown executor service when all tasks are complete
shutdownExecutorService();
// ...
}
// ...
// define the recursive crawl function
private static void crawl(String url, int depth) {
// ...
// submit a task to the executor service
executorService.submit(() -> {
Document doc = retrieveHTML(url);
if (doc != null) {
// extract product data
extractProductData(doc);
// ...
}
});
}
// ...
// function to shutdown the executor service
private static void shutdownExecutorService() {
executorService.shutdown();
try {
if (!executorService.awaitTermination(60, TimeUnit.SECONDS)) {
executorService.shutdownNow();
}
} catch (InterruptedException e) {
executorService.shutdownNow();
}
}
}
Crawling JavaScript Rendered Pages in Java
Your current crawler, using JSoup, cannot crawl JavaScript-rendered pages. Rather, it is designed to fetch and parse static HTML in the server's response, and JavaScript-rendered pages are not present in a website's static HTML.
To crawl dynamic content, you need headless browsers that render JavaScript and enable you to interact with page elements. For example, the Selenium WebDriver.
Distributed Web Crawling in Java
Distributed web crawling is an optimization technique that distributes the workload across multiple machines or processes. This is particularly useful when scaling your web crawling or dealing with large-scale data extraction projects.
To learn how to build a distributed crawler architecture, check out our distributed web crawling guide.
Conclusion
You've learned how to build a Java web crawler from scratch. Remember that while building a web crawler to navigate web pages is a great starting point, you must overcome anti-bot measures to gain access to modern websites.
You've learned that manual configurations, such as proxy rotation, are mostly insufficient, and ZenRows provides the easiest, most reliable, scalable solution. For easy-to-implement web crawling without getting blocked, use ZenRows. Sign up for free to get started today.