How to Build a Web Crawler in Java

Idowu Omisola
Idowu Omisola
January 8, 2025 · 10 min read

Web crawlers are the heart of data extraction tools and various other analysis applications. If you're looking to build a web crawler in Java, you've come to the right place.

This guide will take you step by step through the process of creating a web crawler in Java. From project setup to development and optimization, you'll learn how to efficiently discover URLs and extract data at scale.

Before we begin, let's cover some background information.

What Is Web Crawling?

Web crawling is the process of navigating the web to discover specific information (usually URLs and page links) for different purposes, such as web scraping, archiving, or indexing, as with search engines like Google and Bing.

Although web crawling is often used interchangeably with web scraping, it's important to know the differences between both terms. 

  • Web scraping focuses on extracting data from one or more websites.
  • Web crawling is about discovering URLs and page links.

In most large-scale data extraction projects, both processes are used together. For example, you might first crawl a target domain to identify relevant links and then scrape those links to extract the desired information.

For a deeper dive, check out our web crawling vs. web scraping comparison guide.

Build Your First Java Web Crawler

The best way to start your web crawling journey is through hands-on experience with a real-world example. For this tutorial, we'll crawl the ScrapingCourse e-commerce test site.

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

This website contains numerous pages, including paginated products, carts, and checkout. After crawling the page for all usable URLs, we'll collect some of those links and extract valuable product information.

If you're new to data extraction using Java, or you'd like a quick refresher on the topic, check out our guide on web scraping with Java.

In the meantime, let's progressively build a Java web crawler capable of discovering every URL on a target domain and retrieving the necessary data.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step 1: Prerequisites for Building a Java Web Crawler

To follow along in this tutorial, ensure you meet the following requirements.

  • Java Development Kit (JDK): Install the latest JDK. You can download it from Oracle or other providers, such as OpenJDK.
  • Your preferred IDE: In this tutorial, we'll use Visual Studio Code, but you can use any IDE you choose.
  • JSoup: You'll require this library to fetch and parse HTML.

There are different ways to add the JSoup library to your project. However, the most common approach involves using dependency managers like Maven and Gradle.

To add JSoup to a Maven project, include the following XML snippet in your pom.xml <dependencies> section.

pom.xml
<dependency>
  <!-- jsoup HTML parser library @ https://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.18.3</version>
</dependency>

Alternatively, if you're using Gradle, add the following code line in your build.gradle file.

Build.gradle
// jsoup HTML parser library @ https://jsoup.org/
implementation 'org.jsoup:jsoup:1.18.3'

For more information on the JSoup library, check out our JSoup web scraping blog.

That's it. You're all set up.

Now, navigate to the directory where you'd like to store your code, create a Crawler Java class, and get ready to write some code.

Let's start with the most basic functionality of a Java web crawler: making a GET request to the target website (also known as the seed URL) and retrieving its HTML.

Here's a function to retrieve the HTML.

It's often recommended that you catch and handle exceptions to keep your code clean. Thus, we've added a try-catch block to handle possible errors when fetching HTML.

Crawler.java
package com.example;

// import the required modules
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.IOException;

public class Crawler {
    public static void main(String[] args) {
        // URL of the target webpage
        String seedUrl = "https://www.scrapingcourse.com/ecommerce/";
       
        // retrieve HTML
        Document doc = retrieveHTML(seedUrl);

        // check if the document was successfully retrieved
        if (doc != null) {
            System.out.println("HTML successfully retrieved!");
        }
    }
    // define function to retrieve HTML
    private static Document retrieveHTML(String url) {
        try {
            // download HTML document using JSoup's connect class
            Document doc = Jsoup.connect(url).get();

            // log the HTML content
            System.out.println(doc.html());

            // return the HTML document
            return doc;
        } catch (IOException e) {
            // handle exceptions
            System.err.println("Unable to fetch HTML of: " + url);
        }
        return null;
    }
}

This code retrieves and logs the HTML content, as seen in the result below.

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>
        Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com
    </title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <div class="beta site-title">
        <a href="https://www.scrapingcourse.com/ecommerce/" rel="home">
            Ecommerce Test Site to Learn Web Scraping
        </a>
    </div>
    <!-- other content omitted for brevity -->
</body>
</html>
HTML successfully retrieved!

The next step is to modify the crawler to find links on the target page, visit those links, and extract links from them.

You'll need to track visited URLs to avoid crawling the same URL multiple times, which can result in an infinite loop. Then, recursively crawl found links to discover more links.

Here's a step-by-step guide to do this.

In Java, the Set data structure automatically handles duplicates, ensuring you store URLs only once. So, initialize a visitedUrls Set and prepare to create a recursive crawl() function.

A recursive function, in this context, is one that calls itself to continuously crawl pages until it reaches a point or a condition where the recursion stops.

To control how far your crawler goes from the starting page, you must set a depth limit. This will prevent indefinite crawling, which could misuse resources or trigger anti-bot restrictions.

Here's a practical example:

  • Depth = 1: Only find and follow the links on the starting page.
  • Depth = 2: Crawl the starting page and the linked pages.

Without setting this limit, you could unintentionally be crawling an endless list of pages.

In that case, your crawl() function will take two arguments: the URL to crawl and the current depth. Remember to Initialize a max depth.

Within this function, check if the URL is valid (starts with http:// or https://). Since you'll follow all the links on the page, you'll most likely encounter links with invalid protocols. You'll want to skip them to avoid errors.

After that, check if the crawler has already visited this URL. If no, call the retreiveHTML() function to get the HTML of the current URL. Then, add a crawling logic to follow all links on the page. You'll see how to create this logic shortly.

Crawler.java
// import the required modules

// ...
import java.util.HashSet;
import java.util.Set;

public class Crawler {
    // intialize a set to store visited URLs
    private static Set<String> visitedUrls = new HashSet<>();
    
    // initialize max depth
    private static int maxDepth = 2;

    // ...
    
    // define the recursive crawl function
    private static void crawl(String url, int depth) {
        // check if the URL is valid (starts with http or https)
        if (!url.startsWith("http://") && !url.startsWith("https://")) {
            System.err.println("Skipping invalid URL: " + url);
            return;
        }

        // check if you've reached maximum depth or URL has already been visited
        if (depth > maxDepth || visitedUrls.contains(url)) {
            return;
        }
       
        // log current URL
        System.out.println("Crawling: " + url);
       
        // add URL to visitedUrls set
        visitedUrls.add(url);

        // call the retrieveHTML function to fetch HTML of the current page
        Document doc = retrieveHTML(url);
        if (doc != null){
            // crawling logic to recursively follow all links on the page goes here.
        }
    }
}

It's time to create the crawling logic. You want to find all links on the current page and follow them to extract more links until there are no more links or max depth is reached. This is where the recursion comes in.

Website links are often defined in anchor tags. Therefore, you can find all the links on a web page by selecting the href attribute of all anchor tags on the page.

JSoup provides a doc.select() method that allows you to select all HTML elements with a specific CSS attribute or selector. Using this method, find all links and follow them by recursively calling the crawl function and incrementing the depth.

Remember to get the absolute URLs. This ensures that even if the href attributes are relative (they do not contain the base URL, for example, /product-url), they'll be automatically concatenated to form a complete URL.

Crawler.java
//import the required libraries

// ...
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Crawler {
    // ...   

    private static void crawl(String url, int depth) {
        // ...
        
        if (doc != null){
            // find all links on the page
            Elements links = doc.select("a[href]");
            for (Element link : links) {
                // get absolute URL
                String nextUrl = link.absUrl("href");
                // check if nextUrl exists and link hasn't been visited 
                if (!nextUrl.isEmpty() && !visitedUrls.contains(nextUrl)) {
                    // recursively call the crawl function
                    crawl(nextUrl, depth++); 
                }
            }

        } 
    }
}

That's it.

Now, combine all the steps and call the crawl function to begin the crawling process. You'll get the following complete code:

Crawler.java
package com.example;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;

public class Crawler {
    // intialize a set to store visited URLs
    private static Set<String> visitedUrls = new HashSet<>();

    // initialize max depth
    private static int maxDepth = 2;

    public static void main(String[] args) {
        // URL of the target webpage
        String seedUrl = "https://www.scrapingcourse.com/ecommerce/";
       
        // start crawling from the seed URL
        crawl(seedUrl, 1);
    }
    
    // define function to retrieve HTML
    private static Document retrieveHTML(String url) {
        try {
            // download HTML document using JSoup's connect class
            Document doc = Jsoup.connect(url).get();

            // log the page title
            System.out.println("Page Title: " + doc.title());

            // return the HTML document
            return doc;
        } catch (IOException e) {
            // handle exceptions
            System.err.println("Unable to fetch HTML of: " + url);
        }
        return null;
    }
    
    // define the recursive crawl function
    private static void crawl(String url, int depth) {
        // check if the URL is valid (starts with http or https)
        if (!url.startsWith("http://") && !url.startsWith("https://")) {
            System.err.println("Skipping invalid URL: " + url);
            return;
        }

        // check if you've reached maximum depth or URL has been visited
        if (depth > maxDepth || visitedUrls.contains(url)) {
            return;
        }
       
        // log current URL
        System.out.println("Crawling: " + url);
       
        // add URL to visitedUrls set
        visitedUrls.add(url);

        // call the retrieveHTML function to fetch HTML of the current page
        Document doc = retrieveHTML(url);
        if (doc != null){
            // find all links on the page
            Elements links = doc.select("a[href]");
            for (Element link : links) {
                // get absolute URL
                String nextUrl = link.absUrl("href");
                // check if nextUrl exists and link hasn't been visited 
                if (!nextUrl.isEmpty() && !visitedUrls.contains(nextUrl)) {
                    // recursively call the crawl function
                    crawl(nextUrl, depth++); 
                }
            }
        }
    }
}

This code will find and follow all the links on the target page. Here's what your console would look like:

Output
Crawling: https://www.scrapingcourse.com/ecommerce/
Page Title: Ecommerce Test Site to Learn Web Scraping ? ScrapingCourse.com
Crawling: https://www.scrapingcourse.com/ecommerce/#site-navigation
Page Title: Ecommerce Test Site to Learn Web Scraping ? ScrapingCourse.com
// ... omitted for brevity ... //
Crawling: https://www.scrapingcourse.com/ecommerce/page/2/#content
Page Title: Ecommerce Test Site to Learn Web Scraping ? Page 2 ? ScrapingCourse.com

// ... omitted for brevity ... //

However, to save time and boost overall crawler performance, most data extraction projects focus on specific data, such as pagination links.

Here's how you can modify your code to crawl specific links. In this example, we'll crawl the pagination links on the seed URL.

Start by inspecting the page. Right-click on a pagination element and select Inspect.

scrapingcourse ecommerce homepage inspect
Click to open the image in full screen

This will open the DevTools window.

scrapingcourse ecommerce homepage devtools
Click to open the image in full screen

Here, you'll find that there are 12 product pages, and all pagination elements share the same page-numbers class.

Using this information, modify your crawling logic to target only pagination links.

Crawler.java
package com.example;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;

public class Crawler {
    // intialize a set to store visited URLs
    private static Set<String> visitedUrls = new HashSet<>();

    // initialize max depth
    private static int maxDepth = 2;

    public static void main(String[] args) {
        // URL of the target webpage
        String seedUrl = "https://www.scrapingcourse.com/ecommerce/";
       
        // start crawling from the seed URL
        crawl(seedUrl, 1);
    }
    
    // define function to retrieve HTML
    private static Document retrieveHTML(String url) {
        try {
            // download HTML document using JSoup's connect class
            Document doc = Jsoup.connect(url).get();

            // log the page title
            System.out.println("Page Title: " + doc.title());

            // return the HTML document
            return doc;
        } catch (IOException e) {
            // handle exceptions
            System.err.println("Unable to fetch HTML of: " + url);
        }
        return null;
    }
    
    // define the recursive crawl function
    private static void crawl(String url, int depth) {
        // check if the URL is valid (starts with http or https)
        if (!url.startsWith("http://") && !url.startsWith("https://")) {
            System.err.println("Skipping invalid URL: " + url);
            return;
        }

        // check if you've reached maximum depth or URL has been visited
        if (depth > maxDepth || visitedUrls.contains(url)) {
            return;
        }
       
        // log current URL
        System.out.println("Crawling: " + url);
       
        // add URL to visitedUrls set
        visitedUrls.add(url);

        // call the retrieveHTML function to fetch HTML of the current page
        Document doc = retrieveHTML(url);
        if (doc != null){
            // select all pagination links
            Elements paginationLinks = doc.select("a.page-numbers");
            for (Element link : paginationLinks) {
                // get absolute URL
                String nextUrl = link.absUrl("href");
                // check if nextUrl exists and link hasn't been visited
                if (!nextUrl.isEmpty() && !visitedUrls.contains(nextUrl)) {
                    // recursively call the crawl function
                    crawl(nextUrl, depth++); 
                }
            }
        } 
    }
}

This will crawl only paginations and output the following result.

Output
//... truncated for brevity ... //
Crawling: https://www.scrapingcourse.com/ecommerce/page/10/
Page Title: Ecommerce Test Site to Learn Web Scraping ? Page 10 ? ScrapingCourse.com
Crawling: https://www.scrapingcourse.com/ecommerce/page/11/
Page Title: Ecommerce Test Site to Learn Web Scraping ? Page 11 ? ScrapingCourse.com
Crawling: https://www.scrapingcourse.com/ecommerce/page/12/
Page Title: Ecommerce Test Site to Learn Web Scraping ? Page 12 ? ScrapingCourse.com

Congratulations! You've created your first Java web crawler.

Step 3: Extract Data From Your Crawler

In this section, you'll learn how to further enhance your Java web crawler to extract specific product details from the crawled pagination links. But before we begin, let's set up the basics.

Once the crawler navigates to each pagination link, we'll extract the following data points:

  • Product name.
  • Product price.
  • Product image.

Let's begin!

Inspect the page to identify the CSS selectors for each data point.

scrapingcourse ecommerce homepage inspect first product li
Click to open the image in full screen

You'll notice that each product is a list item with class product, and the data points are as follows:

  • Product name: <h2> with class product-name.
  • Product price: span element with class, product-price.
  • Product image: <img> tag with class product-image.

Using this information, create a scraping logic to select all product items on the current page, loop through them, and extract the product name, price, and image URL.

We recommend abstracting this scraping logic into a function so you can easily apply it to the crawl() function. This will keep your code clean and modular.

Crawler.java
public class Crawler {
    // ...

    // function to extract product details from the current page
    private static void extractProductData(Document document) {
        // select all product items on the current page
        Elements products = document.select("li.product");

        // loop through each item
        for (Element product : products) {
            // extract product name, price, and image URL
            String productName = product.select(".product-name").text();
            String price = product.select(".product-price").text();
            String imageUrl = product.select(".product-image").attr("src");

            // log the result
            System.out.println("product-name: " + productName);
            System.out.println("product-price: " + price);
            System.out.println("product-image: " + imageUrl);

        }
    }
}

That's it.

Now combine all the steps above and call the extractProductData() function in the crawl() method just after the crawler navigates to the current page.

You'll have the following complete code:

Crawler.java
package com.example;

// import the required modules
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.io.FileWriter;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

public class Crawler {
    // intialize a set to store visited URLs
    private static Set<String> visitedUrls = new HashSet<>();

    // initialize an empty list to store scraped product data
    private static List<String[]> productData = new ArrayList<>();

    // initialize max depth
    private static int maxDepth = 2;

    public static void main(String[] args) {
        // URL of the target webpage
        String seedUrl = "https://www.scrapingcourse.com/ecommerce/";
       
        // start crawling from the seed URL
        crawl(seedUrl, 1);
    }
    
    // define function to retrieve HTML
    private static Document retrieveHTML(String url) {
        try {
            // download HTML document using JSoup's connect class
            Document doc = Jsoup.connect(url).get();

            // log the page title
            System.out.println("Page Title: " + doc.title());

            // return the HTML document
            return doc;
        } catch (IOException e) {
            // handle exceptions
            System.err.println("Unable to fetch HTML of: " + url);
        }
        return null;
    }
    
    // define the recursive crawl function
    private static void crawl(String url, int depth) {
        // check if the URL is valid (starts with http or https)
        if (!url.startsWith("http://") && !url.startsWith("https://")) {
            System.err.println("Skipping invalid URL: " + url);
            return;
        }

        // check if you've reached maximum depth or URL has been visited
        if (depth > maxDepth || visitedUrls.contains(url)) {
            return;
        }
       
        // log current URL
        System.out.println("Crawling: " + url);
       
        // add URL to visitedUrls set
        visitedUrls.add(url);

        // call the retrieveHTML function to fetch HTML of the current page
        Document doc = retrieveHTML(url);
        if (doc != null){
            // extract product data
            extractProductData(doc);
            
            // select all pagination links
            Elements paginationLinks = doc.select("a.page-numbers");
            for (Element link : paginationLinks) {
                // get absolute URL
                String nextUrl = link.absUrl("href");
                // check if nextUrl exists and link hasn't been visited
                if (!nextUrl.isEmpty() && !visitedUrls.contains(nextUrl)) {
                    // recursively call the crawl function
                    crawl(nextUrl, depth++); 
                }
            }
        } 
    }

    // function to extract product details from the current page
    private static void extractProductData(Document document) {
        // select all product items on the current page
        Elements products = document.select("li.product");

        // loop through each item
        for (Element product : products) {
            // extract product name, price, and image URL
            String productName = product.select(".product-name").text();
            String price = product.select(".product-price").text();
            String imageUrl = product.select(".product-image").attr("src");

            // log the result
            System.out.println("product-name: " + productName);
            System.out.println("product-price: " + price);
            System.out.println("product-image: " + imageUrl);
        }
    }
}

This code crawls all the pagination links and extracts the product details on each page. Here's what your terminal would look like.

Output
Crawling: https://www.scrapingcourse.com/ecommerce/
Page Title: Ecommerce Test Site to Learn Web Scraping ? ScrapingCourse.com
product-name: Abominable Hoodie
product-price: $69.00
product-image: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main.jpg

// ... other content omitted brevity ... //

Step 4: Export the Scraped Data to CSV

Storing data in a structured format is often essential for easy analysis. You can do this in Java using the built-in FileWriter class.

Like the previous section, let's abstract this functionality into a reusable method. To achieve that, start by initializing an empty list to store scraped data.

Crawler.java
public class Crawler {
    // ...

    // initialize an empty list to store scraped product data
    private static List<String[]> productData = new ArrayList<>();

    // ...
}

Then, modify the extractProductData() function to add scraped data for each page to the empty list.

Crawler.java
public class Crawler {
    // ...

    // function to extract product details from the current page
    private static void extractProductData(Document document) {
        // ...        

        // store the product details in the data list
        productData.add(new String[]{productName, price, imageUrl});
    }
}

After that, create a function to export scraped data to CSV. Within this function, initialize a FileWriter class. Then, write the CSV headers and populate the rows with the scraped data.

Crawler.java
public class Crawler {
    // ...

    // method to save data to a CSV file
    private static void exportDataToCsv(String filePath) {
        try (FileWriter writer = new FileWriter(filePath)) {
            // write headers
            writer.append("Product Name,Price,Image URL\n");
           
            // write data rows
            for (String[] row : data) {
                writer.append(String.join(",", row));
                writer.append("\n");
            }
            System.out.println("Data saved to " + filePath);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

That's it!

Now, combine the steps above. Also, call the exportDataToCsv() function in main().

You'll have the following complete code.

Crawler.java
package com.example;

// import the required modules
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.io.FileWriter;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

public class Crawler {
    // intialize a set to store visited URLs
    private static Set<String> visitedUrls = new HashSet<>();

    // initialize an empty list to store scraped product data
    private static List<String[]> productData = new ArrayList<>();

    // initialize max depth
    private static int maxDepth = 2;

    public static void main(String[] args) {
        // URL of the target webpage
        String seedUrl = "https://www.scrapingcourse.com/ecommerce/";
       
        // start crawling from the seed URL
        crawl(seedUrl, 1);
        
        // export scraped data to CSV
        exportDataToCsv("product_data.csv");

    }
    
    // define function to retrieve HTML
    private static Document retrieveHTML(String url) {
        try {
            // download HTML document using JSoup's connect class
            Document doc = Jsoup.connect(url).get();

            // log the page title
            System.out.println("Page Title: " + doc.title());

            // return the HTML document
            return doc;
        } catch (IOException e) {
            // handle exceptions
            System.err.println("Unable to fetch HTML of: " + url);
        }
        return null;
    }
    
    // define the recursive crawl function
    private static void crawl(String url, int depth) {
        // check if the URL is valid (starts with http or https)
        if (!url.startsWith("http://") && !url.startsWith("https://")) {
            System.err.println("Skipping invalid URL: " + url);
            return;
        }

        // check if you've reached maximum depth or URL has been visited
        if (depth > maxDepth || visitedUrls.contains(url)) {
            return;
        }
       
        // log current URL
        System.out.println("Crawling: " + url);
       
        // add URL to visitedUrls set
        visitedUrls.add(url);

        // call the retrieveHTML function to fetch HTML of the current page
        Document doc = retrieveHTML(url);
        if (doc != null){
            // extract product data
            extractProductData(doc);
            
            // select all pagination links
            Elements paginationLinks = doc.select("a.page-numbers");
            for (Element link : paginationLinks) {
                // get absolute URL
                String nextUrl = link.absUrl("href");
                // check if nextUrl exists and link hasn't been visited
                if (!nextUrl.isEmpty() && !visitedUrls.contains(nextUrl)) {
                    // recursively call the crawl function
                    crawl(nextUrl, depth++); 
                }
            }
        } 
    }
    // function to extract product details from the current page
    private static void extractProductData(Document document) {
        // select all product items on the current page
        Elements products = document.select("li.product");

        // loop through each item
        for (Element product : products) {
            // extract product name, price, and image URL
            String productName = product.select(".product-name").text();
            String price = product.select(".product-price").text();
            String imageUrl = product.select(".product-image").attr("src");

            // store the product details in the data list
            productData.add(new String[]{productName, price, imageUrl});
        }
    }
    // method to save data to a CSV file
    private static void exportDataToCsv(String filePath) {
        try (FileWriter writer = new FileWriter(filePath)) {
            // write headers
            writer.append("Product Name,Price,Image URL\n");
           
            // write data rows
            for (String[] row : productData) {
                writer.append(String.join(",", row));
                writer.append("\n");
            }
            System.out.println("Data saved to " + filePath);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

This code creates a new CSV file and exports the scraped data. You'll find this file in your project's root directory.

Product Data CSV
Click to open the image in full screen

Awesome! You now know how to crawl links and extract data from your crawler.

Optimize Your Web Crawler

The following are key areas to consider when optimizing your web crawler.

Duplicate links can cause your web crawler to revisit the same URL multiple times, leading to inefficiency and wasted time and resources. This often happens due to inconsistent URL formats or the presence of multiple identical links on a page.

To prevent this, ensure each link is visited only once. One effective approach is to use a HashSet to track visited URLs. In Java, a HashSet automatically checks for duplicates before adding a new URL, ensuring efficient and streamlined crawling.

Prioritize Specific Pages

Prioritizing specific pages can optimize your web crawler as it allows you to focus on relevant pages. In our current crawler, we use CSS selectors to prioritize pagination links which saves time and resources.

However, that approach only crawled pagination links. If you're also interested in other links, you can maintain separate queues for pagination links and other links. Then, process pagination links first.

To implement prioritization in the current crawler, create two queues for pagination links and other links and store both link categories in their respective queues. Then, define a function to process the queues (crawl pagination links first).

Crawler.java
// import the required modules

// ...
import java.util.LinkedList;
import java.util.Queue;

public class Crawler {
    // ...

    // initialize a queue for pagination links
    private static Queue<String> paginationQueue = new LinkedList<>();

    // initialize a queue for other links
    private static Queue<String> otherLinksQueue = new LinkedList<>();

    public static void main(String[] args) {
        // define seed URL
        String seedUrl = "https://www.scrapingcourse.com/ecommerce/";

        // start crawling from the seed URL
        crawl(seedUrl, 1);

        // process the queues
        processQueues();
    }

    // define the recursive crawl function
    private static void crawl(String url, int depth) {
        // ...

        // call the retrieveHTML function to fetch HTML of the current page
        Document doc = retrieveHTML(url);
        if (doc != null) {
            // select all pagination links
            Elements paginationLinks = doc.select("a.page-numbers");
            for (Element link : paginationLinks) {
                // get absolute URL
                String nextUrl = link.absUrl("href");
                // check if nextUrl exists and link hasn't been visited
                if (!nextUrl.isEmpty() && !visitedUrls.contains(nextUrl)) {
                    // add to pagination queue
                    paginationQueue.add(nextUrl);
                }
            }

            // select other links
            Elements otherLinks = doc.select("a[href]");
            for (Element link : otherLinks) {
                // get absolute URL
                String nextUrl = link.absUrl("href");
                // check if nextUrl exists and link hasn't been visited
                if (!nextUrl.isEmpty() && !visitedUrls.contains(nextUrl) && !paginationQueue.contains(nextUrl)) {
                    // add to other links queue
                    otherLinksQueue.add(nextUrl);
                }
            }
        }
    }

    // define function to process queues
    private static void processQueues() {
        // Process pagination queue first
        while (!paginationQueue.isEmpty()) {
            String nextUrl = paginationQueue.poll();
            crawl(nextUrl, maxDepth);
        }

        // Process other links queue
        while (!otherLinksQueue.isEmpty()) {
            String nextUrl = otherLinksQueue.poll();
            crawl(nextUrl, maxDepth);
        }
    }
}

Maintain a Single Crawl Session

A session is a persistent connection to a target website, often preserved using session parameters, such as cookies, headers, and authentication. By maintaining a single session for the entire crawling process, you can significantly boost your web crawler's efficiency.

This is particularly useful when crawling websites that employ rate-limiting technologies to mitigate excess traffic.

In JSoup, you can use the Connection class to create and maintain sessions using headers.

Crawler.java
public class Crawler {
    // ...

    // create a single session
    private static connection session = Jsoup.connect("");
    public static void main(String[] args) {
        string seedUrl = "https://www.scrapingcourse.com/ecommerce/";
        session.url(seedUrl).timeout(5000).userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36");
        crawl(seedUrl, 1);
        // ...
}

However, optimizing your crawler is one thing, and getting access to a target website is another.

Most modern websites implement sophisticated anti-bot measures that can block your requests. You must overcome these obstacles to take advantage of your optimized crawler's efficiency and performance.

In the next section, we'll show you how to handle these anti-bot measures and crawl seamlessly.

Avoid Getting Blocked While Crawling With Java

Anti-bot solutions employ various techniques to mitigate bot traffic. One of these techniques involves tracking request behaviors and looking for distinguishable patterns between humans and bots.

To make matters worse, web crawlers are easily detected as they follow a systematic pattern, making multiple requests in a bot-like manner.

That said, you can configure your crawler to seem human to the target server. Some common patches you can implement to this end include proxy rotation, request header spoofing, and reducing request frequency.

However, like most manual configurations, they can get tedious to implement, especially when scaling your web crawling or dealing with advanced anti-bot systems.

The ZenRows Scraper API offers the easiest and most reliable solution for scalable web crawling.

This tool empowers you with everything you need to crawl any website without getting blocked. Some of its features include advanced anti-bot bypass out of the box, geo-located requests, fingerprinting evasion, actual user spoofing, request header management, and more.

To use ZenRows, sign up to get your free API key.

You'll be redirected to the Request Builder page, where you can find your ZenRows API key at the top right.

Input your target URL and activate Premium Proxies and JS Rendering boost mode. Let's use the ScrapingCourse Antibot Challenge page as the target URL for this example.

Then, select the Java language and choose the API option. ZenRows works with any language and provides ready-to-use snippets for the most popular ones.

building a scraper with zenrows
Click to open the image in full screen

Lastly, copy the generated code on the right to your editor for testing.

Your code should look like this:

Crawler.java
import org.apache.hc.client5.http.fluent.Request;

public class APIRequest {
    public static void main(final String... args) throws Exception {
        String apiUrl = "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.scrapingcourse.com%2Fantibot-challenge&js_render=true&premium_proxy=true";
        String response = Request.get(apiUrl)
                .execute().returnContent().asString();

        System.out.println(response);
    }
}

Remember to add the Apache HttpClient Fluent dependency to your pom.xml file, as shown below.

pom.xml
<dependency>
    <groupId>org.apache.httpcomponents.client5</groupId>
    <artifactId>httpclient5-fluent</artifactId>
    <version>5.4.1</version>
</dependency>

Here's the result.

This code bypasses the anti-bot solution, makes a GET request to the page, and prints its response.

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

Congratulations! You're now well-equipped to crawl any website without getting blocked.  

Web Crawling Tools for Java

The right web crawling tools can significantly impact the outcome of your data extraction projects. Here are some tools to consider when creating a Java web crawler.

  • ZenRows: An all-in-one web scraping API that provides everything you need to crawl without getting blocked. Its headless browser functionality also means it can handle dynamic content, making it a valuable web crawling tool.
  • Selenium: Another valuable web crawling tool for Java. While it's popular for its browser automation capabilities, Selenium is also a great web crawling tool, allowing you to interact with web pages like a natural user.
  • JSoup: A popular Java library for fetching and parsing HTML documents. This tool can handle malformed HTML, allowing you to parse complex real-world HTML files, which are often broken.

Java Web Crawling Best Practices and Considerations

The following recommended best practices can enhance your crawler's efficiency and performance.

Parallel Crawling and Concurrency

Crawling multiple pages sequentially can be inefficient and time-consuming as your crawler spends most of its time waiting for HTTP responses. However, you can significantly reduce your overall crawl time by using parallel crawling and Java's concurrency features.

Java's ExecutorService framework provides a way to manage concurrency. Here's a code snippet showing how to implement parallel crawling for the crawler you built earlier.

Example
package com.example;

// import the required modules

// ...
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Crawler {
    // ...

    // initialize thread count
    private static final int THREAD_COUNT = 10;
    private static ExecutorService executorService = Executors.newFixedThreadPool(THREAD_COUNT);

    public static void main(String[] args) {
        // ...

        // shutdown executor service when all tasks are complete
        shutdownExecutorService();

        // ...
    }
    
    // ...

    // define the recursive crawl function
    private static void crawl(String url, int depth) {
        // ...

        // submit a task to the executor service
        executorService.submit(() -> {
            Document doc = retrieveHTML(url);
            if (doc != null) {
                // extract product data
                extractProductData(doc);

                // ...
            }
        });
    }
   
    // ...

    // function to shutdown the executor service
    private static void shutdownExecutorService() {
        executorService.shutdown();
        try {
            if (!executorService.awaitTermination(60, TimeUnit.SECONDS)) {
                executorService.shutdownNow();
            }
        } catch (InterruptedException e) {
            executorService.shutdownNow();
        }
    }
}

Crawling JavaScript Rendered Pages in Java

Your current crawler, using JSoup, cannot crawl JavaScript-rendered pages. Rather, it is designed to fetch and parse static HTML in the server's response, and JavaScript-rendered pages are not present in a website's static HTML.

To crawl dynamic content, you need headless browsers that render JavaScript and enable you to interact with page elements. For example, the Selenium WebDriver.

Distributed Web Crawling in Java

Distributed web crawling is an optimization technique that distributes the workload across multiple machines or processes. This is particularly useful when scaling your web crawling or dealing with large-scale data extraction projects.

To learn how to build a distributed crawler architecture, check out our distributed web crawling guide.

Conclusion

You've learned how to build a Java web crawler from scratch. Remember that while building a web crawler to navigate web pages is a great starting point, you must overcome anti-bot measures to gain access to modern websites.

You've learned that manual configurations, such as proxy rotation, are mostly insufficient, and ZenRows provides the easiest, most reliable, scalable solution. For easy-to-implement web crawling without getting blocked, use ZenRows. Sign up for free to get started today.

Ready to get started?

Up to 1,000 URLs for free are waiting for you