Crawler4j is an open-source Java framework for quick and efficient web crawling. Its intuitive interface allows you to set up a multi-threaded web crawler in a few minutes.
Let's see how.
This tutorial provides a step-by-step guide on crawling websites using Crawler4j. You'll learn how to discover links, follow them, and extract data using Java web scraping techniques.
Let's begin!
Prerequisites
Before we begin, If you're interested in learning more about web scraping using Java, be sure to check out our Java web scraping guide.
To follow along in this tutorial, ensure you meet the following requirements for Crawler4j:
- Java Development Kit (JDK) 11 or older.
- The Maven dependency manager: Modern Java IDEs like IntelliJ and NetBeans have built-in dependency manager support (Maven and Gradle). Simply select the Maven template when creating your Java project to use Maven.
- Your preferred IDE. We'll be using IntelliJ in this tutorial.
Build Your First Crawler4j Web Crawler
Real-world examples are the best learning tools. Therefore, in this tutorial, we'll crawl the ScrapingCourse E-commerce test site.

We'll find and follow product links on the website and also scrape valuable information (product name, price, and image URL) as the crawler navigates each product page.
Step 1: Set up Crawler4j
Before you set up your project, here's some background information about the tool.
At its core, Crawler4j consists of two main components: the WebCrawler
interface, which you must extend in your own Crawler
class, and the Controller
class that specifies the crawl configuration.
The WebCrawler
class decides which URL to crawl and handles downloaded pages using two methods: ShouldVisit()
and visit()
, both of which should be overridden.
-
ShouldVisit()
: This function specifies which URL to crawl. You can implement your own logic to filter out domains or extensions, such as.css
,.gif
,.js
, etc. -
visit()
: This method processes downloaded pages. Crawler4j provides basic methods for accessing data from fetched HTML. You can query for the URL, text, links, and even the page's unique ID.
Similarly, the Controller
class defines the seed URL (the target website), sets crawl configurations, and starts your crawler.
Now, let's set up Crawler4j.
Navigate to the directory where you'd like to store your code and create your Java project. Then, add Crawler4j to your project by including the XML snippet below in your pom.xml
file.
<dependencies>
<dependency>
<groupId>edu.uci.ics</groupId>
<artifactId>crawler4j</artifactId>
<version>4.4.0</version>
</dependency>
</dependencies>
You're ready to go!
Step 2: Access the Target Website
Let's start with a basic Crawler4j script to access the target website and retrieve its HTML. This script makes a GET request to the target website and downloads the page.
Later on, we'll extend this script to follow links and extract product information.
// import the required classes
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import java.util.regex.Pattern;
// create a Crawler class that extends WebCrawler
public class Crawler extends WebCrawler{
// define unwanted extensions
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg|png|mp3|mp4|zip|gz))$");
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
/**
* only crawl URLs that don't match the filtered extensions and
* starts with the scraping course domain
*/
return !FILTERS.matcher(href).matches()
&& href.startsWith("https://www.scrapingcourse.com/");
}
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
System.out.println("Visiting: " + url);
// check if the page contains HTML data
if (page.getParseData() instanceof HtmlParseData) {
// parse HTML
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
// log HTML
String html = htmlParseData.getHtml();
System.out.println(html);
}
}
}
Remember, you need to implement a Controller
class to configure and start your crawler. Here's the Controller
class for the crawler above.
This class sets the max crawling depth to zero to limit crawling to the seed URL. It also sets the number of threads and other configurations necessary to run your crawler.
The max depth controls how far your crawler goes from the seed URL. Setting this value is important for defining the crawl target and avoiding infinite crawling.
You can set numerous configurations in the Controller
class depending on your project needs. For more information, refer to the Crawler4j documentation. However, for this tutorial, we'll stick to the basics.
// import the required classes
import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
public class Controller {
public static void main(String[] args) throws Exception {
// initialize crawl config
CrawlConfig config = new CrawlConfig();
// define folder to store intermediate crawl data
config.setCrawlStorageFolder("src/main/resources/");
// set maximum crawling depth
config.setMaxDepthOfCrawling(0);
// set number of pages to fetch
config.setMaxPagesToFetch(1);
// instantiate the controller for this crawl.
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
// disable robots.txt handling
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
// set number of threads to use during crawling
int numberOfCrawlers = 1;
// add seed URL
controller.addSeed("https://www.scrapingcourse.com/ecommerce/");
// initialize the factory, which creates instances of crawlers.
CrawlController.WebCrawlerFactory<Crawler> factory = Crawler::new;
// start the crawl.
controller.start(factory, numberOfCrawlers);
}
}
Run this code, and it'll log the HTML content of the page, as seen below.
<html lang="en">
<head>
<!-- ... -->
<title>
Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com
</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<div class="beta site-title">
<a href="https://www.scrapingcourse.com/ecommerce/" rel="home">
Ecommerce Test Site to Learn Web Scraping
</a>
</div>
<!-- other content omitted for brevity -->
</body>
</html>
That was quite fast!
Step 3: Follow Links With Crawler4j
Let's extend the initial Crawler class to find and follow links on the page. We'll focus only on pagination links (product page links) to keep things simple and practical.
By default and with no definition of what links to visit, Crawler4j will attempt to visit all possible links, starting from the seed URL. Therefore, you must specify the links you want to crawl using the shouldVisit()
method.
To achieve this, inspect the page to identify the pagination pattern. Navigate to the target page using a browser, right-click on a pagination element, and select inspect. This will open the Developer Tools window, as seen in the image below:

Here, you'll find that the pagination links all end with the format /page/{number}/
. Using this information, modify your shouldVisit()
method to only visit the pagination links.
Crawler4j offers different options for achieving this:
- The
Pattern.compile()
method, which takes the regex (Regular expression) pattern that matches the target links - The
startsWith()
method, takes a string that matches what the target URLs start with.
We recommend using the Pattern.compile ()
method to be as precise as possible and avoid crawling unwanted links, such as those with additional parameters (for example, /?add-to-cart/
).
In that case, the regex pattern matching the pagination links is .*/page/\\d+/$
.
If you're unfamiliar with regular expressions, you can find numerous resources online that can guide you. For now, it is beyond the scope of this article.
That said, follow the steps below to modify your shouldVisit()
method accordingly.
Define the pattern for pagination links using the regex pattern above.
// ...
public class Crawler extends WebCrawler {
// ...
// define the pattern for pagination links (/page/number/)
private static final Pattern PAGINATION_PATTERN = Pattern.compile(".*/page/\\d+/$");
}
Next, apply the pagination pattern to your shouldVisit()
method.
public class Crawler extends WebCrawler {
// ...
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
// ...
/**
* only crawl URLs that don't match the filtered extensions, but
* matches pagination pattern
*/
return !FILTERS.matcher(href).matches() && PAGINATION_PATTERN.matcher(href).matches();
}
}
Now, put everything together, and here's your complete Crawler class. We've added some print statements to track crawling progress.
// import the required classes
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import java.util.regex.Pattern;
import java.util.Set;
// create a Crawler class that extends WebCrawler
public class Crawler extends WebCrawler {
// define unwanted extensions
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg|png|mp3|mp4|zip|gz))$");
// define the pattern for pagination links (/page/number/)
private static final Pattern PAGINATION_PATTERN = Pattern.compile(".*/page/\\d+/$");
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
/**
* only crawl URLs that don't match the filtered extensions, but
* matches pagination pattern
*/
return !FILTERS.matcher(href).matches()
&& PAGINATION_PATTERN.matcher(href).matches();
}
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
System.out.println("Visiting: " + url);
// check if the page contains HTML data
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String html = htmlParseData.getHtml();
Set<WebURL> links = htmlParseData.getOutgoingUrls();
System.out.println("-------------------------------------");
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
System.out.println("-------------------------------------");
}
}
}
In your Controller
class, change crawl depth to 2 to allow Crawler4j to visit all pagination links and run your code. Here's our result.
Visiting: https://www.scrapingcourse.com/ecommerce/page/5/
-------------------------------------
Html length: 84167
Number of outgoing links: 86
-------------------------------------
Visiting: https://www.scrapingcourse.com/ecommerce/page/6/
-------------------------------------
Html length: 84595
Number of outgoing links: 86
-------------------------------------
// ... omitted for brevity ... //
Step 4: Extract Data From Collected Links
The next step is to extract data from the links we visit. Once the crawler navigates to a product page, we want to extract the product name, price, and image URL of each product on the page.
However, Crawler4j doesn't natively provide support for parsing HTML elements. So, we need to integrate with a library like Jsoup to navigate the HTML document and select elements using CSS selectors.
Follow the steps below to combine Crawler4j with Jsoup to extract data from collected links.
Add Jsoup to your project by including the following XML snippet in your pom.xml
file.
<dependency>
<!-- jsoup HTML parser library @ https://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.18.3</version>
</dependency>
Next, import the required Jsoup classes, and in your visit()
method, parse the HTML using Jsoup.
// import the required classes
// ...
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
// create a Crawler class that extends WebCrawler
public class Crawler extends WebCrawler {
@Override
public void visit(Page page) {
// ...
// check if the page contains HTML data
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String html = htmlParseData.getHtml();
// parse HTML content using Jsoup
Document document = Jsoup.parse(html);
}
}
}
After that, inspect the page to identify the right CSS selectors for the aforementioned data points.

You'll find that each product is a list item with the class product
. The following HTML elements within the list items represent each data point.
- Product name:
<h2>
with classproduct-name
. - Product price: span element with class,
product-price
. - Product image:
<img>
tag with classproduct-image
.
Use this information to create scraping logic that selects all product items on the current page, loops through them, and extracts the product name, price, and image URL.
Let's abstract this scraping logic into a function so you can easily apply it to your visit()
method. This will keep your code clean and modular.
//...
//import the required classes
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
// create a Crawler class that extends WebCrawler
public class Crawler extends WebCrawler {
// ...
// function to extract product details from the current page
private void extractProductData(Document document) {
// select all product items on the current page
Elements products = document.select("li.product");
// loop through each item
for (Element product : products) {
// extract product name, price, and image URL
String productName = product.select(".product-name").text();
String price = product.select(".product-price").text();
String imageUrl = product.select(".product-image").attr("src");
// log the result
System.out.println("product-name: " + productName);
System.out.println("product-price: " + price);
System.out.println("product-image: " + imageUrl);
}
}
}
That's it.
Now, call the extractProductData()
function in the visit()
method and combine all the steps to get the following complete code:
// import the required classes
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
// create a Crawler class that extends WebCrawler
public class Crawler extends WebCrawler {
// define unwanted extensions
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg|png|mp3|mp4|zip|gz))$");
// define the pattern for pagination links (/page/number/)
private static final Pattern PAGINATION_PATTERN = Pattern.compile(".*/page/\\d+/$");
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
/**
* only crawl URLs that don't match the filtered extensions, but
* matches pagination pattern
*/
return !FILTERS.matcher(href).matches()
&& PAGINATION_PATTERN.matcher(href).matches();
}
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
System.out.println("Visiting: " + url);
// check if the page contains HTML data
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String html = htmlParseData.getHtml();
// parse HTML content using Jsoup
Document document = Jsoup.parse(html);
// extract product data
extractProductData(document);
}
}
// function to extract product details from the current page
private void extractProductData(Document document) {
// select all product items on the current page
Elements products = document.select("li.product");
// loop through each item
for (Element product : products) {
// extract product name, price, and image URL
String productName = product.select(".product-name").text();
String price = product.select(".product-price").text();
String imageUrl = product.select(".product-image").attr("src");
// log the result
System.out.println("product-name: " + productName);
System.out.println("product-price: " + price);
System.out.println("product-image: " + imageUrl);
}
}
}
This code visits all pagination links and extracts each product's name, price, and image URL on each page.
Here's the result:
Visiting: https://www.scrapingcourse.com/ecommerce/
product-name: Abominable Hoodie
product-price: $69.00
product-image: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main.jpg
product-name: Adrienne Trek Jacket
product-price: $57.00
product-image: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_main.jpg
product-name: Aeon Capri
product-price: $48.00
product-image: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wp07-black_main.jpg
// ... omitted for brevity ... //
Step 5: Export the Scraped Data to CSV
Exporting scraped data to CSV is often essential for turning raw data into valuable insights. You can achieve this in Java using the built-in FileWriter
class.
Here's a step-by-step guide:
Start by initializing an empty list to store scraped data.
// import the required classes
// ...
import java.util.ArrayList;
import java.util.List;
// create a Crawler class that extends WebCrawler
public class Crawler extends WebCrawler {
// ...
// initialize an empty list to store scraped product data
private static List<String[]> productData = new ArrayList<>();
// ...
}
After that, modify the extractProductData()
function to add scraped data to the empty list.
// create a Crawler class that extends WebCrawler
public class Crawler extends WebCrawler {
// ...
// function to extract product details from the current page
private static void extractProductData(Document document) {
// ...
// store the product details in the data list
productData.add(new String[]{productName, price, imageUrl});
}
}
Next, create a function to write scraped data to CSV. Within this function, initialize a FileWriter
class, write the CSV headers, and populate the rows with the scraped data.
// import the required libraries
// ...
import java.io.IOException;
import java.io.FileWriter;
// create a Crawler class that extends WebCrawler
public class Crawler extends WebCrawler {
// ...
// method to save data to a CSV file
private static void exportDataToCsv(String filePath) {
// initialize a FileWriter class
try (FileWriter writer = new FileWriter(filePath)) {
// write headers
writer.append("Product Name,Price,Image URL\n");
// populate data rows with scraped data
for (String[] row : data) {
writer.append(String.join(",", row));
writer.append("\n");
}
System.out.println("Data saved to " + filePath);
} catch (IOException e) {
e.printStackTrace();
}
}
}
That's it.
Now, combine the steps above and call the exportDataToCsv()
function in your visit()
method.
Your new code should look like this:
// import the required classes
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.ArrayList;
import java.util.List;
import java.io.IOException;
import java.io.FileWriter;
// create a Crawler class that extends WebCrawler
public class Crawler extends WebCrawler {
// define unwanted extensions
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg|png|mp3|mp4|zip|gz))$");
// define the pattern for pagination links (/page/number/)
private static final Pattern PAGINATION_PATTERN = Pattern.compile(".*/page/\\d+/$");
// initialize an empty list to store scraped product data
private static List<String[]> productData = new ArrayList<>();
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
/**
* only crawl URLs that don't match the filtered extensions, but
* matches pagination pattern
*/
return !FILTERS.matcher(href).matches()
&& PAGINATION_PATTERN.matcher(href).matches();
}
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
System.out.println("Visiting: " + url);
// check if the page contains HTML data
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String html = htmlParseData.getHtml();
// parse HTML content using Jsoup
Document document = Jsoup.parse(html);
// extract product data
extractProductData(document);
// export scraped data to CSV
exportDataToCsv("product_data.csv");
}
}
// function to extract product details from the current page
private void extractProductData(Document document) {
// select all product items on the current page
Elements products = document.select("li.product");
// loop through each item
for (Element product : products) {
// extract product name, price, and image URL
String productName = product.select(".product-name").text();
String price = product.select(".product-price").text();
String imageUrl = product.select(".product-image").attr("src");
// store the product details in the data list
productData.add(new String[]{productName, price, imageUrl});
}
}
// method to save data to a CSV file
private static void exportDataToCsv(String filePath) {
// initialize a FileWriter class
try (FileWriter writer = new FileWriter(filePath)) {
// write headers
writer.append("Product Name,Price,Image URL\n");
// populate data rows with scraped data
for (String[] row : productData) {
writer.append(String.join(",", row));
writer.append("\n");
}
System.out.println("Data saved to " + filePath);
} catch (IOException e) {
e.printStackTrace();
}
}
}
This code creates a new product_data.csv
file in your project's root directory and populates the rows with the scraped data.
Here's a sample screenshot of the result.

Congratulations! You've done it.
Not only do you know how to crawl websites using Crawler4j, but you now have a Crawler4j crawler that can follow links, scrape pages, and export data to CSV.
Avoid Getting Blocked While Crawling With Crawler4j
Getting blocked is a common web crawling challenge, as modern websites continuously employ anti-bot measures, such as browser fingerprinting, rate limiting, IP reputation, and others designed to detect web crawler patterns.
Here's a real-world example.
Let's try to crawl the Antibot Challenge page, an anti-bot-protected website, using your Crawler4j crawler.
// import the required classes
import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
public class Controller {
public static void main(String[] args) throws Exception {
// initialize crawl config
CrawlConfig config = new CrawlConfig();
// define folder to store intermediate crawl data
config.setCrawlStorageFolder("src/main/resources/");
// set maximum crawling depth
config.setMaxDepthOfCrawling(0);
// set number of pages to fetch
config.setMaxPagesToFetch(1);
// instantiate the controller for this crawl.
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
// disable robots.txt handling
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
// set number of threads to use during crawling
int numberOfCrawlers = 1;
// add seed URL
controller.addSeed("https://www.scrapingcourse.com/antibot-challenge");
// initialize the factory, which creates instances of crawlers.
CrawlController.WebCrawlerFactory<Crawler> factory = Crawler::new;
// start the crawl.
controller.start(factory, numberOfCrawlers);
}
}
You'll get the 403 error, indicating that the website refuses to fulfill your request. This happens because the anti-bot solution detects your web crawling activity and blocks your crawler.
"HTTP/1.1 403 Forbidden[\r][\n]"
To overcome this challenge, you can try common best practices, such as rotating proxies and setting custom user agents. However, these methods do not work against advanced anti-bot solutions.
You'll need to completely emulate natural user behavior to avoid getting blocked when web crawling. The ZenRows' Universal Scraper API provides an easy way to do that.
ZenRows is an all-in-one web scraping API that makes it easy to emulate real user behavior and fly under the radar when web crawling. It handles all the technicalities under the hood, allowing you to focus on extracting your desired data.
Some of ZenRows' features include advanced anti-bot bypass out of the box, CAPTCHA bypass, premium proxies, geo-located requests, actual user spoofing, request header management, and more.
Here's ZenRows in action against the same Antibot Challenge page where Crawler4j failed.
To follow along, sign up to get your free API key.
You'll be redirected to the Request Builder page, where you can find your ZenRows API key at the top right.

Input your target URL and activate Premium Proxies and JS Rendering boost mode.
Then, select the Java language and choose the API option. ZenRows works with any language and provides ready-to-use snippets for the most popular ones.
Lastly, copy the generated code on the right to your editor for testing.
Your code should look like this:
import org.apache.hc.client5.http.fluent.Request;
public class APIRequest {
public static void main(final String... args) throws Exception {
String apiUrl = "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.scrapingcourse.com%2Fantibot-challenge&js_render=true&premium_proxy=true";
String response = Request.get(apiUrl)
.execute().returnContent().asString();
System.out.println(response);
}
}
Remember to add the Apache HttpClient Fluent dependency to your pom.xml
file, as shown below.
<dependency>
<groupId>org.apache.httpcomponents.client5</groupId>
<artifactId>httpclient5-fluent</artifactId>
<version>5.4.1</version>
</dependency>
Run the code, and you'll get the following result:
<html lang="en">
<head>
<!-- ... -->
<title>Antibot Challenge - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<h2>
You bypassed the Antibot challenge! :D
</h2>
<!-- other content omitted for brevity -->
</body>
</html>
This code bypasses the anti-bot solution and retrieves the HTML of the target page.
Congratulations! You now know how to crawl any website without getting blocked. Â
Conclusion
Phew! That was quite the ride. If you stuck around to the end, there's no doubt you're now equipped to take on real-world Java web crawling projects.
Let's do a quick recap. You've learned how to:
- Find and follow links using Crawler4j.
- Extract data from crawled links.
- Export scraped data to CSV.
Just remember that while these steps are a great starting point, you must overcome anti-bot challenges to access most target websites.
Thankfully, ZenRows makes all this easy by offering a reliable and scalable solution that enables you to crawl without getting blocked.