A Complete Web Crawling Guide With Apache Nutch

February 25, 2025 · 9 min read

Table of contents

Prerequisites
Build first Apache Nutch web crawler
- Set up Apache Nutch
- Access the target website
- Follow links with Apache Nutch
- Extract data from collected links
- Export the scraped data to CSV
Avoid getting blocked
Conclusion

Apache Nutch is an open-source web crawler. It lets you capture, parse, store, and index web pages for easy searching and querying.

It is also a pluggable, modular, and easy to maintain, so you can you can quickly find hyperlinks, check for broken links, and handle duplicates using basic commands.

In this tutorial, I'll show you how to crawl websites using Apache Nutch.

Prerequisites

To follow along in this tutorial, ensure you meet the following requirements:

Java Development Kit (JDK) 11 or newer.
Unix environment, or Windows Cygwin environment.
Your preferred IDE. We'll be using Visual Studio Code in this tutorial.

Build Your First Apache Nutch Web Crawler

To demonstrate how to crawl websites using Apache Nutch, we'll use the ScrapingCourse E-commerce test site as a target page.

ScrapingCourse.com Ecommerce homepage — Click to open the image in full screen

By the end of this tutorial, you'll have a functional Apache Nutch web crawler that can discover product links, follow them, and extract product information (product name, price, and image URL).

Step 1: Set Up Apache Nutch

Before we dive in, let's take a step back to understand how the tool works.

Apache Nutch is a batch-based crawler that relies on plugins for custom implementation. At its core, it consists of two main components:

Crawldb: This stands for the Crawl database, which stores and tracks all URLs, whether crawled or not. It also contains the metadata of these links.
Segments: These are directories containing the content fetched during each crawl, including links and parsed text.

You'll better understand the role of these components when we get hands-on. For now, here's a quick overview of the Apache Nutch crawl cycle:

A list of seed URLs is injected into the Crawldb.
Nutch visits these URLs and fetches their content.
Then, it parses the retrieved response into various fields stored in the segment directory.
It pushes discovered links to the Crawldb for another crawl cycle.

You can automate all this from the command line.

Now that that's out of the way, let's set up Apache Nutch and begin crawling.

First, navigate to a directory where you'd like to store your code and download the Apache Nutch binary package (apache-nutch-1.X-bin.zip).

Unzip this binary package.

                    Terminal
                
unzip apache-nutch-1.20-bin.zip

Copied!

You'll find a folder with the format apache-nutch-1.X. Change directory (cd) into this folder and enter the following command to verify your installation.

                    Terminal
                
bin/nutch

Copied!

If done correctly, this command will output the version and an overview of the Nutch commands and their use cases, as shown below.

                    Output
                
nutch 1.20
Usage: nutch COMMAND [-Dproperty=value]... [command-specific args]...
where COMMAND is one of:
  readdb            read / dump crawl db
  mergedb           merge crawldb-s, with optional filtering
  readlinkdb        read / dump link db
  inject            inject new urls into the database
  generate          generate new segments to fetch from crawl db
  freegen           generate new segments to fetch from text files
  fetch             fetch a segment's pages
  parse             parse a segment's pages
  readseg           read / dump segment data
  mergesegs         merge several segments, with optional filtering and slicing
  updatedb          update crawl db from segments after fetching

 # ... truncated for brevity ... #

  
  

  
Copied!

Lastly, note that Nutch 1.x relies on the Apache Hadoop data structure. Therefore, to avoid related errors like the one below, Hadoop must be running on your machine.

                    Output
                
java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset

Copied!

Follow the steps below to set up Hadoop on Windows.

Download winutils from a trusted source. This is a supplementary tool for Windows to ensure compatibility with Hadoop. We'll use the cdarlint Github project providing winutils.exe.

                    Terminal
                
git clone https://github.com/cdarlint/winutils/

Copied!

This command clones the entire repository containing multiple Hadoop versions. Choose one and set the HADOOP_HOME environment variable to the root path of this version. For example, C:\winutils\hadoop-3.3.6.

Also, add the bin directory (C:\winutils\hadoop-3.3.6\bin) of the Hadoop version to your system PATH environment variable.

That's it!

Crawl websites at scale without getting blocked.

ZenRows improves your data collection workflow with fast and scalable web crawlers.

Try for Free

Step 2: Access the Target Website

Apache Nutch requires some configurations before you can begin crawling.

First, you must customize your crawl properties. The conf/nutch-default.xml file contains the default crawl properties, which you can modify or use as is, depending on your project needs. At the same time, Nutch provides the nutch-site.xml file to define your custom crawl properties that'll override those in the nutch-default.xml.

For the most basic implementation, you only need to set the http.agent.name value field. Thus, add the following XML code snippet in your nutch-site.xml file.

                    nutch-site.xml
                
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<configuration>
    
    <property>
        <name>http.agent.name</name>
        <value>Nutch Crawler</value>
    </property>

</configuration>

Copied!

While configuring other HTTP agent settings could be helpful, this is the only required one.

Note

If you have permission to scrape your target website, you could configure Apache Nutch to navigate its robots.txt rules using the http.robot.rules.allowlist property. That said, it's important to respect website rules and scrape ethically.

After that, define your seed URL(s). Apache Nutch requires you to create a text file (urls/seed.txt) containing the list of target URLs, one per line.

Run the following command to create the urls/ directory.

                    Terminal
                
mkdir -p urls

Copied!

Then, create a seed.txt file and write the target URL.

                    Terminal
                
echo "https://www.scrapingcourse.com/ecommerce/" >> urls/seed.txt

Copied!

Lastly, navigate to conf/regex-urlfilter.txt. This is where you define what URLs Nutch should include or exclude in your crawl.

You need to specify a regex pattern and prepend your regex with a plus (+) sign if you want to include the corresponding URL matches or a negative sign (-) to exclude.

At this stage of the tutorial, we simply want to access the website. So, add the following regex pattern to your regex-urlfilter.txt file.

                    regex-urlfilter.txt
                
# accept anything else
+^https://www\.scrapingcourse\.com/ecommerce/.*

Copied!

This tells Nutch to limit crawling to the target URL and ignore links that do not match.

Now, you can begin crawling.

Run the command below to inject your seed URL into the crawl database.

                    Terminal
                
bin/nutch inject crawl/crawldb urls

Copied!

This creates a crawl/crawldb directory, which acts as a web database with your seed URL.

Next, generate a fetch list from the database.

                    Terminal
                
bin/nutch generate crawl/crawldb crawl/segments

Copied!

This command queues the seed URLs for crawling. It places this list in a Segments directory named according to its timestamp.

You'll need to reference this directory in subsequent commands. We recommend saving the name of this segment in a shell variable for reusability.

                    Terminal
                
s1=`ls -d crawl/segments/2* | tail -1`

Copied!

This saves the current segment as s1.

After that, fetch the web page of the current segment using the fetch command.

                    Terminal
                
bin/nutch fetch $s1

Copied!

Lastly, parse the fetched web page to retrieve various fields, including the HTML content, text, and hyperlinks.

                    Terminal
                
bin/nutch parse $s1

Copied!

The results are stored as Segments containing different data fields, such as content, crawl_fetch, crawl_parse, parse_data, and parse_text.

content: stores the HTML content.
parse_data: contains the metadata for each URL.
parse_text: holds the HTML text content.
crawl_fetch: contains the fetch status and HTTP response data.
crawl_parse: includes parse metadata.

That's it! You've made your first crawl using Apache Nutch.

Run the read segment command below to access and view your results.

                    Terminal
                
bin/nutch readseg -dump $s1 output

Copied!

This command exports crawled data from segments into a new folder named output. This dump file contains all the data fields mentioned above.

Here's the HTML content for reference.

                    Output
                
<html lang="en">
<head>
    <!-- ... -->
    <title>
        Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com
    </title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <div class="beta site-title">
        <a href="https://www.scrapingcourse.com/ecommerce/" rel="home">
            Ecommerce Test Site to Learn Web Scraping
        </a>
    </div>
    <!-- other content omitted for brevity -->
</body>
</html>

  
  

  
Copied!

Step 3: Follow Links With Apache Nutch

Now, let's scale your crawler to find and follow specific links. For this step, we'll focus on only pagination links, as they contain product information we'll scrape later.

Do you recall the regex-urlfilter.txt file where you defined a regex pattern to limit crawling to links associated with the target URL?

If you want Nutch to crawl only the pagination links, you must modify this file with a new regex pattern that matches the pagination structure.

To achieve this, inspect the page to identify the pagination format. Visit the target website on a Chrome browser, right-click on a pagination element, and select Inspect. This will open the Developer Tools window, as shown in the image below.

scrapingcourse ecommerce homepage devtools — Click to open the image in full screen

You'll notice that there are 12 pagination links, all ending with the format /page/{number}/. Using this information, create a custom URL filter (regex pattern) instructing Nutch to crawl only pagination links.

                    Example
                
^https://www\.scrapingcourse\.com/ecommerce/page/[0-9]+/$

Copied!

Add this rule to your regex-urlfilter.txt file.

                    regex-urlfilter.txt
                
# accept pagination links
+^https://www\.scrapingcourse\.com/ecommerce/page/[0-9]+/$

Copied!

Now, generate a new fetch list to reflect this change and save this segment in a new shell variable (s2).

                    Terminal
                
bin/nutch generate crawl/crawldb crawl/segments

s2=`ls -d crawl/segments/2* | tail -1`

Copied!

Then, following the same steps as before, fetch and parse the web page.

                    Terminal
                
bin/nutch fetch $s2

bin/nutch parse $s2

Copied!

This parses the results into the different fields mentioned earlier. However, at this point, we're only interested in the links. Thus, run this command to exclude other fields and view only the crawled links.

                    Terminal
                
bin/nutch readseg -dump $s2 outputdir -nocontent -nofetch -nogenerate -noparse -noparsetext

Copied!

You'll get an outputdir directory and a dump file containing the crawled links, including duplicates.

                    dump
                
Outlinks: 16
  outlink: toUrl: https://www.scrapingcourse.com/ecommerce/ anchor: Skip to content
  outlink: toUrl: https://www.scrapingcourse.com/ecommerce/page/2/ anchor: 2
  outlink: toUrl: https://www.scrapingcourse.com/ecommerce/page/3/ anchor: 3
  outlink: toUrl: https://www.scrapingcourse.com/ecommerce/page/4/ anchor: 4
  outlink: toUrl: https://www.scrapingcourse.com/ecommerce/page/10/ anchor: 10
  outlink: toUrl: https://www.scrapingcourse.com/ecommerce/page/11/ anchor: 11
  outlink: toUrl: https://www.scrapingcourse.com/ecommerce/page/12/ anchor: 12

# ... truncated for brevity ... #

  
  

  
Copied!

Here, you'll notice that pages 5, 6, 7, 8, and 9 are absent. Why is that?

If you open the target page in a browser, you'll see that the HTML displayed does not include these pages (5, 6, 7, 8, and 9).

scrapingcourse ecommerce homepage inspect — Click to open the image in full screen

Therefore, Apache Nutch couldn't find these links in the first crawl. You need to update Crawldb with the found links and crawl until it covers the entire pagination chain.

Run the command below to update Crawldb with the result of the first crawl.

                    Terminal
                
bin/nutch updatedb crawl/crawldb $s2

Copied!

Now that the database contains new entries, generate a new fetch list and save the segment in a new shell variable (s3). Then, repeat the fetch and parse process.

                    Terminal
                
bin/nutch generate crawl/crawldb crawl/segments 
s3=`ls -d crawl/segments/2* | tail -1`

bin/nutch fetch $s3
bin/nutch parse $s3

Copied!

You should now have all the pagination links. Run the read segment command to view your result.

                    Terminal
                
bin/nutch readseg -dump $s2 outputdir -nocontent -nofetch -nogenerate -noparse -noparsetext

Copied!

You'll notice that the dump file contains all the pagination links but is a bit verbose. Let's isolate the pagination links using text-processing command-line tools (grep and awk).

Enter the following command to isolate pagination links from the dump file.

                    Terminal
                
cat outputdir/dump | awk '{print $3}' | awk '!seen[$0]++' | grep -E "/ecommerce/page/[0-9]+/$" > pagination_links.txt

Copied!

This command filters the result and saves pagination links in a new text file named pagintion_links.

Here's an overview of what each command does:

awk '{print $3}': prints only the third column of the file content. In this case, it eliminates keys, such as outlink and toUrl, leaving only the https:// column.
awk '!seen[$0]++': This ensures there are no duplicate lines.
grep -E "/ecommerce/page/[0-9]+/$": selects links that end with the format /ecommerce/page/{number}/.

Your terminal should look like this:

                    Output
                
# ... omitted for brevity ... #
https://www.scrapingcourse.com/ecommerce/page/10/
https://www.scrapingcourse.com/ecommerce/page/11/
https://www.scrapingcourse.com/ecommerce/page/12/

Copied!

Awesome! You've crawled specific links using Apache Nutch.

Step 4: Extract Data From Collected Links

The next step is to extract product information from the crawled links.

Apache Nutch offers the parsefilter-regex plugin that allows you to customize your parse result using regular expression. You can define these rules in a separate file or directly in the nutch-site.xml file.

The rule format is as follows: <name> | <source> | <regex>

Where <name> is your field of interest, <source> is either HTML or text and <regex> is the regular expression that extracts the desired data from the source.

For example:

                    nutch-site.xml
                
// ...
    <property>
        <name>parsefilter.regex.rules</name>
        <value>
            product_name     html    <regex>
        </value>
    </property>

  
  

  
Copied!

After setting these rules, you'll need to rerun your crawl and access the result in the dump file as in the previous steps.

However, processing the output in this approach can get tedious, requiring a lot of manual configuration.

In that case, we recommend integrating with Java frameworks like Jsoup. We'll navigate each pagination link and extract the product name, price, and image URL.

Below is a step-by-step guide:

To begin, create a Java project in your project directory and add Jsoup as a dependency. We'll use Maven to manage dependencies. So, include the following XML snippet in your pom.xml <dependencies> section.

                    pom.xml
                
<dependency>
  <!-- jsoup HTML parser library @ https://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.18.3</version>
</dependency>

Copied!

Next, create a Parser Java class (Parser.java) and prepare to write your code.

In your Java class, open the pagination_links file using Java's BufferedReader() class. Then, loop through the file, read each line, and fetch the HTML document of each URL using Jsoup.

                    Parser.java
                
// import the required classes

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

import org.jsoup.Jsoup;

public class Parser {
    public static void main(String[] args) {
        // open pagination links file
        try (BufferedReader br = new BufferedReader(new FileReader("path\\to\\pagination_links"))) {
            
            String url;
            // loop through file and read links line by line
            while ((url = br.readLine()) != null) {
                // log progress
                System.out.println("Crawling: " + url);
                // fetch and parse HTML content using Jsoup
                Document document = Jsoup.connect(url).get();
            }
        } catch (IOException e) {
            System.err.println("Error reading the file");
            e.printStackTrace();
        }

    }
}

  
  

  
Copied!

Next, create a scraping logic to extract each product's name, price, and image URL from each page.

To achieve this, inspect a product card to identify the right selectors for the desired data points.

scrapingcourse ecommerce homepage inspect first product li — Click to open the image in full screen

You'll notice that each product is a list item with the class product. The following HTML elements within the list items represent each data point.

Product name: <h2> with class product-name.
Product price: span element with class, product-price.
Product image: <img> tag with class product-image.

Using this information, instruct Jsoup to select all product cards, loop through them, and extract their product name, price, and image URL.

We recommend abstracting this logic for a cleaner and modular code.

                    Parser.java
                
//import the required classes

// ...
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Parser {
    // ...

    // function to extract product details from the current page
    private static void extractProductData(Document document) {
        // select all product items on the current page
        Elements products = document.select("li.product");

        // loop through each item
        for (Element product : products) {
            // extract product name, price, and image URL
            String productName = product.select(".product-name").text();
            String price = product.select(".product-price").text();
            String imageUrl = product.select(".product-image").attr("src");

            // log the result
            System.out.println("product-name: " + productName);
            System.out.println("product-price: " + price);
            System.out.println("product-image: " + imageUrl);

        }
    }
}

  
  

  
Copied!

That's it!

Now, call the extractProductData() function in main() and combine all the steps to get the following complete code:

                    Parser.java
                
// import the required classes

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Parser {
    public static void main(String[] args) {
        // open pagination links file
        try (BufferedReader br = new BufferedReader(new FileReader("path\\to\\pagination_links"))) {
            
            String url;
            // loop through file and read links line by line
            while ((url = br.readLine()) != null) {
                // log progress
                System.out.println("Crawling: " + url);
                // fetch and parse HTML content using Jsoup
                Document document = Jsoup.connect(url).get();
    
                // extract product information
                extractProductData(document);
            }
        } catch (IOException e) {
            System.err.println("Error reading the file");
            e.printStackTrace();
        }

    }

    // function to extract product details from the current page
    private static void extractProductData(Document document) {
        // select all product items on the current page
        Elements products = document.select("li.product");

        // loop through each item
        for (Element product : products) {
            // extract product name, price, and image URL
            String productName = product.select(".product-name").text();
            String price = product.select(".product-price").text();
            String imageUrl = product.select(".product-image").attr("src");

            // log the result
            System.out.println("product-name: " + productName);
            System.out.println("product-price: " + price);
            System.out.println("product-image: " + imageUrl);

        }
    }
}

  
  

  
Copied!

This extracts each product's name, price, and image URL. Here's what your terminal would look like.

                    Output
                
product-name: Karissa V-Neck Tee
product-price: $32.00
product-image: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/ws10-red_main.jpg
product-name: Karmen Yoga Pant
product-price: $39.00
product-image: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wp01-gray_main.jpg

// ... truncated for brevity ... //

Copied!

Step 5: Export the Scraped Data to CSV

Exporting scraped data to file formats, such as CSV, is often essential for quick and easy analysis. You can do this in Java using the built-in FileWriter class.

Follow the steps below to do this.

Start by initializing an empty list to store scraped data.

                    Parser.java
                
// import the required classes
// ...

import java.util.ArrayList;
import java.util.List;

public class Parser {
    // ...

    // initialize an empty list to store scraped product data
    private static List<String[]> productData = new ArrayList<>();

    // ...
}

Copied!

After that, modify the extractProductData() function to add scraped data to the empty list.

                    Parser.java
                
public class Parser {
    // ...

    // function to extract product details from the current page
    private static void extractProductData(Document document) {
        // ...        

        // store the product details in the data list
        productData.add(new String[]{productName, price, imageUrl});
    }
}

Copied!

Next, create a function to write scraped data to CSV. Within this function, initialize a FileWriter class, write the CSV headers, and populate the rows with the scraped data.

                    Parser.java
                
// import the required libraries

// ...
import java.io.FileWriter;


public class Parser {
    // ...

    // method to save data to a CSV file
    private static void exportDataToCsv(String filePath) {
        // initialize a FileWriter class
        try (FileWriter writer = new FileWriter(filePath)) {
            // write headers
            writer.append("Product Name,Price,Image URL\n");
           
            // populate data rows with scraped data
            for (String[] row : productData) {
                writer.append(String.join(",", row));
                writer.append("\n");
            }
            System.out.println("Data saved to " + filePath);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

  
  

  
Copied!

That's it.

Now, combine all the steps above and call the exportDataToCSV() function in main().

                    Parser.java
                
// import the required classes

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.util.ArrayList;
import java.util.List;
import java.io.FileWriter;

public class Parser {
    // initialize an empty list to store scraped product data
    private static List<String[]> productData = new ArrayList<>();

    public static void main(String[] args) {
        // open pagination links file
        try (BufferedReader br = new BufferedReader(new FileReader("path\\to\\pagination_links"))) {
            
            String url;
            // loop through file and read links line by line
            while ((url = br.readLine()) != null) {
                // log progress
                System.out.println("Crawling: " + url);
                // fetch and parse HTML content using Jsoup
                Document document = Jsoup.connect(url).get();
    
                // extract product information
                extractProductData(document);

                // export scraped data to CSV
               exportDataToCsv("product_data.csv");
            }
        } catch (IOException e) {
            System.err.println("Error reading the file");
            e.printStackTrace();
        }

    }

    // function to extract product details from the current page
    private static void extractProductData(Document document) {
        // select all product items on the current page
        Elements products = document.select("li.product");

        // loop through each item
        for (Element product : products) {
            // extract product name, price, and image URL
            String productName = product.select(".product-name").text();
            String price = product.select(".product-price").text();
            String imageUrl = product.select(".product-image").attr("src");

            // store the product details in the data list
            productData.add(new String[]{productName, price, imageUrl});
        }
    }

    // method to save data to a CSV file
    private static void exportDataToCsv(String filePath) {
        // initialize a FileWriter class
        try (FileWriter writer = new FileWriter(filePath)) {
            // write headers
            writer.append("Product Name,Price,Image URL\n");
           
            // populate data rows with scraped data
            for (String[] row : productData) {
                writer.append(String.join(",", row));
                writer.append("\n");
            }
            System.out.println("Data saved to " + filePath);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

  
  

  
Copied!

This creates a new product_data.csv file in your project's root directory and populates the rows with the scraped data.

Here's a sample screenshot of the result.

CSV Data Export — Click to open the image in full screen

Congratulations! You now know how to use Apache Nutch to crawl websites and export scraped data to CSV.

Avoid Getting Blocked While Crawling With Apache Nutch

Getting blocked is a common challenge when web crawling. This is because web crawlers often exhibit bot-like patterns that are easily flagged by anti-bot solutions.

See for yourself.

In an attempt to crawl an Antibot Challenge page using Apache Nutch, we get the following error response.

                    Output
                
...fetch of https://www.scrapingcourse.com/antibot-challenge failed with: Http code=403, url=https://www.scrapingcourse.com/antibot-challenge

Copied!

This error code (403) signifies that the target server understood your requests but refused to fulfill them because Apache Nutch doesn't meet the required threshold.

Common recommendations for overcoming this challenge include rotating proxies and setting custom user agents. However, these measures do not work against advanced anti-bot solutions.

To guarantee you can crawl any website without getting blocked, consider ZenRows' Universal Scraper API, the most reliable solution for scalable web crawling.

ZenRows is a complete web scraping toolkit that handles every anti-bot solution for you, allowing you to focus on extracting your desired data. Some of its features include advanced anti-bot bypass out of the box, geo-located requests, fingerprinting evasion, actual user spoofing, request header management, and more.

Here's ZenRows in action against the same anti-bot challenge where Apache Nutch failed.

To follow along in this example, sign up to get your free API key.

Completing your sign-up will take you to the Request Builder page, where you'll find your API key at the top right.

building a scraper with zenrows — Click to open the image in full screen

Input your target URL and activate Premium Proxies and JS Rendering boost mode.

Next, select the Java language and choose the API option. ZenRows works with any language and provides ready-to-use snippets for the most popular ones.

Copy the generated code on the right to your editor for testing.

Your code should look like this:

                    Main.java
                
import org.apache.hc.client5.http.fluent.Request;

public class APIRequest {
    public static void main(final String... args) throws Exception {
        String apiUrl = "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.scrapingcourse.com%2Fantibot-challenge&js_render=true&premium_proxy=true";
        String response = Request.get(apiUrl)
                .execute().returnContent().asString();

        System.out.println(response);
    }
}

Copied!

This code bypasses the anti-bot challenge and retrieves the HTML.

Remember to add the Apache HttpClient Fluent dependency to your pom.xml file, as shown below.

                    pom.xml
                
<dependency>
    <groupId>org.apache.httpcomponents.client5</groupId>
    <artifactId>httpclient5-fluent</artifactId>
    <version>5.4.1</version>
</dependency>

Copied!

Run the code, and you'll get the following result:

                    Output
                
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

  
  

  
Copied!

Congratulations! You're now well-equipped to crawl any website without getting blocked.

Conclusion

You've learned how to crawl websites using Apache Nutch. From setting up your project to integrating with other Java frameworks, here's a quick recap of your progress.

You now know how to:

Crawl specific links.
Extract data from collected links.
Export scraped data to CSV.

Bear in mind that to take advantage of your crawling skills, you must first overcome anti-bot challenges. Nutch is a useful crawling tool. However, advanced anti-bot solutions will always block your Nutch crawler.

Consider ZenRows, an easy-to-implement and scalable solution to crawl any website without getting blocked.