Apache Nutch is an open-source, production-ready web crawler with an extensible interface that lets you fetch, parse, store, and index web pages for easy searching and querying.
It is pluggable, modular, and easy to maintain: you can quickly find hyperlinks, check for broken links, and handle duplicates using basic commands.
This tutorial will guide you through crawling websites using Apache Nutch. You'll learn how to discover links, follow them, and extract valuable data as your crawler navigates each page. Let's get started!
Prerequisites
To follow along in this tutorial, ensure you meet the following requirements:
- Java Development Kit (JDK) 11 or newer.
- Unix environment, or Windows Cygwin environment.
- Your preferred IDE. We'll be using Visual Studio Code in this tutorial.
Build Your First Apache Nutch Web Crawler
To demonstrate how to crawl websites using Apache Nutch, we'll use the ScrapingCourse E-commerce test site as a target page.

By the end of this tutorial, you'll have a functional Apache Nutch web crawler that can discover product links, follow them, and extract product information (product name, price, and image URL).
Step 1: Set Up Apache Nutch
Before we dive in, let's take a step back to understand how the tool works.
Apache Nutch is a batch-based crawler that relies on plugins for custom implementation. At its core, it consists of two main components:
- Crawldb: This stands for the Crawl database, which stores and tracks all URLs, whether crawled or not. It also contains the metadata of these links.
- Segments: These are directories containing the content fetched during each crawl, including links and parsed text.
You'll better understand the role of these components when we get hands-on. For now, here's a quick overview of the Apache Nutch crawl cycle:
- A list of seed URLs is injected into the Crawldb.Â
- Nutch visits these URLs and fetches their content.Â
- Then, it parses the retrieved response into various fields stored in the segment directory.Â
- It pushes discovered links to the Crawldb for another crawl cycle.
You can automate all this from the command line.
Now that that's out of the way, let's set up Apache Nutch and begin crawling.
First, navigate to a directory where you'd like to store your code and download the Apache Nutch binary package (apache-nutch-1.X-bin.zip
).
Unzip this binary package.
unzip apache-nutch-1.20-bin.zip
You'll find a folder with the format apache-nutch-1.X
. Change directory (cd) into this folder and enter the following command to verify your installation.
bin/nutch
If done correctly, this command will output the version and an overview of the Nutch commands and their use cases, as shown below.
nutch 1.20
Usage: nutch COMMAND [-Dproperty=value]... [command-specific args]...
where COMMAND is one of:
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
# ... truncated for brevity ... #
Lastly, note that Nutch 1.x relies on the Apache Hadoop data structure. Therefore, to avoid related errors like the one below, Hadoop must be running on your machine.
java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset
Follow the steps below to set up Hadoop on Windows.
Download winutils
from a trusted source. This is a supplementary tool for Windows to ensure compatibility with Hadoop. We'll use the cdarlint Github project providing winutils.exe
.
git clone https://github.com/cdarlint/winutils/
This command clones the entire repository containing multiple Hadoop versions. Choose one and set the HADOOP_HOME
environment variable to the root path of this version. For example, C:\winutils\hadoop-3.3.6
.
Also, add the bin directory (C:\winutils\hadoop-3.3.6\bin
) of the Hadoop version to your system PATH environment variable.
That's it!
Step 2: Access the Target Website
Apache Nutch requires some configurations before you can begin crawling.
First, you must customize your crawl properties. The conf/nutch-default.xml
file contains the default crawl properties, which you can modify or use as is, depending on your project needs. At the same time, Nutch provides the nutch-site.xml
file to define your custom crawl properties that'll override those in the nutch-default.xml
.
For the most basic implementation, you only need to set the http.agent.name
value field. Thus, add the following XML code snippet in your nutch-site.xml
file.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>Nutch Crawler</value>
</property>
</configuration>
While configuring other HTTP agent settings could be helpful, this is the only required one.
If you have permission to scrape your target website, you could configure Apache Nutch to navigate its robots.txt
rules using the http.robot.rules.allowlist
property. That said, it's important to respect website rules and scrape ethically.
After that, define your seed URL(s). Apache Nutch requires you to create a text file (urls/seed.txt
) containing the list of target URLs, one per line.
Run the following command to create the urls/
directory.
mkdir -p urls
Then, create a seed.txt
file and write the target URL.
echo "https://www.scrapingcourse.com/ecommerce/" >> urls/seed.txt
Lastly, navigate to conf/regex-urlfilter.txt
. This is where you define what URLs Nutch should include or exclude in your crawl.
You need to specify a regex pattern and prepend your regex with a plus (+) sign if you want to include the corresponding URL matches or a negative sign (-) to exclude.
At this stage of the tutorial, we simply want to access the website. So, add the following regex pattern to your regex-urlfilter.txt
file.
# accept anything else
+^https://www\.scrapingcourse\.com/ecommerce/.*
This tells Nutch to limit crawling to the target URL and ignore links that do not match.
Now, you can begin crawling.
Run the command below to inject your seed URL into the crawl database.
bin/nutch inject crawl/crawldb urls
This creates a crawl/crawldb
directory, which acts as a web database with your seed URL.
Next, generate a fetch list from the database.
bin/nutch generate crawl/crawldb crawl/segments
This command queues the seed URLs for crawling. It places this list in a Segments
directory named according to its timestamp.
You'll need to reference this directory in subsequent commands. We recommend saving the name of this segment in a shell variable for reusability.
s1=`ls -d crawl/segments/2* | tail -1`
This saves the current segment as s1
.
After that, fetch the web page of the current segment using the fetch
command.
bin/nutch fetch $s1
Lastly, parse the fetched web page to retrieve various fields, including the HTML content, text, and hyperlinks.
bin/nutch parse $s1
The results are stored as Segments
containing different data fields, such as content
, crawl_fetch
, crawl_parse
, parse_data
, and parse_text
.
content
: stores the HTML content.parse_data
: contains the metadata for each URL.parse_text
: holds the HTML text content.crawl_fetch
: contains the fetch status and HTTP response data.crawl_parse
: includes parse metadata.
That's it! You've made your first crawl using Apache Nutch.
Run the read segment command below to access and view your results.
bin/nutch readseg -dump $s1 output
This command exports crawled data from segments into a new folder named output
. This dump file contains all the data fields mentioned above.
Here's the HTML content for reference.
<html lang="en">
<head>
<!-- ... -->
<title>
Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com
</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<div class="beta site-title">
<a href="https://www.scrapingcourse.com/ecommerce/" rel="home">
Ecommerce Test Site to Learn Web Scraping
</a>
</div>
<!-- other content omitted for brevity -->
</body>
</html>
Step 3: Follow Links With Apache Nutch
Now, let's scale your crawler to find and follow specific links. For this step, we'll focus on only pagination links, as they contain product information we'll scrape later.
Do you recall the regex-urlfilter.txt
file where you defined a regex pattern to limit crawling to links associated with the target URL?
If you want Nutch to crawl only the pagination links, you must modify this file with a new regex pattern that matches the pagination structure.
To achieve this, inspect the page to identify the pagination format. Visit the target website on a Chrome browser, right-click on a pagination element, and select Inspect. This will open the Developer Tools window, as shown in the image below.

You'll notice that there are 12 pagination links, all ending with the format /page/{number}/
. Using this information, create a custom URL filter (regex pattern) instructing Nutch to crawl only pagination links.
^https://www\.scrapingcourse\.com/ecommerce/page/[0-9]+/$
Add this rule to your regex-urlfilter.txt
file.
# accept pagination links
+^https://www\.scrapingcourse\.com/ecommerce/page/[0-9]+/$
Now, generate a new fetch list to reflect this change and save this segment in a new shell variable (s2
).
bin/nutch generate crawl/crawldb crawl/segments
s2=`ls -d crawl/segments/2* | tail -1`
Then, following the same steps as before, fetch and parse the web page.
bin/nutch fetch $s2
bin/nutch parse $s2
This parses the results into the different fields mentioned earlier. However, at this point, we're only interested in the links. Thus, run this command to exclude other fields and view only the crawled links.
bin/nutch readseg -dump $s2 outputdir -nocontent -nofetch -nogenerate -noparse -noparsetext
You'll get an outputdir
directory and a dump file containing the crawled links, including duplicates.
Outlinks: 16
outlink: toUrl: https://www.scrapingcourse.com/ecommerce/ anchor: Skip to content
outlink: toUrl: https://www.scrapingcourse.com/ecommerce/page/2/ anchor: 2
outlink: toUrl: https://www.scrapingcourse.com/ecommerce/page/3/ anchor: 3
outlink: toUrl: https://www.scrapingcourse.com/ecommerce/page/4/ anchor: 4
outlink: toUrl: https://www.scrapingcourse.com/ecommerce/page/10/ anchor: 10
outlink: toUrl: https://www.scrapingcourse.com/ecommerce/page/11/ anchor: 11
outlink: toUrl: https://www.scrapingcourse.com/ecommerce/page/12/ anchor: 12
# ... truncated for brevity ... #
Here, you'll notice that pages 5, 6, 7, 8, and 9 are absent. Why is that?
If you open the target page in a browser, you'll see that the HTML displayed does not include these pages (5, 6, 7, 8, and 9).

Therefore, Apache Nutch couldn't find these links in the first crawl. You need to update Crawldb with the found links and crawl until it covers the entire pagination chain.
Run the command below to update Crawldb with the result of the first crawl.
bin/nutch updatedb crawl/crawldb $s2
Now that the database contains new entries, generate a new fetch list and save the segment in a new shell variable (s3
). Then, repeat the fetch and parse process.
bin/nutch generate crawl/crawldb crawl/segments
s3=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s3
bin/nutch parse $s3
You should now have all the pagination links. Run the read segment command to view your result.
bin/nutch readseg -dump $s2 outputdir -nocontent -nofetch -nogenerate -noparse -noparsetext
You'll notice that the dump file contains all the pagination links but is a bit verbose. Let's isolate the pagination links using text-processing command-line tools (grep and awk).
Enter the following command to isolate pagination links from the dump file.
cat outputdir/dump | awk '{print $3}' | awk '!seen[$0]++' | grep -E "/ecommerce/page/[0-9]+/$" > pagination_links.txt
This command filters the result and saves pagination links in a new text file named pagintion_links
.
Here's an overview of what each command does:
awk '{print $3}'
: prints only the third column of the file content. In this case, it eliminates keys, such asoutlink
 andtoUrl
, leaving only thehttps://
column.awk '!seen[$0]++'
: This ensures there are no duplicate lines.grep -E "/ecommerce/page/[0-9]+/$"
: selects links that end with the format/ecommerce/page/{number}/
.
Your terminal should look like this:
# ... omitted for brevity ... #
https://www.scrapingcourse.com/ecommerce/page/10/
https://www.scrapingcourse.com/ecommerce/page/11/
https://www.scrapingcourse.com/ecommerce/page/12/
Awesome! You've crawled specific links using Apache Nutch.
Step 4: Extract Data From Collected Links
The next step is to extract product information from the crawled links.
Apache Nutch offers the parsefilter-regex
plugin that allows you to customize your parse result using regular expression. You can define these rules in a separate file or directly in the nutch-site.xml
file.
The rule format is as follows: <name> | <source> | <regex>
Where <name>
is your field of interest, <source>
is either HTML or text and <regex>
is the regular expression that extracts the desired data from the source.
For example:
// ...
<property>
<name>parsefilter.regex.rules</name>
<value>
product_name html <regex>
</value>
</property>
After setting these rules, you'll need to rerun your crawl and access the result in the dump file as in the previous steps.
However, processing the output in this approach can get tedious, requiring a lot of manual configuration.
In that case, we recommend integrating with Java frameworks like Jsoup. We'll navigate each pagination link and extract the product name, price, and image URL.
Below is a step-by-step guide:
To begin, create a Java project in your project directory and add Jsoup as a dependency. We'll use Maven to manage dependencies. So, include the following XML snippet in your pom.xml
<dependencies>
section.
<dependency>
<!-- jsoup HTML parser library @ https://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.18.3</version>
</dependency>
Next, create a Parser Java class (Parser.java) and prepare to write your code.
In your Java class, open the pagination_links
file using Java's BufferedReader()
class. Then, loop through the file, read each line, and fetch the HTML document of each URL using Jsoup.
// import the required classes
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import org.jsoup.Jsoup;
public class Parser {
public static void main(String[] args) {
// open pagination links file
try (BufferedReader br = new BufferedReader(new FileReader("path\\to\\pagination_links"))) {
String url;
// loop through file and read links line by line
while ((url = br.readLine()) != null) {
// log progress
System.out.println("Crawling: " + url);
// fetch and parse HTML content using Jsoup
Document document = Jsoup.connect(url).get();
}
} catch (IOException e) {
System.err.println("Error reading the file");
e.printStackTrace();
}
}
}
Next, create a scraping logic to extract each product's name, price, and image URL from each page.
To achieve this, inspect a product card to identify the right selectors for the desired data points.

You'll notice that each product is a list item with the class product
. The following HTML elements within the list items represent each data point.
- Product name:
<h2>
with classproduct-name
. - Product price: span element with class,
product-price
. - Product image:
<img>
tag with classproduct-image
.
Using this information, instruct Jsoup to select all product cards, loop through them, and extract their product name, price, and image URL.
We recommend abstracting this logic for a cleaner and modular code.
//import the required classes
// ...
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Parser {
// ...
// function to extract product details from the current page
private static void extractProductData(Document document) {
// select all product items on the current page
Elements products = document.select("li.product");
// loop through each item
for (Element product : products) {
// extract product name, price, and image URL
String productName = product.select(".product-name").text();
String price = product.select(".product-price").text();
String imageUrl = product.select(".product-image").attr("src");
// log the result
System.out.println("product-name: " + productName);
System.out.println("product-price: " + price);
System.out.println("product-image: " + imageUrl);
}
}
}
That's it!
Now, call the extractProductData()
function in main()
and combine all the steps to get the following complete code:
// import the required classes
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Parser {
public static void main(String[] args) {
// open pagination links file
try (BufferedReader br = new BufferedReader(new FileReader("path\\to\\pagination_links"))) {
String url;
// loop through file and read links line by line
while ((url = br.readLine()) != null) {
// log progress
System.out.println("Crawling: " + url);
// fetch and parse HTML content using Jsoup
Document document = Jsoup.connect(url).get();
// extract product information
extractProductData(document);
}
} catch (IOException e) {
System.err.println("Error reading the file");
e.printStackTrace();
}
}
// function to extract product details from the current page
private static void extractProductData(Document document) {
// select all product items on the current page
Elements products = document.select("li.product");
// loop through each item
for (Element product : products) {
// extract product name, price, and image URL
String productName = product.select(".product-name").text();
String price = product.select(".product-price").text();
String imageUrl = product.select(".product-image").attr("src");
// log the result
System.out.println("product-name: " + productName);
System.out.println("product-price: " + price);
System.out.println("product-image: " + imageUrl);
}
}
}
This extracts each product's name, price, and image URL. Here's what your terminal would look like.
product-name: Karissa V-Neck Tee
product-price: $32.00
product-image: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/ws10-red_main.jpg
product-name: Karmen Yoga Pant
product-price: $39.00
product-image: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wp01-gray_main.jpg
// ... truncated for brevity ... //
Step 5: Export the Scraped Data to CSV
Exporting scraped data to file formats, such as CSV, is often essential for quick and easy analysis. You can do this in Java using the built-in FileWriter
class.
Follow the steps below to do this.
Start by initializing an empty list to store scraped data.
// import the required classes
// ...
import java.util.ArrayList;
import java.util.List;
public class Parser {
// ...
// initialize an empty list to store scraped product data
private static List<String[]> productData = new ArrayList<>();
// ...
}
After that, modify the extractProductData()
function to add scraped data to the empty list.
public class Parser {
// ...
// function to extract product details from the current page
private static void extractProductData(Document document) {
// ...
// store the product details in the data list
productData.add(new String[]{productName, price, imageUrl});
}
}
Next, create a function to write scraped data to CSV. Within this function, initialize a FileWriter
class, write the CSV headers, and populate the rows with the scraped data.
// import the required libraries
// ...
import java.io.FileWriter;
public class Parser {
// ...
// method to save data to a CSV file
private static void exportDataToCsv(String filePath) {
// initialize a FileWriter class
try (FileWriter writer = new FileWriter(filePath)) {
// write headers
writer.append("Product Name,Price,Image URL\n");
// populate data rows with scraped data
for (String[] row : productData) {
writer.append(String.join(",", row));
writer.append("\n");
}
System.out.println("Data saved to " + filePath);
} catch (IOException e) {
e.printStackTrace();
}
}
}
That's it.
Now, combine all the steps above and call the exportDataToCSV()
function in main()
.
// import the required classes
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.ArrayList;
import java.util.List;
import java.io.FileWriter;
public class Parser {
// initialize an empty list to store scraped product data
private static List<String[]> productData = new ArrayList<>();
public static void main(String[] args) {
// open pagination links file
try (BufferedReader br = new BufferedReader(new FileReader("path\\to\\pagination_links"))) {
String url;
// loop through file and read links line by line
while ((url = br.readLine()) != null) {
// log progress
System.out.println("Crawling: " + url);
// fetch and parse HTML content using Jsoup
Document document = Jsoup.connect(url).get();
// extract product information
extractProductData(document);
// export scraped data to CSV
exportDataToCsv("product_data.csv");
}
} catch (IOException e) {
System.err.println("Error reading the file");
e.printStackTrace();
}
}
// function to extract product details from the current page
private static void extractProductData(Document document) {
// select all product items on the current page
Elements products = document.select("li.product");
// loop through each item
for (Element product : products) {
// extract product name, price, and image URL
String productName = product.select(".product-name").text();
String price = product.select(".product-price").text();
String imageUrl = product.select(".product-image").attr("src");
// store the product details in the data list
productData.add(new String[]{productName, price, imageUrl});
}
}
// method to save data to a CSV file
private static void exportDataToCsv(String filePath) {
// initialize a FileWriter class
try (FileWriter writer = new FileWriter(filePath)) {
// write headers
writer.append("Product Name,Price,Image URL\n");
// populate data rows with scraped data
for (String[] row : productData) {
writer.append(String.join(",", row));
writer.append("\n");
}
System.out.println("Data saved to " + filePath);
} catch (IOException e) {
e.printStackTrace();
}
}
}
This creates a new product_data.csv
file in your project's root directory and populates the rows with the scraped data.
Here's a sample screenshot of the result.

Congratulations! You now know how to use Apache Nutch to crawl websites and export scraped data to CSV.
Avoid Getting Blocked While Crawling With Apache Nutch
Getting blocked is a common challenge when web crawling. This is because web crawlers often exhibit bot-like patterns that are easily flagged by anti-bot solutions.
See for yourself.
In an attempt to crawl an Antibot Challenge page using Apache Nutch, we get the following error response.
...fetch of https://www.scrapingcourse.com/antibot-challenge failed with: Http code=403, url=https://www.scrapingcourse.com/antibot-challenge
This error code (403) signifies that the target server understood your requests but refused to fulfill them because Apache Nutch doesn't meet the required threshold.
Common recommendations for overcoming this challenge include rotating proxies and setting custom user agents. However, these measures do not work against advanced anti-bot solutions.
To guarantee you can crawl any website without getting blocked, consider ZenRows' Universal Scraper API, the most reliable solution for scalable web crawling.
ZenRows is a complete web scraping toolkit that handles every anti-bot solution for you, allowing you to focus on extracting your desired data. Some of its features include advanced anti-bot bypass out of the box, geo-located requests, fingerprinting evasion, actual user spoofing, request header management, and more.
Here's ZenRows in action against the same anti-bot challenge where Apache Nutch failed.
To follow along in this example, sign up to get your free API key.
Completing your sign-up will take you to the Request Builder page, where you'll find your API key at the top right.

Input your target URL and activate Premium Proxies and JS Rendering boost mode.
Next, select the Java language and choose the API option. ZenRows works with any language and provides ready-to-use snippets for the most popular ones.
Copy the generated code on the right to your editor for testing.
Your code should look like this:
import org.apache.hc.client5.http.fluent.Request;
public class APIRequest {
public static void main(final String... args) throws Exception {
String apiUrl = "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.scrapingcourse.com%2Fantibot-challenge&js_render=true&premium_proxy=true";
String response = Request.get(apiUrl)
.execute().returnContent().asString();
System.out.println(response);
}
}
This code bypasses the anti-bot challenge and retrieves the HTML.
Remember to add the Apache HttpClient Fluent dependency to your pom.xml
file, as shown below.
<dependency>
<groupId>org.apache.httpcomponents.client5</groupId>
<artifactId>httpclient5-fluent</artifactId>
<version>5.4.1</version>
</dependency>
Run the code, and you'll get the following result:
<html lang="en">
<head>
<!-- ... -->
<title>Antibot Challenge - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<h2>
You bypassed the Antibot challenge! :D
</h2>
<!-- other content omitted for brevity -->
</body>
</html>
Congratulations! You're now well-equipped to crawl any website without getting blocked. Â
Conclusion
You've learned how to crawl websites using Apache Nutch. From setting up your project to integrating with other Java frameworks, here's a quick recap of your progress.
You now know how to:
- Crawl specific links.
- Extract data from collected links.
- Export scraped data to CSV.
Bear in mind that to take advantage of your crawling skills, you must first overcome anti-bot challenges. Nutch is a useful crawling tool. However, advanced anti-bot solutions will always block your Nutch crawler.
Consider ZenRows, an easy-to-implement and scalable solution to crawl any website without getting blocked.