How to Crawl a Website With Apache StormCrawler

Sergio Nonide
Sergio Nonide
December 24, 2024 · 7 min read

Do you want to crawl multiple pages with Java? StormCrawler, a feature-rich web crawling tool, can help you build a robust and efficient solution for web crawling projects of any scale.

In this tutorial, you'll learn what Apache StormCrawler is and how to use it for web crawling through a step-by-step guide.

Let's go!

What Is Apache StormCrawler?

StormCrawler is an open-source software development toolkit (SDK) designed for building large-scale and customizable web crawlers and web scrapers in Java. It leverages Apache Storm, a free and scalable real-time computational system, to efficiently manage distributed crawling and data processing tasks.

Apache StormCrawler has many components. The key ones include the crawl topology, spouts, and bolts. Let's briefly explain what each component does:

  • Spout: A class designed to read URLs from files with configurable encodings, such as UTF-8 .txt files. It also includes a sub-component for reading URLs directly from memory.
  • Bolts: Components that process data streams, performing tasks such as HTML parsing, data extraction, filtering, and storing results in formats like CSV, JSON, or databases. Bolts typically process data emitted by spouts or other bolts within a crawl topology.
  • Crawl Topology: Defines the workflow for processing crawl tasks by connecting other components (e.g., spouts and bolts) in a directed path. It determines how data flows within the Apache Storm platform.

Let's go through the requirements for building a web crawler with StormCrawler and Apache Storm.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Prerequisites

Before starting this Apache StormCrawler tutorial, ensure you have the following tools and environments:

  • Java Development Kit (JDK): This tutorial uses JDK 17 to compile and run StormCrawler projects.
  • Python: You'll execute Apache Storm commands directly from Python. So, ensure you download and install the latest Python version and add it to your system variable path.
  • A Java Integrated Development Environment (IDE): Choose a suitable Java development environment, such as IntelliJ IDEA or Visual Studio Code. This tutorial uses IntelliJ IDEA since it's compatible with Java.
  • Dependency Manager: This tutorial uses Maven, the recommended build tool for setting up StormCrawler.

Ready to crawl a website with StormCrawler and Apache Storm? Let's build your first StormCrawler bot!

Build Your First StormCrawler Web Crawler

Your Stormcrawler spider will crawl all the pages of the e-commerce challenge page and extract specific data, including product names, prices, and image URLs.

See the target website layout below:

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

You'll scrape data from all 12 pages of this target site.

Step 1: Set up Apache Storm

The first step is to get the latest version of Apache Storm. Download the Apache Storm tar.gz file from the official download page and extract it to a preferred location on your machine.

Look inside Apache Storm's bin directory, and you'll see a couple of files, including a storm.py file. This Python file provides access to the Apache Storm crawl management API.

Open the command line to the bin directory and run the following command to confirm successful download:

Terminal
python storm.py version

The above command should return your Apache Storm version information similar to the one below:

Output
# ... 
Storm 2.7.1
URL https://ghp_...e4834a017d
Branch v2.7.1
Compiled by rui on 2024-11-18T18:08Z
From source with checksum a1bbdc50252215dfa9c6c01c691d117f

As mentioned, Apache Storm provides features like distributed stream processing, real-time data management, and fault tolerance. However, in this tutorial, you'll focus on running your StormCrawler crawler locally using Apache Storm.

Are you ready? Let's build a StormCrawler project from the source.

Step 2: Build StormCrawler From the Source

Building StormCrawler from the source creates the standard codebase architecture. Create a new crawler project folder and open it via the command line.

Run the following command to build a StormCrawler project interactively using Maven:

Terminal
mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-archetype -DarchetypeVersion=3.1.0

The above command will open an interactive console to configure your project. Although you can use your preferred setup, this tutorial uses the following configuration:

Example
Define value for property 'http-agent-name' (should match expression '^[a-zA-Z_\-]+$'): MyCrawler
Define value for property 'http-agent-version': 1.0
Define value for property 'http-agent-description': A StormCrawler bot for collecting publicly available data
Define value for property 'http-agent-url': https://github.com/<YOUR_USERNAME>/
Define value for property 'http-agent-email' (should match expression '^\S+@\S+\.\S+$'): example@gmail.com
Define value for property 'groupId': com.tutorial
Define value for property 'artifactId': stormcrawler-tutorial
Define value for property 'version' 1.0-SNAPSHOT: 1.0
Define value for property 'package' com.tutorial: com.tutorial

Once the build concludes, open the project via your IDE. You should now see the following default files and subfolders.

StormCrawler Project Directory
Click to open the image in full screen

Create a seeds.txt file in your project root directory. The StormCrawler spout will emit the URLs in this file for parsing. Add the target URL to this file as shown below:

seeds.txt
https://www.scrapingcourse.com/ecommerce/

Now, open the crawler-conf.yaml file and add the full path to the seed.txt file:

crawler-conf.yaml
# ...
config:
  # ...
  urlSpout.seeds.file: '<FULL_PROJECT_PATH>/seeds.txt'

Go to the src/main/java/com.tutorial directory containing the CrawlTopology.java class. Then, create the com.tutorial.spouts and com.tutorial.bolts packages. To create those packages, right-click com.tutorial, go to New, select Package, and type your package name.

Create a URLSpout.java class inside com.tutorial.spouts (right-click the spouts package and go to New, then click Java class and enter your class name). Next, open com.tutorial.bolts and add the CSVExportBolt.java and ParseBolt.java classes.

Your project directory should look like the following after the above modifications:

StormCrawler Final Project Directory
Click to open the image in full screen

Let's now load URLs with the URLSpout class.

Step 3: Load Target URLs With Spout

The next step is to emit URLs from seeds.txt. Let's update the URLSpout class to achieve that.

Open the URLSpouts class and import the following packages:

URLSpout.java
package com.tutorial.spouts;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.Map;
import java.util.logging.Level;
import java.util.logging.Logger;

import org.apache.storm.spout.SpoutOutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseRichSpout;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;

Update the URLSpout class to extend Storm's BaseRichSpout. Notably, this class initializes a buffer reader and collector to read and emit URLs:

URLSpout.java
//...
public class URLSpout extends BaseRichSpout {
   private static final Logger logger = Logger.getLogger(URLSpout.class.getName());

   private SpoutOutputCollector collector;
   private BufferedReader reader;
   private boolean eofReached;
}

Declare an open method to set up resources and connect to the seeds.txt file path (replace it with your project path):

URlSpout.java
//...
public class URLSpout extends BaseRichSpout {
   //...
   @Override
   public void open(Map<String, Object> config, TopologyContext context, SpoutOutputCollector collector) {
       this.collector = collector;
       try {
           // initialize the BufferedReader with the seed file path
           // replace the placeholder with your project root full path
           reader = new BufferedReader(new FileReader("<FULL_PROJECT_PATH>/seeds.txt"));
           eofReached = false; // initialize EOF flag
       } catch (IOException e) {
           logger.log(Level.SEVERE, "Error opening file: {0}", e.getMessage());
       }
   }
}

Define a nextTuple method to implement the URL emitter. Apache Storm runs this method continuously to fetch the next URL from seeds.txt. After URL emission, the method terminates by setting eofReached to true:

URLSpout.java
//...
public class URLSpout extends BaseRichSpout {
   //...
   @Override
   public void nextTuple() {
       if (eofReached) {
           return; // stop emitting tuples once EOF is reached
       }

       try {
           // read a line (URL) from the seed file
           String url = reader.readLine();
           if (url != null) {
               // log and emit the URL as a tuple
               logger.log(Level.INFO, "Emitting URL: {0}", url); // Log the emitted URL
               collector.emit(new Values(url));
           } else {
               // reached end of file
               logger.info("No more URLs to emit.");
               eofReached = true;  // set EOF flag
           }
       } catch (IOException e) {
           logger.log(Level.SEVERE, "Error reading file: {0}", e.getMessage());
       }
   }
}

Finally, shut down the spout to release its resources. Then, declare the output field.

URLSpout.java
public class URLSpout extends BaseRichSpout {
//...

   @Override
   public void close() {
       try {
           // close the BufferedReader
           if (reader != null) {
               reader.close();
           }
       } catch (IOException e) {
           logger.log(Level.SEVERE, "Error closing file: {0}", e.getMessage());
       }
   }

   @Override
   public void declareOutputFields(OutputFieldsDeclarer declarer) {
       // declare the output field for the spout
       declarer.declare(new Fields("url"));
   }
}

Merge all the snippets, and you'll get the following final code:

URLSpout.java
package com.tutorial.spouts;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.Map;
import java.util.logging.Level;
import java.util.logging.Logger;

import org.apache.storm.spout.SpoutOutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseRichSpout;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;

public class URLSpout extends BaseRichSpout {
   private static final Logger logger = Logger.getLogger(URLSpout.class.getName());

   private SpoutOutputCollector collector;
   private BufferedReader reader;
   private boolean eofReached;

   @Override
   public void open(Map<String, Object> config, TopologyContext context, SpoutOutputCollector collector) {
       this.collector = collector;
       try {
           // initialize the BufferedReader with the seed file path
           // replace the placeholder with your project root full path
           reader = new BufferedReader(new FileReader("<FULL_PROJECT_PATH>/seeds.txt"));
           eofReached = false; // initialize EOF flag
       } catch (IOException e) {
           logger.log(Level.SEVERE, "Error opening file: {0}", e.getMessage());
       }
   }

   @Override
   public void nextTuple() {
       if (eofReached) {
           return; // stop emitting tuples once EOF is reached
       }

       try {
           // read a line (URL) from the seed file
           String url = reader.readLine();
           if (url != null) {
               // log and emit the URL as a tuple
               logger.log(Level.INFO, "Emitting URL: {0}", url); // Log the emitted URL
               collector.emit(new Values(url));
           } else {
               // reached end of file
               logger.info("No more URLs to emit.");
               eofReached = true;  // set EOF flag
           }
       } catch (IOException e) {
           logger.log(Level.SEVERE, "Error reading file: {0}", e.getMessage());
       }
   }

   @Override
   public void close() {
       try {
           // close the BufferedReader
           if (reader != null) {
               reader.close();
           }
       } catch (IOException e) {
           logger.log(Level.SEVERE, "Error closing file: {0}", e.getMessage());
       }
   }

   @Override
   public void declareOutputFields(OutputFieldsDeclarer declarer) {
       // declare the output field for the spout
       declarer.declare(new Fields("url"));
   }
}

You've set up your crawler to read URLs from seeds.txt. Let's parse data from the emitted URLs.

Step 4: Create a Parsing Bolt to Extract Data

The parser bolt retrieves URLs emitted by the URLSpout class for HTML parsing. You'll use this class to extract product names, prices, and image URLs from the target website. 

Let's start by parsing the HTML from the current URL in seeds.txt.

Open the ParseBolt class in the com.tutorial.bolt package and import the following packages:

ParseBolt.java
package com.tutorial.bolts;

import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

Update the ParseBolt class to extend BaseBasicBolt and declare an execute method to parse HTML from the emitted URL:

ParseBolt.java
//...
public class ParseBolt extends BaseBasicBolt {
   @Override
   @SuppressWarnings("CallToPrintStackTrace")
   public void execute(Tuple tuple, BasicOutputCollector collector) {
       // retrieve the URL emitted by the URLSpout
       String url = tuple.getStringByField("url");
      
       try {
           // fetch the page content using Jsoup
           Document doc = Jsoup.connect(url).get();

       } catch (IOException e) {
           // handle error in case the URL is not accessible
           e.printStackTrace();
       }
   }

   @Override
   public void declareOutputFields(OutputFieldsDeclarer declarer) {

   }
}

Now, let's update the above code to extract specific product data. 

Since we aim to extract data from all 12 product pages, update the seeds.txt file with the URLs of all the pages. This modification ensures that the URLSpout class emits all the pages on the website. 

Here's the modified seeds.txt file:

seeds.txt
https://www.scrapingcourse.com/ecommerce/page/1/
https://www.scrapingcourse.com/ecommerce/page/2/
https://www.scrapingcourse.com/ecommerce/page/3/
https://www.scrapingcourse.com/ecommerce/page/4/
https://www.scrapingcourse.com/ecommerce/page/5/
https://www.scrapingcourse.com/ecommerce/page/6/
https://www.scrapingcourse.com/ecommerce/page/7/
https://www.scrapingcourse.com/ecommerce/page/8/
https://www.scrapingcourse.com/ecommerce/page/9/
https://www.scrapingcourse.com/ecommerce/page/10/
https://www.scrapingcourse.com/ecommerce/page/11/
https://www.scrapingcourse.com/ecommerce/page/12/

Modify the ParseBolt class with the scraping logic that extracts the target data from the product page:

ParseBolt.java
//...
public class ParseBolt extends BaseBasicBolt {
   //...      
   try {
           //...
           // select all product elements
           Elements products = doc.select(".product");
           List<Map<String, Object>> productList = new ArrayList<>();

           for (Element product : products) {
               // extract individual product details
               String productName = product.select(".product-name").text();
               String productPrice = product.select(".price").text();
               String productImage = product.select("img").attr("src");

               // add placeholders if values are missing
               if (productName.isEmpty()) productName = "Unknown Product";
               if (productPrice.isEmpty()) productPrice = "N/A";
               if (productImage.isEmpty()) productImage = "http://example.com/default.jpg";

               // create a map for the product details
               Map<String, Object> productMap = new HashMap<>();
               productMap.put("Name", productName);
               productMap.put("Price", productPrice);
               productMap.put("Image URL", productImage);

               // add the product map to the product list
               productList.add(productMap);

               // emit the product details to the next bolt
               collector.emit(new Values(productName, productPrice, productImage));
           }

           // log the complete product list for debugging
           System.out.println("Extracted Products: " + productList);


       } catch (IOException e) {
           // handle error in case the URL is not accessible
           e.printStackTrace();
       }
   }
   //... declareOutputFields method omitted
}

Update the declareOutputFields with the expected fields:

Example
declarer.declare(new Fields("productName", "productPrice", "productImage"));

Combine the snippets, and you'll get the following complete code:

ParseBolt.java
package com.tutorial.bolts;

import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class ParseBolt extends BaseBasicBolt {
   @Override
   @SuppressWarnings("CallToPrintStackTrace")
   public void execute(Tuple tuple, BasicOutputCollector collector) {
       // retrieve the URL emitted by the URLSpout
       String url = tuple.getStringByField("url");
      
       try {
           // fetch the page content using Jsoup
           Document doc = Jsoup.connect(url).get();

           // select all product elements
           Elements products = doc.select(".product");
           List<Map<String, Object>> productList = new ArrayList<>();

           for (Element product : products) {
               // extract individual product details
               String productName = product.select(".product-name").text();
               String productPrice = product.select(".price").text();
               String productImage = product.select("img").attr("src");

               // add placeholders if values are missing
               if (productName.isEmpty()) productName = "Unknown Product";
               if (productPrice.isEmpty()) productPrice = "N/A";
               if (productImage.isEmpty()) productImage = "http://example.com/default.jpg";

               // create a map for the product details
               Map<String, Object> productMap = new HashMap<>();
               productMap.put("Name", productName);
               productMap.put("Price", productPrice);
               productMap.put("Image URL", productImage);

               // add the product map to the product list
               productList.add(productMap);

               // emit the product details to the next bolt
               collector.emit(new Values(productName, productPrice, productImage));
           }

           // log the complete product list for debugging
           System.out.println("Extracted Products: " + productList);


       } catch (IOException e) {
           // handle error in case the URL is not accessible
           e.printStackTrace();
       }
   }

   @Override
   public void declareOutputFields(OutputFieldsDeclarer declarer) {
       // declare the fields that this bolt will emit
       declarer.declare(new Fields("productName", "productPrice", "productImage"));
   }
}

Your StormCrawler bot will now parse data from URLs emitted from the spout. Let's export the scraped data to a CSV file.

Step 5: Export the Scraped Data to CSV

Data storage is essential for further processing, referencing, sharing, and more. Let's modify the CSV bolt to write the extracted data into a CSV file.

Open the CSVExportBolt class and import the following packages:

CSVExportBolt.java
package com.tutorial.bolts;

import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;

import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.tuple.Tuple;

Update this class to extend the BasebasicBolt class and specify the expected CSV file path (use your project's full path). Write the CSV headers and insert extracted product data into rows:

CSVExportBolt.java
//...
public class CSVExportBolt extends BaseBasicBolt {

   // define the CSV file path
   // replace the placeholder with your project root full path
   private static final String CSV_FILE_PATH = "<FULL_PROJECT_PATH>/products.csv";
   private boolean isFirstWrite = true;

   @Override
   @SuppressWarnings("CallToPrintStackTrace")
   // retrieve the scraped data information
   public void execute(Tuple tuple, BasicOutputCollector collector) {
       String productName = tuple.getStringByField("productName");
       String productPrice = tuple.getStringByField("productPrice");
       String productImage = tuple.getStringByField("productImage");

       try (PrintWriter writer = new PrintWriter(new FileWriter(CSV_FILE_PATH, !isFirstWrite))) {
           // write header only once
           if (isFirstWrite) {
               writer.println("\"Name\",\"Price\",\"Image URL\"");
               isFirstWrite = false;
           }
           writer.println(String.format("\"%s\",\"%s\",\"%s\"", productName, productPrice, productImage));
           System.out.printf("Written to CSV: %s, %s, %s%n", productName, productPrice, productImage);
       } catch (IOException e) {
           e.printStackTrace();
       }
   }

   @Override
   public void declareOutputFields(OutputFieldsDeclarer declarer) {
   }
}

Here's the complete code after merging the snippets:

CSVExportBolt.java
package com.tutorial.bolts;

import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;

import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.tuple.Tuple;

public class CSVExportBolt extends BaseBasicBolt {

   // define the CSV file path
   private static final String CSV_FILE_PATH = "<FULL_PROJECT_PATH>/products.csv";
   private boolean isFirstWrite = true;

   @Override
   @SuppressWarnings("CallToPrintStackTrace")
   // retrieve the scraped data information
   public void execute(Tuple tuple, BasicOutputCollector collector) {
       String productName = tuple.getStringByField("productName");
       String productPrice = tuple.getStringByField("productPrice");
       String productImage = tuple.getStringByField("productImage");

       try (PrintWriter writer = new PrintWriter(new FileWriter(CSV_FILE_PATH, !isFirstWrite))) {
           // write header only once
           if (isFirstWrite) {
               writer.println("\"Name\",\"Price\",\"Image URL\"");
               isFirstWrite = false;
           }
           writer.println(String.format("\"%s\",\"%s\",\"%s\"", productName, productPrice, productImage));
           System.out.printf("Written to CSV: %s, %s, %s%n", productName, productPrice, productImage);
       } catch (IOException e) {
           e.printStackTrace();
       }
   }

   @Override
   public void declareOutputFields(OutputFieldsDeclarer declarer) {
   }
}

You're almost there! It's time to run the crawl topology.

Step 6: Run the Crawl Topology

The final step is to submit the crawl topology locally and execute the web crawling task. 

Open the CrawlTopology class inside the com.tutorial package and import the required packages, including the bolts and spout you created previously:

CrawlTopology.java
package com.tutorial;

import org.apache.storm.topology.TopologyBuilder;
import org.apache.stormcrawler.ConfigurableTopology;

import com.tutorial.bolts.CSVExportBolt;
import com.tutorial.bolts.ParseBolt;
import com.tutorial.spouts.URLSpout;

Modify the default CrawlTopology class to link the bolts and spout using the TopologyBuilder object. Then, submit the topology:

CrawlTopology.java
//...
public class CrawlTopology extends ConfigurableTopology {

   public static void main(String[] args) throws Exception {
      ConfigurableTopology.start(new CrawlTopology(), args);
   }

   @Override
   protected int run(String[] args) {
      TopologyBuilder builder = new TopologyBuilder();

      // call spout: URLSpout
      builder.setSpout("url-spout", new URLSpout());

      // call bolt: ParseBolt
      builder.setBolt("parse-bolt", new ParseBolt())
            .shuffleGrouping("url-spout");

      // call bolt: CSVExportBolt
      builder.setBolt("csv-export", new CSVExportBolt())
            .shuffleGrouping("parse-bolt");

      // submit the crawl topology
      return submit("crawl-topology", conf, builder);
   }
}

The complete CrawlTopology class looks like this after merging the snippets:

CrawlTopology.java
package com.tutorial;

import org.apache.storm.topology.TopologyBuilder;
import org.apache.stormcrawler.ConfigurableTopology;

import com.tutorial.bolts.CSVExportBolt;
import com.tutorial.bolts.ParseBolt;
import com.tutorial.spouts.URLSpout;

public class CrawlTopology extends ConfigurableTopology {

   public static void main(String[] args) throws Exception {
      ConfigurableTopology.start(new CrawlTopology(), args);
   }

   @Override
   protected int run(String[] args) {
      TopologyBuilder builder = new TopologyBuilder();

      // call spout: URLSpout
      builder.setSpout("url-spout", new URLSpout());

      // call bolt: ParseBolt
      builder.setBolt("parse-bolt", new ParseBolt())
            .shuffleGrouping("url-spout");

      // call bolt: CSVExportBolt
      builder.setBolt("csv-export", new CSVExportBolt())
            .shuffleGrouping("parse-bolt");

      // submit the crawl topology
      return submit("crawl-topology", conf, builder);
   }
}

Build the project using mvn:

Terminal
mvn clean package

Now, open the bin directory of Apache Storm via the command line and execute the StormCrawler bot with the storm.py command. Ensure you replace the path placeholder with your project's full path:

Terminal
python storm.py local <FULL_PROJECT_PATH>\target\stormcrawler-tutorial-1.0.jar --local-ttl 60 com.tutorial.CrawlTopology -- -conf <FULL_PROJECT_PATH>\crawler-conf.yaml

The above command executes the crawl topology and generates a products.csv file containing the extracted data. You'll find the CSV file inside your project root folder:

Extracted Data in CSV  File
Click to open the image in full screen

Great job! 👏 You've now created a StormCrawler bot that extracts data from multiple pages.

That said, anti-bots are a major challenge you'll often encounter during web crawling. Let's find out how to avoid these anti-bot measures in the next section.

Avoid Getting Blocked While Crawling With StormCrawler

Web crawlers are prone to anti-bot detection due to multiple requests, which often results in blocking. 

For instance, the current StormCrawler web crawler won't work with a protected site like the anti-bot challenge page. You can try it out by replacing the URLs in seeds.txt with the URL of the protected web page:

seeds.txt
https://www.scrapingcourse.com/antibot-challenge

Remember to comment out the CSVExportBolt call in the CrawlTopology class to avoid a null reference.

Modify the ParseBolt class to scrape the website's full-page HTML:

ParseBolt.java
package com.tutorial.bolts;

import java.io.IOException;

import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.tuple.Tuple;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class ParseBolt extends BaseBasicBolt {
   @Override
   @SuppressWarnings("CallToPrintStackTrace")
   public void execute(Tuple tuple, BasicOutputCollector collector) {
       // retrieve the URL emitted by the URLSpout
       String url = tuple.getStringByField("url");
      
       try {
           // fetch the page content using Jsoup
           Document doc = Jsoup.connect(url).get();
          
           // print the full-page HTML
           System.out.println("Extracted HTML: " + doc);

       } catch (IOException e) {
           // handle error in case the URL is not accessible
           e.printStackTrace();
       }
   }

   @Override
   public void declareOutputFields(OutputFieldsDeclarer declarer) {

   }
}

The crawler got blocked with the following 403 forbidden error:

Output
HTTP error fetching URL. Status=403, URL=[https://www.scrapingcourse.com/antibot-challenge]

Although StormCrawler features a proxy manager for setting up web scraping proxies, this solution doesn't offer full-scale stealth.

The best way to bypass anti-bots while scraping multiple pages is to use a web scraping API such as ZenRows. The ZenRows Scraper API provides all the toolkits required for efficient web crawling. It helps you handle premium proxy rotation, request header management, advanced fingerprint spoofing, JavaScript execution, anti-bot auto-bypass, and more with a single API call.

Let's see how it works by scraping the anti-bot challenge page that blocked your crawler.

Sign up to open the ZenRows Request Builder. Paste the target URL in the link box, and activate Premium Proxies and JS Rendering.

Choose Java as your programming language and select the API connection mode. Copy and paste the generated code into your crawler file:

building a scraper with zenrows
Click to open the image in full screen

Here's the generated code:

Example
import org.apache.hc.client5.http.fluent.Request;

public class APIRequest {
   public static void main(final String... args) throws Exception {
       String apiUrl = "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.scrapingcourse.com%2Fantibot-challenge&js_render=true&premium_proxy=true";
       String response = Request.get(apiUrl)
               .execute().returnContent().asString();

       System.out.println(response);
   }
}

The above code accesses the protected web page and extracts its HTML, as shown:

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

Congratulations! 🎉You just bypassed an anti-bot protection using ZenRows.

Conclusion

You've seen how to crawl an entire website using Apache StormCrawler and Apache Storm. You now know how to:

  • Set up Apache Storm and build StormCrawler from the source.
  • Create a spout and use it to emit URLs from seeds.txt.
  • Use a parser bolt to parse HTML from emitted URLs and extract specific data.
  • Export scraped data to CSV using a CSV exporter bolt.
  • Create a custom crawl topology and execute the crawling task with Apache Storm. 

However, remember that despite being an excellent web crawling tool, StormCrawler can't handle anti-bot measures effectively. The easiest way to crawl any website at scale without getting blocked is to use ZenRows, an all-in-one web scraping solution.

Try ZenRows for free—no credit card required!

Ready to get started?

Up to 1,000 URLs for free are waiting for you