Do you want to crawl multiple pages with Java? StormCrawler, a feature-rich web crawling tool, can help you build a robust and efficient solution for web crawling projects of any scale.
In this tutorial, you'll learn what Apache StormCrawler is and how to use it for web crawling through a step-by-step guide.
- Step 1: Set up Apache Storm.
- Step 2: Build StormCrawler from the source.
- Step 3: Load the target URLs with Spout.
- Step 4: Create parsing bolt to extract data.
- Step 5: Export the scraped data to CSV.
- Step 6: Run the crawl topology.
Let's go!
What Is Apache StormCrawler?
StormCrawler is an open-source software development toolkit (SDK) designed for building large-scale and customizable web crawlers and web scrapers in Java. It leverages Apache Storm, a free and scalable real-time computational system, to efficiently manage distributed crawling and data processing tasks.
Apache StormCrawler has many components. The key ones include the crawl topology, spouts, and bolts. Let's briefly explain what each component does:
-
Spout: A class designed to read URLs from files with configurable encodings, such as UTF-8
.txt
files. It also includes a sub-component for reading URLs directly from memory. - Bolts: Components that process data streams, performing tasks such as HTML parsing, data extraction, filtering, and storing results in formats like CSV, JSON, or databases. Bolts typically process data emitted by spouts or other bolts within a crawl topology.
- Crawl Topology: Defines the workflow for processing crawl tasks by connecting other components (e.g., spouts and bolts) in a directed path. It determines how data flows within the Apache Storm platform.
Let's go through the requirements for building a web crawler with StormCrawler and Apache Storm.
Prerequisites
Before starting this Apache StormCrawler tutorial, ensure you have the following tools and environments:
- Java Development Kit (JDK): This tutorial uses JDK 17 to compile and run StormCrawler projects.
- Python: You'll execute Apache Storm commands directly from Python. So, ensure you download and install the latest Python version and add it to your system variable path.
- A Java Integrated Development Environment (IDE): Choose a suitable Java development environment, such as IntelliJ IDEA or Visual Studio Code. This tutorial uses IntelliJ IDEA since it's compatible with Java.
- Dependency Manager: This tutorial uses Maven, the recommended build tool for setting up StormCrawler.
Ready to crawl a website with StormCrawler and Apache Storm? Let's build your first StormCrawler bot!
Build Your First StormCrawler Web Crawler
Your Stormcrawler spider will crawl all the pages of the e-commerce challenge page and extract specific data, including product names, prices, and image URLs.
See the target website layout below:
You'll scrape data from all 12 pages of this target site.
The source code for this tutorial is also available on GitHub.
Step 1: Set up Apache Storm
The first step is to get the latest version of Apache Storm. Download the Apache Storm tar.gz
file from the official download page and extract it to a preferred location on your machine.
Look inside Apache Storm's bin
directory, and you'll see a couple of files, including a storm.py
file. This Python file provides access to the Apache Storm crawl management API.
Open the command line to the bin
directory and run the following command to confirm successful download:
python storm.py version
The above command should return your Apache Storm version information similar to the one below:
# ...
Storm 2.7.1
URL https://ghp_...e4834a017d
Branch v2.7.1
Compiled by rui on 2024-11-18T18:08Z
From source with checksum a1bbdc50252215dfa9c6c01c691d117f
As mentioned, Apache Storm provides features like distributed stream processing, real-time data management, and fault tolerance. However, in this tutorial, you'll focus on running your StormCrawler crawler locally using Apache Storm.
Are you ready? Let's build a StormCrawler project from the source.
Step 2: Build StormCrawler From the Source
Building StormCrawler from the source creates the standard codebase architecture. Create a new crawler
project folder and open it via the command line.
Run the following command to build a StormCrawler project interactively using Maven:
mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler -DarchetypeArtifactId=stormcrawler-archetype -DarchetypeVersion=3.1.0
The above command will open an interactive console to configure your project. Although you can use your preferred setup, this tutorial uses the following configuration:
Define value for property 'http-agent-name' (should match expression '^[a-zA-Z_\-]+$'): MyCrawler
Define value for property 'http-agent-version': 1.0
Define value for property 'http-agent-description': A StormCrawler bot for collecting publicly available data
Define value for property 'http-agent-url': https://github.com/<YOUR_USERNAME>/
Define value for property 'http-agent-email' (should match expression '^\S+@\S+\.\S+$'): example@gmail.com
Define value for property 'groupId': com.tutorial
Define value for property 'artifactId': stormcrawler-tutorial
Define value for property 'version' 1.0-SNAPSHOT: 1.0
Define value for property 'package' com.tutorial: com.tutorial
Once the build concludes, open the project via your IDE. You should now see the following default files and subfolders.
Create a seeds.txt
file in your project root directory. The StormCrawler spout will emit the URLs in this file for parsing. Add the target URL to this file as shown below:
https://www.scrapingcourse.com/ecommerce/
Now, open the crawler-conf.yaml
file and add the full path to the seed.txt
file:
# ...
config:
# ...
urlSpout.seeds.file: '<FULL_PROJECT_PATH>/seeds.txt'
Go to the src/main/java/com.tutorial
directory containing the CrawlTopology.java
class. Then, create the com.tutorial.spouts
and com.tutorial.bolts
packages. To create those packages, right-click com.tutorial
, go to New, select Package, and type your package name.
Create a URLSpout.java
class inside com.tutorial.spouts
(right-click the spouts
package and go to New, then click Java class and enter your class name). Next, open com.tutorial.bolts
and add the CSVExportBolt.java
and ParseBolt.java
classes.
Your project directory should look like the following after the above modifications:
Let's now load URLs with the URLSpout
class.
Step 3: Load Target URLs With Spout
The next step is to emit URLs from seeds.txt
. Let's update the URLSpout
class to achieve that.
Open the URLSpouts
class and import the following packages:
package com.tutorial.spouts;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.Map;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.storm.spout.SpoutOutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseRichSpout;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;
Update the URLSpout
class to extend Storm's BaseRichSpout
. Notably, this class initializes a buffer reader and collector to read and emit URLs:
//...
public class URLSpout extends BaseRichSpout {
private static final Logger logger = Logger.getLogger(URLSpout.class.getName());
private SpoutOutputCollector collector;
private BufferedReader reader;
private boolean eofReached;
}
Declare an open
method to set up resources and connect to the seeds.txt
file path (replace it with your project path):
//...
public class URLSpout extends BaseRichSpout {
//...
@Override
public void open(Map<String, Object> config, TopologyContext context, SpoutOutputCollector collector) {
this.collector = collector;
try {
// initialize the BufferedReader with the seed file path
// replace the placeholder with your project root full path
reader = new BufferedReader(new FileReader("<FULL_PROJECT_PATH>/seeds.txt"));
eofReached = false; // initialize EOF flag
} catch (IOException e) {
logger.log(Level.SEVERE, "Error opening file: {0}", e.getMessage());
}
}
}
Define a nextTuple
method to implement the URL emitter. Apache Storm runs this method continuously to fetch the next URL from seeds.txt
. After URL emission, the method terminates by setting eofReached
to true:
//...
public class URLSpout extends BaseRichSpout {
//...
@Override
public void nextTuple() {
if (eofReached) {
return; // stop emitting tuples once EOF is reached
}
try {
// read a line (URL) from the seed file
String url = reader.readLine();
if (url != null) {
// log and emit the URL as a tuple
logger.log(Level.INFO, "Emitting URL: {0}", url); // Log the emitted URL
collector.emit(new Values(url));
} else {
// reached end of file
logger.info("No more URLs to emit.");
eofReached = true; // set EOF flag
}
} catch (IOException e) {
logger.log(Level.SEVERE, "Error reading file: {0}", e.getMessage());
}
}
}
Finally, shut down the spout to release its resources. Then, declare the output field.
public class URLSpout extends BaseRichSpout {
//...
@Override
public void close() {
try {
// close the BufferedReader
if (reader != null) {
reader.close();
}
} catch (IOException e) {
logger.log(Level.SEVERE, "Error closing file: {0}", e.getMessage());
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// declare the output field for the spout
declarer.declare(new Fields("url"));
}
}
Merge all the snippets, and you'll get the following final code:
package com.tutorial.spouts;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.Map;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.storm.spout.SpoutOutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseRichSpout;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;
public class URLSpout extends BaseRichSpout {
private static final Logger logger = Logger.getLogger(URLSpout.class.getName());
private SpoutOutputCollector collector;
private BufferedReader reader;
private boolean eofReached;
@Override
public void open(Map<String, Object> config, TopologyContext context, SpoutOutputCollector collector) {
this.collector = collector;
try {
// initialize the BufferedReader with the seed file path
// replace the placeholder with your project root full path
reader = new BufferedReader(new FileReader("<FULL_PROJECT_PATH>/seeds.txt"));
eofReached = false; // initialize EOF flag
} catch (IOException e) {
logger.log(Level.SEVERE, "Error opening file: {0}", e.getMessage());
}
}
@Override
public void nextTuple() {
if (eofReached) {
return; // stop emitting tuples once EOF is reached
}
try {
// read a line (URL) from the seed file
String url = reader.readLine();
if (url != null) {
// log and emit the URL as a tuple
logger.log(Level.INFO, "Emitting URL: {0}", url); // Log the emitted URL
collector.emit(new Values(url));
} else {
// reached end of file
logger.info("No more URLs to emit.");
eofReached = true; // set EOF flag
}
} catch (IOException e) {
logger.log(Level.SEVERE, "Error reading file: {0}", e.getMessage());
}
}
@Override
public void close() {
try {
// close the BufferedReader
if (reader != null) {
reader.close();
}
} catch (IOException e) {
logger.log(Level.SEVERE, "Error closing file: {0}", e.getMessage());
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// declare the output field for the spout
declarer.declare(new Fields("url"));
}
}
You've set up your crawler to read URLs from seeds.txt
. Let's parse data from the emitted URLs.
Step 4: Create a Parsing Bolt to Extract Data
The parser bolt retrieves URLs emitted by the URLSpout
class for HTML parsing. You'll use this class to extract product names, prices, and image URLs from the target website.
Let's start by parsing the HTML from the current URL in seeds.txt
.
Open the ParseBolt
class in the com.tutorial.bolt
package and import the following packages:
package com.tutorial.bolts;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
Update the ParseBolt
class to extend BaseBasicBolt
and declare an execute
method to parse HTML from the emitted URL:
//...
public class ParseBolt extends BaseBasicBolt {
@Override
@SuppressWarnings("CallToPrintStackTrace")
public void execute(Tuple tuple, BasicOutputCollector collector) {
// retrieve the URL emitted by the URLSpout
String url = tuple.getStringByField("url");
try {
// fetch the page content using Jsoup
Document doc = Jsoup.connect(url).get();
} catch (IOException e) {
// handle error in case the URL is not accessible
e.printStackTrace();
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
}
Now, let's update the above code to extract specific product data.
Since we aim to extract data from all 12 product pages, update the seeds.txt
file with the URLs of all the pages. This modification ensures that the URLSpout
class emits all the pages on the website.
Here's the modified seeds.txt
file:
https://www.scrapingcourse.com/ecommerce/page/1/
https://www.scrapingcourse.com/ecommerce/page/2/
https://www.scrapingcourse.com/ecommerce/page/3/
https://www.scrapingcourse.com/ecommerce/page/4/
https://www.scrapingcourse.com/ecommerce/page/5/
https://www.scrapingcourse.com/ecommerce/page/6/
https://www.scrapingcourse.com/ecommerce/page/7/
https://www.scrapingcourse.com/ecommerce/page/8/
https://www.scrapingcourse.com/ecommerce/page/9/
https://www.scrapingcourse.com/ecommerce/page/10/
https://www.scrapingcourse.com/ecommerce/page/11/
https://www.scrapingcourse.com/ecommerce/page/12/
Modify the ParseBolt
class with the scraping logic that extracts the target data from the product page:
//...
public class ParseBolt extends BaseBasicBolt {
//...
try {
//...
// select all product elements
Elements products = doc.select(".product");
List<Map<String, Object>> productList = new ArrayList<>();
for (Element product : products) {
// extract individual product details
String productName = product.select(".product-name").text();
String productPrice = product.select(".price").text();
String productImage = product.select("img").attr("src");
// add placeholders if values are missing
if (productName.isEmpty()) productName = "Unknown Product";
if (productPrice.isEmpty()) productPrice = "N/A";
if (productImage.isEmpty()) productImage = "http://example.com/default.jpg";
// create a map for the product details
Map<String, Object> productMap = new HashMap<>();
productMap.put("Name", productName);
productMap.put("Price", productPrice);
productMap.put("Image URL", productImage);
// add the product map to the product list
productList.add(productMap);
// emit the product details to the next bolt
collector.emit(new Values(productName, productPrice, productImage));
}
// log the complete product list for debugging
System.out.println("Extracted Products: " + productList);
} catch (IOException e) {
// handle error in case the URL is not accessible
e.printStackTrace();
}
}
//... declareOutputFields method omitted
}
Update the declareOutputFields
with the expected fields:
declarer.declare(new Fields("productName", "productPrice", "productImage"));
Combine the snippets, and you'll get the following complete code:
package com.tutorial.bolts;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ParseBolt extends BaseBasicBolt {
@Override
@SuppressWarnings("CallToPrintStackTrace")
public void execute(Tuple tuple, BasicOutputCollector collector) {
// retrieve the URL emitted by the URLSpout
String url = tuple.getStringByField("url");
try {
// fetch the page content using Jsoup
Document doc = Jsoup.connect(url).get();
// select all product elements
Elements products = doc.select(".product");
List<Map<String, Object>> productList = new ArrayList<>();
for (Element product : products) {
// extract individual product details
String productName = product.select(".product-name").text();
String productPrice = product.select(".price").text();
String productImage = product.select("img").attr("src");
// add placeholders if values are missing
if (productName.isEmpty()) productName = "Unknown Product";
if (productPrice.isEmpty()) productPrice = "N/A";
if (productImage.isEmpty()) productImage = "http://example.com/default.jpg";
// create a map for the product details
Map<String, Object> productMap = new HashMap<>();
productMap.put("Name", productName);
productMap.put("Price", productPrice);
productMap.put("Image URL", productImage);
// add the product map to the product list
productList.add(productMap);
// emit the product details to the next bolt
collector.emit(new Values(productName, productPrice, productImage));
}
// log the complete product list for debugging
System.out.println("Extracted Products: " + productList);
} catch (IOException e) {
// handle error in case the URL is not accessible
e.printStackTrace();
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// declare the fields that this bolt will emit
declarer.declare(new Fields("productName", "productPrice", "productImage"));
}
}
Your StormCrawler bot will now parse data from URLs emitted from the spout. Let's export the scraped data to a CSV file.
Step 5: Export the Scraped Data to CSV
Data storage is essential for further processing, referencing, sharing, and more. Let's modify the CSV bolt to write the extracted data into a CSV file.
Open the CSVExportBolt
class and import the following packages:
package com.tutorial.bolts;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.tuple.Tuple;
Update this class to extend the BasebasicBolt
class and specify the expected CSV file path (use your project's full path). Write the CSV headers and insert extracted product data into rows:
//...
public class CSVExportBolt extends BaseBasicBolt {
// define the CSV file path
// replace the placeholder with your project root full path
private static final String CSV_FILE_PATH = "<FULL_PROJECT_PATH>/products.csv";
private boolean isFirstWrite = true;
@Override
@SuppressWarnings("CallToPrintStackTrace")
// retrieve the scraped data information
public void execute(Tuple tuple, BasicOutputCollector collector) {
String productName = tuple.getStringByField("productName");
String productPrice = tuple.getStringByField("productPrice");
String productImage = tuple.getStringByField("productImage");
try (PrintWriter writer = new PrintWriter(new FileWriter(CSV_FILE_PATH, !isFirstWrite))) {
// write header only once
if (isFirstWrite) {
writer.println("\"Name\",\"Price\",\"Image URL\"");
isFirstWrite = false;
}
writer.println(String.format("\"%s\",\"%s\",\"%s\"", productName, productPrice, productImage));
System.out.printf("Written to CSV: %s, %s, %s%n", productName, productPrice, productImage);
} catch (IOException e) {
e.printStackTrace();
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
}
Here's the complete code after merging the snippets:
package com.tutorial.bolts;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.tuple.Tuple;
public class CSVExportBolt extends BaseBasicBolt {
// define the CSV file path
private static final String CSV_FILE_PATH = "<FULL_PROJECT_PATH>/products.csv";
private boolean isFirstWrite = true;
@Override
@SuppressWarnings("CallToPrintStackTrace")
// retrieve the scraped data information
public void execute(Tuple tuple, BasicOutputCollector collector) {
String productName = tuple.getStringByField("productName");
String productPrice = tuple.getStringByField("productPrice");
String productImage = tuple.getStringByField("productImage");
try (PrintWriter writer = new PrintWriter(new FileWriter(CSV_FILE_PATH, !isFirstWrite))) {
// write header only once
if (isFirstWrite) {
writer.println("\"Name\",\"Price\",\"Image URL\"");
isFirstWrite = false;
}
writer.println(String.format("\"%s\",\"%s\",\"%s\"", productName, productPrice, productImage));
System.out.printf("Written to CSV: %s, %s, %s%n", productName, productPrice, productImage);
} catch (IOException e) {
e.printStackTrace();
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
}
You're almost there! It's time to run the crawl topology.
Step 6: Run the Crawl Topology
The final step is to submit the crawl topology locally and execute the web crawling task.
Open the CrawlTopology
class inside the com.tutorial
package and import the required packages, including the bolts and spout you created previously:
package com.tutorial;
import org.apache.storm.topology.TopologyBuilder;
import org.apache.stormcrawler.ConfigurableTopology;
import com.tutorial.bolts.CSVExportBolt;
import com.tutorial.bolts.ParseBolt;
import com.tutorial.spouts.URLSpout;
Modify the default CrawlTopology
class to link the bolts and spout using the TopologyBuilder
object. Then, submit the topology:
//...
public class CrawlTopology extends ConfigurableTopology {
public static void main(String[] args) throws Exception {
ConfigurableTopology.start(new CrawlTopology(), args);
}
@Override
protected int run(String[] args) {
TopologyBuilder builder = new TopologyBuilder();
// call spout: URLSpout
builder.setSpout("url-spout", new URLSpout());
// call bolt: ParseBolt
builder.setBolt("parse-bolt", new ParseBolt())
.shuffleGrouping("url-spout");
// call bolt: CSVExportBolt
builder.setBolt("csv-export", new CSVExportBolt())
.shuffleGrouping("parse-bolt");
// submit the crawl topology
return submit("crawl-topology", conf, builder);
}
}
The complete CrawlTopology
class looks like this after merging the snippets:
package com.tutorial;
import org.apache.storm.topology.TopologyBuilder;
import org.apache.stormcrawler.ConfigurableTopology;
import com.tutorial.bolts.CSVExportBolt;
import com.tutorial.bolts.ParseBolt;
import com.tutorial.spouts.URLSpout;
public class CrawlTopology extends ConfigurableTopology {
public static void main(String[] args) throws Exception {
ConfigurableTopology.start(new CrawlTopology(), args);
}
@Override
protected int run(String[] args) {
TopologyBuilder builder = new TopologyBuilder();
// call spout: URLSpout
builder.setSpout("url-spout", new URLSpout());
// call bolt: ParseBolt
builder.setBolt("parse-bolt", new ParseBolt())
.shuffleGrouping("url-spout");
// call bolt: CSVExportBolt
builder.setBolt("csv-export", new CSVExportBolt())
.shuffleGrouping("parse-bolt");
// submit the crawl topology
return submit("crawl-topology", conf, builder);
}
}
Build the project using mvn
:
mvn clean package
Now, open the bin
directory of Apache Storm via the command line and execute the StormCrawler bot with the storm.py
command. Ensure you replace the path placeholder with your project's full path:
python storm.py local <FULL_PROJECT_PATH>\target\stormcrawler-tutorial-1.0.jar --local-ttl 60 com.tutorial.CrawlTopology -- -conf <FULL_PROJECT_PATH>\crawler-conf.yaml
The above command executes the crawl topology and generates a products.csv
file containing the extracted data. You'll find the CSV file inside your project root folder:
Great job! 👏 You've now created a StormCrawler bot that extracts data from multiple pages.
That said, anti-bots are a major challenge you'll often encounter during web crawling. Let's find out how to avoid these anti-bot measures in the next section.
Avoid Getting Blocked While Crawling With StormCrawler
Web crawlers are prone to anti-bot detection due to multiple requests, which often results in blocking.
For instance, the current StormCrawler web crawler won't work with a protected site like the anti-bot challenge page. You can try it out by replacing the URLs in seeds.txt
with the URL of the protected web page:
https://www.scrapingcourse.com/antibot-challenge
Remember to comment out the CSVExportBolt
call in the CrawlTopology
class to avoid a null reference.
Modify the ParseBolt
class to scrape the website's full-page HTML:
package com.tutorial.bolts;
import java.io.IOException;
import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.tuple.Tuple;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class ParseBolt extends BaseBasicBolt {
@Override
@SuppressWarnings("CallToPrintStackTrace")
public void execute(Tuple tuple, BasicOutputCollector collector) {
// retrieve the URL emitted by the URLSpout
String url = tuple.getStringByField("url");
try {
// fetch the page content using Jsoup
Document doc = Jsoup.connect(url).get();
// print the full-page HTML
System.out.println("Extracted HTML: " + doc);
} catch (IOException e) {
// handle error in case the URL is not accessible
e.printStackTrace();
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
}
The crawler got blocked with the following 403 forbidden error:
HTTP error fetching URL. Status=403, URL=[https://www.scrapingcourse.com/antibot-challenge]
Although StormCrawler features a proxy manager for setting up web scraping proxies, this solution doesn't offer full-scale stealth.
The best way to bypass anti-bots while scraping multiple pages is to use a web scraping API such as ZenRows. The ZenRows Scraper API provides all the toolkits required for efficient web crawling. It helps you handle premium proxy rotation, request header management, advanced fingerprint spoofing, JavaScript execution, anti-bot auto-bypass, and more with a single API call.
Let's see how it works by scraping the anti-bot challenge page that blocked your crawler.
Sign up to open the ZenRows Request Builder. Paste the target URL in the link box, and activate Premium Proxies and JS Rendering.
Choose Java as your programming language and select the API connection mode. Copy and paste the generated code into your crawler file:
Here's the generated code:
import org.apache.hc.client5.http.fluent.Request;
public class APIRequest {
public static void main(final String... args) throws Exception {
String apiUrl = "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.scrapingcourse.com%2Fantibot-challenge&js_render=true&premium_proxy=true";
String response = Request.get(apiUrl)
.execute().returnContent().asString();
System.out.println(response);
}
}
The above code accesses the protected web page and extracts its HTML, as shown:
<html lang="en">
<head>
<!-- ... -->
<title>Antibot Challenge - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<h2>
You bypassed the Antibot challenge! :D
</h2>
<!-- other content omitted for brevity -->
</body>
</html>
Congratulations! 🎉You just bypassed an anti-bot protection using ZenRows.
You can also copy and paste the apiUrl
string from the generated code into StormCrawler's seeds.txt
file for the same result.
Conclusion
You've seen how to crawl an entire website using Apache StormCrawler and Apache Storm. You now know how to:
- Set up Apache Storm and build StormCrawler from the source.
- Create a spout and use it to emit URLs from
seeds.txt
. - Use a parser bolt to parse HTML from emitted URLs and extract specific data.
- Export scraped data to CSV using a CSV exporter bolt.
- Create a custom crawl topology and execute the crawling task with Apache Storm.
However, remember that despite being an excellent web crawling tool, StormCrawler can't handle anti-bot measures effectively. The easiest way to crawl any website at scale without getting blocked is to use ZenRows, an all-in-one web scraping solution.
Try ZenRows for free—no credit card required!