WebMagic is a Java-based framework for extracting data from websites. Designed with a focus on simplicity, this tool offers an intuitive API that allows you to handle the full web crawler lifecycle (downloading HTML, URL management, content extraction, and persistence).
This tutorial will guide you through using WebMagic to crawl websites, follow links, and extract product data. By the end, you'll have a fully functional web crawler capable of discovering links and extracting data at scale. Let's begin.
Prerequisites
To follow along in this tutorial, ensure you meet the following requirements:
- Latest Java Development Kit (JDK).
- The Maven dependency manager: Modern Java IDEs like IntelliJ and NetBeans have built-in dependency manager support (Maven and Gradle). Simply select the Maven template when creating your Java project to use Maven.
- Your preferred IDE. We'll be using IntelliJ in this tutorial.
- WebMagic.
Build Your First WebMagic Web Crawler
WebMagic's extensive documentation and ready-to-use code samples lay the foundation for getting started.
However, real-world examples are the best way to begin any learning journey. Thus, for this tutorial, we'll crawl the ScrapingCourse E-commerce Test site.

The goal here is to follow all links on the page and scrape product information (product name, price, and image URL) once the crawler navigates to a product page.Â
Step 1: Set up WebMagic
Before we begin writing code, here's some background information about the tool.
WebMagic is inspired by Scrapy, and as such, they share a similar architecture, which is divided into four main components: Downloader, PageProcessor, Scheduler, and Pipeline.
While not all components run in separate threads, the WebMagic architecture is designed to optimize web crawling as much as possible.
The Downloader and PageProcessor components can run in multiple threads, and the Pipeline runs in the main thread. However, WebMagic passes processed pages asynchronously to the Pipeline, allowing it to handle tasks independent of other threads.
Here's a summary of WebMagic's workflow:
- A Spider defines what (start URL) and how to crawl.
- The downloader downloads the HTML content of the predefined URL.
- The PageProcessor processes the page according to defined instructions. Here, you can discover more links for crawling.Â
- The scheduler manages these links, automatically handling deduplication and ensuring each URL is crawled only once.
- The pipeline handles the result.
Now, let's prepare to write some code.
Navigate to a directory where you'd like to store your code and create your Java project. Then, add WebMagic to your project by including the following XML snippet in your pom.xml
's <dependencies>
section.
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.5</version>
</dependency>
You're all set up!
Step 2: Access the Target Website
As an actual first step, here's a basic WebMagic script that accesses the target website and retrieves its HTML content.
Later on, we'll extend this script to follow links on the page and extract product information.
For now, this code will make a GET request to the target URL and retrieve its HTML.
The Spider.create()
method initializes a new web crawler using the Crawler
class. This class implements the PageProcessor interface to define how pages should be processed. In this case, it simply logs the HTML to the console.
// import the required classes
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
public class Main {
public static void main(String[] args) {
// initialize a Spider
Spider.create(new Crawler())
// define start URL
.addUrl("https://www.scrapingcourse.com/ecommerce/")
// set number of threads
.thread(3)
.run();
}
// PageProcessor implementation
public static class Crawler implements PageProcessor {
// set crawling configuration
private final Site site = Site.me()
.setRetryTimes(3)
.setSleepTime(1000);
@Override
// process crawled page
public void process(Page page) {
// retrieve and log HTML content
page.putField("html", page.getHtml().toString());
}
@Override
// provide site settings to the crawler framework
public Site getSite() {
return site;
}
}
}
Here's the result.
<html lang="en">
<head>
<!-- ... -->
<title>
Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com
</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<div class="beta site-title">
<a href="https://www.scrapingcourse.com/ecommerce/" rel="home">
Ecommerce Test Site to Learn Web Scraping
</a>
</div>
<!-- other content omitted for brevity -->
</body>
</html>
Step 3: Follow Links With WebMagic
Now, let's extend our initial basic script to find and follow links on the page.
WebMagic provides a predefined function, page.addTargetRequests()
, for this task, allowing you to discover URLs and queue them for crawling without having to write code from scratch.
The page.addTargetRequests()
takes, as an argument, the method for obtaining the links you want to follow.
To define this method, inspect the target page to identify the right selectors.
In order to avoid too wide a scope, let's focus only on product page links (that is, pagination links).
Here's what the pagination section of the web page looks like:

Right-click on this section and select Inspect. This will open up the Developer Tools window, showing the page's HTML structure, as seen in the image below:

You'll find that there are 12 product pages, all of which share the same page-numbers
class.
In the process()
method, use this information to extract all pagination links. Then, add them to the crawl queue using the page.addTargetRequests()
.
WebMagic offers the Selectable
API, which allows you to chain extraction directly to page elements. In this case, from page HTML to pagination links, as shown below.
// import the required classes
//...
import java.util.List;
// ...
@Override
// process crawled page
public void process(Page page) {
// extract pagination links
List<String> paginationLinks = page.getHtml().css(".page-numbers").links().all();
// add pagination links to crawl queue
page.addTargetRequests(paginationLinks);
}
To test your code, modify the initial basic script with the code snippet above. You'll get the following WebMagic script:
// import the required classes
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import java.util.List;
public class Main {
public static void main(String[] args) {
// initialize a Spider
Spider.create(new Crawler())
// define start URL
.addUrl("https://www.scrapingcourse.com/ecommerce/")
// set number of threads
.thread(3)
.run();
}
// PageProcessor implementation
public static class Crawler implements PageProcessor {
// set crawling configuration
private final Site site = Site.me()
.setRetryTimes(3)
.setSleepTime(1000);
@Override
// process crawled pages
public void process(Page page) {
// extract pagination links
List<String> paginationLinks = page.getHtml().css(".page-numbers").links().all();
// add pagination links to crawl queue
page.addTargetRequests(paginationLinks);
}
@Override
// provide site settings to the crawler framework
public Site getSite() {
return site;
}
}
}
This code finds and follows all the pagination links on the page and automatically logs the following progress report.
get page: https://www.scrapingcourse.com/ecommerce/
get page: https://www.scrapingcourse.com/ecommerce/page/2/
get page: https://www.scrapingcourse.com/ecommerce/page/4/
get page: https://www.scrapingcourse.com/ecommerce/page/3/
// ... omitted for brevity ... //
Step 4: Extract Data From Collected Links
Now that your crawler has successfully followed the pagination links, the next step is to extract data from them.
Once WebMagic navigates to a product page, we want to extract the product name, price, and image URL.
To achieve that, inspect a product card to identify the right selectors for each data point.

You'll find that each product is a list item with the class product
. The following HTML elements within the list items represent each data point.
- Product name:
<h2>
with classproduct-name
. - Product price: span element with class,
product-price
. - Product image:
<img>
tag with classproduct-image
.
Using this information, modify the process()
method to select all product items, loop through them, and extract the desired data points from each product page.
Note that the WebMagic CSS selector does not extract the actual price value by simply specifying the span class. This is because the text value of the price is nested in multiple span elements, and only specifying the class attribute will return the immediate text of the selected span, which is empty.
Therefore, you need a more precise CSS selector: .product-price bdi
. However, this only selects the numeric price value without the symbol. To get the full price, select the symbol and numeric price differently, then concatenate both results.
The currency symbol is a span element with the class, woocommerce-Price-currencySymbol
.
// import the required classes
//...
import us.codecraft.webmagic.selector.Selectable;
// ...
@Override
// process crawled page
public void process(Page page) {
// ...
// select all product items
List<Selectable> products = page.getHtml().css("li.product").nodes();
// loop through all product items and extract product details
for (Selectable product : products) {
// extract product name
String productName = product.$(".product-name", "text").toString();
// extract product price
String currencySymbol = product.$(".woocommerce-Price-currencySymbol", "text").toString();
String numericPrice = product.$(".product-price bdi", "text").toString();
String price = currencySymbol + numericPrice;
// extract product image URL
String imageUrl = product.$(".product-image", "src").toString();
// log the result
System.out.println("Product Name: " + productName);
System.out.println("Product Price: " + price);
System.out.println("Image URL: " + imageUrl);
}
}
This code extracts the product name, price, and image URL of each product as the crawler navigates each pagination link.
To verify the code works, modify the previous process()
using the code snippet above.
// import the required classes
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Selectable;
import java.util.List;
public class Main {
public static void main(String[] args) {
// initialize a Spider
Spider.create(new Crawler())
// define start URL
.addUrl("https://www.scrapingcourse.com/ecommerce/")
// set number of threads
.thread(3)
.run();
}
// PageProcessor implementation
public static class Crawler implements PageProcessor {
// set crawling configuration
private final Site site = Site.me()
.setRetryTimes(3)
.setSleepTime(1000);
@Override
// process crawled pages
public void process(Page page) {
// extract pagination links
List<String> paginationLinks = page.getHtml().css(".page-numbers").links().all();
// add pagination links to crawl queue
page.addTargetRequests(paginationLinks);
// select all product items
List<Selectable> products = page.getHtml().css("li.product").nodes();
// loop through all product items and extract product details
for (Selectable product : products) {
// extract product name
String productName = product.$(".product-name", "text").toString();
// extract product price
String currencySymbol = product.$(".woocommerce-Price-currencySymbol", "text").toString();
String numericPrice = product.$(".product-price bdi", "text").toString();
// concatenate currency symbol and numeric price
String price = currencySymbol + numericPrice;
// extract product image URL
String imageUrl = product.$(".product-image", "src").toString();
// log the result
System.out.println("Product Name: " + productName);
System.out.println("Product Price: " + price);
System.out.println("Image URL: " + imageUrl);
}
}
@Override
// provide site settings to the crawler framework
public Site getSite() {
return site;
}
}
}
Here's the result:
Product Name: Abominable Hoodie
Product Price: $69.00
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main.jpg
Product Name: Adrienne Trek Jacket
Product Price: $57.00
Product Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_main.jpg
// ... omitted for brevity ... //
Congratulations! You now have a WebMagic crawler capable of following links and extracting data from crawled links.
Step 5: Export the Scraped Data to CSV
Your crawler runs perfectly, but what if you want to save or process your result? WebMagic offers multiple Pipeline classes for managing results.
However, the FilePipeline
and JsonFilePipeline
both return JSON results. If you want to save scraped data in a different format, such as CSV, you'll need to customize your own Pipeline.
That said, we recommend using the in-built Java FileWriter
class for more control.
Here's the step-by-step process.
We'll abstract this functionality into a reusable method for cleaner code and easy maintenance.
But first, initialize an empty list to store scraped data.
// import the required classes
// ...
import java.util.ArrayList;
public class Main {
// initialize an empty list to store scraped product data
private static List<String[]> productData = new ArrayList<>();
// ...
}
Next, modify the process()
method to add scraped data for each page to the empty list.
// import the required classes
// ...
@Override
// process crawled page
public void process(Page page) {
// ...
// store the product details in the data list
productData.add(new String[]{productName, price, imageUrl});
}
Now, create a function to extract data to CSV using the FileWriter
class. This function will write the CSV headers and populate the rows with the scraped data.
// import the required classes
// ...
import java.io.FileWriter;
import java.io.IOException;
public class Main {
// ...
// method to save data to a CSV file
private static void exportDataToCsv(String filePath) {
try (FileWriter writer = new FileWriter(filePath)) {
// write headers
writer.append("Product Name,Price,Image URL\n");
// write data rows
for (String[] row : productData) {
writer.append(String.join(",", row));
writer.append("\n");
}
System.out.println("Data saved to " + filePath);
} catch (IOException e) {
e.printStackTrace();
}
}
}
That's it.
Combine the steps above and call the exportDataToCsv()
function after crawling finishes to get your functional WebMagic script.
// import the required classes
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Selectable;
import java.util.List;
import java.util.ArrayList;
import java.io.FileWriter;
import java.io.IOException;
public class Main {
// initialize an empty list to store scraped product data
private static List<String[]> productData = new ArrayList<>();
public static void main(String[] args) {
// initialize a Spider
Spider.create(new Crawler())
// define start URL
.addUrl("https://www.scrapingcourse.com/ecommerce/")
// set number of threads
.thread(3)
.run();
// export data to CSV after crawling finishes
exportDataToCsv("product_data.csv");
}
// PageProcessor implementation
public static class Crawler implements PageProcessor {
// set crawling configuration
private final Site site = Site.me()
.setRetryTimes(3)
.setSleepTime(1000);
@Override
// process crawled pages
public void process(Page page) {
// extract pagination links
List<String> paginationLinks = page.getHtml().css(".page-numbers").links().all();
// add pagination links to crawl queue
page.addTargetRequests(paginationLinks);
// select all product items
List<Selectable> products = page.getHtml().css("li.product").nodes();
// loop through all product items and extract product details
for (Selectable product : products) {
// extract product name
String productName = product.$(".product-name", "text").toString();
// extract product price
String currencySymbol = product.$(".woocommerce-Price-currencySymbol", "text").toString();
String numericPrice = product.$(".product-price bdi", "text").toString();
// concatenate currency symbol and numeric price
String price = currencySymbol + numericPrice;
// extract product image URL
String imageUrl = product.$(".product-image", "src").toString();
// store the product details in the data list
productData.add(new String[]{productName, price, imageUrl});
}
}
@Override
// provide site settings to the crawler framework
public Site getSite() {
return site;
}
}
// method to save data to a CSV file
private static void exportDataToCsv(String filePath) {
try (FileWriter writer = new FileWriter(filePath)) {
// write headers
writer.append("Product Name,Price,Image URL\n");
// write data rows
for (String[] row : productData) {
writer.append(String.join(",", row));
writer.append("\n");
}
System.out.println("Data saved to " + filePath);
} catch (IOException e) {
e.printStackTrace();
}
}
}
This code creates a products_data.csv
file in your root directory and writes the scraped data to it.
Here's a sample screenshot for reference.

Awesome! You now have a WebMagic crawler that follows links, extracts data and saves scraped data to CSV.
Avoid Getting Blocked While Crawling With WebMagicÂ
Getting blocked is a common challenge when web crawling. This is because web crawlers often exhibit patterns that are easily flagged by anti-bot solutions.
See for yourself.
Here's our initial WebMagic crawler attempting to crawl the ScrapingCourse Antibot Challenge page, an anti-bot-protected website.
// import the required classes
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
public class Main {
public static void main(String[] args) {
// initialize a Spider
Spider.create(new Crawler())
// define start URL
.addUrl("https://www.scrapingcourse.com/antibot-challenge")
// set number of threads
.thread(3)
.run();
}
// PageProcessor implementation
public static class Crawler implements PageProcessor {
// set crawling configuration
private final Site site = Site.me()
.setRetryTimes(3)
.setSleepTime(1000);
@Override
// process crawled page
public void process(Page page) {
// retrieve and log HTML content
page.putField("html", page.getHtml().toString());
}
@Override
// provide site settings to the crawler framework
public Site getSite() {
return site;
}
}
}
This code yields an empty result. This happens because the website blocks the script, WebMagic is unable to fetch the page, and page.getHtml()
returns an empty string.
To overcome this challenge, you can try recommended practices, such as rotating proxies and setting custom user agents. However, these measures may not always suffice against advanced anti-bot solutions.
To avoid getting blocked when crawling with WebMagic, consider the ZenRows' Universal Scraper API, the most reliable solution for scalable web crawling. This tool handles everything you need to avoid detection, allowing you to focus on extracting your desired data.
Some of ZenRows' features include advanced anti-bot bypass out of the box, anti-CAPTCHA, premium proxies, geo-located requests, fingerprinting evasion, actual user spoofing, request header management, and more.
Let's see ZenRows in action against the same Antibot Challenge page where WebMagic failed.
To follow along, sign up to get your free API key.
You'll be redirected to the Request Builder page, where you can find your ZenRows API key at the top right.

Input your target URL and activate Premium Proxies and JS Rendering boost mode.
Then, select the Java language and choose the API option. ZenRows works with any language and provides ready-to-use snippets for the most popular ones.
Lastly, copy the generated code on the right to your editor for testing.
Your code should look like this:
import org.apache.hc.client5.http.fluent.Request;
public class APIRequest {
public static void main(final String... args) throws Exception {
String apiUrl = "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.scrapingcourse.com%2Fantibot-challenge&js_render=true&premium_proxy=true";
String response = Request.get(apiUrl)
.execute().returnContent().asString();
System.out.println(response);
}
}
Remember to add the Apache HttpClient Fluent dependency to your pom.xml
file, as shown below.
<dependency>
<groupId>org.apache.httpcomponents.client5</groupId>
<artifactId>httpclient5-fluent</artifactId>
<version>5.4.1</version>
</dependency>
Run the code, and you'll get the following result:
<html lang="en">
<head>
<!-- ... -->
<title>Antibot Challenge - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<h2>
You bypassed the Antibot challenge! :D
</h2>
<!-- other content omitted for brevity -->
</body>
</html>
This code bypasses the anti-bot solution, makes a GET request to the page, and prints its response.
Congratulations! You're now well-equipped to crawl any website without getting blocked. Â
Conclusion
Here's a quick recap of what you've learned so far. From setting up your WebMagic project, you now know how to:
- Find and follow links using WebMagic.
- Extract data from crawled links.
- Export scraped data to CSV.
Just remember that while these steps enable you to build a web crawler, you must avoid getting blocked to take advantage of your new tool. WebMagic is a great crawling framework. However, you'll most likely be denied access in real-world use cases.
To crawl without getting blocked, consider ZenRows, an easy-to-implement, reliable, and scalable solution.
Sign up now to try ZenRows for free.