How to Use Pyspider for Web Crawling?

January 17, 2025 · 4 min read

Table of contents

Prerequisites
Build first Pyspider web crawler
- Set up Pyspider
- Access the target website
Avoid getting blocked
Conclusion

Pyspider is an open-source web crawling service for collecting data from websites at scale. It's popular for its standout features, such as task prioritization, a token bucket algorithm for traffic control, and a WebUI, which allows you to manage everything about your project.

In this guide, we'll cover the steps to build a Pyspider web crawler and how to avoid getting blocked when web scraping with Python.

Prerequisites

To follow along in this tutorial, ensure you meet the following requirements.

Python.
Pip3: this package manager comes with your Python installation.
Your preferred IDE. We'll be using Visual Studio Code in this tutorial.
Pyspider and its dependencies.

Before we get started, don't forget to check out our complete guide on web sraping with Python.

Build Your First Pyspider Web Crawler

Pyspider's intuitive interface and well-structured documentation make it easy to build your first web crawler.

But before we begin, let's set up some basics. We'll be crawling the ScrapingCourse e-commerce test website.

ScrapingCourse.com Ecommerce homepage — Click to open the image in full screen

Our goal is to find and follow all product links on the target website to scrape product information, including product name, product price, and product image.

Crawl websites at scale without getting blocked.

ZenRows improves your data collection workflow with fast and scalable web crawlers.

Try for Free

Step 1: Set up Pyspider

Before we proceed with the setup, let's take a step back and understand how this web crawler works.

Pyspider has a robust yet simple architecture that divides processes into components (Scheduler, Fetcher, Processor, and Output). Each component runs in unique threads, allowing you to perform multiple tasks simultaneously. You can even distribute tasks across multiple machines to optimize your web crawling efforts if needed.

At its core, the Pyspider data flow begins with an on_start() call back.

The Scheduler receives this on_start() task and dispatches it to the Fetcher according to a predefined schedule.
The Fetcher makes a request and receives the response, which it then feeds to the processor.
The Processor processes the task and, in most cases, generates new URLs to crawl. It informs the Scheduler of finished tasks and sends new tasks to the Scheduler, and the process repeats until there are no more tasks.

Now, let's see Pyspider in action. Setting it up is straightforward. Follow the steps below to set up your Pyspider project.

Create and navigate to a new directory where you'd like to store your code.

                    Terminal
                
mkdir Pyspider_crawler && cd Pyspider_crawler

Copied!

Install Pyspider and its dependencies by running the command below:

                    Terminal
                
pip3 install pyspider

Copied!

After that, run the following command to start Pyspider.

                    Terminal
                
pyspider

Copied!

This will start the WebUI on port 5000. To view your dashboard, open a web browser and navigate to http://localhost:5000.

Click the create button, enter your project name and starting URL, and prepare to write some code.

Step 2: Access the Target Website

Starting with the most basic Pyspider functionality, let's connect to the target website and retrieve its HTML.

                    crawler.py
                
from Pyspider.libs.base_handler import *

class Handler(BaseHandler):
    crawl_config = {}

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl("https://www.scrapingcourse.com/ecommerce/", callback=self.detail_page)

    def detail_page(self, response):
        # log the HTML content
        self.logger.info(response.text)

Copied!

This code makes a GET request to the target website and logs the response. Here's what each method does.

@every: The every decorator schedules the on_start() callback to run at a specified interval. For example, with minutes=24* 60, it would run every 24 hours. This is particularly useful for projects involving periodic tasks, such as monitoring website changes.
on_start(): The on_start() callback fires when you run the code. It makes a GET request to the target website and calls the detail_page() function to handle the response.
detail_page(): This function processes the response from the on_start() callback. In this case, it logs the HTML content.

Ideally, this code should return the page's HTML content. However, Pyspider has been deprecated and is no longer maintained. In other words, it hasn't received any updates or bug fixes in years. Thus, it isn't compatible with current Python versions, and your code might not run.

You can try downgrading to an older Python version, like Python 2.7, which Pyspider was originally built for.

That said, whether you managed to tweak Pyspider to work for you or you've switched to a different tool, you must avoid getting blocked to gain access to your target website.

In the next section, you'll learn how to avoid detection when web crawling.

Avoid Getting Blocked While Crawling With Pyspider

Getting blocked is a common web crawling challenge, and this happens because websites frequently employ anti-bot technologies that flag and restrict bot-like traffic. Crawlers exhibit patterns, such as excessive requests, that are quickly identifiable as an automated client, making it easy for websites to block your requests.

Some recommended best practices for overcoming this challenge include rotating proxies and setting custom user agents.

However, they're not scalable and can become even more challenging to implement when dealing with advanced anti-bot solutions.

To avoid getting blocked while web crawling, you can use the ZenRows Universal Scraper API, an all-in-one web scraping toolkit that provides the most reliable solution for scalable web crawling.

With features such as premium proxies, anti-CAPTCHA, JavaScript rendering, and much more, ZenRows provides everything you need to crawl without getting blocked.

Here's an example showing ZenRows' ability to bypass any anti-bot restriction. We'll use the ScrapingCourse Antibot Challenge page as the target URL.

To follow along, sign up to get your free API key. This will take you to the Request Builder page, where your ZenRows API key is at the top right.

building a scraper with zenrows — Click to open the image in full screen

Input your target URL and activate Premium Proxies and JS Rendering boost mode.

Next, select the Python language and choose the API option. ZenRows works with any language and provides ready-to-use snippets for the most popular ones.

Copy the generated code on the right to your editor for testing. You can use your Python HTTP client. The generated code uses the Requests library, which you can install by running the following command.

                    Terminal
                
pip3 install requests

Copied!

Your code should look like this:

                    crawler.py
                
# import the required libraries
import requests

url = 'https://www.scrapingcourse.com/antibot-challenge'
apikey = '<YOUR_ZENROWS_API_KEY>'
params = {
    'url': url,
    'apikey': apikey,
    'js_render': 'true',
    'premium_proxy': 'true',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)

  
  

  
Copied!

This code bypasses the anti-bot challenge and prints the web page's HTML.

                    Output
                
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

  
  

  
Copied!

Congratulations! You're now well-equipped to crawl any website without getting blocked.

To learn more, check out our in-depth guide on web crawling with Python.

Conclusion

To wrap things up, let's quickly recap the key concepts we've explored in this guide. From learning how Pyspider works to building your first web crawler, you now know that:

Pyspider has been deprecated and is no longer maintained.
Getting blocked is a common challenge that you must overcome when web crawling.

Most importantly, you've seen that ZenRows offers the most reliable and scalable web crawling solution. Thus, rather than trying your luck with manual configurations that'll most likely fail, use ZenRows for hassle-free web crawling. Try ZenRows for free today!