Pyspider is an open-source web crawling service for collecting data from websites at scale. It's popular for its standout features, such as task prioritization, a token bucket algorithm for traffic control, and a WebUI, which allows you to manage everything about your project.
In this guide, we'll cover the steps to build a Pyspider web crawler and how to avoid getting blocked when web scraping with Python.
Prerequisites
To follow along in this tutorial, ensure you meet the following requirements.
- Python.
- Pip3: this package manager comes with your Python installation.
- Your preferred IDE. We'll be using Visual Studio Code in this tutorial.
- Pyspider and its dependencies.
Before we get started, don't forget to check out our complete guide on web sraping with Python.
Build Your First Pyspider Web Crawler
Pyspider's intuitive interface and well-structured documentation make it easy to build your first web crawler.
But before we begin, let's set up some basics. We'll be crawling the ScrapingCourse e-commerce test website.

Our goal is to find and follow all product links on the target website to scrape product information, including product name, product price, and product image.
Step 1: Set up Pyspider
Before we proceed with the setup, let's take a step back and understand how this web crawler works.
Pyspider has a robust yet simple architecture that divides processes into components (Scheduler, Fetcher, Processor, and Output). Each component runs in unique threads, allowing you to perform multiple tasks simultaneously. You can even distribute tasks across multiple machines to optimize your web crawling efforts if needed.
At its core, the Pyspider data flow begins with an on_start()
call back.
- The Scheduler receives this
on_start()
task and dispatches it to the Fetcher according to a predefined schedule. - The Fetcher makes a request and receives the response, which it then feeds to the processor.
- The Processor processes the task and, in most cases, generates new URLs to crawl. It informs the Scheduler of finished tasks and sends new tasks to the Scheduler, and the process repeats until there are no more tasks.
Now, let's see Pyspider in action. Setting it up is straightforward. Follow the steps below to set up your Pyspider project.
Create and navigate to a new directory where you'd like to store your code.
mkdir Pyspider_crawler && cd Pyspider_crawler
Install Pyspider and its dependencies by running the command below:
pip3 install pyspider
After that, run the following command to start Pyspider.
pyspider
This will start the WebUI on port 5000. To view your dashboard, open a web browser and navigate to http://localhost:5000
.
Click the create button, enter your project name and starting URL, and prepare to write some code.
Step 2: Access the Target Website
Starting with the most basic Pyspider functionality, let's connect to the target website and retrieve its HTML.
from Pyspider.libs.base_handler import *
class Handler(BaseHandler):
crawl_config = {}
@every(minutes=24 * 60)
def on_start(self):
self.crawl("https://www.scrapingcourse.com/ecommerce/", callback=self.detail_page)
def detail_page(self, response):
# log the HTML content
self.logger.info(response.text)
This code makes a GET request to the target website and logs the response. Here's what each method does.
-
@every: The
every
decorator schedules theon_start()
callback to run at a specified interval. For example, with minutes=24* 60, it would run every 24 hours. This is particularly useful for projects involving periodic tasks, such as monitoring website changes. -
on_start()
: Theon_start()
callback fires when you run the code. It makes a GET request to the target website and calls thedetail_page()
function to handle the response. -
detail_page()
: This function processes the response from theon_start()
callback. In this case, it logs the HTML content.
Ideally, this code should return the page's HTML content. However, Pyspider has been deprecated and is no longer maintained. In other words, it hasn't received any updates or bug fixes in years. Thus, it isn't compatible with current Python versions, and your code might not run.
You can try downgrading to an older Python version, like Python 2.7, which Pyspider was originally built for.
That said, whether you managed to tweak Pyspider to work for you or you've switched to a different tool, you must avoid getting blocked to gain access to your target website.
In the next section, you'll learn how to avoid detection when web crawling.Â
Avoid Getting Blocked While Crawling With Pyspider
Getting blocked is a common web crawling challenge, and this happens because websites frequently employ anti-bot technologies that flag and restrict bot-like traffic. Crawlers exhibit patterns, such as excessive requests, that are quickly identifiable as an automated client, making it easy for websites to block your requests.
Some recommended best practices for overcoming this challenge include rotating proxies and setting custom user agents.
However, they're not scalable and can become even more challenging to implement when dealing with advanced anti-bot solutions.
To avoid getting blocked while web crawling, you can use the ZenRows Scraper API, an all-in-one web scraping toolkit that provides the most reliable solution for scalable web crawling.
With features such as premium proxies, anti-CAPTCHA, JavaScript rendering, and much more, ZenRows provides everything you need to crawl without getting blocked.
Here's an example showing ZenRows' ability to bypass any anti-bot restriction. We'll use the ScrapingCourse Antibot Challenge page as the target URL.
To follow along, sign up to get your free API key. This will take you to the Request Builder page, where your ZenRows API key is at the top right.

Input your target URL and activate Premium Proxies and JS Rendering boost mode.
Next, select the Python language and choose the API option. ZenRows works with any language and provides ready-to-use snippets for the most popular ones.
Copy the generated code on the right to your editor for testing. You can use your Python HTTP client. The generated code uses the Requests library, which you can install by running the following command.
pip3 install requests
Your code should look like this:
# import the required libraries
import requests
url = 'https://www.scrapingcourse.com/antibot-challenge'
apikey = '<YOUR_ZENROWS_API_KEY>'
params = {
'url': url,
'apikey': apikey,
'js_render': 'true',
'premium_proxy': 'true',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
This code bypasses the anti-bot challenge and prints the web page's HTML.
<html lang="en">
<head>
<!-- ... -->
<title>Antibot Challenge - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<h2>
You bypassed the Antibot challenge! :D
</h2>
<!-- other content omitted for brevity -->
</body>
</html>
Congratulations! You're now well-equipped to crawl any website without getting blocked.
To learn more, check out our in-depth guide on web crawling with Python.Â
Conclusion
To wrap things up, let's quickly recap the key concepts we've explored in this guide. From learning how Pyspider works to building your first web crawler, you now know that:
- Pyspider has been deprecated and is no longer maintained.
- Getting blocked is a common challenge that you must overcome when web crawling.
Most importantly, you've seen that ZenRows offers the most reliable and scalable web crawling solution. Thus, rather than trying your luck with manual configurations that'll most likely fail, use ZenRows for hassle-free web crawling. Try ZenRows for free today!