Scrapy is a powerful web crawling tool. But managing several spiders at scale for continuous data collection can actually prove challenging. You might even lose track of the next crawl to schedule.
Whether you want to refresh your data pipeline at intervals, run periodic cron jobs, or scrape multiple pages concurrently from several Scrapy spiders at once, Gerapy comes to the rescue. We'll show you how Gerapy works and how to use it to schedule and distribute your Scrapy scraping tasks for scalability.
- Step 1: Install the required dependencies.
- Step 2: Initialize and configure Gerapy.
- Step 3: Set up a Scrapyd server.
- Step 4: Create and run a single client in Gerapy.
- Step 5: Schedule scraping jobs concurrently for multiple Scrapy spiders.
- Step 6: Create and import Scrapy projects in Gerapy.
What Is Gerapy and How Does It Work?
Gerapy lets you manage and distribute Scrapy crawling tasks on Scrapyd, a dedicated server for deploying Scrapy projects locally or remotely. Built with Django, Gerapy provides a user interface to schedule, queue, monitor, create, update, and delete scraping jobs.
When you deploy a Scrapy project to a running Scrapyd server, you can connect that server to Gerapy as a client and start managing it from a central dashboard.
Another feature of Gerapy is that it inherently lets you schedule crawls. It also supports Redis through Scrapy-Redis. This makes scheduling more flexible and is helpful if you're scraping several URLs at once but want to avoid double crawls.
How exactly does distributed crawling work on Gerapy?
Imagine you've deployed a Scrapy project with many spiders, and each spider targets specific pages. You can deploy each spider to Gerapy as a separate task and schedule them as batched, queued jobs.
For instance, when scraping paginated sites with Scrapy, you can distribute the pages among your spiders, where each scrapes 5 pages at a time. When you deploy your Scrapy project via Scrapyd, connecting it to Gerapy allows you to schedule and queue each spider.
In the next section, we'll show you how to manage a Scrapy project on Gerapy, focusing on distributed crawling, job scheduling, and queuing.
Setting Up and Scaling Gerapy for Distributed Web Scraping
In this part, you'll learn to connect Gerapy with a Scrapyd server and schedule scraping jobs across multiple nodes.
This tutorial assumes you already have a running Scrapy project. Otherwise, quickly set one up with our Scrapy web scraping guide.
So, let's first set up the Gerapy and Scrapyd.
Step 1: Install Required Dependencies
To start, install Gerapy and Scrapyd with pip
pip3 install gerapy scrapyd
All done? You're now ready to deploy your Scrapy project for managed distributed crawling.
Step 2: Initialize and Configure Gerapy
Open your command line to any suitable folder and initiate a new Gerapy workspace:
gerapy init
The above command creates a new gerapy
folder in the current directory. Next, migrate into the gerapy
folder and set up an SQLite database:
cd gerapy
gerapy migrate
Next, create a superuser using the command below. Follow the prompts to create a username, and add your email address and password:
gerapy createsuperuser
Finally, start the gerapy server:
gerapy runserver
The above command initiates a local server at port 8000
:
http://127.0.0.1:8000/
Open that URL via your browser and log in with the username and password you set earlier:

Once logged in, you should see the following interface:

You're done setting up Gerapy. Let's go ahead and link it with a Scrapyd server.
Step 3: Set Up a Scrapyd Server
To start a Scrapyd server, open the command line to your Scrapy project root folder and run:
scrapyd
The above command initiates a local Scrapyd server. This runs on port 6800
by default:
Site starting on 6800
The next part is to connect your Scrapy project with Scrapyd. Replace the configuration inside the scrapy.cfg
file with the following:
[settings]
default = product_scraper.settings
[deploy:local]
url = http://localhost:6800/
project = product_scraper
Deploy the project using the following base command, where <target_name>
is the environment name (e.g., local) and <your_project_name>
is your Scrapy project's name (e.g., scraper
):
scrapyd-deploy <target_name> -p <your_project_name>
For instance, to deploy a Scrapy project named scraper
using a target name of local
, run the following command:
scrapyd-deploy local -p scraper
Now, open http://localhost:6800/
in your browser. You'll see a page listing your Scrapy project.

You can deploy multiple projects to a single Scrapyd server. You only need to connect to that server via each project's scrapy.cfg
file.
Step 4: Create and Run a Single Client in Gerapy
Next is deploying and tracking a Scrapy project and its spider via Gerapy.
- Open the Gerapy dashboard and click "Clients" on the left sidebar.
- Click "+ create" at the top-right.

- Give the client a suitable name in the "Name" field. Paste the local Scrapyd server IP address in the "IP" field (
127.0.0.1
). Then, fill in the "Port" field with the Scrapyd server port number (6800
).
If you've hosted Scrapyd on a remote service like Azure or AWS, specify the provider's IP address or your domain name service (DNS) in the "IP" field. Then, toggle the "Auth" button to apply relevant authentication credentials.
- Click "create".

- The deployed client now appears on the "Clients" table. To run it individually, click "Schedule" under the "Operations" column.

- Click "Run" under operations to execute a scraping job. You can even click it multiple times to queue the tasks. You'll see all running and queued tasks on this page.

That's it! You just ran your first scraping job with Gerapy. Let's see how to schedule scraping jobs in the next section.
Step 5: Schedule Scraping Jobs Concurrently for Multiple Scrapy Spiders
Gerapy lets you schedule scraping for a specific date in case you forget to run the job manually. It also provides an option for cron jobs, enabling you to automate repetitive crawling. And there's an interval option, which is handy for updating your data pipeline within a particular time frame (e.g., every 10 minutes).
Let's see how to schedule scraping jobs using the interval option.
- Go to "Tasks" on the left sidebar.
- Click "create" in the top-right corner.

- Next, enter a suitable name for your task in the "Name" field. Fill in the "Project" and "Spider" fields with your Scrapy project and Spider names, respectively. Then, select the connected client from the "Clients" dropdown.
- Select the "Interval" option from the Trigger dropdown.

- Now, schedule a chosen interval for scraping jobs from the provided fields. For instance, the following configuration runs the crawling tasks with a 4-second interval for 1 hour (between
16:00:00
and17:00:00
).

- Click "create" to activate your schedules.
Great job! You've activated interval schedules with Gerapy. You're now updating your data pipeline with new data.
That said, Gerapy automatically queues each job to avoid overlaps and memory overhead. So, you'll see pending and running jobs even if the interval is close.
If your Scrapy project has multiple spiders, you can schedule and queue them all with Gerapy. Just repeat the above scheduling process, but change the task name and ensure you update the Spider field with the target spider's name.
The image below shows how Gerapy runs multiple spiders. Here, the product_scraper
and screenshot_spiders
run concurrently within the same timeframe and interval. You can also cancel a job if you want:

Nice! You've distributed scraping jobs concurrently among different Scrapy spiders, saving as much time as possible.
Step 6: Create and Import Scrapy Projects in Gerapy
You can also import your Scrapy project and manage it directly inside Gerapy. The platform has a real-time code editor to update your code. Any changes you make to the project on Gerapy will reflect in your Scrapyd deployment.
To import a Scrapy project:
- Go to "Projects" in Gerapy and click "+ create" at the top right.

- If your Scrapy project is available locally, click "Upload" and upload it as a zipped folder. Then, click "Finish". If it's on a version control like Git, select the Clone option, enter the remote URL in the provided field, and click "Clone".

- Next, click "deploy" under Operations.

- On the next page, fill in the "Description" field inside the "Build Project" section with a build description.
- Click "build".
- Go to the table at the top and click "deploy" under "Operations".

- To view the uploaded files, return to "Projects" and click "edit" under "Operations".

- You can edit the project directly via the code editor, add new spiders and project files, or update existing ones.

Good job! You're all set and ready to distribute crawling tasks with Gerapy.
Scale Up Your Scraping With ZenRows
Despite Gerapy's ability to distribute scraping jobs, anti-bot measures like CAPTCHAs, JavaScript challenges, IP bans, and more can still block Scrapy spiders. Besides, getting blocked is a major challenge of large-scale web scraping.
The best way to scrape without being blocked is to use a web scraping solution like the ZenRows Universal Scraper API. ZenRows bypasses all anti-bot measures behind the scenes with a single API call. This lets you focus on getting your desired data rather than spending huge resources and time solving anti-scraping blocks.
It also has headless browser features to automate user interactions and scrape dynamic content. And if trying to scrape geo-restricted data, ZenRows geo-targeting comes in handy.
ZenRows integrates easily with Scrapy via the scrapy-zenrows
middleware, bringing you all the benefits of the Universal Scraper API.
Let's see how it works by scraping a heavily protected website like this Anti-bot Challenge page.
Sign up and go to the Request Builder. Then, copy your ZenRows API key.

Next, install the scrapy-zenrows
middleware with pip
:
pip3 install scrapy-zenrows
Add the middleware and your ZenRows API key to your settings.py
file and set ROBOTSTXT_OBEY
to False
to gain access to ZenRows' API:
# ...
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
# enable scrapy-zenrows middleware
"scrapy_zenrows.middleware.ZenRowsMiddleware": 543,
}
# ZenRows API key
ZENROWS_API_KEY = "<YOUR_ZENROWS_API_KEY>"
Import ZenRowsRequest
into your scraper
spider and add ZenRows' params to your start_requests
function, including the JS Rendering and Premium Proxy features:
# pip3 install scrapy-zenrows
import scrapy
from scrapy_zenrows import ZenRowsRequest
class Scraper(scrapy.Spider):
name = "scraper"
allowed_domains = ["www.scrapingcourse.com"]
start_urls = ["https://www.scrapingcourse.com/antibot-challenge"]
def start_requests(self):
# use ZenRowsRequest for customization
for url in self.start_urls:
yield ZenRowsRequest(
url=url,
params={
"js_render": "true",
"premium_proxy": "true",
},
callback=self.parse,
)
def parse(self, response):
self.log(response.text)
The above Scrapy spider returns the protected site's full-page HTML, showing you bypassed the anti-bot challenge:
<html lang="en">
<head>
<!-- ... -->
<title>Antibot Challenge - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<h2>
You bypassed the Antibot challenge! :D
</h2>
<!-- other content omitted for brevity -->
</body>
</html>
Congratulations! You just used ZenRows to bypass an anti-bot in Scrapy using the scrapy-zenrows
middleware. You can now scale your crawling jobs on Gerapy without limitations.
Conclusion
You've learned to deploy a Scrapy project with Scrapyd and manage it at scale using Gerapy's distributed scraping feature. You can now schedule concurrent scraping jobs for different Scrapy spiders via Gerapy.
Remember that anti-bot measures are always there to block your requests, preventing you from accessing your target site. ZenRows helps you bypass this limitation, giving you unlimited access to your desired data. It's your one-stop scraping solution for scalable data collection with minimal effort.