How to Scale Your Web Scraping With Gerapy

Idowu Omisola
Idowu Omisola
May 8, 2025 · 6 min read

Scrapy is a powerful web crawling tool. But managing several spiders at scale for continuous data collection can actually prove challenging. You might even lose track of the next crawl to schedule.

Whether you want to refresh your data pipeline at intervals, run periodic cron jobs, or scrape multiple pages concurrently from several Scrapy spiders at once, Gerapy comes to the rescue. We'll show you how Gerapy works and how to use it to schedule and distribute your Scrapy scraping tasks for scalability.

What Is Gerapy and How Does It Work?

Gerapy lets you manage and distribute Scrapy crawling tasks on Scrapyd, a dedicated server for deploying Scrapy projects locally or remotely. Built with Django, Gerapy provides a user interface to schedule, queue, monitor, create, update, and delete scraping jobs.

When you deploy a Scrapy project to a running Scrapyd server, you can connect that server to Gerapy as a client and start managing it from a central dashboard.

Another feature of Gerapy is that it inherently lets you schedule crawls. It also supports Redis through Scrapy-Redis. This makes scheduling more flexible and is helpful if you're scraping several URLs at once but want to avoid double crawls.

How exactly does distributed crawling work on Gerapy?

Imagine you've deployed a Scrapy project with many spiders, and each spider targets specific pages. You can deploy each spider to Gerapy as a separate task and schedule them as batched, queued jobs.

For instance, when scraping paginated sites with Scrapy, you can distribute the pages among your spiders, where each scrapes 5 pages at a time. When you deploy your Scrapy project via Scrapyd, connecting it to Gerapy allows you to schedule and queue each spider.

In the next section, we'll show you how to manage a Scrapy project on Gerapy, focusing on distributed crawling, job scheduling, and queuing.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Setting Up and Scaling Gerapy for Distributed Web Scraping

In this part, you'll learn to connect Gerapy with a Scrapyd server and schedule scraping jobs across multiple nodes.

This tutorial assumes you already have a running Scrapy project. Otherwise, quickly set one up with our Scrapy web scraping guide.

So, let's first set up the Gerapy and Scrapyd.

Step 1: Install Required Dependencies

To start, install Gerapy and Scrapyd with pip

Terminal
pip3 install gerapy scrapyd

All done? You're now ready to deploy your Scrapy project for managed distributed crawling.

Step 2: Initialize and Configure Gerapy

Open your command line to any suitable folder and initiate a new Gerapy workspace:

Terminal
gerapy init

The above command creates a new gerapy folder in the current directory. Next, migrate into the gerapy folder and set up an SQLite database:

Terminal
cd gerapy
gerapy migrate

Next, create a superuser using the command below. Follow the prompts to create a username, and add your email address and password:

Terminal
gerapy createsuperuser

Finally, start the gerapy server:

Terminal
gerapy runserver

The above command initiates a local server at port 8000:

Terminal
http://127.0.0.1:8000/

Open that URL via your browser and log in with the username and password you set earlier:

Click to open the image in full screen

Once logged in, you should see the following interface:

Click to open the image in full screen

You're done setting up Gerapy. Let's go ahead and link it with a Scrapyd server.

Step 3: Set Up a Scrapyd Server

To start a Scrapyd server, open the command line to your Scrapy project root folder and run:

Terminal
scrapyd

The above command initiates a local Scrapyd server. This runs on port 6800 by default:

Terminal
Site starting on 6800

The next part is to connect your Scrapy project with Scrapyd. Replace the configuration inside the scrapy.cfg file with the following:

scraper.cfg
[settings]
default = product_scraper.settings

[deploy:local]
url = http://localhost:6800/
project = product_scraper

Deploy the project using the following base command, where <target_name> is the environment name (e.g., local) and <your_project_name> is your Scrapy project's name (e.g., scraper):

Terminal
scrapyd-deploy <target_name> -p <your_project_name>

For instance, to deploy a Scrapy project named scraper using a target name of local, run the following command:

Terminal
scrapyd-deploy local -p scraper

Now, open http://localhost:6800/ in your browser. You'll see a page listing your Scrapy project.

Scrapyd Local Home
Click to open the image in full screen

Step 4: Create and Run a Single Client in Gerapy

Next is deploying and tracking a Scrapy project and its spider via Gerapy.

  1. Open the Gerapy dashboard and click "Clients" on the left sidebar.
  2. Click "+ create" at the top-right.
The Gerapy UI
Click to open the image in full screen
  1. Give the client a suitable name in the "Name" field. Paste the local Scrapyd server IP address in the "IP" field (127.0.0.1). Then, fill in the "Port" field with the Scrapyd server port number (6800).
  1. Click "create".
Click to open the image in full screen
  1. The deployed client now appears on the "Clients" table. To run it individually, click "Schedule" under the "Operations" column.
Click to open the image in full screen
  1. Click "Run" under operations to execute a scraping job. You can even click it multiple times to queue the tasks. You'll see all running and queued tasks on this page.
Click to open the image in full screen

That's it! You just ran your first scraping job with Gerapy. Let's see how to schedule scraping jobs in the next section.

Step 5: Schedule Scraping Jobs Concurrently for Multiple Scrapy Spiders

Gerapy lets you schedule scraping for a specific date in case you forget to run the job manually. It also provides an option for cron jobs, enabling you to automate repetitive crawling. And there's an interval option, which is handy for updating your data pipeline within a particular time frame (e.g., every 10 minutes).

Let's see how to schedule scraping jobs using the interval option.

  1. Go to "Tasks" on the left sidebar.
  2. Click "create" in the top-right corner.
Click to open the image in full screen
  1. Next, enter a suitable name for your task in the "Name" field. Fill in the "Project" and "Spider" fields with your Scrapy project and Spider names, respectively. Then, select the connected client from the "Clients" dropdown.
  2. Select the "Interval" option from the Trigger dropdown.
Click to open the image in full screen
  1. Now, schedule a chosen interval for scraping jobs from the provided fields. For instance, the following configuration runs the crawling tasks with a 4-second interval for 1 hour (between 16:00:00 and 17:00:00).
Click to open the image in full screen
  1. Click "create" to activate your schedules.

Great job! You've activated interval schedules with Gerapy. You're now updating your data pipeline with new data.

That said, Gerapy automatically queues each job to avoid overlaps and memory overhead. So, you'll see pending and running jobs even if the interval is close.

The image below shows how Gerapy runs multiple spiders. Here, the product_scraper and screenshot_spiders run concurrently within the same timeframe and interval. You can also cancel a job if you want:

Click to open the image in full screen

Nice! You've distributed scraping jobs concurrently among different Scrapy spiders, saving as much time as possible.

Step 6: Create and Import Scrapy Projects in Gerapy

You can also import your Scrapy project and manage it directly inside Gerapy. The platform has a real-time code editor to update your code. Any changes you make to the project on Gerapy will reflect in your Scrapyd deployment.

To import a Scrapy project:

  1. Go to "Projects" in Gerapy and click "+ create" at the top right.
Click to open the image in full screen
  1. If your Scrapy project is available locally, click "Upload" and upload it as a zipped folder. Then, click "Finish". If it's on a version control like Git, select the Clone option, enter the remote URL in the provided field, and click "Clone".
Click to open the image in full screen
  1. Next, click "deploy" under Operations.
Click to open the image in full screen
  1. On the next page, fill in the "Description" field inside the "Build Project" section with a build description.
  2. Click "build".
  3. Go to the table at the top and click "deploy" under "Operations".
Click to open the image in full screen
  1. To view the uploaded files, return to "Projects" and click "edit" under "Operations".
Click to open the image in full screen
  1. You can edit the project directly via the code editor, add new spiders and project files, or update existing ones.
Click to open the image in full screen

Good job! You're all set and ready to distribute crawling tasks with Gerapy.

Scale Up Your Scraping With ZenRows

Despite Gerapy's ability to distribute scraping jobs, anti-bot measures like CAPTCHAs, JavaScript challenges, IP bans, and more can still block Scrapy spiders. Besides, getting blocked is a major challenge of large-scale web scraping.

The best way to scrape without being blocked is to use a web scraping solution like the ZenRows Universal Scraper API. ZenRows bypasses all anti-bot measures behind the scenes with a single API call. This lets you focus on getting your desired data rather than spending huge resources and time solving anti-scraping blocks.

It also has headless browser features to automate user interactions and scrape dynamic content. And if trying to scrape geo-restricted data, ZenRows geo-targeting comes in handy.

ZenRows integrates easily with Scrapy via the scrapy-zenrows middleware, bringing you all the benefits of the Universal Scraper API.

Let's see how it works by scraping a heavily protected website like this Anti-bot Challenge page.

Sign up and go to the Request Builder. Then, copy your ZenRows API key.

building a scraper with zenrows
Click to open the image in full screen

Next, install the scrapy-zenrows middleware with pip:

Terminal
pip3 install scrapy-zenrows 

Add the middleware and your ZenRows API key to your settings.py file and set ROBOTSTXT_OBEY to False to gain access to ZenRows' API:

settings.py
# ...
ROBOTSTXT_OBEY = False

DOWNLOADER_MIDDLEWARES = {
	# enable scrapy-zenrows middleware
	"scrapy_zenrows.middleware.ZenRowsMiddleware": 543,
}
# ZenRows API key
ZENROWS_API_KEY = "<YOUR_ZENROWS_API_KEY>"

Import ZenRowsRequest into your scraper spider and add ZenRows' params to your start_requests function, including the JS Rendering and Premium Proxy features:

Example
# pip3 install scrapy-zenrows
import scrapy
from scrapy_zenrows import ZenRowsRequest

class Scraper(scrapy.Spider):
    name = "scraper"
    allowed_domains = ["www.scrapingcourse.com"]
    start_urls = ["https://www.scrapingcourse.com/antibot-challenge"]

    def start_requests(self):
        # use ZenRowsRequest for customization
        for url in self.start_urls:
            yield ZenRowsRequest(
                url=url,
                params={
                    "js_render": "true",
                    "premium_proxy": "true",
                },
                callback=self.parse,
            )

    def parse(self, response):
        self.log(response.text)

The above Scrapy spider returns the protected site's full-page HTML, showing you bypassed the anti-bot challenge:

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

Congratulations! You just used ZenRows to bypass an anti-bot in Scrapy using the scrapy-zenrows middleware. You can now scale your crawling jobs on Gerapy without limitations.

Conclusion

You've learned to deploy a Scrapy project with Scrapyd and manage it at scale using Gerapy's distributed scraping feature. You can now schedule concurrent scraping jobs for different Scrapy spiders via Gerapy.

Remember that anti-bot measures are always there to block your requests, preventing you from accessing your target site. ZenRows helps you bypass this limitation, giving you unlimited access to your desired data. It's your one-stop scraping solution for scalable data collection with minimal effort.

Try ZenRows for free!

Ready to get started?

Up to 1,000 URLs for free are waiting for you