Cloudscraper With Scrapy: How to Use + Alternatives

October 2, 2024 · 4 min read

Table of contents

How to integrate Cloudscraper with Scrapy?
- Create a custom middleware
- Activate Cloudscraper middleware
- Scrape the Cloudflare-protected web page
Working alternatives
- ZenRows: Web scraping API
- Selenium
- Undetected ChromeDriver
Conclusion

Scrapy is a powerful Python library explicitly designed for extracting data from web pages at scale. However, Cloudflare-protected websites can identify Scrapy's fingerprint and block your requests.

Fortunately, one of Scrapy's superpowers is its customizability. You can integrate with external solutions, such as Cloudscraper, to bypass Cloudflare.

But do these tools truly live up to expectations?

In this article, we'll review the viability of using Cloudscraper with Scrapy and also recommend more efficient alternatives.

How to Integrate Cloudscraper With Scrapy?

Cloudflare-protected web pages are arguably one of the most challenging to scrape. If Cloudflare suspects bot-like activities, it responds with the infamous "I'm Under Attack" page, presenting various challenges that require you to prove your legitimacy.

What if we told you vanilla Scrapy never makes it past this page?

You probably already know that, and this is where Cloudscraper could play a key role.

It provides a lightweight API for solving Cloudflare challenges and bypassing the "I'm Under Attack Mode" (IAUM). Scrapy allows you to integrate with this tool while maintaining its scraping architecture.

Let's explore how to combine both solutions to build a Cloudflare scraper.

Scrapy offers the downloader middleware framework that lets you customize its requests/response processing. By injecting Cloudscraper into this middleware, you can configure Scrapy to pass requests through Cloudscraper.

In a nutshell, to integrate Cloudscraper with Scrapy, activate a middleware class that makes requests using Cloudscraper. This middleware will intercept your requests, handle Cloudflare challenges, and return the response to Scrapy.

Here's a step-by-step guide.

To follow along, install Cloudscraper and Scrapy using the following commands.

                    Terminal
                
pip3 install scrapy cloudscraper

Copied!

Step 1: Create a Custom Middleware

In your middlewares.py file, import Cloudscraper and HtmlResponse.

                    middlewares.py
                
from scrapy.http import HtmlResponse
import cloudscraper
from scrapy import signals

Copied!

The HtmlResponse class is a specialized Scrapy response type for handling HTML responses. Since Scrapy expects responses to be instances of its response objects, you must convert Cloudscraper's response to an HtmlResponse object.

Then, add the following code:

                    middlewares.py
                
#...
class CloudscraperMiddleware:
    def __init__(self):
        # create a Cloudscraper instance
        self.scraper = cloudscraper.create_scraper()

    @classmethod
    def from_crawler(cls, crawler):
        # this method is used by Scrapy to create your spiders
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # make request using Cloudscraper
        response = self.scraper.get(request.url)

        # return a Scrapy HtmlResponse object with the content from Cloudscraper
        return HtmlResponse(
            url=request.url,
            status=response.status_code,
            body=response.content,
            encoding='utf-8',
            request=request
        )

    def spider_opened(self, spider):
        spider.logger.info(f'Spider {spider.name} opened: CloudscraperMiddleware active')

  
  

  
Copied!

The code above creates a Cloudscraper middleware class that defines two main methods:

The from_crawler() class method receives a crawler instance and initializes middleware with the crawler object. This serves as the main entry point for the Cloudscraper middleware, as it gives access to the entire Scrapy crawler object, including settings.
The process_response() method makes each request using Cloudscraper and returns an HtmlResponse object. This method is called for each request that goes through the downloader middleware.

Step 2: Activate your Cloudscraper Middleware

In your settings.py file, add the Cloudscraper middleware in the DOWNLOADER_MIDDLEWARES setting.

                    settings.py
                
DOWNLOADER_MIDDLEWARES = {
    "<project_name>.middlewares.CloudscraperMiddleware": 543,
}

Copied!

<project_name>.middlewares.CloudscraperMiddleware represents the location of the Cloudscraper middleware, and 543 is the order Scrapy uses to assign your middleware.

If unsure about your order value, check out Scrapy's default middleware order. However, 543 is a good value as it puts your middleware right in the middle of the request/response process.

That's it. You've integrated Cloudscraper with Scrapy.

Step 3: Scrape the Cloudflare-protected Web Page

Now, to the crux of the matter.

Can Cloudscraper bypass Cloudflare?

Let's put it to the test.

For this example, we'll use a Cloudflare Challenge page as the target website.

First, create a spider that makes a request to the target website and prints the response.

                    scraper.py
                
import scrapy

class ScraperSpider(scrapy.Spider):
    name = "scraper"
    start_urls = ["https://www.scrapingcourse.com/cloudflare-challenge"]

    def parse(self, response):
        self.log(response.text)

Copied!

The Cloudscraper middleware intercepts this Spider's request to bypass Cloudflare. However, we get the following response.

                    Output
                
xx.xx.xx  [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.scrapingcourse.com:443
xx.xx.xx [urllib3.connectionpool] DEBUG: https://www.scrapingcourse.com:443 "GET /cloudflare-challenge HTTP/11" 403 None
xx.xx.xx [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.scrapingcourse.com/cloudflare-challenge> (referer: None)

Copied!

Frustrated that your web scrapers are blocked once and again?

ZenRows API handles rotating proxies and headless browsers for you.

Try for FREE

The result above shows that your Scrapy Spider received the 403 Forbidden error, which indicates that the target page understood your request but refused to fulfill it.

This could occur for different reasons, the most notable being that Cloudflare, like other WAF providers (PerimeterX, DataDome, Akamai), continuously updates its detection mechanisms, and open-source tools like Cloudscraper simply can't keep up.

You have your answer: Cloudscraper can no longer bypass Cloudflare.

However, you can still bypass Cloudflare with Scrapy.

Below are some working alternatives.

Working Alternatives to Cloudscraper + Scrapy

To bypass Cloudflare, your request must completely imitate a natural user. Here are three ways to achieve that, some more effective than others.

Method #1: ZenRows: Web Scraping API

Manually emulating natural user behavior requires technical expertise; most times, it's impossible to plug every hole.

What if you could do it using a single API call?

Web scraping APIs like ZenRows offer the most effective and only surefire way to bypass Cloudflare and any anti-bot system.

Not only does ZenRows handle all the technical aspects of emulating natural browsing behavior, but it also evolves with Cloudflare's frequent updates and changes.

ZenRows provides numerous features out of the box, including auto-rotating user agents, premium proxies, anti-CAPTCHA, and more. Its headless browser functionality allows you to render dynamic pages and interact with web page elements like in a regular browser.

All this and more makes ZenRows the most effective option for bypassing Cloudflare and web scraping at scale.

To back up these claims, let's put ZenRows to the test using the same Cloudflare Challenge page that blocked Cloudscraper.

To follow along, sign up for free to get your API key. You'll be redirected to the Request Builder page.

building a scraper with zenrows — Click to open the image in full screen

Input the target URL and activate Premium Proxies and the JS Rendering mode. In some cases, ZenRows automatically activates these parameters.

Select the Python language option on the right and choose the API mode. This will generate your request code.

Copy the code and use your preferred HTTP client to make a request to the ZenRows API. The code below uses Python Requests.

                    Example
                
# pip install requests
import requests

url = 'https://www.scrapingcourse.com/cloudflare-challenge'
apikey = '<YOUR_ZENROWS_API_KEY>'
params = {
   'url': url,
   'apikey': apikey,
   'js_render': 'true',
   'premium_proxy': 'true',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)

  
  

  
Copied!

Here's the result:

                    Output
                
<html lang="en">
<head>
    <!-- ... -->
    <title>Cloudflare Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Cloudflare challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

  
  

  
Copied!

Congratulations! You've bypassed Cloudflare.

Method #2: Selenium

Headless browsers like Selenium can also emulate natural browsing behavior by rendering JavaScript and automating web page interactions like in a regular browser.

This adds a human touch to your request, allowing you to load dynamic pages and handle JavaScript challenges, a significant part of Cloudflare's detection mechanisms.

You can also integrate Selenium with Scrapy if you'd like to maintain Scrapy's crawling architecture.

Unlike Cloudscraper, Selenium receives regular updates and boasts one of the largest developer communities among headless browsers.

To get started with Selenium, check out this guide on Selenium web scraping.

However, you should know that Selenium is memory-intensive and challenging to scale, especially when running multiple browser instances in parallel. Also, it doesn't guarantee 100% success, as advanced anti-bot systems could detect its automation properties and block your requests.

Method #3: Undetected ChromeDriver (+ Premium Proxies)

Undetected ChromeDriver (UC) is a modified version of the standard ChromeDriver that enables you to leverage Selenium's functionalities without triggering anti-bot mechanisms.

This makes it harder for websites to detect Selenium's automation properties, allowing you to fly under the radar.

For more details on how to implement this technique, check out this blog on Undetected Chromedriver with Python.

However, Undetected Chromedriver isn't foolproof, as Cloudflare protection varies according to target website settings. Advanced Cloudflare protection can block your UC request.

That said, you can better your chances of avoiding detection by supercharging Undetected Chromedriver with premium proxies.

Conclusion

Although integrating Cloudscraper with Scrapy doesn't get you over the hump, other options exist for bypassing Cloudflare. The three discussed in this article (ZenRows, Selenium, and Undetected Chromedriver) are powerful alternatives.

However, the most reliable method to prevent being blocked is to utilize web scraping APIs such as ZenRows.

Try ZenRows now for free!