Scrapy Impersonate: Advanced Tutorial for 2024

Idowu Omisola
Idowu Omisola
September 9, 2024 · 5 min read

Do you want to enhance your Scrapy spider's ability to bypass blocks during web scraping? Scrapy-impersonate can help you achieve it without a browser overhead.

In this article, you'll learn how scrapy-impersonate works and how to use it to take full advantage of its anti-bot bypass mechanisms.

Let's go!

Why Use Scrapy Impersonate?

Scrapy-impersonate is a Scrapy download handler that spoofs a browser's Transport Layer Security (TLS) signature to improve your spider's ability to bypass anti-bots. Scrapy Impersonate supports Chrome, Edge, and Safari.

In addition to browser spoofing, it mimics various platforms, including macOS, Windows, iOS, and Android, giving you more stealth while web scraping with Scrapy.

However, scrapy-impersonate isn't another headless browser that lets you execute JavaScript during web scraping. Like curl_cffi, it only modifies part of the browser and doesn't provide a full-fledged browser interface, as seen with automation tools like Selenium. 

Check out our tutorial on Scrapy Playwright or Scrapy Splash if you want to add headless browser support to your Scrapy project.

Before learning how the library works, let's see the requirements to get started.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Prerequisites

Scrapy-impersonate works with Python 3+. So, download and install the latest Python version from python.org. 

Then, install scrappy-impersonate using pip. This process also installs Scrapy:

Terminal
pip install scrapy-impersonate

Once installed, run the following command to start a Scrapy project. Let's call it my_scraper:

Terminal
scrapy startproject my_scraper

Now, open your project folder with any code editor, e.g., VS Code.

Step 1: Set Up Scrapy-impersonate

The next step is configuring Scrapy to use scrapy-impersonate's features. As mentioned, the library acts as a Scrapy download handler, a support client that gets content from web pages. 

Its configuration only requires pointing Scrapy to the scrapy-impersonate download handler in the settings file.

Open your project's settings.py file and add the following code lines:

settings.py
DOWNLOAD_HANDLERS = {
    "http": "scrapy_impersonate.ImpersonateDownloadHandler",
    "https": "scrapy_impersonate.ImpersonateDownloadHandler",
}

Let's see how scrapy-impersonate works in the next section.

Step 2: Set a Browser to Scrape a Page

Scraping a web page with scrapy-impersonate requires intercepting your spider's request with a browser option. 

You'll learn how the library works by scraping Browserleaks, a test website that returns a brief JSON response of your browser's TLS fingerprint details. 

First, let's see what a simple spider request without scrapy-impersonate gives us. 

Create a new scraper.py file in your spiders' directory and configure your spider class as shown:

scraper.py
# import scrapy
import scrapy

class ScraperSpider(scrapy.Spider):
    # specify the spider's name
    name = "scraper"
    start_urls = ["https://tls.browserleaks.com/json"]

    # parse the response
    def parse(self, response):

        yield response.json()

Run the spider with the following command:

Terminal
scrapy crawl scraper

The above code returns the fingerprint details below. The ja3_hash is Scrapy's fixed value. Run the scraper a few more times, and you'll see that the ja3_hash` doesn't change. 

Additionally, Akamai's fingerprint parameters, including akamai_hash and akamai_text, are missing, indicating that the spider failed Akamai's browser fingerprinting test. It also reveals that the request uses Scrapy's User Agent, which can get you blocked easily due to its bot-like appearance:

Output
{
    'user_agent': 'Scrapy/2.11.2 (+https://scrapy.org)',
    'ja3_hash': '5cc600468c246704e1699c12f51eb3ab',
    'ja3_text': '771,4866-4867-4865-49196-49200-159-52393-...',
    'ja3n_hash': '41b57c95e90f19a8d418248b79dab8e4',
    'ja3n_text': '771,4866-4867-4865-49196-49200-159-52393-...',
    'akamai_hash': '',
    'akamai_text': ''
}

Let's run another check to see Scrapy's default request headers. 

To check your spider's complete request headers, replace the target URL in the previous code with https://httpbin.io/headers. 

See the output below.

Output
{
    'headers':
    {
        'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'],
        'Accept-Encoding': ['gzip, deflate, br, zstd'],
        'Accept-Language': ['en'],
        'Host': ['httpbin.io'],
        'User-Agent': ['Scrapy/2.11.2 (+https://scrapy.org)']
    }
}

Essential header information, such as platform name, browser version, and secure client hints, are missing, making your spider more vulnerable to anti-bot detection. The User-Agent is also a Scrapy string, validating the Browserleaks test. 

Let's modify these TLS fingerprints and request headers with scrapy-impersonate using Browserleaks as a test website.

Modify your spider class like this:

scraper.py
class ScraperSpider(scrapy.Spider):

    # specify the spider's name
    name = "scraper"

Define a Scrapy request interceptor that requests the target page and sets the browser option in a meta option:

scraper.py
class ScraperSpider(scrapy.Spider):
   
    # ...

    def start_requests(self):
       
        # intercept the request
        yield scrapy.Request(
            "https://tls.browserleaks.com/json",

            # specify a browser version
            meta={
                "impersonate": "chrome119",
            },
        )

Finally, parse the content in a parser function:

scraper.py
class ScraperSpider(scrapy.Spider):

    # ...

    def parse(self, response):
        yield response.json()

Combine the snippets. Here's the full spider code:

scraper.py
# import scrapy
import scrapy

class ScraperSpider(scrapy.Spider):
    # specify the spider's name
    name = "scraper"

    def start_requests(self):
       
        # intercept the request
        yield scrapy.Request(
            "https://tls.browserleaks.com/json",

            # specify a browser version
            meta={
                "impersonate": "chrome119",
            },
        )

    def parse(self, response):
        yield {"response": response.json()}

The above spider outputs the following TLS fingerprint details:

Output
{
    'user_agent': 'Scrapy/2.11.2 (+https://scrapy.org)',
    'ja3_hash': '0dc98a37b899c02cf946eda176979e52',
    'ja3_text': '771,4867-4865-4866-49195-49199-49196-...',
    'ja3n_hash': 'd87d800723dc75d90d46b7894734d303',
    'ja3n_text': '771,4867-4865-4866-49195-49199-49196-...',
    'akamai_hash': '52d84b11737d980aef856699f885ca86',
    'akamai_text': '1:65536;2:0;4:6291456;6:262144|15663105|0|m,a,s,p'
}

It contains the Akamai hash and text. Additionally, the ja3_hash mimics a real browser and changes per request, boosting your scraper's chances of evading anti-bot security checks. The only disadvantage of this result is that your spider still uses Scrapy's User-Agent.

Now, replace the Browserleaks URL with https://httpbin.io/headers to see how scrapy-impersonate has modified your request headers. 

Below is the result for https://httpbin.io/headers, detailing notable request headers such as the secure client hint User Agent (Sec-Ch-Ua), platform, Sec-Fetch-User, and Sec-Fetch-Mode. All these prove that the request is from a legitimate user.

However, like the previous test, this one also shows that scrapy-impersonate doesn't patch the User-Agent while modifying the TLS signature:

Output
{
    'headers':
    {
        'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'],
        'Accept-Encoding': ['gzip, deflate, br, zstd'],
        'Accept-Language': ['en'],
        'Host': ['httpbin.io'],
        'Sec-Ch-Ua': ['"Google Chrome";v="119", "Chromium";v="119", "Not?A_Brand";v="24"'],
        'Sec-Ch-Ua-Mobile': ['?0'],
        'Sec-Ch-Ua-Platform': ['"macOS"'],
        'Sec-Fetch-Dest': ['document'],
        'Sec-Fetch-Mode': ['navigate'],
        'Sec-Fetch-Site': ['none'],
        'Sec-Fetch-User': ['?1'],
        'Upgrade-Insecure-Requests': ['1'],
        'User-Agent': ['Scrapy/2.11.2 (+https://scrapy.org)']
    }
}

To improve scrapy-impersonate's patch and enhance its anti-bot evasion capability, change Scrapy's User-Agent to a custom browser string.

Hint: Since you've spoofed Chrome 119 and the platform (Sec-Ch-Ua-Platform) header shows macOS, ensure you use a macOS Chrome 119 User Agent. This consistency is essential because anti-bots also block inconsistent header values.

Add the following line to the settings.py file to modify the User-Agent globally across all spiders:

settings.py
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"

Now, repeat your TLS test on Browserleaks by running the previous scraper. You'll get the following TLS signature details, showing a real Browser's User-Agent:

Output
{
    'user_agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36',
    'ja3_hash': '011bd77a82977acd5de9b01227c4d451',
    'ja3_text': '771,4867-4865-4866-49195-49199-49196-...',
    'ja3n_hash': '6274bd95c69dbbaa058e5e066834bf35',
    'ja3n_text': '771,4867-4865-4866-49195-49199-49196-...',
    'akamai_hash': '52d84b11737d980aef856699f885ca86',
    'akamai_text': '1:65536;2:0;4:6291456;6:262144|15663105|0|m,a,s,p'
}

Awesome! You just enhanced your spider for anti-bot bypass.

However, scrapy-impersonate is selective with browser version support. Here are all the support Scrapy-impersonate also has some limitations you should know before adding it to your Scrapy stack.

Limitations and Solutions of Scrapy Impersonate

Scrapy-impersonate is a simple solution to mimic human behavior in Scrapy. However, its limitations make it an incomplete scraping toolset.

Scrapy-impersonate has a low user base and almost no documentation describing how to set it up or use it. This downside makes it less beginner-friendly, leaving you with little to no resources to solve related problems.

While it supports various browsers, it has browser version limitations. For instance, it doesn't support Chrome 121+. Even Chrome versions 119 and 120, the latest on its list of supported browsers, are only available for the macOS User Agent. So, a desired platform may constrain you to an old browser version, which may not be suitable for scraping some modern websites. 

On top of that, our anti-bot bypass test shows that scrapy-impersonate can't handle sophisticated anti-bot measures, including Cloudflare, Akamai, and DataDome.

But the good news is that you can overcome all these limitations with a web scraping API like ZenRows, an all-in-one scraping solution that integrates seamlessly with Scrapy. ZenRows automatically fixes fingerprinting mismatches, modifies your request headers, acts as a headless browser, and bypasses CAPTCHAs and any other anti-bot at scale without putting you through rigorous setups.

To see how ZenRows works, let's use it to scrape the full-page HTML of G2 Reviews, a website heavily protected by Datadome.

Sign up to open the Request Builder. Paste the target URL in the link box, activate Premium Proxies, and select JS Rendering. Choose Python as your preferred language and select the API connection mode. Copy and paste the generated code into your Python script.

building a scraper with zenrows
Click to open the image in full screen

Here's the generated code:

Example
# pip install requests
import requests

url = "https://www.g2.com/products/asana/reviews"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
    "url": url,
    "apikey": apikey,
    "js_render": "true",
    "premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)

The code accesses the protected website and scrapes its full-page HTML. See the output below:

Output
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8" />
    <link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
    <title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
</head>
<body>
    <!-- other content omitted for brevity -->
</body>
</html>

Congratulations! You've just bypassed Datadome, an advanced anti-bot protection, using ZenRows.

Conclusion

You've seen how scrapy-impersonate mimics a browser and how to use this feature to patch your Scrapy spider for anti-bot bypass. You've learned how to:

  • Configure your Scrapy project to use scrapy-impersonate.
  • Add a browser feature to Scrapy with scrapy-impersonate.
  • Perform basic fingerprinting tests to understand how scrapy-impersonate works.

Keep in mind that while scrappy-impersonate increases your scraper's ability to evade blocks, it can't handle advanced anti-bots and even struggles with basic blocks sometimes. We recommend using ZenRows, an efficient anti-bot bypass API, to avoid all blocks and scrape any website without limitations.

Try ZenRows for free today without a credit card!

Ready to get started?

Up to 1,000 URLs for free are waiting for you