Do you want to enhance your Scrapy spider's ability to bypass blocks during web scraping? Scrapy-impersonate can help you achieve it without a browser overhead.
In this article, you'll learn how scrapy-impersonate works and how to use it to take full advantage of its anti-bot bypass mechanisms.
Let's go!
Why Use Scrapy Impersonate?
Scrapy-impersonate is a Scrapy download handler that spoofs a browser's Transport Layer Security (TLS) signature to improve your spider's ability to bypass anti-bots. Scrapy Impersonate supports Chrome, Edge, and Safari.
In addition to browser spoofing, it mimics various platforms, including macOS, Windows, iOS, and Android, giving you more stealth while web scraping with Scrapy.
However, scrapy-impersonate isn't another headless browser that lets you execute JavaScript during web scraping. Like curl_cffi, it only modifies part of the browser and doesn't provide a full-fledged browser interface, as seen with automation tools like Selenium.Â
Check out our tutorial on Scrapy Playwright or Scrapy Splash if you want to add headless browser support to your Scrapy project.
Before learning how the library works, let's see the requirements to get started.
Prerequisites
Scrapy-impersonate works with Python 3+. So, download and install the latest Python version from python.org.Â
Then, install scrappy-impersonate using pip
. This process also installs Scrapy:
pip install scrapy-impersonate
Once installed, run the following command to start a Scrapy project. Let's call it my_scraper
:
scrapy startproject my_scraper
Now, open your project folder with any code editor, e.g., VS Code.
Step 1: Set Up Scrapy-impersonate
The next step is configuring Scrapy to use scrapy-impersonate's features. As mentioned, the library acts as a Scrapy download handler, a support client that gets content from web pages.Â
Its configuration only requires pointing Scrapy to the scrapy-impersonate download handler in the settings file.
Open your project's settings.py
file and add the following code lines:
DOWNLOAD_HANDLERS = {
"http": "scrapy_impersonate.ImpersonateDownloadHandler",
"https": "scrapy_impersonate.ImpersonateDownloadHandler",
}
Let's see how scrapy-impersonate works in the next section.
Step 2: Set a Browser to Scrape a Page
Scraping a web page with scrapy-impersonate requires intercepting your spider's request with a browser option.Â
You'll learn how the library works by scraping Browserleaks, a test website that returns a brief JSON response of your browser's TLS fingerprint details.Â
First, let's see what a simple spider request without scrapy-impersonate gives us.Â
Create a new scraper.py
file in your spiders' directory and configure your spider class as shown:
# import scrapy
import scrapy
class ScraperSpider(scrapy.Spider):
# specify the spider's name
name = "scraper"
start_urls = ["https://tls.browserleaks.com/json"]
# parse the response
def parse(self, response):
yield response.json()
Run the spider with the following command:
scrapy crawl scraper
The above code returns the fingerprint details below. The ja3_hash
is Scrapy's fixed value. Run the scraper a few more times, and you'll see that the ja3_hash` doesn't change.Â
Additionally, Akamai's fingerprint parameters, including akamai_hash
and akamai_text
, are missing, indicating that the spider failed Akamai's browser fingerprinting test. It also reveals that the request uses Scrapy's User Agent, which can get you blocked easily due to its bot-like appearance:
{
'user_agent': 'Scrapy/2.11.2 (+https://scrapy.org)',
'ja3_hash': '5cc600468c246704e1699c12f51eb3ab',
'ja3_text': '771,4866-4867-4865-49196-49200-159-52393-...',
'ja3n_hash': '41b57c95e90f19a8d418248b79dab8e4',
'ja3n_text': '771,4866-4867-4865-49196-49200-159-52393-...',
'akamai_hash': '',
'akamai_text': ''
}
Let's run another check to see Scrapy's default request headers.Â
To check your spider's complete request headers, replace the target URL in the previous code with https://httpbin.io/headers
.Â
See the output below.
{
'headers':
{
'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'],
'Accept-Encoding': ['gzip, deflate, br, zstd'],
'Accept-Language': ['en'],
'Host': ['httpbin.io'],
'User-Agent': ['Scrapy/2.11.2 (+https://scrapy.org)']
}
}
Essential header information, such as platform name, browser version, and secure client hints, are missing, making your spider more vulnerable to anti-bot detection. The User-Agent is also a Scrapy string, validating the Browserleaks test.Â
Let's modify these TLS fingerprints and request headers with scrapy-impersonate using Browserleaks as a test website.
Modify your spider class like this:
class ScraperSpider(scrapy.Spider):
# specify the spider's name
name = "scraper"
Define a Scrapy request interceptor that requests the target page and sets the browser option in a meta
option:
class ScraperSpider(scrapy.Spider):
# ...
def start_requests(self):
# intercept the request
yield scrapy.Request(
"https://tls.browserleaks.com/json",
# specify a browser version
meta={
"impersonate": "chrome119",
},
)
Finally, parse the content in a parser function:
class ScraperSpider(scrapy.Spider):
# ...
def parse(self, response):
yield response.json()
Combine the snippets. Here's the full spider code:
# import scrapy
import scrapy
class ScraperSpider(scrapy.Spider):
# specify the spider's name
name = "scraper"
def start_requests(self):
# intercept the request
yield scrapy.Request(
"https://tls.browserleaks.com/json",
# specify a browser version
meta={
"impersonate": "chrome119",
},
)
def parse(self, response):
yield {"response": response.json()}
The above spider outputs the following TLS fingerprint details:
{
'user_agent': 'Scrapy/2.11.2 (+https://scrapy.org)',
'ja3_hash': '0dc98a37b899c02cf946eda176979e52',
'ja3_text': '771,4867-4865-4866-49195-49199-49196-...',
'ja3n_hash': 'd87d800723dc75d90d46b7894734d303',
'ja3n_text': '771,4867-4865-4866-49195-49199-49196-...',
'akamai_hash': '52d84b11737d980aef856699f885ca86',
'akamai_text': '1:65536;2:0;4:6291456;6:262144|15663105|0|m,a,s,p'
}
It contains the Akamai hash and text. Additionally, the ja3_hash
mimics a real browser and changes per request, boosting your scraper's chances of evading anti-bot security checks. The only disadvantage of this result is that your spider still uses Scrapy's User-Agent.
Now, replace the Browserleaks URL with https://httpbin.io/headers
to see how scrapy-impersonate has modified your request headers.Â
Below is the result for https://httpbin.io/headers
, detailing notable request headers such as the secure client hint User Agent (Sec-Ch-Ua), platform, Sec-Fetch-User, and Sec-Fetch-Mode. All these prove that the request is from a legitimate user.
However, like the previous test, this one also shows that scrapy-impersonate doesn't patch the User-Agent while modifying the TLS signature:
{
'headers':
{
'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'],
'Accept-Encoding': ['gzip, deflate, br, zstd'],
'Accept-Language': ['en'],
'Host': ['httpbin.io'],
'Sec-Ch-Ua': ['"Google Chrome";v="119", "Chromium";v="119", "Not?A_Brand";v="24"'],
'Sec-Ch-Ua-Mobile': ['?0'],
'Sec-Ch-Ua-Platform': ['"macOS"'],
'Sec-Fetch-Dest': ['document'],
'Sec-Fetch-Mode': ['navigate'],
'Sec-Fetch-Site': ['none'],
'Sec-Fetch-User': ['?1'],
'Upgrade-Insecure-Requests': ['1'],
'User-Agent': ['Scrapy/2.11.2 (+https://scrapy.org)']
}
}
To improve scrapy-impersonate's patch and enhance its anti-bot evasion capability, change Scrapy's User-Agent to a custom browser string.
Hint: Since you've spoofed Chrome 119 and the platform (Sec-Ch-Ua-Platform) header shows macOS, ensure you use a macOS Chrome 119 User Agent. This consistency is essential because anti-bots also block inconsistent header values.
Add the following line to the settings.py
file to modify the User-Agent globally across all spiders:
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
Now, repeat your TLS test on Browserleaks by running the previous scraper. You'll get the following TLS signature details, showing a real Browser's User-Agent:
{
'user_agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36',
'ja3_hash': '011bd77a82977acd5de9b01227c4d451',
'ja3_text': '771,4867-4865-4866-49195-49199-49196-...',
'ja3n_hash': '6274bd95c69dbbaa058e5e066834bf35',
'ja3n_text': '771,4867-4865-4866-49195-49199-49196-...',
'akamai_hash': '52d84b11737d980aef856699f885ca86',
'akamai_text': '1:65536;2:0;4:6291456;6:262144|15663105|0|m,a,s,p'
}
Awesome! You just enhanced your spider for anti-bot bypass.
However, scrapy-impersonate is selective with browser version support. Here are all the support Scrapy-impersonate also has some limitations you should know before adding it to your Scrapy stack.
Limitations and Solutions of Scrapy Impersonate
Scrapy-impersonate is a simple solution to mimic human behavior in Scrapy. However, its limitations make it an incomplete scraping toolset.
Scrapy-impersonate has a low user base and almost no documentation describing how to set it up or use it. This downside makes it less beginner-friendly, leaving you with little to no resources to solve related problems.
While it supports various browsers, it has browser version limitations. For instance, it doesn't support Chrome 121+. Even Chrome versions 119 and 120, the latest on its list of supported browsers, are only available for the macOS User Agent. So, a desired platform may constrain you to an old browser version, which may not be suitable for scraping some modern websites.Â
On top of that, our anti-bot bypass test shows that scrapy-impersonate can't handle sophisticated anti-bot measures, including Cloudflare, Akamai, and DataDome.
But the good news is that you can overcome all these limitations with a web scraping API like ZenRows, an all-in-one scraping solution that integrates seamlessly with Scrapy. ZenRows automatically fixes fingerprinting mismatches, modifies your request headers, acts as a headless browser, and bypasses CAPTCHAs and any other anti-bot at scale without putting you through rigorous setups.
To see how ZenRows works, let's use it to scrape the full-page HTML of G2 Reviews, a website heavily protected by Datadome.
Sign up to open the Request Builder. Paste the target URL in the link box, activate Premium Proxies, and select JS Rendering. Choose Python as your preferred language and select the API connection mode. Copy and paste the generated code into your Python script.
Here's the generated code:
# pip install requests
import requests
url = "https://www.g2.com/products/asana/reviews"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
"premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)
The code accesses the protected website and scrapes its full-page HTML. See the output below:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
</head>
<body>
<!-- other content omitted for brevity -->
</body>
</html>
Congratulations! You've just bypassed Datadome, an advanced anti-bot protection, using ZenRows.
Conclusion
You've seen how scrapy-impersonate mimics a browser and how to use this feature to patch your Scrapy spider for anti-bot bypass. You've learned how to:
- Configure your Scrapy project to use scrapy-impersonate.
- Add a browser feature to Scrapy with scrapy-impersonate.
- Perform basic fingerprinting tests to understand how scrapy-impersonate works.
Keep in mind that while scrappy-impersonate increases your scraper's ability to evade blocks, it can't handle advanced anti-bots and even struggles with basic blocks sometimes. We recommend using ZenRows, an efficient anti-bot bypass API, to avoid all blocks and scrape any website without limitations.
Try ZenRows for free today without a credit card!