What Is Scrapy Code 403?
Scrapy's error code 403 is a common web scraping error related to the HTTP status code 403: Forbidden error
.
This is an example of how it looks in your logs:
[scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.g2.com> (referer: None)
[scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.g2.com>: HTTP status code is not handled or not allowed
Underline The web scraping 403 error means the target server understood your request but refused to fulfill it. This can happen for different reasons, which we can classify under two main explanations:
- You're not authorized to access the target web resource. That can result from targeted restrictions like IP banning.
- You're flagged as a bot and denied access.
They're especially common when your target URL is a web page protected by Cloudflare, as it often returns a 403 status code for unacceptable requests.
Regardless of the case, these are seven actionable techniques to solve Scrapy's error code 403:
- Easiest solution is using a bypass library.
- Space out your requests.
- Use proxies to randomize your IP.
- Change your User Agent.
- Complete your headers.
- Render JavaScript.
- Middleware to get around Cloudflare.
How Do I Fix 403 Forbidden in Scrapy?
Here are several solutions that can help you overcome the 403: Forbidden error.
1. Easy Solution to Bypass Scrapy 403
The most effective way to bypass error 403 in Scrapy is through a web scraping API, which acts as an intermediary between your Scrapy spider and target server and bypasses all anti-bot measures for you.
ZenRows is a popular web scraping API that integrates seamlessly with Scrapy. You only need to send your requests through the ZenRows API endpoint.
To try it, create a function that generates the ZenRows API URL, which you'll use in your Scrapy Python web scraping spider to make your requests.
The function takes two arguments: your API key and the target URL (we'll set them in the next step). And for higher effectiveness, we recommend activating the Premium Proxies, and JS Rendering parameters. Additionally, to create the function, you need to encode the URL and the necessary parameters using urlencode
.
import scrapy
from urllib.parse import urlencode, quote
def get_zenrows_api_url(url, api_key):
# Creates a ZenRows proxy URL for a given target_URL using the provided API key.
payload = {
'url': url,
'js_render': 'true',
'premium_proxy': 'true'
}
api_url = f'https://api.zenrows.com/v1/?apikey={api_key}&{urlencode(payload)}'
return api_url
Now, use the below function in your Spider to make your request and retrieve data. For that, define your target URL (we'll scrape https://www.g2.com/
) and specify your API key (sign up to get yours for free).
class TestSpider(scrapy.Spider):
name = "test"
def start_requests(self):
urls = [
'https://www.g2.com',
]
api_key = '<YOUR_ZENROWS_API_KEY>'
Then, make a GET
request using the ZenRows API URL:
//..
//..
for url in urls:
# make a GET request using the ZenRows API URL
api_url = get_zenrows_api_url(url, api_key)
yield scrapy.Request(api_url, callback=self.parse)
Lastly, retrieve your desired data from the response object. We'll print the page's title tag to verify that our code works. The complete code should look like this.
import scrapy
from urllib.parse import urlencode, quote
def get_zenrows_api_url(url, api_key):
# Creates a ZenRows proxy URL for a given target_URL using the provided API key.
payload = {
'url': url,
'js_render': 'true',
'premium_proxy': 'true'
}
# Construct the API URL by appending the encoded payload to the base URL with the API key
api_url = f'https://api.zenrows.com/v1/?apikey={api_key}&{urlencode(payload)}'
return api_url
class TestSpider(scrapy.Spider):
name = 'test'
def start_requests(self):
urls = [
'https://www.g2.com',
]
api_key = '<YOUR_ZENROWS_API_KEY>'
for url in urls:
# make a GET request using the ZenRows API URL
api_url = get_zenrows_api_url(url, api_key)
yield scrapy.Request(api_url, callback=self.parse)
def parse(self, response):
# Extract and print the title tag
title = response.css('title::text').get()
self.logger.info(f'Title: {title}')
You can see the result on line seven:
//..
[scrapy.core.engine] INFO: Spider opened
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[test] INFO: Spider opened: test
[scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com&js_render=true&premium_proxy=true> (referer: None)
[test] INFO: Title: Business Software and Services Reviews | G2
[scrapy.core.engine] INFO: Closing spider (finished)
Congrats, you've bypassed Scrapy's error 403 by integrating Scrapy with ZenRows.
We'll see other techniques below in this article, but ZenRows applies them and many more automatically without any custom development or needing dozens of dependencies.
2. Space Out Your Requests
Making too many requests within a short time frame can result in an IP address ban, causing you to face Scrapy's error 403. To mitigate bot traffic, rate limiting in web scraping and IP blocking are common measures that websites use.
Fortunately, you can mitigate this risk using the RANDOMIZE_DOWNLOAD_DELAY
feature from Scrapy. This setting introduces a randomized delay between requests, making them appear more natural.
To enable RANDOMIZE_DOWNLOAD_DELAY
in Scrapy, follow the steps below:
- Navigate to your project's settings file (
settings.py
). - Locate the
DOWNLOAD_DELAY
setting and remove the hash (#
) to uncomment the line. By default, this setting is commented (inactive), meaning that each request is sent consecutively without any delay between them.ย - Since we don't want to set a specific time between requests but a randomized behavior, add a line break after
DOWNLOAD_DELAY
, and typeยRANDOMIZE_DOWNLOAD_DELAY = True
.
Spacing out your requests can be beneficial, provided the reason for an IP ban can be Too many requests in a short time
. However, that's more of a compliment than a real solution, and websites may block your IP address for different reasons. Thus, randomizing delay is not enough. For better results, consider the next solution.
3. Use Proxies to Randomize Your IP
Proxies allow you to route your requests through different IP addresses.
There are two main types of proxies used in web scraping: datacenter and residential proxies. The first one refers to proxy servers provided by data centers or cloud providers, which are easily detected and not recommended. The second is IPs assigned to real devices, such as home internet connections, which are generally considered the best web scraping proxies since they're quite more difficult to spot.
In addition to getting a proxy, you need to randomize your IP address per request to appear like different human users to avoid detection.ย
Check out our guide on how to use a proxy with Scrapy to integrate one.ย
4. Change Your User Agent to Solve Scrapy 403
A User Agent (UA) is a string in the HTTP request headers that informs the web server of the client (in this case, your Scrapy spider) making the request. Websites often use this information to distinguish between browsers and bots.
The default Scrapy User Agent is Scrapy/2.9.0 (+[https://scrapy.org](https://scrapy.org))
. Meanwhile, a real browser uses one like this:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36
Obviously, websites will easily recognize you as a scraper, and you'll get Error 403 from Scrapy if you don't set a new User Agent (UA).
You can check out our complete guide on User Agents in Scrapy and grab some real ones from our list of top UAs for web scraping.ย
As a quick tutorial, you can change it using one of these two methods:
Handle a Manual Setup
In your settings.py
file, locate the USER_AGENT
setting and remove the hash (#) to uncomment the line. Then, replace the value with a browser's User Agent string. For example:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
You can send GET requests to https://httpbin.io/user-agent
to test your new User Agents. This returns a JSON object containing the User Agent that was sent in the request headers.
Use the scrapy-fake-useragent Middleware
The scrapy-fake-useragent
middleware automatically sets and rotates multiple User Agents for you. To use it, install it using the following command:
pip install scrapy-fake-useragent
BoldThen, in your settings.py
file, activate RandomUserAgentMiddleware
and disable Scrapy's default UA middleware by setting useragent.UserAgentMiddleware
to None
. Your settings should look like this:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
}
You can make multiple requests to https://httpbin.io/user-agent
to test your User Agent rotator. Each request will produce a different UA string.
Great! Changing your UA will improve your scraper's reliability to fight against Scrapy's error code 403. Yet, you need to consider the other HTTP request headers.
5. Complete Your Headers
While the User Agent is the most important HTTP header string, it isn't the only one you need to account for to appear like a regular user.
These are Scrapy's default headers that you can find in your settings.py
file:
{
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'User-Agent': 'Scrapy/2.9.0 (+https://scrapy.org)'
}
Meanwhile, a browser's default headers look like this:
{
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
'cookie': 'prov=4568ad3a-2c02-1686-b062-b26204fd5a6a; usr=p=%5b10%7c15%5d%5b160%7c%3bNewest%3b%5d',
'referer': 'https://www.google.com/',
'sec-ch-ua': '"Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'cross-site',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}
As you can see, Scrapy's headers don't include by default all the header strings an actual browser sends with its request, such as referer
, cookie
, and others specific to the website or user session. So without complete and accurate headers, websites can easily detect and serve you the Scrapy 403 error response code.
As a first approach to set them all, check your browser's headers (e.g., Chrome) and use them.
To do that, navigate to any website, and inspect the page. Then, select the Network
tab, reload the page if necessary, and choose any network request from the list. You should be able to view your request headers under the Headers
section.
Now, to add custom headers on Scrapy, locate and uncomment the DEFAULT_REQUEST_HEADERS
setting in settings.py
. Then, modify its key by adding headers as necessary. For example:
DEFAULT_REQUEST_HEADERS = {
'cookie': 'prov=4568ad3a-2c02-1686-b062-b26204fd5a6a; usr=p=%5b10%7c15%5d%5b160%7c%3bNewest%3b%5d',
'referer': 'https://www.google.com/',
# Add more custom headers as needed
}
Read our guide on HTTP headers for web scraping to learn more.
6. Render JavaScript to Bypass Scrapy Error 403 Forbidden
Rendering JavaScript is critical to avoid getting error code 403 in Scrapy because websites test your ability to execute JavaScipt to determine if you're accessing from a real browser or if the traffic is automated. These tests include techniques like browser fingerprinting and other JavaScript-based anti-bot measures.ย
Unfortunately, Scrapy doesn't come with built-in JavaScript rendering capabilities. However, you can add middleware like Scrapy Splash (most popular), Scrapy Selenium or Scrapy Playwright to render website content like an actual browser and emulate natural user behavior.
At the same time, these headless browsers come with flags that can spot you as a web scraper, so you'll need to customize them in your project. Also, these tools are resource intensive and often require additional configuration, which may introduce additional complexity.
7. Middleware to Get Around Cloudflare 403 with Scrapy
With many websites using Cloudflare, getting blocked by its bot management system has become frequent in data extraction projects.ย
Cloudflare is a widely used web security and Content Delivery Network (CDN). It works like a reverse proxy between you and the target web server, intercepting incoming requests, analyzing them, and implementing security measures to distinguish between natural users and bots. If it detects the request comes from a scraper, it instantly denies you access.
The good news is you can use use the scrapy-cloudflare-middleware
middleware to bypass Cloudflare with Scrapy. The plugin modifies the requests and responses during the scraping process and helps you overcome Cloudflare's challenges.ย
Conclusion
Encountering Scrapy's 403 Forbidden Error when web scraping can be frustrating. While solutions like spacing out requests could benefit testing purposes, real-world scenarios require a combination of more advanced techniques like a well-set-up and fortified headless browser or paid services like residential IP pools.
Fortunately, ZenRows is an all-in-one solution to avoid getting blocked and integrates with Scrapy. You only need to make your request through its API endpoint to retrieve your desired data. Sign up now to get your 1,000 free API credits.