We've all encountered the 403 Forbidden error when web scraping. It's even more frustrating when using a tool like Cloudscraper, a popular open-source library designed to bypass Cloudflare.
If you don't want the Cloudscraper 403 error to halt your web scraping, you're in the right place. In this tutorial, we'll show you the two best ways to solve the 403 Forbidden error when web scraping using Cloudscraper.
But first, here's some background information.
What Is 403 Forbidden Error When Using CloudScraper?
The 403 Forbidden error is the HTTP status code you receive when the web server denies you access to its content. It typically means the server understands your request but refuses to fulfill it.
For a basic web client, such as Requests, this indicates a lack of sufficient permissions to access the target web page because the server quickly identifies the client's non-browser status.
However, if this error appears when web scraping using Cloudscraper, it implies that the target website implements advanced bot detection techniques that are beyond Cloudscraper's capabilities.
Here's what the error looks like in your terminal:
HTTPError: 403 Client Error: Forbidden for https://www.g2.com/products/asana/reviews
Most modern websites use sophisticated anti-bot systems that make it challenging to bypass with open-source tools. These systems continuously evolve, and solutions like Cloudscraper quickly become obsolete as they struggle to keep up.
In case your Cloudccraper-based scraper gets blocked when trying to bypass Cloudflare-protected pages, you need additional solutions to keep scraping. Below, we've recommended two that will work the best.
How to Bypass 403 Forbidden in CloudScraper
Fortunately, the Cloudscraper 403 Forbidden error is solvable with the right tools. While there are ways to manually configure your scraper and try bypass Cloudflare, they are often time-consuming, unscalable, and still don't guarantee a success.
The following two methods are the best against the frequently evolving anti-bot systems, both when it comes to success rates and saving resources.
1. Using Proxies
Proxies act as a bridge between your scraper and the target website, masking your IP address and disguising your web activity.
While Cloudscraper focuses on solving common JavaScript challenges, proxies add an extra layer of anonymity. This makes it much more difficult for websites to flag you as a bot.
To use proxies in Cloudscraper, pass a dictionary to the proxies
parameter when making a request.
Here's how to do it:
import cloudscraper
# create a Cloudscraper instance
scraper = cloudscraper.create_scraper()
# define your proxy
proxy = {
'http': 'http://154.94.5.241:7001'
'https': 'https://154.94.5.241:7001'
}
# make a request using the proxy
response = scraper.get('https://httpbin.io/ip', proxies=proxy)
# print website content if request is successful
if response.status_code == 200:
print(response.text)
else:
print(f'Failed to retrieve content: {response.status_code}')
This code routes a request to HttpBin, an endpoint that returns the requesting agent's IP through a free proxy from the Free Proxy List.
Here's the result:
{
"origin": "154.94.5.241:91236"
}
While we used free proxies to demonstrate how to add them to your scraper, they're unreliable and unsuitable for real-world use cases. Websites can easily detect and block them, plus, they have a short lifespan.
For consistent performance, you need premium proxies. It's also essential to rotate them to better your chances of bypassing the 403 error and avoiding IP-based restrictions, such as rate limiting and IP bans.
That said, manually rotating proxies can require much time and effort, especially when web scraping at scale.
Luckily, ZenRows can help simplify this process. By automatically rotating residential proxies, this solution allows you to focus on extracting data rather than writing long lines of code.
Here's a quick guide on how to rotate premium proxies using ZenRows.
To use ZenRows, sign up to access your dashboard. Select Residential Proxies in the left menu section and create a new proxy user. You'll be directed to the Proxy Generator page.
Copy your proxy URL for use in your Cloudscraper script. You can choose between the auto-rotate option and sticky sessions, which allow you to maintain a proxy for a specified time frame.
Here's the final code using ZenRows premium proxies:
import cloudscraper
# create CloudScraper instance
scraper = cloudscraper.create_scraper()
# define your proxy
proxy = {
'http': 'http://<PROXY_USERNAME>:<PROXY_PASSWORD>@superproxy.zenrows.com:1337'
'https': 'https://<PROXY_USERNAME>:<PROXY_PASSWORD>@superproxy.zenrows.com:1338'
}
# make a request using the proxy
response = scraper.get("https://httpbin.io/ip", proxies=proxy)
# print website content if request is successful
if response.status_code == 200:
print(response.text)
else:
print(f'Failed to retrieve content: {response.status_code}')
However, using proxies only still doesn't guarantee bypassing websites with the highest level of Cloudflare protection.
2. Using a Web Scraping API
Some cases may require much more than proxies as anti-bot systems get even more sophisticated. In these scenarios, use a web scraping API.
The best web scraping APIs, such as ZenRows, automatically handle all anti-bot measures. Under the hood, it emulates natural user behavior, allowing you to scrape any website without getting blocked.
ZenRows works with any programming language and provides numerous features, including premium proxies, anti-CAPTCHA, user agent rotator, and everything you need to bypass any anti-bot system.
Below is a step-by-step guide on how to use the ZenRows web scraping API.
Sign up, and you'll be directed to the Request Builder page.
Input the target URL and activate Premium Proxies and the JS Rendering mode. For this example, we'll scrape a Cloudflare-protected G2 Reviews page.
Select any language option on the right, e.g., Python, and choose the API mode. ZenRows will generate your request code.
Copy the code and use your preferred HTTP client to make a request to the ZenRows API. Your script will look like this:
# pip install requests
import requests
url = 'https://www.g2.com/products/asana/reviews'
apikey = '<YOUR_ZENROWS_API_KEY>'
params = {
'url': url,
'apikey': apikey,
'js_render': 'true',
'premium_proxy': 'true',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
Run it, and you'll get the following result:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
<!-- ... -->
</head>
<body>
<!-- other content omitted for brevity -->
</body>
That's how easy it is to bypass advanced anti-bot protection using ZenRows.
Conclusion
Encountering the 403 error when web scraping with Cloudscraper can be frustrating. Fortunately, there are methods to prevent it from appearing.
Rotating premium proxies may work in some scenarios but aren't always effective. The only surefire way of bypassing Cloudflare is using a web scraping API like ZenRows, the best tool to ensure you don't ever run into 403 forbidden errors.
ZenRows deals with everything behind the scenes and doesn't require manual configurations.
Try ZenRows for free now without a credit card!