For Imperva (Incapsula)-protected websites, error 403 is a common enemy of web scrapers. But don't worry: there are ways to deal with the error and scrape uninterrupted.
If your target website keeps returning the 403 response, here are two tried and tested solutions that'll help you access your desired data:
In this tutorial, we'll show you how to apply both methods step-by-step. But first, let's explain this error code and why you're getting it.Â
What Is 403 Forbidden Error in Imperva (Incapsula)?
The 403 Forbidden error is an HTTP response status code that means the server received your request and understood it but refuses to fulfill it. For an average user, this indicates insufficient permissions to access the website content.
However, if this error appears when scraping an Imperva-protected web page, it typically implies that the anti-bot system flags your web scraper as an unwanted bot and blocks your request.
Here's what the response looks like in your console:
HTTPError: 403 Client Error: Forbidden for https://www.carehome.co.uk/
Before we dive into the methods, let's take a step back to understand how Imperva works.
The basic architecture of an Imperva-protected website is represented in the image below.
As you can see, Imperva's web application firewall (WAF) is a barrier between you (the client) and the target web application. When your scraper tries to access the web page, the WAF intercepts your request and analyzes it to determine the source: human or bot. Imperva also uses its Advanced Bot Protection service for further analysis.
Based on the result, the service instructs the WAF to either block the request, grant access, or require further verification using a CAPTCHA. Imperva will likely block the request if your scraper is identified as a bot, resulting in the 403 Forbidden error.
Imperva uses three main detection approaches, all working together to mitigate bot traffic.
- Signature-based detection techniques: Imperva maintains an evolving database of known bot patterns. The tool compares your requests against them to determine their source. Some signature-based detection techniques include request header analysis and IP reputation analysis.
- Behavior-based detection techniques: Imperva's bot management system uses the client's behavioral data to identify bot-like traits. This includes defining rules and heuristics to capture user behavior and analyze unnatural activities such as high-frequency requests, mouse movements, etc. Most of these behavioral analysis checks (request analysis, page navigation analysis, client interaction analysis, and others) occur on the server side.
- Client fingerprinting techniques: Bots have unique fingerprints, which Imperva uses to differentiate them from human traffic. By gathering specific data from different touchpoints in your request, this anti-bot system can create fingerprints to track your scraper. Based on the touchpoints and tools used to gather the necessary information, some examples are device fingerprinting, browser fingerprinting, TLS fingerprinting, JavaScript challenges, and more.
For more details on how this anti-bot system works, check out this guide on how to bypass Incapsula.
How to Bypass 403 Forbidden in Imperva (Incapsula)?
Emulating natural user behavior increases your chances of getting past the analysis stage and ultimately bypassing the 403 Forbidden error. Below are the two best methods to achieve that.
Method #1: Use Proxies
Proxies allow you to configure your requests to originate from a different IP address, essentially increasing anonymity and disguising your web activity.
To use proxies, set up your script to make requests through a proxy server. Let's see how to do it:
import requests
# define your proxy server address
proxy_url = 'http://189.240.60.166:9090'
# create a dictionary with the proxy configuration
proxies = {
'http': proxy_url,
'https': proxy_url
}
# make a request using the proxy
response = requests.get('https://httpbin.io/ip', proxies=proxies)
# verify it works
print(response.text)
You'll get the following output on running this code:
{
"origin": "189.240.60.168:9090"
}
This code shows how to implement a proxy in Python Requests. It routes a request to HTTPBin, a website that returns the client's IP address, through a proxy from Free Proxy List.
A free proxy can hide your IP address and works well as a learning tool. However, it's unreliable, and connections are short-lived. Moreover, free proxies often come with security risks and aren't suitable for web scraping tasks.
For consistent performance and to avoid potential security issues, you need premium proxies. That said, it's also crucial to rotate between them to avoid IP-based restrictions such as rate limiting and IP bans.
Check out this list of the best web scraping proxies to choose the right fit for your project, or if you're in budget mode pick a proxy from our cheap residential proxies.
Method #2: Use a Web Scraping API
The best web scraping APIs automatically implement everything you need to emulate natural user behavior, including rotating premium proxies.
ZenRows provides a complete web scraping toolkit and an intuitive interface that enables you to bypass any anti-bot system with a single API call. Some of its features include AI web unblocker, user agent rotator, auto-rotating premium proxies, anti-CAPTCHA, and more.
Additionally, the tool offers headless browser functionality, which allows you to scrape dynamic websites and interact with page elements.
To help you get started, below is a simple step-by-step guide on bypassing an Imperva-protected website using ZenRows.
Sign up, and you'll be directed to the Request Builder page.
Input the target URL and activate Premium Proxies and the JS Rendering mode.
Select any language option on the right, e.g., Python, and choose the API mode. ZenRows will generate your request code.
Copy the code and use your preferred HTTP client to make a request to the ZenRows API. Your script will look like this:
# pip install requests
import requests
url = 'https://www.carehome.co.uk/'
apikey = '<YOUR_ZENROWS_API_KEY>'
params = {
'url': url,
'apikey': apikey,
'js_render': 'true',
'premium_proxy': 'true',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
Run it, and you'll get the page's HTML content.
<html class="no-js" lang="en" xml:lang="en">
<head>
<!-- ... -->
<title>Care Homes & Nursing Homes UK - Care Home Reviews & Nursing Home Reviews</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<h1>Reviews for Care Homes, Residential Homes & Nursing Homes</h1>
<!-- ... -->
</body>
</html>
And done! Getting through Imperva/Incapsula 403 is as easy as that using ZenRows.
Conclusion
Solving the Incapsula 403 error requires completely emulating natural user behavior. As Imperva's security system continuously evolves, manually setting up a script to mimic a human user can be quite a hassle.
Both techniques discussed in this article can help you streamline your web scraping. However, only web scraping APIs, such as ZenRows, can bypass the 403 Forbidden errors in all cases.
Additionally, it deals with everything behind the scenes and doesn't require manual configurations as a top scraper API.