Are you frustrated by running into the PerimeterX 403 Forbidden error while web scraping?
This common error prevents you from retrieving content from the web page, and worst case scenario, it can even earn you a permanent block.
But there are ways to bypass PerimeterX altogether. In this article, you’ll see four practical techniques that will help you avoid the 403 forbidden error, along with a step-by-step guide on implementing them. Let's go!
What Is PerimeterX 403?
The PerimeterX 403 error is similar to the 403: Forbidden HTTP status code, which is a common response to unfulfilled requests.
This error occurs when the PerimeterX anti-bot system flags your request as suspicious. It then displays an error message indicating that it understood your request but refused to fulfill it.
This is what the error looks like in your console or terminal.
HTTPError: 403 Client Error: Forbidden for url: https://www.ssense.com/en-ca
And if you're scraping using a graphical user interface (GUI) such as Selenium, your response may be similar to the image below:
How to Bypass 403 Forbidden in PerimeterX?
Let's explore four actionable techniques to avoid the 403 Forbidden error when scraping a PerimeterX-protected web page.
Method #1: Use a Web Scraping API
The only surefire way to bypass the PerimeterX 403 error is to mimic natural browsing behavior. Web scraping APIs offer the easiest and most effective solution to achieve this.
A web scraping API such as ZenRows handles all the technicalities needed to emulate human behavior under the hood, allowing you to bypass any anti-bot system with a single API call.
With features like JavaScript rendering, premium proxies, User Agent rotation, anti-CAPTCHAs, and more, ZenRows provides everything you need to scrape any web page without getting blocked.
Let's take a look at how to use ZenRows to bypass a PerimeterX-protected website (https://www.ssense.com/en-ca
).
Sign up for free, and you'll be redirected to the Request Builder page:
Input the target URL (in this case, https://www.ssense.com/en-ca
), and activate Premium Proxies and the JS rendering mode.
That'll generate your request code on the right. Copy it, and use your preferred HTTP client. In this example, let's go with Python Requests, which you can install using the following command:
pip install requests
Your code should be similar to this:
# import the nesseccary module
import requests
url = 'https://www.ssense.com/en-ca'
apikey = '<YOUR_ZENROWS_API_KEY>'
params = {
'url': url,
'apikey': apikey,
'js_render': 'true',
'premium_proxy': 'true',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
Run it, and you'll get the HTML content of your target web page:
<html lang="en-ca">
<head>
<meta name="language" content="en">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Luxury fashion & independent designers | SSENSE Canada</title>
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=no,minimal-ui">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="author" content="SSENSE">
<meta name="description" content="Shop from 500+ luxury labels, emerging designers and streetwear brands for both men and women. Gucci, Off-White, Acne Studios, and more. Shipping globally." data-vue-meta="1">
<!---...---!>
</head>
Congratulations, you've just retrieved content from a PerimeterX-protected website with no blocks in sight.
Method #2: Get Premium Proxies
The PerimeterX 403 Forbidden error can also stem from IP-based restrictions, which limit or block access based on the requesting IP address.
In such cases, proxies can help you fly under the radar. Proxies act as intermediaries between your scraper and the target server, and they provide anonymity by routing your requests through a different server. Your request will seem to originate from a different region or an actual user, especially if you're using residential premium proxies (proxies assigned to real devices or users).
Use a headless browser, such as Playwright, to fully leverage the benefits of proxies. This way, you can emulate natural user behavior while avoiding IP-based restrictions.
Setting a Playwright proxy involves launching your browser and sending your proxy details as separate parameters. Below is a sample code snippet showing how to set a proxy in Playwright:
from playwright.async_api import async_playwright
import asyncio
async def main():
async with async_playwright() as playwright:
browser = await playwright.chromium.launch(
proxy={
'server': '181.129.43.3:8080',
},
)
context = await browser.new_context()
page = await context.new_page()
await page.goto('https://httpbin.org/ip')
html_content = await page.content()
print(html_content)
await context.close()
await browser.close()
asyncio.run(main())
The code above uses a proxy from the Free Proxy List. However, free proxies are only useful for tutorials and testing since they’re unreliable, short-lived, and don't stand a chance against powerful anti-bot solutions such as Playwright.
This is why it's best to use high-quality premium proxies that provide auto-rotation. They'll help you avoid rate limiting and IP bans and guarantee a stable performance. If you'd like to choose the best proxies for your project, take a look at this ranking of premium web scraping proxies.
Method #3: Use Headless Browsers With Anti-Bot Plugins
Headless browsers are helpful in simulating natural user behavior. They're especially valuable when scraping websites that rely on JavaScript to display content. Some of the most popular ones are Selenium and Puppeteer,
However, the ability to render JavaScript like an actual browser isn't usually enough to overcome the PerimeterX 403 error. This is because some of the headless browsers' properties (for example, navigator.webdriver
) are easily detectable by anti-bot systems.
Fortunately, you can use special “stealth” plugins to mask these properties and ultimately increase your chances of success. You'll find such plugins for all the most popular headless browsers, for example:
- For Selenium, there's Undetected Chromedriver, which patches Selenium's automation properties, making it harder for websites to detect.
- For Puppeteer, Puppeteer Stealth uses various evasion modules to hide the automation properties. Each evasion module plugs a particular leak.
Still, since the detection of automation properties is only one of many techniques PerimeterX uses to detect bots, this method alone may be insufficient.
Method #4: Customize Your User Agent
HTTP headers are metadata sent along with every HTTP request, used by websites to tailor responses.
Headless browsers' HTTP headers are quite different from those of a regular browser, which makes them easy for anti-bot systems to detect.
See for yourself. Here's what Selenium's default request headers look like:
"headers": {
"Accept": [
"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"
],
"Accept-Encoding": [
"gzip, deflate, br"
],
"Connection": [
"keep-alive"
],
"Host": [
"httpbin.io"
],
"Sec-Ch-Ua": [
"\"Chromium\";v=\"122\", \"Not(A:Brand\";v=\"24\", \"HeadlessChrome\";v=\"122\""
],
"Sec-Ch-Ua-Mobile": [
"?0"
],
"Sec-Ch-Ua-Platform": [
"\"Windows\""
],
"Sec-Fetch-Dest": [
"document"
],
"Sec-Fetch-Mode": [
"navigate"
],
"Sec-Fetch-Site": [
"none"
],
"Sec-Fetch-User": [
"?1"
],
"Upgrade-Insecure-Requests": [
"1"
],
"User-Agent": [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/122.0.6261.95 Safari/537.36"
]
}
Meanwhile, a browser's default header is something like this:
{
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
'cookie': 'prov=4568ad3a-2c02-1686-b062-b26204fd5a6a; usr=p=%5b10%7c15%5d%5b160%7c%3bNewest%3b%5d',
'referer': 'https://www.google.com/',
'sec-ch-ua': '"Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'cross-site',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}
In the samples above, it's evident that Selenium's HTTP header components are incomplete.
The most critical component of these HTTP headers is the User Agent (UA), which informs the target server of the requesting client. If you get identified as a non-browser client, you will most likely receive the PerimeterX 403 error response code. Therefore, customizing your User Agent to mirror that of a regular browser will lower the probability of detection.
One way to ensure your headers are accurate is to visit the target website using a regular browser, copy the headers sent by the browser, and use them as custom headers in your Selenium script.
Additionally, check the order of request headers, as PerimeterX can flag your web scraper based on the default arrangement of non-browser headers.
If you’d like to know more about how to optimize headers for web scraping, check out our guide!
However, keep in mind that setting the right User Agent alone won't work against PerimeterX. Websites can also detect discrepancies in HTTP headers, so it's essential to maintain consistency. For example, if your user agent reflects certain information about your request, other headers should align with said data.
Conclusion
Overcoming the PerimeterX 403 error can be challenging. It often requires using multiple manual solutions at once, which is difficult to set up and maintain.
For web scraping at scale, it’s best to use a web scraping API, such as ZenRows. ZenRows handles all these technicalities automatically, allowing you to bypass any anti-bot system and focus on extracting the desired data. Try ZenRows for free.