Does your scraper keep hitting the Imperva anti-bot screen? Incapsula Imperva is among the most popular anti-scraping measures on the internet, meaning bypassing it has become necessary to extract data successfully.Â
We've got you covered! We've got you covered! In this guide, you'll learn how to bypass Imperva protection using four different tested and trusted methods:
- Method #1: Use a web scraping API.
- Method #2: Implement fortified headless browsers.
- Method #3: Scrape archived or cached pages.
- Method #4: Use smart proxies to get past Imperva Incapsula.
We'll use Harvey Norman, an Imperva Incapsula-protected website, to show how each method works. But first, let's learn more about the system itself.
What Is Imperva (Incapsula)?
Imperva Incapsula is a web application firewall (WAF) that uses advanced web security measures to protect websites against attacks, such as DDoS, blocking traffic that doesn't seem human.
Unfortunately, that includes all sorts of bots regardless of their intentions. The Imperva firewall acts as an intermediary between your browser/scraper and the target website's server.
Common Imperva Block Page Messages
Imperva typically displays an anti-bot page to block web scraping attempts, similar to other WAFs like Akamai and PerimeterX. If you're scraping with an HTTP client, the block page can return errors like Imperva/Incapsula 403. However, you might also get a response 200 OK status code since the block page itself is a valid HTML response.
Here are common block messages that indicates you've been blocked by Imperva:
-
Incapsula incident ID
embedded in an iFrame. -
Powered by Imperva
text returned with a CAPTCHA. -
x-cdn: Imperva
in the request headers. -
_Incapsula_Resource
in the script and iframe tags. -
subject=WAF Block Page
in the response HTML. -
visid_incap_
andincap_ses
in the Set-Cookie header field. -
X-Iinfo
in the response headers.
Bypassing Imperva Incapsula is possible. But first, you need to understand its detection techniques.
How Does Imperva Incapsula Detect Bots?
When a user tries to access an Incapsula-protected website, the WAF receives and analyzes the request before getting the content from the source server. Imperva then returns a trust score based on the results of this analysis.
However, due to advanced bot detection techniques, web scrapers rarely exceed the initial analysis stage. Let's discuss Imperva's detection mechanisms below.
HTTP Request Analysis
Scanning the request headers is one of Imperva's initial detection methods. Header fields, such as the User-Agent, contain information that tells the server whether a client is a human.
The web application firewall (WAF) scans incoming requests against a database of known bot signatures or based on the website's header policies. Any deviation from the expected header values can result in detection and subsequent blocking. Browsers typically send headers in a specific order. If your request header strings deviate from the expected order, it can expose you as a web scraper.
Additionally, the anti-bot checks your HTTP version. Since most modern browsers rely on HTTP/2 or HTTP/3 protocols, using an outdated one like HTTP 1.0 or 1.1 can signal bot-like activity.
To reduce the chances of detection via HTTP analysis, use the recommended request headers for web scraping. Then, use HTTP clients that support HTTP/2+ protocols.
IP Fingerprinting
Incapsula collects IP data from website visitors and compares it to a known database of malicious IPs. If your address has a history of hostile attacks or is associated with botnets, it'll gain a poor reputation, and subsequent requests from it will be banned.
The anti-bot also analyzes traffic data, such as the source and request rate and frequency, to identify unnatural user behavior. So, sending multiple requests within a short period or regularly violating rate limits can result in an IP ban, which can be temporary or permanent.
Using proxies to mask your IP address can boost your scraping activities. However, avoid IPs from data centers or shared ones, as they have a low reputation.
Behavior-Based Detection Techniques
Behavior-based detection methods involve behavioral analysis performed on the server and client sides.
The server-side behavioral analysis approach involves page navigation checks to monitor page interaction timing, patterns, and frequency. The client-side method checks browser/client-based user interactions, such as mouse clicks and movements, keyboard inputs, scrolling patterns, etc.
Imperva obtains these behavioral data in real-time using obfuscated JavaScript challenges and sends it back to Imperva for analysis. Once Imperva spots unusual behavior patterns, it blocks the request.
You can reduce behavioral detection using headless browser automation tools like Selenium, Playwright, or Puppeteer.
Browser Fingerprinting
Imperva also uses browser fingerprinting as part of its detection techniques to create a unique fingerprint for each client by collecting specific information. Information gathered includes operating system type and version, browser type, vendor, installed plugins, language, hardware concurrency, screen resolution, etc.
Clients typically present slight differences in their fingerprints, which makes each unique. Imperva leverages the differences between these data points to identify each client and fingerprint them for subsequent requests.
The security further scans each fingerprint against a database of known fingerprints, including those of known bots. If your web scraper has fingerprint traits similar to those of known bots, Imperva will block you.
TLS Fingerprinting
TLS (Transport Layer Security) fingerprinting is another detection technique that Imperva uses to analyze and fingerprint server-client communication. TLS fingerprinting starts with a TLS handshake, where the client sends a "Client Hello" message to the server.
During the "Client Hello" phase, the client provides supported parameters, including the TLS version, cipher suites, extensions, digital signatures, etc.
Imperva uses the details in the "Client Hello" message to generate a hash or fingerprint. This fingerprint can then be matched against a database of known fingerprints to identify the client type or detect unusual patterns.
TLS fingerprinting is more advanced than browser fingerprinting. For instance, even if you spoof HTTP headers like the User-Agent to mimic a real browser, the underlying TLS fingerprint often remains unchanged unless explicitly configured using custom TLS bypass libraries.
You now know how Incapsula detects your scraper. Let's see the 4 ways to bypass it.
Method #1: Use a Web Scraping API for Incapsula Bypass
Using a web scraping API is the easiest and most effective way to bypass Imperva Incapsula. It handles the technical aspect of emulating natural user behavior with proxy rotation, JavaScript rendering, and anti-bot auto-bypass features..
ZenRows is one of the top web scraping APIs for extracting data from any website, regardless of the security level or your project's scale. You only need to make a single API call using any programming language, and ZenRows will help you bypass Incapsula Imperva.
Let's see how ZenRows works by scraping an Incapsula-protected website like Harvey Norman.
Sign up to open the ZenRows Request Builder. Input your target URL in the link box and activate Premium Proxies and JS Rendering. Select your programming language (Python, in this case) and choose the API connection mode.
Copy and paste the generated code into your scraper file.
Since we've selected Python, you'll need to install the Requests library using pip
:
pip3 install requests
The generated Python code should look like this:
# pip install requests
import requests
url = "https://www.harveynorman.com.au/"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
"premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)
Here's the response, showing the website's title with omitted content:
<html>
<head>
<!-- ... -->
<title>Computers, Electrical, Furniture & Bedding | Harvey Norman</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
</body>
</html>
Perfect! You just bypassed Imperva Incapsula using ZenRows.Â
Exploring different methods is important to get the full picture, even if some options have their own strengths.Â
Method #2: Implement Fortified Headless Browsers
This method is suitable if scraping an Incapsula-protected page using headless browser automation tools because of complex automation requirements.
Here's the thing: base headless browsers can render JavaScript and emulate user behavior, but they can't bypass anti-bot measures independently without fortification.Â
Open-source fortified headless browsers, such as Playwright Stealth, are available. Although they hide some detectable bot-like characteristics, they still leak some bot-like details and are unreliable, especially when dealing with sophisticated anti-bots like Incapsula.
For example, the previous Incapsula-protected website (Harvey Norman) blocks Playwright despite adding the stealth plugin.
To try it yourself, install Playwright and its stealth plugin. Then download its browser binaries:
pip3 install playwright playwright-stealth
playwright install
Now, import those libraries and try accessing the protected page with the following code that screenshots the home page:
# pip3 install playwright playwright-stealth
# playwright install
import asyncio
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
async def scraper():
# launch the Playwright instance
async with async_playwright() as playwright:
# launch the browser
browser = await playwright.chromium.launch()
# create a new page
page = await browser.new_page()
# apply stealth to the page
await stealth_async(page)
# navigate to the desired URL
await page.goto("https://www.harveynorman.com.au/")
# wait for any dynamic content to load
await page.wait_for_load_state("networkidle")
# take a screenshot of the page
await page.screenshot(path="screenshot.png")
# close the browser
await browser.close()
# run the main function
asyncio.run(scraper())
The scraper got blocked with the following Incapsula protection page:
Playwright-Stealth got blocked due to missing patches, such as incomplete fingerprints. How can we fortify Playwright better to bypass Imperva during web scraping?Â
That's where the ZenRows Scraping Browser comes in handy. It fortifies your Playwright scraper with essential browser fingerprints and pre-integrated residential proxies, significantly increasing your chances of bypassing the Incapsula anti-bot. It's also highly scalable, running the browser instance in the cloud without extra memory usage from your local machine.
To use it, sign up to load the ZenRows Request Builder. Then, go to the Scraping Browser dashboard and copy your Browser URL.
Connect Playwright's Chromium over the Chrome DevTools Protocol (CDP) using the copied browser connection URL. Then, screenshot the home page after opening the target URL. Here's the updated Playwright scraper:
# pip3 install playwright
# playwright install
import asyncio
from playwright.async_api import async_playwright
import time
async def main():
# launch the Playwright instance
async with async_playwright() as p:
# set the connection URL
connectionURL = "wss://browser.zenrows.com?apikey=<YOUR_ZENROWS_API_KEY>"
# # launch the browser with the connection URL
browser = await p.chromium.connect_over_cdp(connectionURL)
# create a new page
page = await browser.new_page()
# navigate to the desired URL
await page.goto("https://www.harveynorman.com.au/")
# wait for any dynamic content to load
await page.wait_for_load_state("networkidle")
# await page.wait_for_load_state("networkidle")
await page.screenshot(path="screenshot.png")
# close the browser
await browser.close()
# run the main function
asyncio.run(main())
The ZenRows-fortified Playwright scraper accesses the protected page as shown in the screenshot below:
That works! Let's move to the other techniques.
Method #3: Scrape Archived or Cached Pages
Anti-bot systems like Imperva Incapsula are typically triggered in real-time. However, you can bypass the protection altogether by scraping your target's archived version, which doesn't have the anti-bot measure.
Although Google Cache has stopped offering cache services, you can still access snapshot versions of websites via Wayback Machines, such as the Internet Archive. This website contains snapshots of various pages on different days and times.
Selecting any of those snapshots brings up a previously accessed page that doesn't open directly through the Incapsula Imperva content delivery network (CDN).
For instance, to scrape the previous target page archive, open Internet Archive. Then, enter the target URL into the search bar at the top and hit Enter.Â
You'll see snapshots of different dates highlighted in colored dots. Hover over any of them to load the snapshot times for that day. Select the most recent snapshot date and time to reduce the chance of getting outdated data. Click a snapshot period from the options to load the target website's archive.
The loaded archive returns a snapshot of the protected website, as shown:
Once the above archive loads, copy the snapshot URL from the address bar. Open that URL and extract its data with your scraper. The URL looks something like this:
https://web.archive.org/web/20240920195434/https://www.harveynorman.com.au/
While the above method works sometimes, one limitation is that you might end up with outdated data if the website's content has changed since the last snapshot. The archive website may also implement an anti-bot measure to block your scraper from accessing snapshots.
Another way to bypass Incapsula is to use a smart proxy.
Method #4: Use Smart Proxies to Get Past Incapsula Imperva
Some websites only trigger the Imperva anti-bot if the request comes from a geo-restricted IP, a suspicious one, or when an IP exceeds the permissible request limit.
A proxy routes your request through another IP, making it appear as if it's from a different location or machine. You can use free or premium proxies for web scraping. However, free ones have a short lifespan and are unreliable.
The most reliable proxies for web scraping are premium residential ones. These proxies distribute traffic over a pool of IPs assigned to daily internet users by network providers.
This IP distribution model lets you mimic different users and reduces the chance of hitting an IP-triggered Incapsula anti-bot during web scraping. Â Read our guide on the best proxy providers for web scraping to see a list of top options.
The limitation of using only proxies is that you can still get blocked by advanced anti-bot measures, especially those using multiple bot detection techniques beyond IP reputation. You need extra measures to bypass anti-bots.
Conclusion
This step-by-step guide showed you how Incapsula Imperva works and how to bypass it using four approaches:
- Use a web scraping API: The most reliable method to bypass the Incapsula anti-bot page.
- Implement a fortified headless browser: Recommended if your scraping task requires complex automation.
- Scraping the target website's archive: To retrieve content snapshots, which can result in scraping outdated data.
- Integrate smart proxies: Helps avoid IP-triggered Incapsula CAPTCHA. It doesn't work against advanced Incapsula implementations.
ZenRows, an all-in-one web scraping solution, is the most reliable way to bypass Imperva Incapsula at scale. It offers many benefits and features, including anti-bot bypass, JavaScript rendering, proxy rotation, super-fortified scraping browsers with advanced fingerprint management, and more.
Try ZenRows for free now without a credit card!