You found a website you want to scrape, coded your scraper, and ran it, only to realize PerimeterX has blocked you. You're not alone in this struggle!
PerimeterX (now called HUMAN) uses sophisticated server and client-side techniques to detect and block bots like your web scraper. However, you'll be able to bypass PerimeterX and retrieve the data you need by following the methods outlined in this article.
Here are some of the approaches that'll get you through:
- Method #1: Use a Web Scraping API for PerimeterX Bypass.
- Method #2: Use Fortified Headless Browsers.
- Method #3: Employ Smart Proxies to Bypass PerimeterX.
- Method #4: Reverse Engineer the PerimeterX JavaScript Challenge.
Let's get started!
What is PerimeterX?
PerimeterX, now known as HUMAN, is a sophisticated Web Application Firewall (WAF) designed to detect automated threats and bot activity, such as web scraping attempts. It employs advanced algorithms such as behavioral analysis and machine learning techniques to distinguish between legitimate human users and bots.
When PerimeterX detects suspicious activity, it typically responds by blocking access to the website with a PerimeterX 403 error or a "Press & Hold" CAPTCHA, similar to the one shown below:
If you encounter a block page similar to this one, it means PerimeterX has detected behavior that doesn't match typical human patterns. To overcome the "Press & Hold to confirm you are a human" obstacle, we first need to understand how PerimeterX detects bots.
How Does PerimeterX Detect Bots
PerimeterX employs a multi-layered approach to identify and block bot traffic. From analyzing network patterns and HTTP headers to evaluating browser fingerprints and genuine user behavior, it utilizes a complex web of detection mechanisms to catch automated access attempts.
The below list covers the most aggressive defenses PerimeterX deploys. We'll explain how each works and then focus on ways to overcome them.
1. IP Filtering
Security companies like PerimeterX usually have vast lists of IPs known to be used by bots. They monitor IPs that belong to data centers, proxies, or VPN providers. Like other WAFs such as DataDome, Web Application Firewalls usually assign some score or reputation to each IP that tries to access the protected website. If the IP your bot is using has a bad reputation, you will probably get blocked.
While IP monitoring and filtering are powerful tools in PerimeterX's arsenal, it's worth noting that sophisticated bots often use techniques like IP rotation and residential proxies to circumvent this defense. As a result, PerimeterX typically combines IP filtering with other detection methods to provide a more robust defense against automated access.
2. Checking HTTP Headers
One of the simplest yet effective methods PerimeterX uses to detect bots is by examining HTTP headers. Many automated tools and libraries, such as Python's Requests or Node.js's Axios, often send requests with a set of headers that differ from those typically sent by web browsers.
PerimeterX's system is trained to recognize these patterns and can quickly identify requests that don't match expected browser behavior. To bypass this check, it's crucial to ensure that your requests include a full set of headers that closely mimic those of a real browser.
3. CAPTCHAs and Behavioral Analysis
PerimeterX takes pride in its advanced machine-learning algorithms that analyze user behavior patterns to distinguish between human and bot activity. This sophisticated approach goes beyond simple header checks and examines how visitors interact with a website over time.
It employs its own CAPTCHA system, called HUMAN Challenge, as part of its multi-layered defense strategy. This proprietary CAPTCHA serves as a barrier against automated access and allows PerimeterX to gain deep insights into user behavior. When a user's activity is deemed suspicious, the HUMAN Challenge is presented.
PerimeterX's behavioral analysis is comprehensive and examines various aspects of user interaction. It analyzes request and navigation patterns, mouse movements, session duration, page interactions, the timing of site accesses, and more. To bypass advanced behavioral checks like the HUMAN Challenge, it's crucial to mimic human-like behavior as closely as possible.
4. Fingerprinting
PerimeterX employs sophisticated fingerprinting techniques to create unique identifiers for visitors. This process combines various data points to form a distinct "digital signature" for each user or bot, allowing PerimeterX to recognize patterns across different sessions.
The fingerprinting process examines a wide array of factors. These include browser and device characteristics, screen resolution and color depth, installed plugins and fonts, WebGL rendering information, JavaScript execution environment details, and more. Even subtle differences in how a browser processes and renders content can contribute to this unique fingerprint.
PerimeterX uses these fingerprints to build profiles of typical human users and flag anomalies that indicate bot activity. To effectively bypass PerimeterX's fingerprinting, you'll need to present a coherent and realistic digital signature. This involves ensuring that your fingerprint is both unique and indistinguishable from that of a genuine user.
Now that we understand how PerimeterX detects bots, let's explore various working methods to bypass these defenses.
Method #1: Use a Web Scraping API for PerimeterX Bypass
Web scraping APIs like ZenRows offer the most effective and only surefire way to bypass PerimeterX. It continually adapts to PerimeterX's updates and stands out as the most reliable solution. It takes care of all the technical complexities involved in mimicking natural browsing behavior and provides numerous features out of the box, including auto-rotating User Agents, premium proxies, anti-CAPTCHA, and more.
Its advanced headless browser functionality enables you to render dynamic pages and interact with web elements just as a regular browser. All these capabilities, combined with its scalability and ease of use, make ZenRows the most effective option for bypassing PerimeterX and conducting web scraping operations at any scale.
Let's put ZenRows to the test and scrape Zillow, a PerimeterX-protected website.
Sign up for free to get your API key. You'll be redirected to the Request Builder page.
Input the target website URL and activate Premium Proxies and the JS Rendering mode. Select your preferred programming language (e.g., Python) on the right and choose the API mode.
That'll generate your request script on the right. Copy it, and use your preferred HTTP client. This example uses the Python's Requests library, which you can install using the following command:
pip install requests
Your script will look like this:
# pip install requests
import requests
url = "https://www.zillow.com/"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
"premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)
Run it, and you'll get the HTML content of your target web page:
<html lang="en">
<head>
<!-- ... -->
<title>Zillow: Real Estate, Apartments, Mortgages & Home Values</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<h1>
Agents. Tours. Loans. Homes.
</h1>
<!-- other content omitted for brevity -->
</body>
</html>
That's how easy it is to bypass PerimeterX's advanced anti-bot protection using ZenRows.
Method #2: Use Fortified Headless Browsers
While headless browsers were initially designed for testing, they've evolved into essential web scraping tools. They're capable of executing JavaScript, rendering dynamic content, and interacting with web elements. However, they possess automation traits that make them easily identifiable by anti-bot systems like PerimeterX. A common one is the navigator.webdriver
property, for instance.
But, by using fortified versions of these browsers, you can significantly improve your chances of bypassing PerimeterX's detection mechanisms. Fortified headless browsers are modified versions of standard headless browsers that aim to mimic real browser behavior more closely. They typically include additional features and modifications that help mask their automated nature and provide more consistent browser fingerprints. Some popular options include:
- Undetected ChromeDriver and SeleniumBase for Selenium.
- Puppeteer Extra Plugin Stealth for Puppeteer.
- Playwright Stealth for Playwright.
While fortified headless browsers can be effective, it's important to note that they're not foolproof. PerimeterX continually updates its detection methods, and being open-source tools, these browsers struggle to keep pace with the latest anti-bot techniques.
Additionally, they consume significant CPU, memory, and bandwidth resources. So, headless browsing will inevitably result in scraping costs and performance issues.
Method #3: Employ Smart Proxies to Bypass PerimeterX
Smart proxies offer a powerful solution for bypassing PerimeterX. Unlike traditional proxies that simply mask your IP address, smart proxies incorporate advanced features designed to overcome detection methods employed by anti-bot systems like PerimeterX.
These proxies automatically route and rotate your requests through a diverse network of residential and mobile IP addresses. This approach makes your traffic appear to originate from genuine internet users spread across various locations rather than from easily identifiable data centers. By mimicking natural user behavior, smart proxies significantly reduce the risk of triggering PerimeterX's IP-based detection mechanisms.
However, while smart proxies provide a strong foundation for avoiding detection, they are most effective when used as part of a comprehensive anti-detection strategy.
Learn more in our guide about the best web scraping proxies to choose the best one for your project!
Method #4: Reverse Engineer the PerimeterX JavaScript Challenge
One way to bypass the PerimeterX Bot Defender is to reverse engineer its checks and challenges. These are the steps:
- Analyze the network logs.
- Deobfuscate the PerimeterX JavaScript challenge script.
- Analyze the deobfuscated script and the subsequent checks.
How to Create a PerimeterX Bypass
It's essential to understand the firewall internals to reverse engineer it. We'll use mainly JavaScript, but the techniques in this tutorial will allow you to code your PerimeterX bypass in Python or any other language.
In our example, we'll analyze the anti-bot implementation on SSENSE. This target website is a good example because many e-commerce sites use PerimeterX.
Ready?
Step 1: Analyze the Network Log
First, open up the developer tools for the web browser of your choice and switch to the "Network" tab.
Next, leave the developer tools open and navigate to SSENSE. As the page loads, you'll notice many requests appearing in the Network log. The important ones to take note of, in chronological order, are as follows:
An initial GET
request to https://www.ssense.com/en-ca
. Looking at the response, you'll see a Set-Cookie
header for _pxhd
. This is an important cookie: it acts as a session indicator and will be used in future requests. Your PerimeterX bypass will need some data from this cookie to calculate the correct values that will be sent for validation to the server.
Check also that the response body's HTML contains a <script>
tag, which fetches the PerimeterX challenge script:
<script type="text/javascript">
(function () {
window._pxAppId = "PX58Asv359";
if (window._pxAppId) {
var p = document.getElementsByTagName("script")[0],
s = document.createElement("script");
s.async = 1;
s.src = "/" + window._pxAppId.substring(2) + "/init.js";
p.parentNode.insertBefore(s, p);
}
})();
</script>
A GET
request to /<_pxAppId>/init.js
(where <_pxAppId>
is the value of window._pxAppId
). This returns the script PerimeterX uses for client-side bot detection. It's obfuscated and minified, so you won't be able to understand much for now. Click here to see the entire script.
Then, a POST request to /<_pxAppId>/xhr/api/v2/collector
happens. The request payload is a string with content-type application/x-www-form-urlencoded
, and contains the following data:
-
<payload>
is an encrypted and Base64 encoded string. -
<appId>
is the previously defined value ofwindow._pxAppId
. -
<tag>
is a version tag (static per site), ex.v8.0.2
. -
<uuid>
is a randomly generated UUID, ex.4420aff0-351d-11ed-95d0-c137f4896ca9
. -
<ft>
is an integer (static per site), ex.278
. -
<seq>
has the value0
. -
<en>
has the valueNTA
. -
<pc>
is an integer, ex.3195683956001701
. -
<pxhd>
is the value of the_pxhd
cookie. -
<rsc>
has the value1
.
The response body is a JSON object with a single top-level field: do
. The do
field contains an array of strings. The format is as follows:
{
"do": [
"sid|<sid>", // a string, ex. 4415dfc2-351d-11ed-a66d-7275714f5843
"pnf|cu",
"cls|<cls>", // an integer, ex. 85062563435994268828
"sts|<sts>", // is a UNIX timestamp, ex. 1663263533114
"wcs|<wcs>", // a string, ex. cchm6ba3onsi8miotj00
"drc|<drc>", // an integer, ex. 4460
"cts|<cts>|true", // a string, ex. 4415e33e-351d-11ed-a66d-7275714f584
"cs|<cs>", // a SHA2-256 hash, ex. dd2d5dc601445d684b2c4249a4c68f300048446afd4fece93c44ae41f62bdda3
"vid|<vid1>|<vid2>|true", // a string and an integer, ex. 43c15b2f-351d-11ed-97ec-797549415148 and 31536000
"sff|cc|60|<sff>" // a base64-encoded string, ex. U2FtZVNpdGU9TGF4Ow==
]
}
And a second POST request to /<_pxAppId>/xhr/api/v2/collector
. The payload has the same content-type as before and a similar format with a few added fields:
-
<payload>
is a much longer, encrypted + Base64 encoded string. -
<appId>
,<tag>
,<uuid>
,<ft>
and<pxhd>
are the same as the previous request. -
<seq>
has the value1
. -
<en>
has the valueNTA
. -
<cs>
is a SHA2-256 hash, ex.dd2d5dc601445d684b2c4249a4c68f300048446afd4fece93c44ae41f62bdda3
. -
<pc>
is an integer, ex.1670315818019117
. -
<sid>
is a string, ex.4415dfc2-351d-11ed-a66d-7275714f5843
. -
<vid>
is a string, ex.43c15b2f-351d-11ed-97ec-797549415148
. -
<cts>
is a string, ex.4415e33e-351d-11ed-a66d-7275714f5843
. -
<rsc>
has the value2
.
If you take a closer look, you'll see that the cs
, sid
, vid
and cts
fields are derived directly from the JSON object returned from the first POST request.
Additionally, the value of the seq
and rsc
has incremented by 1, relative to the first POST request. This behavior is maintained for all following POST requests too, so we can determine that these fields act as some sort of request counter.
PerimeterX sends another JSON object in the response body, once again containing an array of strings:
{
"do": ["bake|_px2|330|<jwt>|true|300", "pnf|cu"]
// where jwt is a JWT Token, ex. eyJ1IjoiNDQyMGFmZjAtMzUxZC0xMWVkLTk1ZDAtYzEzN2Y0ODk2Y2E5IiwidiI6IjQzYzE1YjJmLTM1MWQtMTFlZC05N2VjLTc5NzU0OTQxNTE0OCIsInQiOjE2NjMyNjM4MzQxMjIsImgiOiIwNzUzZDJhYTU1OWEzZDFhYjM5YjcyOGFmZDA0MDUyYWFlNDQ2MmU1NjMxNjZkNjM4MjM0NjZkNmNjMzIwY2ZlIn0=
}
You may have noticed that none of the POST requests contains a Set-Cookie response header. Typically, once a browser has passed bot-detection checks, an anti-bot system will set special cookies or headers for use in future requests. Then, once a client makes a request to a protected endpoint, those headers/cookies from the request get validated on the server side.
So, how does this work in the case of PerimeterX? If you make a request to an endpoint protected by PerimeterX, you won't see any unusual headers. You will, however, notice what seems to be some PerimeterX-related cookies. For a cleaner overview, you can view all the cookies on the site and filter by the keyword px
:
These are PerimeterX's clearance cookies. They are checked on the server side to determine if a request should be blocked or forwarded to the origin. But remember, there's no record of these cookies being set with the Set-Cookie
header in the network log. So, where are they coming from?
You might recognize the cookie names and values from the response bodies of the POST
requests. This must mean that the cookies are being set directly through JavaScript, which makes sense considering all the PerimeterX cookies lack an Http-Only flag.
Depending on the security level of a PerimeterX-protected site, your browser and your device, the behavior of the challenge script and its requests may differ slightly. At the time of writing, SSENSE only requires the two above POST
requests to /<_pxAppId>/xhr/api/v2/collector
. The second POST
yields a _px2
cookie, which is the main clearance cookie that grants unblocked access to a site.
Higher security sites may require additional POST
requests to /<_pxAppId>/xhr/api/v2/collector
, to obtain a _px3
cookie. For those sites, _px3
acts as a required clearance cookie. Don't worry, though, since the techniques we discuss here will also be useful to bypass PerimeterX on sites with a high-security level.
Okay, great job! By analyzing the requests, we learned a lot about how PerimeterX behaves. Unfortunately, we're still missing a lot of information. We still don't know what data is contained in the encrypted payload field, how some other fields are generated, and what client-side bot detection checks the script performs. If you want to bypass PerimeterX, that knowledge is crucial.
If we want to answer those remaining questions, we have no choice but to directly consult the PerimeterX challenge script to figure out exactly how it works.
Step 2: Deobfuscate the PerimeterX JavaScript Challenge
To make the script unreadable to reverse engineers, PerimeterX applies obfuscation to their Javascript challenge. Here's a non-exhaustive list of some examples:
String Concealing
This technique replaces all references to string literals with a call to a decoder function. In the case of PerimeterX, strings are either Base64 encoded or additionally encrypted with a simple XOR cipher.
// String concealing example from the PerimeterX script
// Creates an empty lookup cache for use in the decoding function
var o = Object.create(null);
/* ... */
// XOR Decryptor function
// Returns the decoded string.
// This function references some external variables and functions.
// The n() and r() functions are related to recording timestamps, and are irrelevant to the decoding function.
// The i() function is a polyfill function for atob (base64 decoding)
// The o variable is defined earlier in the script as a cache.
function c(t) {
// n() is irrelevant to the decoding
var a = n(),
e = o[t];
if (e) u = e; // Try to look up the decoding string in the cache
else {
// i() is a polyfill function for atob
// Base64 decodes the input string
var c = i(t);
var u = "";
// XOR decryption
for (var f = 0; f < c.length; ++f) {
var A = "dDqXfru".charCodeAt(f % 7);
u += String.fromCharCode(A ^ c.charCodeAt(f));
}
// Store the result in the cache
o[t] = u;
}
return r(a), u; // r(a) is irrelevant to the decoding.
}
/* ... */
// Later on in the script, it's used like this:
c("NBxAaVZGQg"); // => "PX11047"
Proxy Variables/Functions
This technique replaces direct references to a variable/function's identifier with an intermediate variable.
/* Proxy function example */
// Decoding function from above
function c(t) {
/* ... */
}
// Intermediate variable declaration
var r = c;
// Calling r() instead of c() directly
r("NBxAaVZERw"); // => "PX11062"
/* Proxy variable example */
// Intermediate variable for the identifier "window"
var F = window;
// Referencing "F" instead of "window" directly
F.performance.now();
Unary Expressions
Rather than directly using boolean literals or the undefined keyword, this technique takes advantage of the automatic type-conversion behavior of JavaScript's unary expression implementation.
var o = !0; // equivalent to o = true
var c = !1; // equivalent to c = false
void 0 === this.channels; // equivalent to undefined === this.channels
Though the PerimeterX challenge script's obfuscation may not be as sophisticated as that of other bot detection vendors, it still requires specialized reverse-engineering skills to convert it to a readable state. Simply pasting it in a general JavaScript deobfuscator won't produce easily understandable code.
To deobfuscate the PerimeterX script, you'll need to create a custom deobfuscator. This step can be difficult, but it's essential for creating a PerimeterX bypass!
Try using abstract syntax tree (AST) manipulation.
Once you've deobfuscated the PerimeterX challenge script, you can read it to determine what bot detection checks are performed and how to replicate the challenge-solving behavior. In the next step, we will go over the deobfuscated script and try to extract critical information about its internals.
Step 3: Analyze the Deobfuscated PerimeterX Script
Let's start by figuring out how the payload is encrypted!
PerimeterX's Payload Encryption
To figure out how the payload is encrypted, so we can code our custom PerimeterX bypass, we're going to work backward. First, we find where it's set by searching for the string "payload="
in the deobfuscated script:
var B = {
vid: cn,
tag: ff.Bn,
appID: ff.J,
cu: Uo,
cs: f,
pc: A,
};
var N = Wc(n, B);
var l = [
"payload=" + N,
"appId=" + ff.J,
"tag=" + ff.Bn,
"uuid=" + Uo,
"ft=" + ff.Nn,
"seq=" + Uu++,
"en=NTA",
];
The final value for payload
is stored in the variable N
. Looking at the definition of N
, we can determine that the Wc
function is responsible for payload encryption. Wc
takes in two parameters:
-
n
: a JavaScript object that stores the raw payload data. -
B
: a JavaScript object that stores some values used as keys in the encryption process.
Let's look up the definition of Wc
:
var B = {
var Wc = function (n, r) {
var t;
var a = n.slice();
t = nc || "1604064986000";
var e = zr(Un(t), 10);
var i = z(a);
a = Un(zr(i, 50));
var c = (function (n, r, t) {
var a, e, i, o, c;
var u = zr(Un(t), 10);
var f = [];
var A = -1;
for (var B = 0; B < n.length; B++) {
/* ... */
}
for (var v = 0; n.length > v; v++) {
/* ... */
}
return f.sort(function (n, r) {
return n - r;
});
})(e, a.length, r[Hc]);
a = (function (n, r, t) {
/* ... */
return (a += r.substring(e));
})(e, a, c);
return a;
};
This is PerimeterX's encryption cipher. The original function is quite long and references many external variables/functions. For the sake of practicality, we've truncated it.
However, there are some important things you can learn about this cipher by looking at the fully deobfuscated PerimeterX script:
- The payload uses two encryption keys: the values of
uuid
andsts
. - uuid appears in every
POST
request, whilests
appears in the 2ndPOST
request onwards. In the case of the 1stPOST
request, wherests
is absent,"1604064986000"
is used in place of it. - This is a symmetric-key algorithm. Therefore, as long as you have the original
sts
anduuid
values, you can decrypt any PerimeterX-encrypted payload. This is useful for analyzing the payload that your browser sends since the keys are always sent in thePOST
request along with the encrypted content.
How PerimeterX Sets Cookies
We previously concluded that all PerimeterX-related cookies were set by the actual script itself. Recall that the raw value of the _px2
cookie first appeared inside of a JSON-formatted response body (as <jwt>
):
{
"do": ["bake|_px2|330||true|300", "pnf|cu"]
}
The field name do
is quite literal: its corresponding value is an array of instructions. Each string is split on every |
into an array. For the first string in the do
array, that looks like this:
// The first instruction
var processedInstruction1 = "bake|_px2|330||true|300".split("|"); // => ["bake","_px2","330","","true","300"]
The first element of the resulting array determines the function to be executed, while the remaining elements are taken as the arguments for the function. In this case, bake
is the name of the function to be executed.
Searching for bake
in the deobfuscated PerimeterX script, we discover the cu
object. This cu
object holds the handler for the bake
instruction:
var cu = {
/**
* @param n = "_px2"
* @param r = "330"
* @param t = ""
* @param a = "true"
* @param e = "300"
*/
bake: function (n, r, t, a, e) {
if (ff.J === window._pxAppId) {
wt(n, r, t, a);
}
/* ... */
},
/* ... */
};
The arguments n
, r
, t
, a
, and e
all take on the values of "_px2"
, "330"
, "<jwt>"
, "true"
, and "300"
respectively.
The bake
method calls a function, wt
. Let's look up the definition of that too:
/**
* @param n = "_px2"
* @param r = "330"
* @param t = ""
* @param a = "true"
*/
function wt(n, r, t, a) {
/* ...*/
try {
var i;
// Creates the expiry date of the cookie, based on the "r" parameter.
if (r !== null) {
i = new Date(+new Date() + 1000 * r).toUTCString().replace(/GMT$/, "UTC");
}
// Initialize the _px2 cookie string
var o = n + "=" + t + "; expires=" + i + "; path=/";
var c = (a === true || a === "true") && bt();
// Append the site domain to the cookie, and add the cookie to document.cookie
c && (o = o + "; domain=" + c)((document.cookie = o + "; " + e));
return true;
} catch (n) {
return false;
}
}
So, it looks like the bake instruction directly sets the _px2
cookie! It's also a play on words, as in baking cookies.
Congrats! You found where in the code their main anti-bot cookie is being set! The next step will be to calculate values for it that make sense to PerimeterX, so your bot does not get flagged as suspicious.
You should note that the cu
object contains handlers for all other possible do instructions, too! To bypass PerimeterX, you need to reverse engineer the functionality of each do
instruction.
Let's learn how to break some of the security checks you might find inside this do array.
WebGL Fingerprinting
In the snippet below, PerimeterX uses WebGL APIs to create and render an image. The hash of the image is then stored in canvasfp
:
// This function creates, renders, and hashes the image to construct "canvasfp".
function A() {
return new T(function (c) {
setTimeout(function () {
try {
a = n.createBuffer();
n.bindBuffer(n.ARRAY_BUFFER, a);
var u = new Float32Array([
-0.2, -0.9, 0, 0.4, -0.26, 0, 0, 0.732134444, 0,
]);
n.bufferData(n.ARRAY_BUFFER, u, n.STATIC_DRAW)((a.itemSize = 3))(
(a.numItems = 3)
)((e = n.createProgram()))((i = n.createShader(n.VERTEX_SHADER)));
n.shaderSource(
i,
"attribute vec2 attrVertex;varying vec2 varyinTexCoordinate;uniform vec2 uniformOffset;void main(){varyinTexCoordinate=attrVertex+uniformOffset;gl_Position=vec4(attrVertex,0,1);}"
);
/* Some more transformations on the canvas image... */
/* ... */
n.drawArrays(
n.TRIANGLE_STRIP,
0,
a.numItems
)(
(r.canvasfp = n.canvas === null ? "no_fp" : In(n.canvas.toDataURL())) // In() computes a hash of the generated image
)((r.extensions = n.getSupportedExtensions() || ["no_fp"]));
} catch (n) {
r.errors.push("PX10703");
}
/* ... */
}, 1);
});
}
This is useful for fingerprinting because even if instructed to draw the exact same image, slight variations in hardware or low-level software (i.e., operating systems) will produce a different output (and thus, a different hash). This makes WebGL fingerprinting a good way to classify devices.
PerimeterX also collects various other WebGL properties to better classify your device. Using machine learning algorithms, they can use this data to detect if you're spoofing WebGL properties/rendering.
The computed canvasfp
, along with the additional WebGL properties, are added to the payload object in the snippet below:
// Adding the collected WebGL data to the POST request payload
(function (t) {
(a.PX10061 = t.canvasfp)((a.PX11016 = t.webglVendor))((a.PX10529 = t.errors))(
(a.PX10279 = t.webglRenderer)
)((a.PX10753 = t.webGLVersion))((a.PX10246 = t.extensions))(
(a.PX11232 = In(t.extensions))
)((a.PX10871 = t.webglParameters))((a.PX11231 = In(t.webglParameters)))(
(a.PX11077 = t.unmaskedVendor)
)((a.PX10165 = t.unmaskedRenderer))((a.PX10244 = t.shadingLangulageVersion));
tt("PX11223");
r(a);
});
Automated Browser
Below, PerimeterX is checking for the existence of automated-browser-specific properties:
try {
(n.PX10010 = !!window.emit)((n.PX10225 = !!window.spawn))(
(n.PX10855 = !!window.fmget_targets)
)((n.PX11065 = !!window.awesomium))((n.PX10456 = !!window.__nightmare))(
(n.PX10441 = Xr(window.RunPerfTest))
)((n.PX10098 = !!window.geb))((n.PX10557 = !!window._Selenium_IDE_Recorder))(
(n.PX10170 = !!window._phantom || !!window.callPhantom)
)((n.PX10824 = !!document.__webdriver_script_fn))(
(n.PX10087 = !!window.domAutomation || !!window.domAutomationController)
)(
(n.PX11042 =
window.hasOwnProperty("webdriver") ||
!!window["webdriver"] ||
document.getElementsByTagName("html")[0].getAttribute("webdriver") ===
"true")
);
} catch (n) {}
Sandboxing Checks
PerimeterX checks for the existence of NodeJS-only APIs to determine if the script is being sandboxed:
var n;
// The process object only exists in NodeJS.
try {
n =
n ||
((typeof process == "undefined" ? "undefined" : A(process)) === "object" &&
String(process) === "[object process]");
} catch (n) {}
try {
n = n || /node|io\.js/.test(process.release.name) === true;
} catch (n) {}
To make sure built-in functions haven't been modified (i.e., monkey-patched), PerimeterX calls typeof
and an implicit toString
on them:
// A() acts as a wrapper for "typeof"
function A(n) {
A =
typeof Symbol == "function" && typeof Symbol.iterator == "symbol"
? function (n) {
return typeof n;
}
: function (n) {
return n &&
typeof Symbol == "function" &&
n.constructor === Symbol &&
n !== Symbol.prototype
? "symbol"
: typeof n;
};
return A(n);
}
//
function Xr(n) {
// When typeof is called on an unmodified built-in function, it will return "function".
// "" + n is an implicit toString()
// An unmodified built-in function will always include "[native code]" in the result.
return A(n) === "function" && /\{\s*\[native code\]\s*\}/.test("" + n);
}
/* ... */
// Later used like this:
n.PX10213 = Xr(window.EventSource);
n.PX10283 = Xr(Function.prototype.bind);
n.PX10116 = Xr(window.setInterval);
// If they haven't been modified, all the above calls should return true.
User Input Event Tracking
PerimeterX collects behavioral biometrics, such as mouse movements, keyboard presses, and touch movements. The collected data can then be analyzed with machine learning to determine if the inputs are human-like or generated by a bot.
In this snippet, PerimeterX tracks the timing and position of touch events:
{
(function (n, r) {
_i.length < 10 &&
_i.push(
+n.movementX.toFixed(2) + "," + +n.movementY.toFixed(2) + "," + wr(r)
);
if (n && n.movementX && n.movementY) {
if (Pi.length < 50) {
Pi.push(
(function (n) {
var r = n.touches || n.changedTouches;
var t = r && r[0];
var a = +(t ? t.clientX : n.clientX).toFixed(0);
var e = +(t ? t.clientY : n.clientY).toFixed(0);
var i = (function (n) {
return +(n.timestamp || n.timeStamp || 0).toFixed(0);
})(n);
return "".concat(a, ",").concat(e, ",").concat(i);
})(n)
);
}
}
})(n, t);
}
Conclusion
Different methods exist to bypass the PerimeterX bot detection system, some being more reliable than others. While reverse engineering the PerimeterX JavaScript challenge is one of the paths, it can get tedious depending on the level of obfuscation.
However, with web scraping APIs like ZenRows, you can easily scrape data from any website with a few lines of code. You can try it by signing up to get your free API key right away.
Keep in mind that PerimeterX frequently updates its challenges, so if you opt for deobfuscation, you must constantly check your web scraper to avoid detection.
Hopefully, this tutorial helped you with your web scraping projects. To learn more techniques for bypassing anti-bot services, check out our detailed guides on Cloudflare bypass and Akamai bypass.