The Anti-bot Solution to Scrape Everything? Get Your Free API Key! ๐Ÿ˜Ž

How to Bypass DataDome: Complete Guide 2024

January 19, 2023 ยท 14 min read

The sad news is that many websites have implemented advanced anti-bots, like DataDome, to prevent web scraping. The good news is you'll learn how to bypass DataDome in this guide, covering:

Ready? Let's dive in!

What Is DataDome?

DataDome is a complete anti-bot suite easily integrated with popular frameworks and platforms such as Amazon CloudFront, Nginx, Salesforce and Kubernetes. It's a leading provider for real or seemingly cyber threats, including DDoS attacks, scrapers, SQL or XSS injection, and card and payment fraud.

How Does DataDome Detect Bots?

DataDome uses server-side and client-side techniques to detect known and unknown bots: user behavior, network and browser fingerprints, geolocation tracking, etc. DataDome also regularly updates and maintains its dataset to keep up with the ever-evolving bot landscape.

Let's learn the most important techniques of each type as a first step to learning how to bypass DataDome.

Server-side Measures

Server-side measures rely on analyzing the connection to the server, the browsing session and all related metadata. They take advantage of protocol specifications surrounding a browsing session, like HTTP, TCP and TLS, to fingerprint a user and look for inconsistencies and suspicious behavior.

HTTP/2 Fingerprinting

HTTP/2 is a binary protocol that sends data as frames within a stream. Its main goal is to improve the performance of websites and web applications by introducing header field compression and allowing concurrent requests and responses on the same TCP connection.

TCP/IP Fingerprinting

A networked device will indicate its characteristics starting from the first TCP/IP request. A TCP/IP connection is initiated with a TCP SYN packet as part of the TCP three-way handshake. In this specific packet, a client will provide essential parameters such as Time-To-Live (TTL) and support of IP fragmentation.

TLS Fingerprinting

TLS fingerprinting is a server-side technique that web servers use to determine a web client's identity (browsers, CLI tools or scripts) using only the parameters in the first packet connection before any application data exchange occurs.

Server-side Behavioral

This technique analyzes browsing sessions and navigations based on logs from the server, like requests, operations executed and attempted, IP addresses and interaction with honeypots.

This data is monitored for anomalies and outliers: if the frequency of the requests is too high, it can be rate-limited; if a request's country of origin changes during a browsing session, it indicates the activation of a proxy.

Client-side Signals

Client-side signals are signals collected from the end-user device. They can be collected in the browser using JavaScript (JS) or an SDK in mobile applications. Some of the techniques used for client-side signals are:

Operating System and Hardware Data

Historically, Operating System fingerprinting has been the most valuable, especially for attackers looking for vulnerable OS versions. It comprises details around CPU, GPU, laptop or phone model, device vendor and manufacturer.

OS details are somewhat static and are less likely to change over time. Taking one step lower than the OS will lead us to hardware data, which is even harder to change.

If you're interested in that data, you can execute the following JavaScript in your Developer Console:

program.cs
console.log("OS: " + navigator.platform); 
console.log("Available RAM in GB: " + navigator.deviceMemory); 
 
document.body.innerHTML += '<canvas id="glcanvas" width="0" height="0"></canvas>'; 
var canvas = document.getElementById("glcanvas"); 
var gl = canvas.getContext("experimental-webgl"); 
 
var dbgRender = gl.getExtension("WEBGL_debug_renderer_info"); 
console.log("GL renderer: " + gl.getParameter(gl.RENDERER)); 
console.log("GL vendor: " + gl.getParameter(gl.VENDOR)); 
console.log("Unmasked renderer: " + gl.getParameter(dbgRender.UNMASKED_RENDERER_WEBGL)); 
console.log("Unmasked vendor: " + gl.getParameter(dbgRender.UNMASKED_VENDOR_WEBGL));

The output will be something similar to this:

Output
OS: Linux x86_64 
Available RAM in GB: 2 
GL renderer: WebKit WebGL 
GL vendor: WebKit 
Unmasked renderer: NVIDIA Corporation 
Unmasked vendor: NVIDIA GeForce GTX 775M OpenGL Engine

It's worth noting that, in the context of a mobile application and SDKs, OS and hardware fingerprinting are considerably more effective and may have access to globally persistent data such as the MAC address or the International Mobile Equipment Identity (IMEI).

Browser Fingerprinting

Browser fingerprinting is a technique used to uniquely identify a web browser by collecting various information about the browser, like the browser type and version, screen resolution and IP address. This information creates a unique "fingerprint" to track browsers across different websites and sessions.

Some techniques employed in browser fingerprinting are Canvas fingerprinting, Audio fingerprinting, Storage and persistent tracking, and Media devices fingerprinting.

Behavioral Data

These data are generated when users interact with a website or app. They can come from gestures like moving the mouse, clicking, touching the screen, typing quickly or using sensors (like the accelerometer in a device).

So how does DataDome succeed with these methods?

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

How Does DataDome Use These Techniques?

DataDome's bot detection engine is a system capable of making quick decisions using the abovementioned detection techniques. They're applied in three layers:

  • Verified bots and custom rules: clear rules to allow/block requests based on their direct attributes, like source IP address, source domain or user agent. That is where verified bots, like Google's crawler, are let in without question.
  • Signature-based bot detection: the synthesis of the fingerprinting is reduced to signatures that are checked on-the-fly. This line of defense is where most bots get caught: automated browsers, proxies, virtual machines and emulators.
  • Machine learning detection: this layer learns from the signals it's fed and picks up on subtle indicators of bot behavior. These machine learning algorithms continuously learn from all signals collected and can identify contradictions in how a device presents itself. Even advanced scrapers, built with well-configured automated browsers and using residential IPs, can get caught at this level.

Reverse Engineering the DataDome Anti-bot System

Anti-bot systems today are complex and multi-layered, and DataDome isn't an exception. To do a DataDome bypass, you need to reverse engineer it. We'll examine the following for a comprehensive approach:

  • DataDome's CAPTCHA.
  • DataDome network requests.
  • Reverse engineering DataDome's JavaScript file.

DataDome's CAPTCHA Challenge

To trigger DataDome's anti-bot mechanisms, we'll use Playwright to try to access the Vercel DataDome template. That will trigger a page stating that our browser displays suspicious behavior with a CAPTCHA, and if you take a closer look at the source code, you'll notice it's loaded from GeeTest.

Let's bypass it using PuppeteerJS and some OpenCV.js magic. Compare the images and analyze their outlines, then calculate how much the slider needs to move to solve the puzzle, and lastly, task Puppeteer with moving the slider:

Automated CAPTCHA Bypass Fail
Automated CAPTCHA bypass fails with static mouse

What about all the behavioral data and mouse movement we covered before? Is this what gives away our automated browser? Let's retry the same thing, but we'll assist our bot and move our mouse somewhat during the solution this time:

Assisted CAPTCHA Bypass Success
Assisted CAPTCHA bypass with mouse movements

That's it! The idle mouse was what gave away our bot.

But can it be considered efficient if our DataDome bypass solution requires manual intervention to move the mouse? No. The bots should never trigger the CAPTCHA challenge in the first place. That was just an illustration of how modern CAPTCHAs used by DataDome correlate additional data signals with the solution to detect bots.

DataDome Network Requests

The network requests are always a good starting point to inspect how the JavaScript file is fetched and other outgoing requests to DataDome's servers. Launch a browser and open the developer tools to reverse engineer this for our DataDome bypass. Then, switch to the Network tab and navigate to a website protected by DataDome.

The first request related to DataDome is a GET request to fetch the JavaScript file over https://js.datadome.co/tags.js with a considerable payload:

Network Requests to DataDome JS
Network request to fetch DataDome's JS file

As part of the obfuscation, such files vary from request to request to confuse anybody trying to understand them. That is achieved using on-time Polymorphic JavaScript obfuscation, which randomizes the output of the JavaScript obfuscator.

In our case, the script file remains static but varies when requested from different sources and at different times. Fortunately, this doesn't count as polymorphic JavaScript obfuscation, just randomly generated names.

The second outgoing network request to DataDome's servers is a POST request to <https://api-js.datadome.co/js/> with a considerable payload:

Payload DataDome Tracking Request
Payload of DataDome tracking network request

Its payload contains multiple parameters, but the most notable of them are:

  • jsData: used to fingerprint the browser visiting a protected page based on many parameters. DataDome's servers will check if this data is consistent, if it matches a bot's signature, and use it to keep track of navigation and behavior.
  • Events: an array of events captured by the script, such as mouse movements, scrolls and keyboard inputs.
  • eventCounters: it can be used as a data signal on its own or to check the integrity of reported events.
  • ddk: a key that identifies the DataDome client.
  • Referer: the address from which the request originated.

If all goes well, the response to this request will contain a cookie named datadome, which will be supplied as proof of clearance in subsequent requests.

Deobfuscating the JavaScript Challenge

Since client-side measures rely on running a script in the user's device, they must be shipped with the protected website or application. These scripts contain proprietary code and are protected using obfuscation. These measures may make it more complex and slower to reverse-engineer, but it doesn't make it impossible.

Start using an online JavaScript deobfuscator, like DeObfuscate.io, and head to https://js.datadome.co/tags.js, where DataDome hosts its script. The JavaScript deobfuscator renames variables to more human-friendly formats, abstracts proxy functions, and simplifies expressions.

Some obfuscation techniques aren't easily reversible and will need you to go further than automated tools.

Let's go through these techniques step by step.

String Concealing

Variables and functions are renamed to meaningless names to lower the script's readability, and all references to strings are replaced with hexadecimal representations. Besides renaming and encoding, they're hidden using a string-concealing function.

larone() is an example of the function in the script since it's called in the script over 1200 times and contains alphabet characters from "a" to "z" with three special symbols, indicating character manipulation. Throughout the code, this function will often be reassigned to a variable to confuse debugging.

When called with an integer as a parameter, it returns a string that will uncover function calls and variable names. Translating its usage to the corresponding functions will be a massive step in the deobfuscation, so let's see what this function returns for each call in a file.

You can set a breakpoint in the browser's developer console where the said function will be available as window.larone and write a simple loop to generate a mapping of parameters into strings outputs:

String-Concealing Example
String-concealing function deobfuscation dictionary

After that, we'll write a script to replace calls to the larone() function in the JS file with their string equivalent. This turn statements like document[_0x3b60ec(132)]("\x63\x61\x6e\x76\x61\x73")["\x67\x65\x74\x43\x6f\x6e\x74\x65\x78\x74"](_0x3b60ec(406)); into more expressive ones like document.createElement("canvas").getContext("webgl");.

After converting these hexadecimal identifiers to human-readable versions and replacing the string-concealing function, some of the puzzles we figured out are CAPTCHAs, POST request's payload, DataDome JS Keys and the Chrome ID.

Editing Variables at Runtime

A significant matter in the script is the huge array of cryptic values. Following where it's used leads us to the first self-invoking function in the script. It shuffles the array by moving the first element to the last position until an expression equals the second integer parameter:

program.js
(function (margart, aloura) { 
	var masal = margart(); 
	while (true) { 
		try { 
			var yasameen = 
				(parseInt(larone(392)) / 1) * (-parseInt(larone(713)) / 2) + 
				(-parseInt(larone(390)) / 3) * (-parseInt(larone(606)) / 4) + 
				parseInt(larone(756)) / 5 + 
				(-parseInt(larone(629)) / 6) * (-parseInt(larone(739)) / 7) + 
				(-parseInt(larone(679)) / 8) * (parseInt(larone(426)) / 9) + 
				parseInt(larone(385)) / 10 + 
				(parseInt(larone(362)) / 11) * (-parseInt(larone(699)) / 12); 
			if (yasameen === aloura) break; 
			else masal.push(masal.shift()); 
		} catch (prabh) { 
			masal.push(masal.shift()); 
		} 
	} 
})(jincy, 291953);

By setting a convenient breakpoint and adding a counter to the function above, we see it's called about 300 times.

The sole purpose of this function is to edit the encryption global variable until the yasameen variable evaluates the second integer parameter. It reorganizes it by shifting the first element to the last position. That happens at runtime and may disorient the reader when decrypting a concealed string.

Control Flow Obfuscation

Developers like to organize their code by its responsibility, like data, business logic, HTTP libraries, encryption, etc. That allows us to separate concerns and follow the flow of execution, especially when debugging.

Good code is readable, and obfuscators know that, so they take advantage of it using control flow obfuscation. This process rearranges instructions to make following the script's logic difficult for humans. If the script is compiled and obfuscated with such techniques, it can even crash decompilers trying to make sense of it.

In our case, the primary function in the script heavily uses recursion and nested calls to blur the path of execution. That is noticeable with the main() function and the executeModule() function it defines. We've also renamed the variables to more expressive names:

program.js
!(function main(allModules, secondArgument, modulesToCall) { 
	function executeModule(moduleIndex) { 
		if (!secondArgument[moduleIndex]) { 
			if (!allModules[moduleIndex]) { 
				var err = new Error("Cannot find module '" + moduleIndex + "'") 
			} 
		} 
			// ... 
			allModules[moduleIndex][0].call( 
				// ... 
				function (key) { 
					return executeModule(allModules[moduleIndex][1][key] || key); 
				}, 
				// ... 
				main, 
				modulesToCall 
				// ... 
			); 
		} 
		return secondArgument[moduleIndex].exports; 
	} 
	for (...) executeModule(modulesToCall[i]); 
	return executeModule; 
})(...)

Figuring out the flow of execution of such functions isn't trivial. We suspect it of orchestrating all the execution and of handling JavaScript modules. Why? Because the error message Cannot find module in the above code gives it away.

Note: This could be due to control flow obfuscation or standard JS module bundling. But it does obfuscate our understanding of the script, so we'll count it as obfuscation.

Analyzing the Deobfuscated Script

So far, we've peeled quite a few obfuscation layers to understand how to bypass DataDome. It's time to step back and take a bird's-eye view of the script. The script contains three categories: the global variables, the string concealing functions and the main function:

program.js
// Global variables, with import paths as keys and small integers as values 
... 
var global3 = { 
	"./common/DataDomeOptions": 1, 
	"./common/DataDomeTools": 2, 
	"./http/DataDomeResponse": 5, 
} 
var global4 = {...} 
// ... 
 
// First self-invoking function to shuffle global array defined below 
(function (first, second) { 
	... 
})(globalArray, 123456) 
 
// String-concealing function 
function concealString(first, second) { 
	// ... 
} 
 
// Global array of 500+ cryptic strings 
globalArray = [ 
	"CMvZB2X2zwrpChrPB25Z", 
	"AM5Oz25VBMTUzwHWzwPQBMvOzwHSBgTSAxbSBwjTAg4", 
	// ... 
] 
 
// Main function 
!(function main(first, second, third) { 
	// Orchestrate JavaScript module passed in first argument 
})( 
	// Nine JavaScript modules, 
	{ 
		1: [ 
			function () {}, 
			global1, 
		], 
		2: [...], 
		// ... 
	} 
	{}, 
	[6], // Entrypoint module 
)

Our analysis will focus on the main function. You'll notice it has a substantial first parameter, and most of the file's code is in this parameter alone. The first parameter is an object that follows a specific pattern, and each value of the object is an array of 2 elements: a function with three arguments and a global object.

Most modules are auxiliary to bot detection and deal with events tracking, HTTP requests and string handling. We'll leave those for you to discover on your own. As for bot detection, the modules related to it are:

  • Module 1: It initializes DataDome's options, like this.endpoint, this.isSalesForce, and this.exposeCaptchaFunction. It also defines and runs a function called this.check. This function loads all DataDome options and is called early in the entry point module.
  • Module 3: This is where fingerprinting happens. It defines over 35 functions, with names prefixed by dd_, and runs them asynchronously. Each of these functions gathers a set of fingerprinting signals. For example, the functions this.dd_j() and this.dd_k() check window variables indicating the use of PhantomJS.
  • Module 6: It's the last argument of the main() function. It loads DataDome options (using module 1) and enables tracking of events (using modules 7 and 8).

DataDome's Fingerprinting Data

Let's look at the data collected by DataDome, how it relates to the previously mentioned fingerprinting methods and how it's gathered in the JavaScript file. For that, we'll match the payload of the HTTP POST request with the results of our deobfuscation.

We'll start with the POST payload attributes glrd and glvd, representing the WebGL renderer and vendor information, respectively. That falls under browser fingerprinting data, which is how DataDome's script collects it.

program.js
webgl_context = document.createElement("canvas").getContext("webgl") 
webgl_info = webgl_context.getExtension("WEBGL_debug_renderer_info") 
jsData.glrd = webgl_context.getParameter(webgl_info.UNMASKED_RENDERER_WEBGL) 
jsData.glvd = webgl_context.getParameter(webgl_info.UNMASKED_VENDOR_WEBGL)

Another data point about the visitor is the browser plugins installed. These are highly personalized from user to user; they don't change often and are empty for automated browsers. This small JavaScript loop gathers the names of the plugins installed and their total number, similar to what DataDome does:

program.js
var plugins = [] 
for (i = 0; i < window.navigator.plugins.length; i++) 
	 plugins.push(window.navigator.plugins[i].name) 
 
jsData.plg = plugins.length 
jsData.plu = plugins.join()

DataDome collects over 30 data points in the jsData payload, depending on the device, the peripherals, the display used, the browser and geolocation. Here's a data sample accompanied by the corresponding JavaScript expressions from DataDome's deobfuscated JavaScript:

program.js
{ 
	"rs_h": 1080, // window.screen.height 
	"rs_w": 1920, // window.screen.width 
	"rs_cd": 24, // window.screen.colorDepth 
	"phe": false, // window._phantom 
	"nm": false, // window.__nightmare 
	"ua": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", // window.navigator.userAgent 
	"lg": "en-GB", // navigator.language || navigator.userLanguage || navigator.browserLanguage || navigator.systemLanguage || ""; 
	"ars_h": 1053, // screen.availHeight 
	"ars_w": 1848, // screen.availWidth 
	"str_ss": true, // window.sessionStorage 
	"str_ls": true, // window.localStorage 
	"str_idb": true, // window.indexedDB 
	"str_odb": true, // window.openDatabase 
	"mmt": "application/pdf,text/pdf", // window.navigator.mimeTypes 
	"plg": 5, // Number of plugins 
	"eva": 33, // eval.toString().length 
	"ts_tec": false, // touch event create 
	"ts_tsa": false, // "ontouchstart" in window 
	"vnd": "Google Inc.", // window.navigator.vendor 
	"bid": "NA", // window.navigator.buildID || "NA" 
	"tzp": "Europe/Paris", // Intl.DateTimeFormat().resolvedOptions().timeZone 
	"cvs": true, // Boolean(document.createElement("canvas").getContext("2d")) 
	"dcok": ".coingecko.com" // window.location.hostname 
}

Now that we've reversed-engineered its anti-bot system, let's go ahead and create a DataDome bypass solution!

How to Bypass DataDome Bot Detection

To bypass DataDome's protection, we'll combine all the knowledge you've gained from the previous sections. As you know, it uses bot detection techniques on the server and client sides, like TLS fingerprinting and browser fingerprinting. To create a DataDome bypass measure, you'll sneak under the radar of these techniques and extract the data.

Here are some ways to do that:

Automated Browsers

Automated browsers, like Selenium and Puppeteer, are tools created for automation purposes, and due to this, they spend no effort hiding their nature. To bypass DataDome in Python (or any other frameworks), use a masking package, like puppeteer-extra-plugin-stealth for Puppeteer, selenium-stealth for Selenium or playwright-stealth for Playwright.

These browsers and their extensions may differ, but the techniques behind them are all the same. The extensions handle the inconsistencies in browser fingerprints, override browser JavaScript variables and remove global variables specific to automated browsers.

Quality Proxies

Proxies are essential for web scraping as they conceal and safeguard your IP address, allowing you to bypass DataDome bot detection and extract data. There are different web scraping proxies, but the most efficient for DataDome bypass are datacenter and residential proxies.

It's also vital to consider static and rotating proxies: static proxies provide a constant IP address; therefore, sending too many requests can lead to detection. The best way to bypass DataDome detection is to rotate your proxies.

CAPTCHA Bypass

DataDome uses CAPTCHA, and solving it can serve as hard proof that the user is human. There are two types of CAPTCHA-solving services, sometimes a hybrid type combining both:

  • Automated CAPTCHA solvers: capable of providing quick solutions to challenges. They're based on machine learning techniques like optical character recognition (OCR) and object detection.
  • Human workers: a team that manually solves all CAPTCHAs and provides you with the response. It's slow and expensive but a reliable way to solve a CAPTCHA.

But let's be realistic here: letting your scraper hit CAPTCHA is a bad idea since it can slow the scraping process and add costs. The best way is to bypass DataDome CAPTCHA entirely by using CAPTCHA proxies. A good proxy CAPTCHA must be fast, scalable and able to support different renderings without compromising speed.

Web Scraping API

While the techniques shared are effective for bypassing DataDome, it's not practical to spend lots of time, energy and money developing and maintaining them. A more straightforward solution to DataDome bypass would be to use a tool capable of avoiding anti-bots, like ZenRows.

ZenRows is a web scraping API and proxy server that quickly scrapes content without going to war with DataDome or other anti-bots. It bypasses DataDome by preventing CAPTCHAs, rotating premium proxies, and using headless browsers.

Conclusion

Whew! That was quite a ride. Congratulations on sticking with us till the end. In this article, we introduced the techniques of the DataDome bot detection system, reversed engineered the anti-bot and discussed some DataDome bypass techniques.

As a recap, the methods to bypass DataDome bot detection are:

  • Use automated browsers.
  • Use quality proxies.
  • Use CAPTCHA-bypass services.
  • Use a web scraping API.

While these methods are effective, they can be expensive and stressful, and there's still the possibility of getting detected while scaling. The best way to avoid this is to use a web scraping tool that efficiently bypasses DataDome and other anti-bots, like ZenRows, and you can get started to test it out for free.

Frequent Questions

What Does DataDome Do?

Datadome is a web security platform that identifies, classifies, and blocks automated threats. It uses advanced methods like machine learning to analyze user behavior, detect suspicious patterns, and provide real-time bot protection.

Its services include protection against the following threats:

  • Account takeover.
  • Scraping.
  • Denial of Service.
  • Card cracking.
  • Credential stuffing.
  • Server overload.
  • Fake account creation.
  • Vulnerability scanning.

Did you find the content helpful? Spread the word and share it on Twitter, or LinkedIn.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.