How to Bypass DataDome With Puppeteer

May 20, 2024 ยท 10 min read

Are you using Puppeteer for web scraping but getting blocked by DataDome? We've got you covered!

In this tutorial, you'll learn how DataDome works and the four best methods to bypass it with Puppeteer.

What Is DataDome

DataDome is a web security service that prevents cyber threats, including account takeover, DDoS, online fraud, SQL and XSS injection, and more. It also blocks web scraping activities, preventing you from getting your target data.

Bypassing DataDome during web scraping can be difficult, as it uses advanced detection mechanisms that swiftly detect bot-like activities. It protects various website types, including e-commerce, real estate, news outlets, and social media, among many others.

How Does DataDome Work?

DataDome uses various detection techniques, including invisible JavaScript challenges and CAPTCHA, to differentiate between bots and legitimate users.

Its invisible challenge employs machine learning to analyze client and server-side bot-like signals to improve detection efficiency. Target parameters include browser/device fingerprints, geolocation, network, and user behavior like scrolling, navigation, and mouse movement patterns.

Once DataDome detects that your request deviates from the typical user behavior, it activates a firewall that blocks it.

Why Is Puppeteer Alone Not Enough to Bypass DataDome?

Puppeteer is a Node.js library that automates the browser. Its ability to interact dynamically with websites and execute JavaScript makes it suitable for content extraction.

While Puppeteer is great for web scraping, some limitations may prevent you from accessing DataDome-protected websites.

The first barrier is Puppeteer presents bot-like parameters like the WebDriver during scraping. It also can't handle DataDome's advanced fingerprinting and machine-learning techniques.

For example, Puppeteer gets blocked while scraping a DataDome-protected website like Best Western. See the target page below:

Best Western Homepage
Click to open the image in full screen

Try it out with the following JavaScript:

scraper.js
// import the required library
const puppeteer = require("puppeteer");

(async () => {

  // start Puppeteer in headless mode and open the target website
  const browser = await puppeteer.launch({ headless: "new" });
  const page = await browser.newPage();

  const url = "https://www.bestwestern.com/";
  const response = await page.goto(url);
  
  // check and log the response status
  if (response && response.status() !== 200) {
    console.error(`Failed to load the page. Status: ${response.status()}`);
    await browser.close();
    return;
  }
  
  // wait for the content to load
  await page.waitForSelector("body");
  
  // get the content of the page
  const content = await page.content();
  console.log(response.status(), content);

  await browser.close();
})();

The result of the above code shows that DataDome blocked Puppeteer with an error 403:

Output
Failed to load the page. Status: 403

That Puppeteer scraper couldn't evade DataDome. Let's see how to solve this problem in the next section.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Best Methods to Bypass Datadome With Puppeteer

DataDome's security system is advanced. This section outlines the four best methods for improving Puppeteer and beating DataDome.

Method #1: Get Premium Proxies

Proxies are services that change your IP address so the server treats your request as coming from a different location.

Routing your scraper through a proxy makes it appear more legitimate and can prevent IP bans due to rate limiting. Puppeteer supports proxy integration to avoid detection.

You can use free proxies, but their short lifespan makes them unreliable. The best option is to use premium web scraping proxies requiring authentication credentials like usernames and passwords.

Check out our tutorial on using proxies with Puppeteer to learn more.

Method #2: Use Puppeteer-extra-plugin-stealth

The Puppeteer-extra-plugin-stealth is a plugin featuring various evasion techniques for bypassing anti-bots like DataDome. This plugin patches some limitations of the vanilla Puppeteer library by removing bot-like parameters like the WebDriver.

It also allows you to customize browser properties to mimic various user environments during web scraping. This feature increases your chance of evading DataDome's browser fingerprinting.

See our article on using the Puppeteer Stealth plugin for a more detailed tutorial.

Method #3: Use a Web Scraping API (The Easiest)

A web scraping API is a solution for bypassing CAPTCHAs and other anti-bot systems. ZenRows is a leading web scraping API that fixes your request headers, auto-rotates premium proxies, and bypasses CAPTCHAs and other anti-bot detections like DataDome at scale.

It also works as a headless browser with JavaScript instructions for extracting dynamically rendered content like infinite scrolls. A web scraping API is the easiest way to bypass DataDome since it doesn't require technicalities and is compatible with any programming language.

Let's use ZenRows to scrape the full-page HTML of the previous DataDome-protected page that blocked you with Puppeteer.

Sign up to open the ZenRows Request Builder. Paste the target URL in the link box, toggle on the Boost mode to JS Rendering, and activate Premium Proxies. Select Node.js as your programming language and click the API request mode. Copy and paste the generated code into your script:

ZenRows Request Builder
Click to open the image in full screen

Here's a slightly modified version of the generated code:

scraper.js
// npm install axios
const axios = require("axios");

// define your request parameters and make an axios request
axios({
    url: "https://api.zenrows.com/v1/",
    method: "GET",
    params: {
        "url": "https://www.bestwestern.com/",
        "apikey": "<YOUR_ZENROWS_API_KEY>",
        "js_render": "true",
        "premium_proxy": "true",
    },
})
    .then(response => console.log(response.data))
    .catch(error => console.log(error));

The above code extracts the full-page HTML of the protected website. See the result below, showing the page title with some omitted content:

Output
<html lang="en-us">
<head>
    <title>Best Western Hotels - Book Online For The Lowest Rate</title>
</head>
<body class="bestWesternContent bwhr-brand">
    <header>
        <!-- ... -->
    </header>
    
    <!-- ... -->
    
</body>
</html>

Congratulations! You just scraped a DataDome-protected website using ZenRows and Axios. Let's see one more solution.

Method #4: Fix Your Request Headers

The request headers describe the source of a request and determine how the server will respond to it. Incomplete or wrong request header parameters can expose your scraper as an automated script and trigger DataDome's anti-bot system.

For example, Puppeteer's User-Agent header value contains a "HeadlessChrome" parameter in headless mode, making it more vulnerable to anti-bot detection. You can configure Puppeteer request headers with custom values to avoid appearing like a bot and increase your chance of evading DataDome.

Check our article on customizing Puppeteer's request headers for a detailed tutorial.

Conclusion

In this article, you've seen the four best methods of bypassing DataDome protection with Puppeteer. Setting up premium proxies, using the Puppeteer Stealth plugin, and customizing the request headers are manual methods that work best when combined.

However, these methods can be challenging to maintain at scale. We recommend integrating ZenRows, a web scraping API that features automatic header and proxy configuration and bypasses CAPTCHAs and anti-bots like DataDome, allowing you to scrape any website without limitations. Try ZenRows for free!

Ready to get started?

Up to 1,000 URLs for free are waiting for you