The Anti-bot Solution to Scrape Everything? Get Your Free API Key! 😎

How to Use Puppeteer Stealth: A Plugin for Scraping

October 20, 2023 · 6 min read

Puppeteer is a fantastic headless browser library, yet it can easily be detected and blocked by anti-scraping measures. This is where Puppeteer Extra, with the help of plugins like Stealth, plays a key role.

This tutorial introduces Puppeteer Stealth and how to scrape web pages with it. Let's dive in!

What Is Puppeteer Extra?

Puppeteer Extra is an open-source library built to extend the functionality of the popular Puppeteer headless browser.

Here's a list of some of the main plugins you can use with Puppeteer Extra and what they do:

  • Stealth plugin hides Puppeteer's automation properties by masking the subtle differences between headless and regular Chrome browsers.
  • AdBlocker plugin blocks ads and trackers.
  • User Data Dir plugin maintains consistent browser data and settings between sessions.
  • reCAPTCHA plugin solves hCAPTCHA and reCAPTCHAs automatically.
  • Block Resource plugin intercepts and blocks unwanted resources, including images, fonts, CSS, etc.
  • DevTools plugin creates a secure portal to Chrome DevTools APIs to allow debugging and custom profiling from anywhere.

We'll focus on how to avoid detection with Puppeteer.

What Is Puppeteer Stealth?

Puppeteer Stealth, also known as puppeteer-extra-plugin-stealth, is an extension built on top of Puppeteer Extra that uses different techniques to hide properties that would otherwise flag your request as a bot. That makes it harder for websites to detect your scraper.

Let's see it in action.

What Does Puppeteer Stealth Do?

While web scraping with a headless browser gives you browser-like access, websites also get code execution access. That means they can leverage various browser fingerprinting scripts to gather data that can identify your automated browser.

Puppeteer Stealth is crucial here. Its goal is to mask some default headless properties, such as headless: true, navigator.webdriver: true and request headers, to crawl below the radar.

That's possible thanks to the extension modules.

Built-in Evasion Modules

Built-in evasion modules are pre-packaged plugins that drive the Puppeteer Stealth functionality. As stated earlier, base Puppeteer has leaks or properties that flag it as a bot, which the Stealth plugin aims to fix.

Each Puppeteer Stealth evasion module is designed to plug a particular leak. Take a look below:

  • iframe.contentWindow fixes the HEADCHR_iframe detection by modifying window.top and window.frameElement.
  • Media.codecs modifies codecs to support what actual Chrome supports.
  • Navigator.hardwareConcurrency sets the number of logical processors to four.
  • Navigator.languages modifies the languages property to allow custom languages.
  • Navigator.plugin emulates navigator.mimeTypes and navigator.plugins with functional mocks to match standard Chrome used by humans.
  • Navigator.permissions masks the permissions property to pass the permissions test.
  • Navigator.vendors makes it possible to customize the navigator.vendor property.
  • Navigator.webdriver masks navigator.webdriver.
  • Sourceurl hides the sourceurl attribute of the Puppeteer script.
  • User-agent-override modifies the user-agent components.
  • Webgl.vendor changes the Vendor/Renderer property from Google, which is the default for Puppeteer headless.
  • Window.outerdimensions adds the missing window.outerWidth or window.outerHeight properties.
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

How to Web Scrape with Puppeteer Stealth

Before we dive into Puppeteer in stealth mode, it's essential to explore web scraping with the base headless browser. As a target, we'll use NowSecure, a website that throws anti-bot challenges at every request and displays a you passed message if you're successful.

Let's begin!

  1. Install NodeJS and Puppeteer using the following command:
npm install puppeteer
  1. Import Puppeteer and open an async function where you'll write your code.
const puppeteer = require('puppeteer'); 
 
(async () => {
	//…
})();
  1. Launch a browser, create a new page, and navigate to your target URL.
(async () => {
	const browser = await puppeteer.launch();
	const page = await browser.newPage();

	//navigate to target URL
	await page.goto('https://nowsecure.nl/');
	
})();
  1. Set the screen size, wait for the page to load, take a screenshot, and close your browser.
 (async () => {
	//…
 
	// Set screen size
	await page.setViewport({width: 1280, height: 720});
 
	// wait for page to load
	await page.waitForTimeout(30000); 
 
	// Take screenshot 
	await page.screenshot({ path: nowsecure.png', fullPage: true }); 
 
	// Closes the browser and all of its pages 
	await browser.close(); 
})();

Putting all of it together, here's your complete code:

const puppeteer = require('puppeteer'); 
 
(async () => {
	const browser = await puppeteer.launch();
	const page = await browser.newPage();

	//navigate to target URL
	await page.goto('https://nowsecure.nl/');
 
	// Set screen size
	await page.setViewport({width: 1280, height: 720});
 
	//wait for page to load
	await page.waitForTimeout(30000); 
 
	// Take screenshot 
	await page.screenshot({ path: nowsecure.png', fullPage: true }); 
 
	// Closes the browser and all of its pages 
	await browser.close(); 
})();

And this is the screenshot of the web page:

NowSecure Blocked
Click to open the image in full screen

The result above shows that our Puppeteer script got blocked since we couldn't bypass anti-bot detection.

Now, let's try scraping the same website using Puppeteer Stealth.

Here are the steps you must take:

Step 1: Install Puppeteer-Stealth

As mentioned earlier, we need the Puppeteer Extra library to use Puppeteer Stealth. So, install both using the following command.

npm install puppeteer-extra puppeteer-extra-plugin-stealth

Step 2: Configure Puppeteer-Stealth

To configure Puppeteer Stealth, start by importing Puppeteer Extra.

const puppeteer = require('puppeteer-extra')

Then add the Stealth plugin and use it in default mode, which ensures your script uses all evasion modules.

// add stealth plugin and use defaults (all evasion techniques)
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())

Next, launch Puppeteer Stealth while specifying the headless option and open an async function where you'll write your code.

puppeteer.launch({ headless: 'new' }).then(async browser => {
    //...
});

Launch Puppeteer Stealth while specifying the execution path and open an async function.

puppeteer.launch({ executablePath: executablePath() }).then(async browser => {


}

Step 3: Take a Screenshot

Like in our base Puppeteer script, create a new page, set the screen size, and navigate to the target website.

// Create a new page 
const page = await browser.newPage(); 
 
// Setting page view 
await page.setViewport({ width: 1280, height: 720 }); 
 
// Go to the website 
await page.goto('https://nowsecure.nl/'); 

Lastly, wait for the page to load and take a screenshot.

// Wait for security check 
await page.waitForTimeout(10000); 
 
await page.screenshot({ path: nowsecure.png', fullPage: true }); 
 
await browser.close(); 

Here's what our result looks like:

NowSecure Success
Click to open the image in full screen

Congratulations! You prevented bot detection using Puppeteer Extra Stealth.

Let's take it a step further and scrape the page.

Step 4: Scrape the Page

First, right-click on an element you want to scrape and select "Inspect". That will open the Chrome DevTools, and you'll find the selectors in the Elements tab.

NowSecure DevTools
Click to open the image in full screen

Next, copy the selectors and use each to get its text. The querySelector method works perfectly for this purpose.

//… 
await page.waitForTimeout(10000); 

// Get title text 
title = await page.evaluate(() => { 
    return document.querySelector('body > div.nonhystericalbg > div > header > div > h3').textContent; 
}); 

// Get message text 
msg = await page.evaluate(() => { 
    return document.querySelector('body > div.nonhystericalbg > div > main > h1').textContent; 
}); 

    // get state text 
state = await page.evaluate(() => { 
    return document.querySelector('body > div.nonhystericalbg > div > main > p:nth-child(2)').textContent; 
    }); 

// print out the results 
console.log(title, '\n', msg, '\n', state); 

await browser.close(); 

The script uses the .textContent method to get the text for each element. The same process can be repeated and saved to a variable.

The complete code looks like this:

const puppeteer = require('puppeteer-extra'); 
 
// Add stealth plugin and use defaults 
const StealthPlugin = require('puppeteer-extra-plugin-stealth'); 
 
// Use stealth 
puppeteer.use(StealthPlugin()); 

// Launch pupputeer-stealth 
puppeteer.launch({ headless: 'new' }).then(async browser => { 
    // Create a new page 
    const page = await browser.newPage(); 
 
    // Setting page view 
    await page.setViewport({ width: 1280, height: 720 }); 
 
    // Go to the website 
    await page.goto('https://nowsecure.nl/'); 
 
    // Wait for security check 
    await page.waitForTimeout(10000); 

    await page.screenshot({ path: nowsecure.png', fullPage: true }); 

    // Get title text 
    title = await page.evaluate(() => { 
        return document.querySelector('body > div.nonhystericalbg > div > header > div > h3').textContent; 
    }); 
 
    // Get message text 
    msg = await page.evaluate(() => { 
        return document.querySelector('body > div.nonhystericalbg > div > main > h1').textContent; 
    }); 
 
     // get state text 
    state = await page.evaluate(() => { 
        return document.querySelector('body > div.nonhystericalbg > div > main > p:nth-child(2)').textContent; 
     }); 
 
    // print out the results 
    console.log(title, '\n', msg, '\n', state); 
 
    await browser.close(); 
});

After running the script, the output should look like this:

Output
Click to open the image in full screen

Awesome! You solved your main problem and successfully avoided Puppeteer bot detection.

Limitations of puppeteer-extra-plugin-stealth and a Solution

While Puppeteer Stealth does a lot to avoid detection, it has its limitations:

  • It can't avoid advanced anti-bots. For example, your script will easily get detected and blocked if you use Puppeteer Stealth to try to bypass Cloudflare or bypass DataDome.
  • It can get extremely slow and, therefore, difficult to scale
  • As with other headless browsers, it's difficult to debug

Let's see an example of Puppeteer Stealth against Cloudflare.

We'll try scraping the [Asana page on g2.com] (https://www.g2.com/products/asana/reviews), a Cloudflare-protected website, using the same code as before.

//...
puppeteer.launch({ headless: 'new' }).then(async browser => { 
    const page = await browser.newPage();  
 
    // Go to the website 
    await page.goto('https://www.g2.com/products/asana/reviews'); 
    await page.screenshot({ path: 'g2.png', fullPage: true }); 
    await browser.close(); 
});

Here's our result:

Result Page
Click to open the image in full screen

We got blocked straight off! Fortunately, there's a quick solution. With the ZenRows API, you'll bypass even the most complicated anti-bots.

Let's see it in action.

First, sign up to get your free Zenrows API key. Paste the URL to scrape, enable Javascript rendering, Antibot and Premium Proxies. Click Node.js on the right and ZenRows will automatically generate a Node.js script for you.

ZenRows Request Builder Page
Click to open the image in full screen

Then, Install Axios by entering the following command in your terminal:

npm install axios

Next, paste the generated code:

const axios = require('axios');

const url = 'https://www.g2.com/products/asana/reviews';
const apikey = '<YOUR_ZENROWS_API_KEY>';
axios({
	url: 'https://api.zenrows.com/v1/',
	method: 'GET',
	params: {
		'url': url,
		'apikey': apikey,
		'js_render': 'true',
		'antibot': 'true',
		'premium_proxy': 'true',
	},
})
    .then(response => console.log(response.data))
    .catch(error => console.log(error));

Here's our result:

How does it feel knowing you can scrape just about any website? Awesome, right?

Conclusion

Puppeteer is a popular web scraping and automation tool. But its default properties make it easy for websites to detect and block your bot. Fortunately, Puppeteer Stealth lets you leverage its evasion modules to stay below the radar.

Yet, Puppeteer Stealth can't keep up with anti-bot measures frequently evolving. Thus, it doesn't work against advanced obstacles. For these cases, consider solutions like ZenRows and use its free trial for your next project.

Did you find the content helpful? Spread the word and share it on Twitter, or LinkedIn.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.