How to Avoid Detection with Puppeteer

December 21, 2022 Β· 6 min read

Websites these days make use of anti-bot systems that are capable of detecting scrapers. The best way to ensure a seamless scraping process is by using proper masking methods, like headless browsers.

Puppeteer is a headless chrome capable of simulating real user behavior to avoid antibots while web scraping. So how do you go about it? In this article, we'll be discussing the best ways to avoid detection in Puppeteer while scraping.

But before that…

What is Puppeteer?

Puppeteer is a Node.js library that provides a high-level API to a Chromium headless browser programmatically. You can easily install Puppeteer using npm or Yarn and one of its key benefits is its ability to access and manipulate DevTools Protocol.

Can Puppeteer be Detected by Anti-bots?

Yes, these anti-bots are capable of detecting headless browsers like Puppeteer. Let's prove that with a quick scraping example trying to crawl NowSecure. It's a website with bots checking tests that tells you whether you have passed the protection or not.

To do this, we'll go ahead and install Node.js and Puppeteer. To install Puppeteer, we'll run the basic command code after Node.js has been installed:

npm install puppeteer

Next up we'll create a JavaScript file index.js and run the file using NodeJS node index.js, like this:

const puppeteer = require('puppeteer'); 
 
(async () => { 
	// Initiate the browser 
	const browser = await puppeteer.launch(); 
 
	// Create a new page with the default browser context 
	const page = await browser.newPage(); 
 
	// Setting page view 
	await page.setViewport({ width: 1280, height: 720 }); 
 
	// Go to the target website 
	await page.goto('https://nowsecure.nl/'); 
 
	// Wait for security check 
	await page.waitForTimeout(30000); 
 
	// Take screenshot 
	await page.screenshot({ path: 'image.png', fullPage: true }); 
 
	// Closes the browser and all of its pages 
	await browser.close(); 
})();

So here's what we did in that example: we used the basic Puppeteer set-up to create a new browser page and visit the target website. And then we take a screenshot after the security check.

Here's the screenshot of the web page when we use Puppeteer only:

NowSecure Blocked
Click to open the image in fullscreen

As you can see from the result, we didn't pass the check and we were unable to prevent Puppeteer detection on the web page.

6 tricks to Avoid Detection with Puppeteer

One of the best ways to ensure a smooth crawling process is by making sure to avoid Puppeteer bot detection. Here's how to prevent Puppeteer detection and avoid getting blocked while scraping:

1. Use Proxies

One of the most widely adopted anti-bot strategies is IP tracking, where the bot detection system tracks the website's requests. And when an IP makes many requests in a short period of time, the anti-bot can detect the Puppeteer scraper.

To avoid detection in Puppeteer, you can make use of proxies, which provide a gateway between users and the internet. So when a request is sent to the server, it's routed to the proxy and then the response data is sent to us.

To do this, we can add a proxy to the args parameter while launching puppeteer, like this:

const puppeteer = require('puppeteer'); 
const proxy = ''; // Add your proxy here 
 
(async () => { 
	// Initiate the browser with a proxy 
	const browser = await puppeteer.launch({args: ['--proxy-server=${proxy}']}); 
 
	// ... continue as before 
})();

That's it! Your Puppeteer scraper can now avoid bot detection while scraping web pages.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

2. Headers

Headers contain context and metadata information about the HTTP request. It identifies whether the tool is a normal web browser or a bot. You can help prevent detection by including the right headers in the HTTP request.

Since Puppeteer works under headlessChrome by default, you can take it a step further by adding custom headers like User-Agent. It's a popular header used in web scraping and it identifies the application, operating system, vendor and version of the request.

const puppeteer = require('puppeteer'); 
 
(async () => { 
	const browser = await puppeteer.launch(); 
	const page = await browser.newPage(); 
 
	// Add Headers 
	await page.setExtraHTTPHeaders({ 
		'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36', 
		'upgrade-insecure-requests': '1', 
		'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8', 
		'accept-encoding': 'gzip, deflate, br', 
		'accept-language': 'en-US,en;q=0.9,en;q=0.8' 
	}); 
 
	// ... continue as before 
})();

There are several methods to add a header in Puppeteer, but the easiest one is to add it while opening a new page.

3. Limit Requests

As previously discussed, an anti-bot can track a user's activity through the number of requests sent. And since normal users don't send hundreds of requests per second, limiting the number of requests and taking breaks between requests helps avoid Puppeteer detection.

To do this, you can make use of the .setRequestInterception() to limit the resources rendered in Puppeteer.

const puppeteer = require('puppeteer'); 
 
(async () => { 
	const browser = await puppeteer.launch(); 
	const page = await browser.newPage(); 
 
	// Limit requests 
	await page.setRequestInterception(true); 
	page.on('request', async (request) => { 
		if (request.resourceType() == 'image') { 
			await request.abort(); 
		} else { 
			await request.continue(); 
		} 
	}); 
 
	// ... continue as before 
})();

By setting the .setRequestInterception() = true, we ignore requests that puppeteer makes for images. In this way, we are able to limit requests. As a plus, we'll get faster scrapers since there are less resources to load and wait for.

4. Mimic user behavior

Machine learning systems study all user behaviors and compare bot behaviors to users'. Mimicking this behavior can help bypass bot detectors. One way to achieve this is by waiting for a random time before clicking or navigating to a new web page in Puppeteer. For example, this simple function generates a random time between 5 and 12 seconds.

The page.waitForTimeout uses the generated time before opening a new tab page.

const puppeteer = require('puppeteer'); 
 
(async () => { 
	const browser = await puppeteer.launch(); 
	const page = await browser.newPage(); 
 
	// Wait for a random time before navigating to a new web page 
	await page.waitForTimeout((Math.floor(Math.random() * 12) + 5) * 1000) 
 
	// ... continue as before 
})();

5. Puppeteer-stealth

The stealth plugin is helpful for Puppeteer scrapers because it provides an API that is comparable to Puppeteer's. By removing the minute variations in browser fingerprints between Headless Chrome and real Chrome browsers, it's able to hide the browser's headless status.

Let's see how to prevent Puppeteer detection using puppeteer-stealth!

Step 1: install puppeteer-stealth

Install puppeteer-stealth using this command code:

npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth

Step 2: configure puppeteer-stealth

To configure puppeteer-stealth, create a new JavaScript file called index.js and add the following code:

const puppeteer = require('puppeteer-extra') 
 
// Add stealth plugin and use defaults 
const pluginStealth = require('puppeteer-extra-plugin-stealth') 
const {executablePath} = require('puppeteer'); 
 
// Use stealth 
puppeteer.use(pluginStealth()) 
 
// Launch pupputeer-stealth 
puppeteer.launch({ headless:false, executablePath: executablePath() }).then(async browser => {

Here's what happened in the code above: we imported puppeteer and puppeteer-extra, a lightweight wrapper around Puppeteer, and also Imported puppeteer-extra-plugin-stealth to avoid Puppeteer detection and it's used with puppeteer.use(pluginStealth()).

puppeteer is saved in executablePath. This informs puppeteer-stealth where the code will be executed. And finally, we launched Puppeteer with the option headless:false to see what happens in the browser.

Step 3: take a screenshot

puppeteer.launch({ headless:false, executablePath: executablePath() }).then(async browser => { 
	// Create a new page 
	const page = await browser.newPage(); 
 
	// Setting page view 
	await page.setViewport({ width: 1280, height: 720 }); 
 
	// Go to the website 
	await page.goto('https://nowsecure.nl/'); 
 
	// Wait for security check 
	await page.waitForTimeout(10000); 
 
	await page.screenshot({ path: 'image.png', fullPage: true }); 
 
	await browser.close(); 
});

After launching the browser, create a new page using await browser.newPage() and set the browser view. Puppeteer was instructed to visit the target website and wait 10 seconds for the security check. After that, a screenshot was taken and the browser was closed.

And here's what the result should look like:

NowSecure success
Click to open the image in fullscreen

Congratulations! You have prevented Puppeteer bot detection. Let's take it a step further and scrape the page.

Step 4: scrape the page πŸ˜€

Find the selector of each element you want to scrape and to do this, right-click on an element and select Inspect. This opens the Chrome DevTools and you can get the selectors from the Elements tab.

NowSecure on DevTools
Click to open the image in fullscreen

Copy selectors and use each selector to get its text. The querySelector method works perfectly for this purpose:

	await page.goto('https://nowsecure.nl/'); 
	await page.waitForTimeout(10000); 
 
	// Get title text 
	title = await page.evaluate(() => { 
		return document.querySelector('body > div.nonhystericalbg > div > header > div > h3').textContent; 
	}); 
 
	// Get message text 
	msg = await page.evaluate(() => { 
		return document.querySelector('body > div.nonhystericalbg > div > main > h1').textContent; 
	 }); 
 
	 // get state text 
	state = await page.evaluate(() => { 
		return document.querySelector('body > div.nonhystericalbg > div > main > p:nth-child(2)').textContent; 
	}); 
 
	// print out the results 
	console.log(title, '\n', msg, '\n', state); 
 
	await browser.close(); 
});

The script uses the .textContent method to get the text for each element. The same process can be repeated for each element and saved to a variable. After running the code here, this is what the output should look like:

NowSecure Text
Click to open the image in fullscreen

Boom! You have solved your main problem and successfully prevented Puppeteer bot detection. Here's what the complete code should be:

const puppeteer = require('puppeteer-extra'); 
 
// Add stealth plugin and use defaults 
const pluginStealth = require('puppeteer-extra-plugin-stealth'); 
const {executablePath} = require('puppeteer'); 
 
// Use stealth 
puppeteer.use(pluginStealth()); 
 
// Launch pupputeer-stealth 
puppeteer.launch({ headless:false, executablePath: executablePath() }).then(async browser => { 
	// Create a new page 
	const page = await browser.newPage(); 
 
	// Setting page view 
	await page.setViewport({ width: 1280, height: 720 }); 
 
	// Go to the website 
	await page.goto('https://nowsecure.nl/'); 
 
	// Wait for security check 
	await page.waitForTimeout(10000); 
 
	// Get title text 
	title = await page.evaluate(() => { 
		return document.querySelector('body > div.nonhystericalbg > div > header > div > h3').textContent; 
	}); 
 
	// Get message text 
	msg = await page.evaluate(() => { 
		return document.querySelector('body > div.nonhystericalbg > div > main > h1').textContent; 
	}); 
 
	 // get state text 
	state = await page.evaluate(() => { 
		return document.querySelector('body > div.nonhystericalbg > div > main > p:nth-child(2)').textContent; 
	 }); 
 
	// print out the results 
	console.log(title, '\n', msg, '\n', state); 
 
	await browser.close(); 
});

Limitations of Puppeteer-stealth

Puppeteer-stealth is a good solution to avoid bot detection with Puppeteer, but it has limitations:
  • It can't avoid advanced antibots.
  • Puppeteer works under headless mode, so it's difficult to scale and scrape large amounts of data.
  • It's difficult to debug headless browsers such as Puppeteer.

Although puppeteer-stealth is a good way to avoid puppeteer detection, it fails against websites with advanced anti-bots. For example, we tried to scrape okta using puppeteer-stealth:

// same as before 
puppeteer.launch({ headless:true, executablePath: executablePath() }).then(async browser => { 
	const page = await browser.newPage(); 
	await page.setViewport({ width: 1920, height: 1080 }); 
 
	// Go to the website 
	await page.goto('https://okta.com/'); 
 
	// Wait for security check 
	await page.waitForTimeout(10000); 
	await page.screenshot({ path: 'image.png', fullPage: true }); 
	await browser.close(); 
});

Here's what we got:

Okta blocked
Click to open the image in fullscreen

We got blocked straight off! Fortunately, we have a solution that can bypass advanced antibots: ZenRows. This is our next tip to avoid being detected.

6. ZenRows

ZenRows is an all-in-one web scraping tool that uses a single API call to handle all anti-bot bypassing. It helps with rotating proxies, headless browsers and CAPTCHAs. Using ZenRows to scrape okta, here's what we got:

ZenRows success on Okta
Click to open the image in fullscreen

Conclusion

There are different methods you can use to avoid detection with Puppeteer and, in this article, we discussed the best and easiest ways to go about it. You can use proxies, headers, limit requests or puppeteer-stealth to get the job done but there are limitations.

The common problem with these methods is that they fail when it comes to bypassing advanced anti-bots. ZenRows handles all anti-bot bypass for you, from rotating proxies and headless browsers to CAPTCHAs with a single API call. And you can get started for free.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Want to keep learning?

We will be sharing all the insights we have learned through the years in the following blog posts. If you don't want to miss a piece and keep learning, we'd be thrilled to have us in our newsletter.

No spam guaranteed. You can unsubscribe at any time.