The Anti-bot Solution to Scrape Everything? Get Your Free API Key! ūüėé

How to Bypass Cloudflare with Puppeteer

January 9, 2023 · 5 min read

The headaches in web scraping start when anti-bots, like Cloudflare, detect and block your scrapers. One of the best ways to avoid this stress is to use a headless browser technique, like using Puppeteer to bypass Cloudflare while scraping. With Puppeteer, it's possible to use some tricks to evade Cloudflare bot detection to let your web crawlers run smoothly.

In this article, we'll discuss how to bypass Cloudflare with Puppeteer. Let's get started!

What Is Puppeteer

Puppeteer is a Node.js library that provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. This API takes remote control of the headless Chromium instances and uses them as a portal for accessing a browser's (chrome) ability to render a webpage and its JavaScript elements. Puppeteer runs headless by default, but you can also configure it to run in non-headless mode (full Chrome or Chromium).

When a Puppeteer web scraper accesses a website, it first creates a browser instance. This is to render the website content before navigating to the desired data's location to scrape. Cloudflare's bot detection can identify this action as performed by a bot, especially when executed in headless mode. But running Puppeteer in full Chrome mode (headless = false) could grant you access to Cloudflare-protected websites.

However, Cloudflare is a sophisticated solution with a frequently updated Web Application Firewall (WAF). Now, headless or not, you might still get blocked. The key to bypassing Cloudflare with Puppeteer is understanding how Cloudflare detects bots.

What Is Cloudflare

Cloudflare's bot protection system enables websites to identify unwanted traffic. However, certain useful bots like Google and other search engines are given access to allow web crawling and ranking. This is possible because Cloudflare maintains a safelist for these bots, and, unfortunately, Puppeteer isn't on this list. So your headless browser scraping endeavors might hit a roadblock at Cloudflare-protected websites.

How Cloudflare Detects Bots

Cloudflare uses various techniques to guard against malicious threats and data invasion. Some of these include detecting botnets, IP address reputation, TLS fingerprinting, CAPTCHAs, Canvas fingerprinting, HTTP request headers, Event tracking, among other ones. To learn more, check out our guide on how to bypass Cloudflare.

Does Cloudflare Detect Puppeteer

Yes, Cloudflare Bot Management is capable of detecting Puppeteer. When a Puppeteer web scraper visits a Cloudflare-protected website, it's subjected to security checks using the methods above. These checks occur in an interstitial page known as the Cloudflare's waiting room. If the web scraper successfully clears these challenges, it's granted access. Otherwise, it's blocked.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Can Puppeteer Bypass Cloudflare?

Although Puppeteer can bypass some of Cloudflare's bot detection techniques due to its Chrome similarities, it won't get passed with some more detailed bot detection checks. The default navigation.webdriver property for headless chrome allows Cloudflare to detect it as an automated browser. While some minor configurations can mitigate this, Puppeteer still leaves subtle traces in its browser fingerprint that make it detectable as a non-human.

To find out if Puppeteer bypasses Cloudflare, let's try scraping CoinTracker, a Cloudflare-protected website. You can confirm this by visiting the website in an actual browser and inspecting the network tab.

CoinTracker
Click to open the image in full screen

We attempted to use the following script to access our target website and take a screenshot:

const puppeteer = require('puppeteer'); 
 
(async () => { 
	//instantiate browser 
	const browser = await puppeteer.launch({ headless: true }); 
 
	//launch new page 
	const page = await browser.newPage(); 
 
	//visit target website 
	await page.goto( 
		'https://www.cointracker.io/', { 
			//wait for website to load 
			waitUntil: 'load', 
		}) 
 
	//take page screenshot 
	await page.screenshot({'path': 'cointracker.png'}) 
	await browser.close() 
})();

But here's what we got:

CoinTracker Blocked
Click to open the image in full screen

So what happened was that CoinTracker's Cloudflare anti-bot protection detected our script as a bot and locked us in Cloudflare's waiting room. Puppeteer, by itself, can't bypass Cloudflare.

So how can we bypass Cloudflare with Puppeteer and free our scraper from the waiting room? Well, we are just getting there.

How to Bypass Cloudflare with Puppeteer

Ideally, you can execute Puppeteer Cloudflare bypassing by masking its automated browser property, which would make it appear like an actual browser. One of the most popular ways to do this is by making use of Puppeteer-extra-plugin-stealth.

Puppeteer-extra-plugin-stealth uses a similar API as base Puppeteer, so there's no learning curve for developers already using Puppeteer. This plugin eliminates the subtle browser fingerprint traces that differentiate Puppeteer from an actual browser. For example, the stealth plugin overrides the navigator.webdriver = true property to mask automation and appear human.

Let's see how to bypass Cloudflare Puppeteer using the Puppeteer-extra-plugin-stealth plugin.

Prerequisites

For this tutorial, we'll be using Node.js, so you'll need Node (or nvm) and npm installed. Some systems have it pre-installed. After that, install all the necessary libraries by running npm install. It'll create a package.json file with all the dependencies. Then install Puppeteer.

npm i puppeteer

Puppeteer Cloudflare Bypass with Puppeteer-extra-plugin-stealth

To bypass Cloudflare bot detection using Puppeteer-extra-plugin-stealth, start by installing the Puppeteer extra and stealth plugin.

npm install puppeteer-extra puppeteer-extra-plugin-stealth

Using the script below, we can bypass CoinTracker's Cloudflare anti-bot detection and take a screenshot of our target website homepage.

// puppeteer-extra is a drop-in replacement for puppeteer, 
// it augments the installed puppeteer with plugin functionality 
const puppeteer = require('puppeteer-extra') 
 
// add stealth plugin and use defaults (all evasion techniques) 
const StealthPlugin = require('puppeteer-extra-plugin-stealth') 
puppeteer.use(StealthPlugin()) 
  
// puppeteer usage as normal 
puppeteer.launch({ headless: true}).then(async browser => { 
	const page = await browser.newPage() 
	await page.goto('https://www.cointracker.io/') 
	await page.waitForTimeout(2000) 
	await page.screenshot({ path: 'cointracker_home.png', fullPage: true }) 
	await browser.close() 
});

Here's what our result looks like:

CoinTracker Homepage
Click to open the image in full screen

Congrats! The plugin worked, and you have successfully avoided Puppeteer Cloudflare detection. You can celebrate if this is your case... but if it isn't, it means you have been stuck with advanced Cloudflare security.

Puppeteer-extra-plugin-stealth limitations

Some websites use more advanced Cloudflare security than others, like the website we employed, and masking Puppeteer's automation properties using the stealth plugin is just not enough to get through. For example, we tried accessing G2.com using the code below:

// ... same as above 
puppeteer.launch({ headless: true }).then(async browser => { 
	const page = await browser.newPage() 
	await page.goto('https://www.g2.com/products/asana/reviews') 
	await page.waitForTimeout(10000) 
	await page.screenshot({ path: 'g2_block.png', fullPage: true }) 
	await browser.close() 
})

And here's what we got:

G2 Blocked
Click to open the image in full screen

Familiar, right? The result shows that the stealth plugin has been detected as a bot by a more complex Cloudflare anti-bot system. So how can you bypass this and extract data from complicated websites? With ZenRows.

Puppeteer Cloudflare Bypass with ZenRows

ZenRows makes it easy to extract data from any website, regardless of its anti-bot detection complexity. It's a unique tool that scales up web scraping by doing a single API call.

To use ZenRows to bypass Puppeteer Cloudflare detection, create a free account and install Axios.

npm install axios

In the Request Builder, enter the URL you want to scrape, select Node.js and then API. Since we're dealing with complex anti-bot detection, we need to activate the premium proxies and anti-bot features:

ZenRows Request Builder Page
Click to open the image in full screen

ZenRows will automatically generate a web scraping script, and you can copy this script to a local directory. Let's use the code block from the project dashboard to retrieve data from G2's product page. To do this, copy the code block to a local directory.

const axios = require('axios');
const fs = require('fs').promises; 

const url = 'https://www.g2.com/products/asana/reviews';
const apikey = '<YOUR_ZENROWS_API_KEY>';
axios({
	url: 'https://api.zenrows.com/v1/',
	method: 'GET',
	params: {
        'url': url,
        'apikey': apikey,
        'js_render': 'true',
        'antibot': 'true',
        'premium_proxy': 'true',
	},
})
    .then(response => fs.writeFile('data.html', response.data))
    .catch(error => console.log(error));

Here's what the output looks like:

ZenRows Result
Click to open the image in full screen

Bingo! We've successfully bypassed the complex Cloudflare detection system with ZenRows.

Conclusion

Web scraping with Puppeteer is fun when done right, although the presence of anti-bots, like Cloudflare, makes the process a bit stressful since they're capable of blocking our scraper. One way to bypass Cloudflare with Puppeteer is using Puppeteer-extra-plugin-stealth, which masks the browser's properties to appear human.

However, this method fails when it comes to avoiding advanced Cloudflare security. To solve this problem, we've used ZenRows. It's a web scraping API that handles all anti-bot bypass for you, from rotating proxies and headless browsers to CAPTCHAs. Get started for free and watch your scraping process become smooth!

Did you find the content helpful? Spread the word and share it on Twitter, or LinkedIn.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.