Does your Node.js scraper keep getting blocked by Cloudflare? You're in the right place to find the way out!
This article explains the best methods and tools to bypass Cloudflare while scraping with Node.js. You'll learn all about how they work, including ready code examples to start scraping right away. Here's what we'll cover:
Let's go!
How to Bypass Cloudflare in Node.js?
Bypassing Cloudflare in Node.js is possible. However, it's challenging with open-source tools, considering most of them can't handle Cloudflare's advanced detection methods. Most of these tools have bot-like attributes, such as automated WebDrivers, making them prone to getting blocked by the Cloudflare 403 error.
What's more, Cloudflare's security levels range from mild to complex, depending on how a website implements it. So, solutions and customizations that work for some sites may not work for others.
This article will test the ability of popular existing Node.js methods and tools to bypass Cloudflare, targeting Cloudflare Challenge, a Cloudflare-protected webpage that's difficult to bypass and Hapag-Lloyd, a Cloudflare-protected website that's easier to bypass.
1. ZenRows
ZenRows is a web scraping API featuring all the tools required to bypass Cloudflare at scale. It auto-rotates premium proxies, fixes the request headers, auto-bypasses CAPTCHAs and other anti-bots, and more. ZenRows also offers a dedicated residential proxy service under the same price plan, extending your choices for bypassing specific blocks, such as rate-limited and geo-specific IP bans.
One of the advantages of ZenRows is that it's beginner-friendly and compatible with all programming languages. With ZenRows, you only need to make a single API call and watch your scraper bypass anti-bots in split seconds.
👍 Pros
- The most reliable solution for bypassing any anti-bot system.
- Easy to use.
- Highly scalable.
- Integrates with other libraries.
- Can scrape JavaScript-rendered websites.
- Fast response time.
- Compatible with all programming languages.
- Dedicated proxy service.
- IP auto-rotation and flexible geo-targeting.
- Transparent pricing.
- High success rate.
- Headless browser features.
👎 Cons
- Paid solution.
How to Bypass Cloudflare in Node.js Using ZenRows
Let's see how ZenRows performs against Cloudflare Challenge.
Sign up to open the Request Builder and get your free API key. Paste the target URL in the link box, activate Premium Proxies and JS Rendering, select Node.js as your programming language and choose the API connection mode. Copy and paste the generated code into your JavaScript scraper file.
Here's what the generated code looks like:
// npm install axios
const axios = require('axios');
const url = 'https://www.scrapingcourse.com/cloudflare-challenge';
const apikey = '<YOUR_ZENROWS_API_KEY>';
axios({
url: 'https://api.zenrows.com/v1/',
method: 'GET',
params: {
url: url,
apikey: apikey,
js_render: 'true',
premium_proxy: 'true',
},
})
.then((response) => console.log(response.data))
.catch((error) => console.log(error));
And here's the full-page HTML of the protected website:
<html lang="en">
<head>
<!-- ... -->
<title>Cloudflare Challenge - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<h2>
You bypassed the Cloudflare challenge! :D
</h2>
<!-- other content omitted for brevity -->
</body>
</html>
Amazing! With ZenRows, you bypassed all levels of Cloudflare security in Node.js.
2. Puppeteer
Puppeteer is a popular Node.js library with high-level headless browser APIs based on Chromium. Being a headless browser, Puppeteer allows you to mimic user interactions, such as clicking, visiting pages dynamically, scrolling, hovering, typing, and more.
Puppeteer can also mimic normal user behavior during scraping, such as request limiting and dynamically waiting for the page to load. One shortcoming of Puppeteer is that it exposes bot-like parameters, such as the HeadlessChrome
flag, allowing anti-bots to block it easily.
👍 Pros
- Automates user interactions.
- Scrapes dynamically rendered content.
- You can set Puppeteer proxies and customize its request headers.
- Mimic normal user behavior.
👎 Cons
- Easily detected by anti-bots.
- Steep learning curve.
- Costly at scale.
- Unsuitable for large-scale web scraping.
How to Bypass Cloudflare in Node.js Using Puppeteer
Let's target Hapag-Lloyd, a protected website, to see how to avoid detection with Puppeteer.
First, install the library using npm
:
npm install puppeteer
Require and launch Puppeteer using an async
function. Set the page viewport and visit the target website. Wait for the page to load, take a screenshot of the target page, and close the browser instance:
// npm install puppeteer
const puppeteer = require('puppeteer');
(async () => {
// initiate the browser
const browser = await puppeteer.launch();
// create a new page with the default browser context
const page = await browser.newPage();
// set a viewport
await page.setViewport({ width: 1280, height: 720 });
// go to the target website
await page.goto('https://www.hapag-lloyd.com/en/home.html');
// wait for the page to load
await new Promise((r) => setTimeout(r, 1000));
// take screenshot
await page.screenshot({ path: 'screenshot.png' });
// closes the browser and all of its pages
await browser.close();
})();
Puppeteer may fail to bypass even low-level bot protection mechanisms. Running the scraper in non-headless mode (GUI mode) can sometimes increase the success rate, as headless browsers are more likely to be flagged as bots. However, non-headless mode leads to additional memory overhead and is generally not recommended.
Additionally, Puppeteer is less suitable for large-scale scraping because managing multiple browser instances becomes resource-intensive and costly at scale.
You can use plugins like Puppeteer Stealth to boost Puppeteer's ability to bypass Cloudflare. You'll see how to use it in the next section.
3. Puppeteer Stealth
Puppeteer Stealth is a plugin that enhances Puppeteer's ability to bypass anti-bot detection. It uses multiple evasion techniques to bypass Cloudflare, like overriding JavaScript objects in the browser, changing the User Agent header, and more.
👍 Pros
- Customizable with extra evasions.
- Hard to detect.
- Ability to automate user interactions.
- Full support for dynamic content scraping.
👎 Cons
- Can't evade advanced anti-bot measures.
- Prone to anti-bot detection when used at scale.
- Steep learning curve.
- Hard to debug.
How to Bypass Cloudflare in Node.js Using Puppeteer Stealth
To avoid Cloudflare bot detection using Puppeteer Stealth, first install the library:
npm install puppeteer-extra puppeteer-extra-plugin-stealth
Require it in your scraper, specify Puppeteer as your executable path, and include the Stealth plugin in Puppeteer's setup. Add the executable path to your browser instance and request the same target website that blocked the base Puppeteer scraper previously (Hapag-Lloyd):
// npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
// store Puppeteer in an executable path
const { executablePath } = require('puppeteer');
// add the stealth plugin
puppeteer.use(StealthPlugin());
(async () => {
// initiate the browser and use Puppeteer as the executable path
const browser = await puppeteer.launch({
executablePath: executablePath(),
});
// create a new page with the default browser context
const page = await browser.newPage();
// set a viewport
await page.setViewport({ width: 1280, height: 720 });
// go to the target website
await page.goto('https://www.hapag-lloyd.com/en/home.html');
// wait for the page to load
await new Promise((r) => setTimeout(r, 5000));
// take screenshot
await page.screenshot({ path: 'screenshot.png' });
// closes the browser and all of its pages
await browser.close();
})();
Puppeteer Stealth will bypass simple Cloudflare protection. Unfortunately, it can't bypass websites with advanced Cloudflare protection like Cloudflare Challenge.
4. Humanoid
Humanoid is a Node JS package that solves and bypasses Cloudflare anti-bot challenges. It solves JavaScript challenges using Node.js runtime, making the anti-bot perceive the scraper as a regular web browser.
Some of its features to solve JavaScript challenges include User Agent randomization, auto-retry on failed challenges, custom cookies, and request header optimization. However, unlike headless browsers, Humanoid only works as an HTTP client and depends on libraries like Cheerio to parse HTML.
👍 Pros
- Easy to use.
- Solves JavaScript challenges automatically.
- Supports asynchronous JavaScript.
- Faster than headless browsers.
👎 Cons
- Hasn't been updated for a long time.
- Doesn't support dynamic content scraping.
- No support for human interaction automation.
- Content extraction depends on parsers like Cheerio.
How to Bypass Cloudflare in Node.js Using Humanoid
To bypass Cloudflare with Node.js and Humanoid, start by installing the package:
npm install humanoid-js
Import the module, create a humanoid
instance, and request the target website using the GET method:
// npm install humanoid-js
const Humanoid = require('humanoid-js');
// create a new humanoid instance
const humanoid = new Humanoid();
// send Get request to the target website
humanoid
.get('https://www.hapag-lloyd.com/en/home.html')
.then((res) => {
// print the result
console.log(res.body);
})
// catch errors if any
.catch((err) => {
console.log(err);
});
Humanoid can't bypass advanced Cloudflare protection and could be blocked even by the simplest security measures due to lack of maintenance, rendering its bypass methods obsolete and easily detectable by anti-bot systems.
5. cloudflare-scraper
cloudflare-scraper is a plugin that works on top of Puppeteer and is designed to bypass Cloudflare JavaScript challenges. It's customizable, allowing you to add extra parameters to requests, such as cookies, proxies, and custom User Agent header.
👍 Pros:
- Works on top of Puppeteer.
- Can solve Cloudflare JavaScript and CAPTCHA challenges.
- Capable of adding proxies to the request.
👎 Cons:
- The lack of documentation makes it difficult to configure.
- cloudflare-scraper is powerless against advanced bot detections.
- Not regularly updated.
- No built-in HTML parser.
- Not beginner-friendly.
How to Bypass Cloudflare in Node.js Using Cloudflare-scraper
To bypass Cloudflare detection with cloudflare-scraper, install the library using npm
:
npm install cloudflare-scraper
Cloudflare-scraper is an ES module. So, ensure your JavaScript code is ES-compatible to use the standard ES import statement. Add the following line to your package.json
file to enforce ES compatibility:
{
// ...
"type": "module",
}
Now, import cloudflare-scraper into your scraper script, visit the target website, and log its HTML content:
// npm install cloudflare-scraper
import cloudflareScraper from 'cloudflare-scraper';
(async () => {
try {
// send Get request to the target website
const response = await cloudflareScraper.get(
'https://www.hapag-lloyd.com/en/home.html'
);
// print out the result
console.log(response);
// handle errors
} catch (error) {
console.log(error);
}
})();
While cloudflare-scraper can help you bypass simple Cloudflare protection, it can't handle advanced security measures. It's also not regularly updated, making it vulnerable to constantly evolving Cloudflare security measures. The lack of detailed documentation also gives it a steep learning curve and makes it unsuitable for large-scale web scraping.
Conclusion
You've seen 5 tools to bypass Cloudflare while scraping with Node.js, including how they work. Among the free tools recommended, Puppeteer Stealth is the most promising. However, while the free tools may bypass simple Cloudflare protection, they don't guarantee a 100% success rate. Besides, they can't avoid advanced Cloudflare security measures.
The best way to bypass Cloudflare and any other anti-bot is to use ZenRows, an all-in-one solution that provides all the tools required to scrape any website at scale without limitations.
Try ZenRows for free without a credit card!
Frequent Questions
What Is Cloudflare?
Cloudflare is a content delivery network that offers web firewalls to defend applications against several security threats, such as cross-site scripting (XSS), credential stuffing and DDoS attacks. By default, it blocks scrapers.
How Does Cloudflare Detect Web Scrapers?
Cloudflare uses two main techniques to detect bots: passive and active detection methods.
Passive detection methods employ backend analyses to detect suspicious activities. They include TLS and HTTP/2 fingerprinting, IP reputation, and more. Active methods use client-side analyses, such as CAPTCHAs, JavaScript challenges, event tracking, and canvas fingerprinting.