Does your Puppeteer web scraper struggle to bypass CAPTCHA? You've come to the right place for the solution!
In this tutorial, you'll learn the four best ways to deal with CAPTCHA while using Puppeteer and scrape without obstacles.
- Method #1: Supercharge Puppeteer with stealth to bypass CAPTCHA.
- Method #2: Bypass CAPTCHA with ZenRows.
- Method #3: Implement a free solver plugin.
- Method #4: Use a paid CAPTCHA solver with Puppeteer.
Can Puppeteer Solve CAPTCHA?
The short answer is yes, but only if you give Puppeteer a boost. That's because Puppeteer alone can't automate CAPTCHA-clicking.
For instance, try scraping G2 Reviews, a website protected by Cloudflare CAPTCHA, with vanilla Puppeteer:
// import the required library
const puppeteer = require("puppeteer");
(async () => {
// start Puppeteer in headless mode and open the target website
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
const response = await page.goto("https://www.g2.com/products/asana/reviews");
// wait for the content to load
await page.waitForSelector("body");
// get the content of the page
const content = await page.content();
console.log(response.status(), content);
// close the browser instance
await browser.close();
})();
The code outputs the following HTML, indicating that Cloudflare has blocked your Puppeteer scraper:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<!-- ... -->
<title>Attention Required! | Cloudflare</title>
</head>
The CAPTCHA typically looks like this in the GUI (non-headless mode):
Generally, there are two ways to CAPTCHAs:
- Solve the CAPTCHA once it triggers.
- Bypass the CAPTCHA completely.
Puppeteer can solve CAPTCHAs only if supported with external CAPTCHA-solving tools. The vanilla version of Puppeteer is an automation library, not designed to solve CAPTCHAs.
The most efficient way of handling CAPTCHA is bypassing it by preventing it from appearing. While Puppeteer's headless browser capability may help you bypass CAPTCHAs, it still requires backup from plugins like Puppeteer Stealth.
Another way to bypass CAPTCHAs is by optimizing Puppeteer's request headers to mimic a real user or setting up a proxy with Puppeteer to switch IPs.
Let's look at the four best techniques to handle CAPTCHAs with Puppeteer.
Method #1: Supercharge Puppeteer With Stealth to Bypass CAPTCHA
Puppeteer Stealth is a plugin featuring various evasion techniques for bypassing anti-bot detection during web scraping. It removes bot-like attributes from Puppeteer's ChromeDriver, making it appear as a legitimate browser.
The Stealth plugin requires some technical setup, but it's a free method of bypassing CAPTCHA with Puppeteer.
Let's see how it works by scraping OpenSea, an anti-bot-protected website that presents CAPTCHAs when a request doesn't meet its security criteria.
To get started, you have to install puppeteer-extra
and puppeteer-extra-plugin-stealth
:
npm install puppeteer-extra puppeteer-extra-plugin-stealth
Once installed, import the modules and enable the stealth plugin:
const puppeteer = require("puppeteer-extra");
const pluginStealth = require("puppeteer-extra-plugin-stealth");
//save to executable path
const { executablePath } = require("puppeteer");
// use stealth
puppeteer.use(pluginStealth());
The next steps include setting the viewport, navigating to the page URL, waiting for it to load, and taking screenshots to track the process.
// ...
// launch puppะตteer-stealth
puppeteer.launch({ executablePath: executablePath() }).then(async browser => {
// create a new page
const page = await browser.newPage();
// set page view
await page.setViewport({ width: 1280, height: 720 });
// navigate to the website
await page.goto("https://www.opensea.io/");
// wait for page to load
await page.waitForTimeout(1000);
// take a screenshot
await page.screenshot({ path: "image.png" });
// close the browser
await browser.close();
});
Here's the complete code:
const puppeteer = require("puppeteer-extra");
// add stealth plugin and use defaults
const pluginStealth = require("puppeteer-extra-plugin-stealth");
const { executablePath } = require("puppeteer");
// use stealth
puppeteer.use(pluginStealth());
// launch puppeteer-stealth
puppeteer.launch({ executablePath: executablePath() }).then(async browser => {
// create a new page
const page = await browser.newPage();
// set page view
await page.setViewport({ width: 1280, height: 720 });
// navigate to the website
await page.goto("https://www.opensea.io/");
// wait for page to load
await page.waitForTimeout(1000);
// take a screenshot
await page.screenshot({ path: "image.png" });
// close the browser
await browser.close();
});
The above code outputs the following screenshot:
Congrats! You've just made your scraper more undetectable.
However, more advanced website protections can detect Puppeteer Stealth. Confirm that by running the same script on a G2's product page. Here's the output:
Puppeteer and Stealth plugin couldn't solve the CAPTCHA problem. Let's see a solution that works in the next section.
Method #2: Best CAPTCHA Bypass With ZenRows
As mentioned, the best way to handle CAPTCHA is to avoid it. That's where ZenRows, an all-in-one web scraping API, comes in. It modifies your request headers, auto-rotates premium proxies, and bypasses CAPTCHAs and other anti-bot measures at scale in a single API call.
ZenRows also features JavaScript instructions, allowing it to act as a headless browser for extracting content from dynamic websites like those using infinite scrolling. Thanks to this feature, you can replace Puppeteer with ZenRows and focus on scraping your target content without getting blocked.
Let's see ZenRows in action by scraping the G2 page where Puppeteer Stealth got blocked.
Sign up to open the ZenRows Request Builder. Paste the target URL in the link box, toggle on the Boost mode to JS Rendering, and activate Premium Proxies. Select Node.js as your programming language and choose the API connection mode. Copy and paste the generated code into your JavaScript:
Here's a slightly modified version of the generated code:
// npm install axios
const axios = require("axios");
// define your request parameters and make an axios request
axios({
url: "https://api.zenrows.com/v1/",
method: "GET",
params: {
"url": "https://www.g2.com/products/asana/reviews",
"apikey": "<YOUR_ZENROWS_API_KEY>",
"js_render": "true",
"premium_proxy": "true",
},
})
.then(response => console.log(response.data))
.catch(error => console.log(error));
The code extracts the full-page HTML of the Cloudflare-protected website. The result below shows the page title and omitted content:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
</head>
<body>
<!-- other content omitted for brevity -->
</body>
You're all set! ZenRows makes bypassing CAPTCHAs and advanced anti-bot measures quick and easy.
Would you rather try solving the CAPTCHA manually instead? Let's go through two more methods.
Method #3: Implement a Free Solver Plugin
Puppeteer-extra-plugin-recaptcha is a free and open-source module that automates the solving of reCAPTCHA and hCAPTCHA, two of the most popular anti-bot technologies on the market. It also supports a 2Captcha integration that you can use when the free module proves insufficient.
Let's use 2Captcha's demo page to illustrate how to integrate the solver with Puppeteer.
To get started, install puppeteer-extra
and recaptcha
.
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-recaptcha
Import the libraries and provide your 2Captcha API key as a token.
const puppeteer = require('puppeteer-extra')
const RecaptchaPlugin = require('puppeteer-extra-plugin-recaptcha')
// use the RecaptchaPlugin with the specified provider (2captcha) and token
puppeteer.use(
RecaptchaPlugin({
provider: {
id: '2captcha',
token: 'XXXXXXX'
},
visualFeedback: true // enable visual feedback (colorize reCAPTCHAs)
})
)
Next, navigate to your target webpage and initialize the solving with the page.solveRecaptchas()
method.
// launch a headless browser instance
puppeteer.launch({ headless: true }).then(async browser => {
// create a new page
const page = await browser.newPage()
// navigate to a page containing a reCAPTCHA challenge
await page.goto('https://2captcha.com/demo/recaptcha-v2')
// automatically solve the reCAPTCHA challenge
await page.solveRecaptchas()
Now, wait for the solution and click on the submit button.
// wait for the navigation and click the submit button
await Promise.all([
await Promise.all([
page.waitForNavigation(),
page.click(`#recaptcha-demo-submit`)
])
The complete code should be:
const puppeteer = require('puppeteer-extra')
const RecaptchaPlugin = require('puppeteer-extra-plugin-recaptcha')
// use the RecaptchaPlugin with the specified provider (2captcha) and token
puppeteer.use(
RecaptchaPlugin({
provider: {
id: '2captcha',
token: 'XXXXXXX'
},
visualFeedback: true // enable visual feedback (colorize reCAPTCHAs)
})
)
// launch a headless browser instance
puppeteer.launch({ headless: true }).then(async browser => {
// create a new page
const page = await browser.newPage()
// navigate to a page containing a reCAPTCHA challenge
await page.goto('https://2captcha.com/demo/recaptcha-v2')
// automatically solve the reCAPTCHA challenge
await page.solveRecaptchas()
// wait for the navigation and click the submit button
await Promise.all([
page.waitForNavigation(),
page.click(`#recaptcha-demo-submit`)
])
// take a screenshot of the response page
await page.screenshot({ path: 'response.png', fullPage: true })
// close the browser
await browser.close()
})
Here's the outcome:
Great! You've just successfully solved the CAPTCHA.
However, free CAPTCHA solvers are unreliable because they're automated. If they fail, you should look into paid solvers. Since they employ humans, they can interact with any CAPTCHA type.
Method #4: Use a Paid CAPTCHA Solver With Puppeteer
Let's imagine you encounter a CAPTCHA-protected form while scraping and need to solve it. Here, we'll use 2Captcha, an API-based service that employs humans to solve the challenge.
Letโs go with the same 2Captcha's demo page.
First, sign up on 2Captcha to get an API key. Then, install Puppeteer and the requests module.
npm install puppeteer request
Now, let's write a script that opens the website you want to scrape, takes a screenshot of the CAPTCHA, and sends it to the service.
const puppeteer = require('puppeteer');
const request = require('request');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// navigate to the page with the CAPTCHA
await page.goto('https://2captcha.com/demo/normal');
// take a screenshot of the CAPTCHA
const screenshot = await page.screenshot();
// convert the screenshot to a base64 encoded string
const image = new Buffer(screenshot).toString('base64');
// send the image to the 2Captcha API
request.post({
url: 'http://2captcha.com/in.php',
formData: {
key: 'your_2captcha_api_key',
method: 'base64',
body: image
}
}, async (error, response, body) => {
if (error) {
console.error(error);
}
Let's capture the API response using an ID, as shown below:
// get the CAPTCHA ID from the 2Captcha API response
const captchaId = body.split('|')[1];
// request the CAPTCHA solution from the 2Captcha API
request.get({
url: `http://2captcha.com/res.php?key=your_2captcha_api_key&action=get&id=${captchaId}`
}, (error, response, body) => {
if (error) {
console.error(error);
}
});
Once we get the solution, we can put it on the page to solve the test.
// get the CAPTCHA solution from the 2Captcha API response
const captchaSolution = body.split('|')[1];
// use the CAPTCHA solution in your Puppeteer script
await page.type('#captcha-input', captchaSolution);
await page.click('#submit-button');
This is what the full script will look like:
const puppeteer = require('puppeteer');
const request = require('request');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// navigate to the page with the CAPTCHA
await page.goto('https://example.com/captcha');
// take a screenshot of the CAPTCHA
const screenshot = await page.screenshot();
// convert the screenshot to a base64 encoded string
const image = new Buffer(screenshot).toString('base64');
// send the image to the 2Captcha API
request.post({
url: 'http://2captcha.com/in.php',
formData: {
key: 'your_2captcha_api_key',
method: 'base64',
body: image
}
}, async (error, response, body) => {
if (error) {
console.error(error);
} else {
// get the CAPTCHA ID from the 2Captcha API response
const captchaId = body.split('|')[1];
// request the CAPTCHA solution from the 2Captcha API
request.get({
url: `http://2captcha.com/res.php?key=your_2captcha_api_key&action=get&id=${captchaId}`
}, async (error, response, body) => {
if (error) {
console.error(error);
} else {
// get the CAPTCHA solution from the 2Captcha API response
const captchaSolution = body.split('|')[1];
// use the CAPTCHA solution in your Puppeteer script
await page.type('#captcha-input', captchaSolution);
await page.click('#submit-button');
}
await browser.close();
});
}
});
})();
Here are the results:
Keep in mind that using CAPTCHA solvers with Puppeteer works mostly for testing purposes rather than large-scale scraping, as they can quickly become too expensive and slow. Additionally, some types of CAPTCHA, e.g., reCAPTCHA or Geetest, can't be solved by API solvers.
Conclusion
In this article, you've learned a few solutions to bypass CAPTCHA with Puppeteer. Methods like integrating a solver or masking the browser are sometimes effective, but they fail for more complex CAPTCHA and don't scale well for big web scraping projects.
For successful data extraction, you need a scalable and efficient solution like ZenRows, which handles all anti-bot bypasses for you in a single API call. Get your API key now and enjoy 1,000 requests for free.