Have you encountered any CAPTCHAs blocking your web scraper? These challenges can be a headache when automating data collection. Luckily, you can use Playwright to bypass CAPTCHA, and we'll walk you through three methods:
If you're tired of dealing with those annoying tests, read on.
Can Playwright Solve CAPTCHA?
The purpose of CAPTCHAs is to be challenging for bots but easy for humans. However, we'll see that you can use Playwright together with complementary tools to get rid of them.
An important lesson is you can either A) solve the test when it appears or B) prevent it from appearing and retry if it's shown.
In the first case, you'll need to employ a Playwright CAPTCHA solver, and it might get expensive at scale. In the second scenario, your scraper needs to simulate human behavior better to stay below the radar. We'll see both approaches, but the second one is the best practice as a foundation.
Now, let's see how you can implement them!
Method #1: Bypass CAPTCHA with Base Playwright and 2Captcha
The first method we'll discuss is using Playwright with 2Captcha, a service that solves CAPTCHAs by employing humans on your behalf.
To get started with Playwright CAPTCHA bypassing, start by installing the library.
npm install playwright
Then, sign up for a 2Captcha account to obtain your API key and install the package.
npm install 2captcha
Now, go to your code editor, import both libraries and create an async
function that launches the headless Chrome browser (with headless: true
, as in production).
// Start with calling both Playwright and 2captcha
const { chromium } = require('playwright');
const Captcha = require("2captcha");
(async () => {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
Pass your API key into a Captcha.Solver
class to gain access to 2Captcha services later in the code.
// Insert your API key here
const solver = new Captcha.Solver("<Your 2Captcha API key>");
Navigate to a demo page containing a reCAPTCHA task, wait for the loading of the test iframe and retrieve its content through captchaFrame.contentFrame()
. That'll enable you to locate and manipulate the essential elements required to solve the challenge.
// Call ReCaptcha Website
const websiteUrl = "https://patrickhlauke.github.io/recaptcha/";
await page.goto(websiteUrl);
// Wait for the CAPTCHA element to load
const captchaFrame = await page.waitForSelector("iframe[src*='recaptcha/api2']");
// Switch to the CAPTCHA iframe
const captchaFrameContent = await captchaFrame.contentFrame();
// Wait for the CAPTCHA checkbox to appear
const captchaCheckbox = await captchaFrameContent.waitForSelector("#recaptcha-anchor");
// Click the CAPTCHA checkbox
await captchaCheckbox.click();
Great! You're just a few steps away from solving it.
To get the answer you need, invoke the solver.recaptcha()
method to send a request to 2Captcha's API and retrieve a response string containing the correct answer. Here, it's crucial to pass the data-sitekey parameter (i.e., 6Ld2sf4SAAAAAKSgzs0Q13IZhY02Pyo31S2jgOB5
) from the CAPTCHA, a unique identifier for the type of challenge the website employs.
Once you have the answer, click the "Submit" button.
// Wait for the CAPTCHA challenge to be solved by 2Captcha
const captchaResponse = await solver.recaptcha("6Ld2sf4SAAAAAKSgzs0Q13IZhY02Pyo31S2jgOB5", websiteUrl);
// Fill in the CAPTCHA response and submit the form
const captchaInput = await captchaFrameContent.waitForSelector("#g-recaptcha-response");
await captchaInput.evaluate((input, captchaResponse) => {
input.value = captchaResponse;
}, captchaResponse);
await captchaFrameContent.waitForSelector("button[type='submit']").then((button) => button.click());
// Wait for the page to navigate to the next page
await page.waitForNavigation();
console.log("CAPTCHA solved successfully!");
await browser.close();
})();
Amazing! You've solved your first CAPTCHA with Playwright.
However, while 2Captcha can be a useful solution for testing and small-scale data extraction, it isn't the most cost-effective option for large-scale web scraping or solving all CAPTCHA types. The best approach is to prevent the challenge from being prompted.
Method #2: Use Playwright with the Stealth Plugin
The previous Playwright setup won't work if you need to scrape data from a website that uses more complex CAPTCHA challenges, but the Stealth plugin is a handy solution. It's an open-source project that strengthens Playwright with additional features to mimic human web traffic:
- It masks your User-Agent.
- It disables WebRTC to prevent IP address identification. While it doesn't explicitly block tracking scripts, it still maintains privacy by obscuring browsing data.
- It adds other elements to your headless browser to make your requests appear more natural.
Let's make our example more vivid and test with Astra, a website with basic Cloudflare protection.
Before getting started, install the required dependencies by running this command inside your project folder:
npm install playwright playwright-extra
Note: You find the Stealth plugin in the playwright-extra
framework.
Supercharge Playwright by calling a headless Chrome browser through playwright-extra
and enabling puppeteer-extra-plugin-stealth
using chromium.use(pluginStealth)
. This combination of tools provides additional measures to make it more difficult for websites to detect your web scraper.
const { chromium } = require('playwright-extra')
// Load the stealth plugin and use defaults (all tricks to hide playwright usage)
const pluginStealth = require("puppeteer-extra-plugin-stealth");
// Use stealth
chromium.use(pluginStealth)
// That's it, the rest is playwright usage as normal 😊
chromium.launch({ headless: true }).then(async browser => {
// Create a new page
const page = await browser.newPage()
// Go to the website
await page.goto('https://www.getastra.com/')
// Wait for page to download
await page.waitForTimeout(1000);
// Take screenshot
await page.screenshot({ path: 'screen.png'})
// Close the browser
console.log('All done, check the screenshot. ✨')
await browser.close()
})
With a fresh web page loaded using browser.newPage()
and calling a page.goto()
function, our website is ready to be scraped.
Your script is now fully functional and can capture a screenshot, as shown below:
Playwright with the Stealth plugin makes bypassing CAPTCHAs easier and more reliable than the previous method. However, some CAPTCHA systems may still detect and block your bot.
For example, when attempting to scrape websites with tougher Cloudflare protection, like G2, you may encounter an Access denied
message when using the Stealth plugin.
The ultimate solution for such cases is ZenRows. Let's learn about it!
Method #3: Best CAPTCHA Bypass with ZenRows
Unlike Playwright and other web automation frameworks, ZenRows is specifically designed for web crawling. It can solve even the most complex challenges of top-tier security systems, like Cloudflare (used by 1/5 of internet sites) and DataDome. You'll scrape G2 with it next to see that it works.
To try ZenRows, sign up to get your free API key and install it by running the following command:
npm install zenrows
Then, use the following code, which performs an API request having enabled js_render
and premium_proxy
.
const { ZenRows } = require("zenrows");
(async () => {
const client = new ZenRows("<Your api key>");
const url = "https://www.g2.com/";
try {
const { data } = await client.get(url, {
"js_render": "true",
"premium_proxy": "true"
});
console.log(data);
} catch (error) {
console.error(error.message);
if (error.response) {
console.error(error.response.data);
}
}
})();
Note: Remember to add your API key.
Run it and wait for beautiful success. 😌
Conclusion
Bypassing CAPTCHA with Playwright can be a hard task, as this popular challenge is designed to prevent automated access to websites. However, by using the right tools and libraries, you'll be able to scrape the data you want.
In this article, we saw three different methods to deal with CAPTCHAs:
- Using base Playwright and 2Captcha.
- Using Playwright with the Stealth plugin.
- Masking requests with ZenRows.
The best solution depends on your specific needs, but ZenRows is a reliable option able to bypass even the toughest anti-bot challenges. Make the most of its free trial and make your first requests with it.