Have you encountered CAPTCHAs blocking your web scraper? Facing a CAPTCHA challenge can be frustrating during web scraping. Luckily, you can use Playwright to bypass CAPTCHA, and we'll walk you through three methods:
Read on if you're tired of dealing with CAPTCHA interruptions while scraping.
Why Playwright Alone Isn't Enough to Bypass CAPTCHA
Playwright is a valuable web scraping tool, as it can handle dynamic websites and mimic human users. Unfortunately, it has bot-like attributes that most websites detect quickly. Consequently, it can't bypass CAPTCHA challenges on its own.
For instance, Playwright presents bot-like fingerprints like the presence of an automated WebDriver, a HeadlessChrome
parameter in the headless User Agent string, missing plugins like the PDF Viewer, misconfigured renderers, etc.
All these factors indicate to the target site that you're trying to gain automated access to extract data. The purpose of CAPTCHAs like the one below is to be challenging for automated bots but easy for humans to solve.
However, while these limitations make Playwright detectable by anti-bot systems, you can make it more effective for web scraping tasks by fortifying it with the correct Playwright CAPTCHA bypass techniques. This approach typically involves pairing Playwright with complementary tools to bypass CAPTCHAs.
Although you can attempt to solve the CAPTCHA test when it appears, it's better to prevent it from appearing at all.
In solving the CAPTCHA, you'll need to employ a Playwright CAPTCHA solver, which might be slow and expensive, making it unsuitable for large-scale scraping. Bypassing CAPTCHA requires your scraper to simulate human behavior better to stay below the radar.Â
Let's see how to implement these solutions.
Method #1: Use 2Captcha for Playwright CAPTCHA Solving
The first method you'll learn is using Playwright with 2Captcha, a service that solves CAPTCHAs by employing humans on your behalf.Â
Let's see how it works using a reCAPTCHA demo page as the target.
To get started with Playwright CAPTCHA solving, install the library.
pip3 install 2captcha-python
Add the 2captcha-python
library to your imports and specify the target site. Start a browser in headless mode and instantiate the CAPTCHA solver with your 2Captcha API key (create a 2Captcha account to obtain one):
# pip3 install playwright 2captcha-python
from playwright.sync_api import sync_playwright
from twocaptcha import TwoCaptcha
# target URL with reCAPTCHA
url = "https://patrickhlauke.github.io/recaptcha/"
# run Playwright
with sync_playwright() as p:
# launch the browser in headless mode
browser = p.chromium.launch(headless=True)
page = browser.new_page()
solver = TwoCaptcha("<YOUR_API_KEY>")
Open the target URL and obtain the iFrame containing the CAPTCHA box. Switch to the iFrame and extract the site key from its src
attribute. Click the CAPTCHA checkbox to spin the image puzzle:
# ...
# run Playwright
with sync_playwright() as p:
# ...
# open the target URL
page.goto(url)
# obtain the iFrame containing the CAPTCHA box
captcha_frame = page.wait_for_selector("iframe[src*='recaptcha']")
# switch to the content of the CAPTCHA iframe
captcha_frame_content = captcha_frame.content_frame()
# extract site key for the CAPTCHA
site_key = captcha_frame.get_attribute("src").split("k=")[-1].split("&")[0]
# get the CAPTCHA checkbox element
captcha_checkbox = captcha_frame_content.wait_for_selector("#recaptcha-anchor")
# click the CAPTCHA checkbox
captcha_checkbox.click()
Call the solver
object with the site key and retrieve the token from the result. Enter the token into the response field to solve the on-page CAPTCHA. Log the input value to confirm token generation:
# ...
# run Playwright
with sync_playwright() as p:
# ...
# solve CAPTCHA
captcha_response = solver.recaptcha(sitekey=site_key, url=url)
# extract the Turnstile token from the response
captcha_token = captcha_response["code"]
if captcha_response:
# fill in the CAPTCHA response in the hidden input
page.evaluate(
f'document.querySelector("#g-recaptcha-response").value="{captcha_response}"'
)
# ... further actions (e.g., trigger form submission or specific action)
# wait to observe the result
page.wait_for_timeout(5000)
browser.close()
Merge the snippets. Here's the complete code:
# pip3 install playwright 2captcha-python
from playwright.sync_api import sync_playwright
from twocaptcha import TwoCaptcha
# target URL with reCAPTCHA
url = "https://patrickhlauke.github.io/recaptcha/"
# run Playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
solver = TwoCaptcha("<YOUR_API_KEY>")
# open the target URL
page.goto(url)
# obtain the iFrame containing the CAPTCHA box
captcha_frame = page.wait_for_selector("iframe[src*='recaptcha']")
# switch to the content of the CAPTCHA iframe
captcha_frame_content = captcha_frame.content_frame()
# extract site key for the CAPTCHA
site_key = captcha_frame.get_attribute("src").split("k=")[-1].split("&")[0]
# get the CAPTCHA checkbox element
captcha_checkbox = captcha_frame_content.wait_for_selector("#recaptcha-anchor")
# click the CAPTCHA checkbox
captcha_checkbox.click()
# solve CAPTCHA
captcha_response = solver.recaptcha(sitekey=site_key, url=url)
# extract the Turnstile token from the response
captcha_token = captcha_response["code"]
if captcha_response:
# fill in the CAPTCHA response in the hidden input
input = page.evaluate(
f'document.querySelector("#g-recaptcha-response").value="{captcha_token}"'
)
# check if the token has been entered correctly
print(input)
page.screenshot(path="screengrab.png")
# ... further actions (e.g., trigger form submission or specific action)
# wait to observe the result
page.wait_for_timeout(5000)
browser.close()
Amazing! You've built your first Playwright CAPTCHA solver.
However, while 2Captcha can be a useful solution for small-scale data extraction, it doesn't work at scale and isn't suitable for solving all CAPTCHA types. As mentioned earlier, the best approach is to prevent the challenge from being triggered in the first place.
Method #2: Bypass CAPTCHAs With Playwright Stealth Plugin
The Playwright Stealth plugin is a handy solution for bypassing CAPTCHAs. It's an open-source Playwright Extra plugin that strengthens Playwright with various evasion techniques to mimic human behavior during web scraping.
For example, the Stealth plugin patches the Playwright User Agent, spoofs a real browser's runtime to mimic an actual browser, turns off WebRTC to prevent IP address identification, changes the WebDriver navigator field from true to false, etc.
Let's make our example more concrete and test it with this Anti-bot Challenge page:
Before getting started, install the required dependencies by running this command inside your project folder:
pip3 install playwright-stealth
Import the Stealth package, launch a new headless browser instance, and add the plugin to Playwright by calling stealth_sync
:
# pip3 install playwright playwright-stealth
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
# launch the Playwright instance
with sync_playwright() as playwright:
# launch the browser
browser = playwright.chromium.launch(headless=True)
# create a new page
page = browser.new_page()
# apply stealth settings to the page
stealth_sync(page)
Open the protected target site and take its screenshot:
# ...
# launch the Playwright instance
with sync_playwright() as playwright:
# ...
# navigate to the desired URL
page.goto("https://www.scrapingcourse.com/antibot-challenge")
# wait for any dynamic content to load
page.wait_for_load_state("networkidle")
# take a screenshot of the page
page.screenshot(path="screenshot.png")
# close the browser
browser.close()
Here's the complete code after combining both snippets:
# pip3 install playwright playwright-stealth
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
# launch the Playwright instance
with sync_playwright() as playwright:
# launch the browser
browser = playwright.chromium.launch(headless=True)
# create a new page
page = browser.new_page()
# apply stealth settings to the page
stealth_sync(page)
# navigate to the desired URL
page.goto("https://www.scrapingcourse.com/antibot-challenge")
# wait for any dynamic content to load
page.wait_for_load_state("networkidle")
# take a screenshot of the page
page.screenshot(path="coursecom.png")
# close the browser
browser.close()
However, running the above code generates a screenshot showing that the Stealth plugin couldn't bypass the Turnstile CAPTCHA:
The Playwright Stealth Plugin failed because it doesn't work against advanced CAPTCHA technologies like the Cloudflare Turnstile.Â
Although we expected the plugin to help us avoid triggering the Turnstile CAPTCHA checkbox, it didn't work because it still leaks some bot-like properties that Cloudflare doesn't overlook. That said, the current scraper can still work with simpler CAPTCHA protections, but not at scale.
There aren't many effective CAPTCHA-bypass options for Playwright. The ultimate solution for such cases is ZenRows. Let's learn more about it!
Method #3: Best CAPTCHA Bypass With ZenRows
ZenRows is the best solution for bypassing CAPTCHAs automatically. It features all the toolkits for successful web crawling and scraping, including premium proxy rotation, request header management, JavaScript rendering support, CAPTCHA auto-bypass, and more.
It bypasses even the most complex challenges posed by top-tier security systems, like Cloudflare (used by 1/5 of internet sites), DataDome, etc. ZenRows even helps you handle advanced fingerprinting with headless browsing, allowing you to simulate human interactions while scraping. As a result, it can serve as a substitute for browser automation tools like Playwright.
Let's try scraping the previous Anti-bot Challenge page with ZenRows to see how it works.
Sign up to open the ZenRows Request Builder. Paste the target URL in the link box and activate Premium Proxies and JS Rendering.
Next, select your programming language (Python, in this case) and choose the API connection mode. Copy and paste the generated code into your Python script:
The generated code should look like this:
# pip install requests
import requests
url = 'https://www.scrapingcourse.com/antibot-challenge'
apikey = '<YOUR_ZENROWS_API_KEY>'
params = {
'url': url,
'apikey': apikey,
'js_render': 'true',
'premium_proxy': 'true',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
The above scraper accesses the protected website and scrapes its full-page HTML, as shown:
<html lang="en">
<head>
<!-- ... -->
<title>Antibot Challenge - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<h2>
You bypassed the Antibot challenge! :D
</h2>
<!-- other content omitted for brevity -->
</body>
</html>
Bravo! 💪 You just bypassed a CAPTCHA challenge with ZenRows.
Conclusion
Bypassing CAPTCHAs with Playwright can be tricky, as this popular challenge is designed to prevent automated website access. There are only a few solutions to bypass CAPTCHA with Playwright. Solving CAPTCHAs isn't sustainable, and the Playwright Stealth plugin also falls short against complex CAPTCHA challenges.
Fortunately, ZenRows is a reliable option to bypass even the toughest CAPTCHA and anti-bot challenges. All it takes is a single API request.
Try ZenRows for free now, no credit card required!