How to Bypass CAPTCHA With Pyppeteer (Python)

May 16, 2024 · 8 min read

Tired of facing CAPTCHA challenges while web scraping? You're not alone. Automation tools like Pyppeteer often trigger anti-bot systems that display CAPTCHAs.

But today, you'll learn how to bypass CAPTCHAs with Pyppeteer using the following methods.

Can Pyppeteer Solve CAPTCHA?

Yes, Pyppeteer can solve CAPTCHA challenges. As a Python port for the powerful Puppeteer, It allows you to automate browser interactions, including solving CAPTCHAs.

There are generally two approaches to solving CAPTCHAs. The first involves using a third-party solver to get the CAPTCHA solution and continuing the script execution once the CAPTCHA is solved.

Then, you can solve CAPTCHAs by completely avoiding them. Websites often employ CAPTCHAs to prevent automated access. So, if you can somehow appear human to the target server, its anti-bot system won't present a CAPTCHA challenge.

But how can you emulate human browser behavior?

Let's find out.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Method #1: Use a Paid CAPTCHA Solver With Pyppeteer

Third-party services typically work by outsourcing the tasks to a human workforce or using advanced algorithms to solve the challenges automatically.

2captcha, the most popular CAPTCHA-solving service, provides an API that makes it easy to submit CAPTCHA challenges and receive solutions. It uses a two-step process where you send a request containing the CAPTCHA data you want to solve and query for the solution using the returned request ID.

The data you send differs depending on the CAPTCHA type. For example, if you're dealing with an image-based CAPTCHA, the data would be a base64 encoded string of the image.

However, in this tutorial, you'll solve Google's reCAPTCHA, the most common type these days.

Click to open the image in full screen

So, your data will be the reCAPTCHA sitekey, which you can find by opening the developer's console in your browser and looking for an element with the data-sitekey attribute.

Example
<body>
<!—-
<div id="recaptcha-demo" class="g-recaptcha" data-sitekey="6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-" data-callback="onSuccess" data-action="action">
</body>

But before going any further, sign up to get your 2captcha API key. Also, install Pyppeteer and Requests using the following command in your terminal.

Terminal
pip install pyppeteer requests

Next, import the required libraries; asyncio (for handling asynchronous operations), launch (a function from the Pyppeteer library), time, and the Requests library (for making HTTP requests).

program.py
import asyncio
from pyppeteer import launch
import requests
import time

After that, define the main function as asynchronous, launch a headless browser, and open a new page.

program.py
#..
 
async def main():
    # launch headless browser
    browser = await launch()
    # open new page
    page = await browser.newPage()

Next, define the target URL (URL containing the CAPTCHA you want to solve), the sitekey, and your 2captcha API key.

program.py
#..
    # URL of the page containing the CAPTCHA challenge
    page_url = 'https://www.google.com/recaptcha/api2/demo'
    
    # Google sitekey of the CAPTCHA challenge
    sitekey = '6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-'
 
    # your 2captcha API key
    api_key = 'YOUR_2CAPTCHA_API_KEY'

Now, navigate to the target URL, make a POST request to 2captcha, and retrieve the resulting captcha_id.

program.py
#..
    # navigate to the target URL
    await page.goto(page_url)
 
    # send a request to 2captcha to solve the CAPTCHA challenge
    response = requests.post(f'http://2captcha.com/in.php?key={api_key}&method=userrecaptcha&googlekey={sitekey}&pageurl={page_url}&json=1')
    # retrieve the response
    captcha_id = response.json()['request']

The code snippet above dynamically constructs the endpoint's URL using an f-string that incorporates the sitekey, api_key, and page_url. It also specifies that the response should be in JSON format (json=1).

Once you've received the captcha_id, wait a few seconds (10 - 20). Then, use it to query 2captcha for the solution.

To do this, initialize a solution variable as an empty string and start a querying loop. Within the loop, make a GET request to the 2captcha API endpoint (base URL == http://2captcha.com/in.php) using your API key and captcha_id. Then, retrieve the response.

program.py
#..
    # wait a few seconds 
    time.sleep(10)
    
    # Query 2captcha for the solution with automatic retries
    max_retries = 10  # Maximum number of retries
    solution = ''
    print('Querying for solution...')
    for retry_count in range(max_retries):
        time.sleep(5)  # Wait for 5 seconds before querying again
        print(f'Retry {retry_count + 1}...')
        # make GET request
        response = requests.get(f'http://2captcha.com/res.php?key={api_key}&action=get&id={captcha_id}&json=1')
        # retrieve response
        response_data = response.json()
        solution = response_data.get('request', '')
        if solution != 'CAPCHA_NOT_READY':  # If solution is obtained, break out of the loop
            break
 
    if solution == 'CAPCHA_NOT_READY':
        print('Maximum retries reached. CAPTCHA solution not obtained.')
    else:
        print('Solution:', solution)

Here's what the solution token looks like.

Output
03AHJ_Vuve5Asa4koK3KSMyUkCq0vUFCR5Im4CwB7PzO3dCxIo11i53epEraq-uBO5mVm2XRikL8iKOWr0aG50sCuej9bXx5qcviUGSm4iK4NC_Q88flavWhaTXSh0VxoihBwBjXxwXuJZ-WGN5Sy4dtUl2wbpMqAj8Zwup1vyCaQJWFvRjYGWJ_TQBKTXNB5CCOgncqLetmJ6B6Cos7qoQyaB8ZzBOTGf5KSP6e-K9niYs772f53Oof6aJeSUDNjiKG9gN3FTrdwKwdnAwEYX-F37sI_vLB1Zs8NQo0PObHYy0b0sf7WSLkzzcIgW9GR0FwcCCm1P8lB-50GQHPEBJUHNnhJyDzwRoRAkVzrf7UkV8wKCdTwrrWqiYDgbrzURfHc2ESsp020MicJTasSiXmNRgryt-gf50q5BMkiRH7osm4DoUgsjc_XyQiEmQmxl5sqZP7aKsaE-EM00x59XsPzD3m3YI6SRCFRUevSyumBd7KmXE8VuzIO9lgnnbka4-eZynZa6vbB9cO3QjLH0xSG3-egcplD1uLGh79wC34RF49Ui3eHwua4S9XHpH6YBe7gXzz6_mv-o-fxrOuphwfrtwvvi2FGfpTexWvxhqWICMFTTjFBCEGEgj7_IFWEKirXW2RTZCVF0Gid7EtIsoEeZkPbrcUISGmgtiJkJ_KojuKwImF0G0CsTlxYTOU2sPsd5o1JDt65wGniQR2IZufnPbbK76Yh_KI2DY4cUxMfcb2fAXcFMc9dcpHg6f9wBXhUtFYTu6pi5LhhGuhpkiGcv6vWYNxMrpWJW_pV7q8mPilwkAP-zw5MJxkgijl2wDMpM-UUQ_k37FVtf-ndbQAIPG7S469doZMmb5IZYgvcB4ojqCW3Vz6Q

Now that you have the solution, inject it into the page. For this, inspect the CAPTCHA element and find the textarea element with id="g-recaptcha-response".

Click to open the image in full screen

Input the solution there and click the Submit button to solve the challenge.

program.py
#..
    # inject the solved CAPTCHA solution into the page and click Submit button
    await page.evaluate('''(solution) => {
        document.getElementById('g-recaptcha-response').innerHTML = solution;        
    }''', solution)
 
    # Click the submit button
    await page.click('#recaptcha-demo-submit')

After that, take a screenshot to see your result, close the browser, and run the main function.

program.py
#.. 
    # inject the solved CAPTCHA solution into the page and click Submit button
    await page.evaluate('''(solution) => {
        document.getElementById('g-recaptcha-response').innerHTML = solution;        
    }''', solution)
 
    # Click the submit button
    await page.click('#recaptcha-demo-submit')
 
    # take screenshot
    await page.screenshot({'path': 'solved.png'})
 
    # Close the browser
    await browser.close()
 
asyncio.get_event_loop().run_until_complete(main())

Putting everything together, here's the complete code.

program.py
import asyncio
from pyppeteer import launch
import requests
import time
 
async def main():
    # launch headless browser
    browser = await launch()
    # open new page
    page = await browser.newPage()
 
    # URL of the page containing the CAPTCHA challenge
    page_url = 'https://www.google.com/recaptcha/api2/demo'
    
    # Google sitekey of the CAPTCHA challenge
    sitekey = '6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-'
 
    # your 2captcha API key
    api_key = '<YOUR_2CAPTCHA_API_KEY>'
 
    # navigate to the target URL
    await page.goto(page_url)
 
 
    print('Making POST request to retrieve CAPTCHA ID')
    # send a request to 2captcha to solve the CAPTCHA challenge
    response = requests.post(f'http://2captcha.com/in.php?key={api_key}&method=userrecaptcha&googlekey={sitekey}&pageurl={page_url}&json=1')
    # retrieve the response
    captcha_id = response.json()['request']
 
    print('Waiting for 10 seconds after retrieving CAPTCHA ID')
    # wait a few seconds
    time.sleep(10)
 
    # query 2captcha for the solution with automatic retries
    max_retries = 10  # Maximum number of retries
    solution = ''
    print('Querying for solution...')
    for retry_count in range(max_retries):
        time.sleep(5)  # Wait for 5 seconds before querying again
        print(f'Retry {retry_count + 1}...')
        # make GET request
        response = requests.get(f'http://2captcha.com/res.php?key={api_key}&action=get&id={captcha_id}&json=1')
        # retrieve response
        response_data = response.json()
        solution = response_data.get('request', '')
        if solution != 'CAPCHA_NOT_READY':  # If solution is obtained, break out of the loop
            break
 
    if solution == 'CAPCHA_NOT_READY':
        print('Maximum retries reached. CAPTCHA solution not obtained.')
    else:
        print('Solution:', solution)
 
    # inject the solved CAPTCHA solution into the page and click Submit button
    await page.evaluate('''(solution) => {
        document.getElementById('g-recaptcha-response').innerHTML = solution;        
    }''', solution)
    print('Solution injected successfully')
 
    # Click the submit button
    await page.click('#recaptcha-demo-submit')
 
    # take a screenshot
    await page.screenshot({'path': 'solved.png'})
 
    # close the browser
    await browser.close()
 
asyncio.get_event_loop().run_until_complete(main())

Run it, and you should have the following result.

Result Page
Click to open the image in full screen

Congratulations! You've solved your first Pyppeteer CAPTCHA.

Not to spoil the party, but while 2CAPTCHA is suitable for testing, it slows down your automation. Also, it only works for some CAPTCHA types and can become expensive, particularly in large-scale web scraping.

Method #2: Bypass CAPTCHA With a Web Scraping API

As mentioned earlier, you can also solve CAPTCHAs by avoiding them. Thus, by flying under the radar (mimicking natural user behavior), you can bypass CAPTCHAs.

While Pyppeteer's headless browser capabilities enable you to emulate browser interactions, it has some limitations that make mimicking human behavior challenging. For example, its automation properties are easily detected by websites. Also, it can get slow and resource-intensive when large-scale scraping.

These limitations prompt the need for alternatives. Luckily, ZenRows, a web scraping API, provides the same headless browser functionality but without the additional overhead. This tool enables you to scrape any web page, including those protected by CAPTCHAs.

Let's see ZenRows in action against G2, a website that presents a CAPTCHA challenge when you trigger its anti-bot systems.

But before diving in, here's Pyppeteer against the same target web page.

program.py
import asyncio
from pyppeteer import launch
 
async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://www.g2.com/products/asana/reviews')
    await page.screenshot({'path': 'blocked.png'})
    await browser.close()
 
asyncio.get_event_loop().run_until_complete(main())

Below is the result

G2 Page Blocked
Click to open the image in full screen

Here, Pyppeteer triggers the page's anti-bot system and is presented with the Turnstile CAPTCHA challenge.

"Why not solve it with 2captcha?" you might ask.

Unfortunately, 2captcha or any other CAPTCHA-solving service cannot solve this CAPTCHA type. Your best bet is to avoid the CAPTCHA altogether.

Here's how to achieve that using ZenRows.

To get started, sign up, and you'll get redirected to the Request Builder page.

ZenRows Request Builder
Click to open the image in full screen

Paste your target URL, check the box for Premium Proxies, and activate the JavaScript Rendering boost mode.

Then, select a language (Python), and you’ll get your script ready to try. Your code should look like this.

program.py
import requests
 
url = 'https://www.g2.com/products/asana/reviews%60'
apikey = '<YOUR_ZENROWS_API_KEY>'
params = {
    'url': url,
    'apikey': apikey,
    'js_render': 'true',
    'premium_proxy': 'true',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)

Run it, and you'll get the page's HTML.

Output
<!DOCTYPE html>
 
#...
 
<title>Asana Reviews 2023: Details, Pricing, &amp; Features | G2</title>
 
#...

Awesome right? That's how easy it is to bypass CAPTCHAs using ZenRows.

Conclusion

CAPTCHAs can frustrate your web scraping efforts. But with the help of third-party services like 2captcha, you can bypass CAPTCHAs using Pyppeteer. However, Pyppeteer does not work against advanced anti-bot systems, so consider ZenRows for the best results.

Additionally, for a deeper dive into CAPTCHA solving, check out the 7 ways to bypass CAPTCHAs while web scraping.

Ready to get started?

Up to 1,000 URLs for free are waiting for you