Web Crawling Webinar for Tech Teams
Web Crawling Webinar for Tech Teams

How to Bypass CAPTCHA With Python Requests

Favour Kelvin
Favour Kelvin
October 25, 2024 · 3 min read

CAPTCHAs are one of the biggest challenges faced when scraping websites with Python Requests. These frustrating pop-ups can easily halt your scraping progress.

Luckily, there are several proven ways to bypass Captcha, but for this tutorial, we'll focus on the following three methods:

Let's dive in and explore each method in detail.

Method #1: Bypass CAPTCHA with Python Requests and 2Captcha

Many websites use CAPTCHA to protect their content from bots and unauthorized access. These tests are designed to ensure that only human visitors can proceed, making them a big obstacle for web scrapers.

One common way to solve CAPTCHAs is by using third-party services like 2Captcha. These services often rely on human solvers or advanced algorithms to decode CAPTCHA challenges and return a solution. However, this process can take a while, which may slow down your scraping efforts.

Let's put it to the test with this CAPTCHA challenge. 

Start by installing the necessary dependencies:

Terminal
pip3 install requests beautifulsoup4 twocaptcha-python

Requests make HTTP requests, 2Captcha is used to solve the CAPTCHA challenge, BeautifulSoup parses the HTML to extract useful information, and urljoin handles relative URLs.

Next, initialize the 2Captcha solver using your API key and set up the URL of the CAPTCHA challenge page.

Example
import requests
from twocaptcha import TwoCaptcha
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# your 2Captcha api key
api_key = 'YOUR_2CAPTCHA_API_KEY'

# initialize the 2Captcha solver
solver = TwoCaptcha(api_key)

# url of the captcha challenge page
url = "https://2captcha.com/demo/normal"

# start a session to maintain cookies
session = requests.Session()

# send a request to the captcha page to download the image
response = session.get(url)

The next step is to download the CAPTCHA image from the target page, which we'll send to 2Captcha for solving.

Example
# ... 

# parse the html to extract the captcha image url
soup = BeautifulSoup(response.content, 'html.parser')

# locate the captcha image using the 'alt' attribute
captcha_img_tag = soup.find("img", {"alt": "normal captcha example"})
captcha_img_url = captcha_img_tag['src']

# handle relative urls by joining with the base url
captcha_img_url = urljoin(url, captcha_img_url)

# download the captcha image
captcha_img_response = session.get(captcha_img_url)

# save the captcha image locally (necessary for 2Captcha api)
captcha_image_path = "captcha_image.jpg"
with open(captcha_image_path, "wb") as f:
   f.write(captcha_img_response.content)

After downloading the image, send it to 2Captcha for solving and print the result.

Example
# ... 

# send the captcha image to 2Captcha for solving
try:
   result = solver.normal(captcha_image_path)
   print(f"captcha solved: {result['code']}")
except Exception as e:
   print(f"error solving captcha: {e}")
   exit()

Here's the complete code:

Example
import requests
from twocaptcha import TwoCaptcha
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# your 2Captcha api key
api_key = 'YOUR_2CAPTCHA_API_KEY'

# initialize the 2Captcha solver
solver = TwoCaptcha(api_key)

# url of the captcha challenge page
url = "https://2captcha.com/demo/normal"

# start a session to maintain cookies
session = requests.Session()

# send a request to the captcha page to download the image
response = session.get(url)

# parse the html to extract the captcha image url
soup = BeautifulSoup(response.content, 'html.parser')

# locate the captcha image using the 'alt' attribute
captcha_img_tag = soup.find("img", {"alt": "normal captcha example"})
captcha_img_url = captcha_img_tag['src']

# handle relative urls by joining with the base url
captcha_img_url = urljoin(url, captcha_img_url)

# download the captcha image
captcha_img_response = session.get(captcha_img_url)

# save the captcha image locally (necessary for 2Captcha api)
captcha_image_path = "captcha_image.jpg"
with open(captcha_image_path, "wb") as f:
   f.write(captcha_img_response.content)

# send the captcha image to 2Captcha for solving
try:
   result = solver.normal(captcha_image_path)
   print(f"captcha solved: {result['code']}")
except Exception as e:
   print(f"error solving captcha: {e}")
   exit()

If the CAPTCHA is successfully solved, you will see an output similar to:

Output
CAPTCHA solved: W9H5K

Congratulations! You've successfully bypassed CAPTCHA with Python Requests and 2Captcha. While 2Captcha is a practical tool for small-scale data extraction and testing purposes, it may not be the most economical option for large-scale scraping projects. Additionally, it does not solve all CAPTCHA types.

Let's explore another alternative.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Method #2: Bypass CAPTCHA With a Web Scraping API

The best approach to bypassing CAPTCHAs is to avoid them. By mimicking natural user behavior and not getting blocked, you can often move through sites without setting off anti-bot systems and CAPTCHA challenges.

Using web scraping APIs like ZenRows is a reliable way to bypass any CAPTCHA, no matter how complex the anti-bot measures are.

With features like auto-rotating premium proxies, user agent rotation, geolocation, and more, ZenRows provides everything you need to scrape without getting blocked.

Let's see ZenRows in action against this Anti-bot challenge page.

Sign up for free, and you'll be redirected to the Request Builder page.

Enter the target URL, activate Premium Proxies and enable the JS Rendering boost mode. Choose Python and click on the API tab to get the generated code.

building a scraper with zenrows
Click to open the image in full screen

Copy the request code generated on the right. The code uses Requests, so install the library using the following command:

Terminal
pip3 install requests

The Request Builder will generate Python code similar to this:

Example
# pip3 install requests
import requests

url = 'https://www.scrapingcourse.com/antibot-challenge'
apikey = '<YOUR_ZENROWS_API_KEY>'

params = {
    'url': url,
    'apikey': apikey,
    'js_render': 'true',
    'premium_proxy': 'true',
}

response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)

Run it, and you'll get the HTML content of your target web page:

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

Congratulations, you've successfully scraped a CAPTCHA-protected page with ZenRows! 

Method #3: Rotate User Agents

CAPTCHAs can often be triggered when websites detect repetitive behavior from the same User Agent. A User Agent is a string sent with each HTTP request that identifies the browser or client and operating system being used. Bots are frequently flagged when they repeatedly use the same User Agent, which makes it clear they're not genuine users.

By switching the User Agent for each request, you appear like a real user and reduce the chance of being blocked.

Let's see how to rotate user agents using Python Requests. Start by installing the Request library in your terminal:

Terminal
pip3 install requests

Then, import the required libraries:

Example
import requests
import itertools

Next, create a list of common User Agents. We compiled a list of the best User Agents you can use while scraping:

Example
# ...

# create a User Agent list
user_agent_list = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0",
    # ...
]

Define a function to rotate user agents using the itertools.cycle method:

Example
# ... 

# define a User Agent rotator
def rotate_ua(user_agent_list):
    return itertools.cycle(user_agent_list)

Then, create a generator instance to keep rotating through the user agents:

Example
# ... 

# create a generator instance
user_agent_generator = rotate_ua(user_agent_list)

Finally, let's use this generator to send HTTP requests while rotating the user agent for each one:

Example
# ...

# rotate the User Agent for 4 requests
for request in range(4):
    # send a request to httpbin.io
    response = requests.get(
        "https://httpbin.io/user-agent",
        headers={"User-Agent": next(user_agent_generator)},
    )
    # print the response text to see the current User Agent
    print(response.text)

Here's the full code:

Example
# import the required libraries
import requests
import itertools


# create a User Agent list
user_agent_list = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
   "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0",
    # ...
]

# define a User Agent rotator
def rotate_ua(user_agent_list):
    return itertools.cycle(user_agent_list)

# create a generator instance
user_agent_generator = rotate_ua(user_agent_list)

# rotate the User Agent for 3 requests
for request in range(4):
    # send a request to httpbin.io
    response = requests.get(
        "https://httpbin.io/user-agent",
        headers={"User-Agent": next(user_agent_generator)},
    )
    # print the response text to see the current User Agent
    print(response.text)

The output will show different user agents for each request:

Output
# request 1
{
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
}

# request 2
{
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
}

# request 3
{
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0"
}

By rotating User Agents, you can make your scraper's behavior resemble a real user's, significantly lowering the chances of being blocked by CAPTCHA. However, relying solely on this may not always be sufficient. Here are a few limitations of this method:

  • Detection Algorithms: Websites employ sophisticated algorithms that analyze patterns in user behavior. If requests are made too frequently or exhibit unnatural patterns, even varied user agents can trigger CAPTCHAs.
  • IP Address Tracking: Many websites monitor IP addresses for suspicious activity. If numerous requests come from the same IP, it may still be flagged, regardless of the user agent being used.

To make rotating User Agents more effective, you can combine it with additional measures like IP rotation, introducing random delays between requests, maintaining session consistency, and regularly updating your list of user agents. 

These combined strategies help mimic genuine user behavior and significantly reduce the risk of triggering CAPTCHAs.

Conclusion

We explored various methods for bypassing CAPTCHA using Python requests. While 2Captcha works well for small-scale scraping, it can become costly and impractical at larger scales. Rotating user agents can help make your bot appear more human, but it may not be sufficient for more sophisticated anti-bot systems.

To ensure successful data extraction, you need a web scraping API like ZenRows to bypass all CAPTCHAs and scrape any website without getting blocked. Sign up now to try ZenRows for free.

Ready to get started?

Up to 1,000 URLs for free are waiting for you