Are you trying to scrape a website but getting blocked by CAPTCHA? CAPTCHAs can hinder any web scraping project and are becoming increasingly challenging.
Fortunately, there are ways to bypass CAPTCHA while web scraping, and we'll cover seven proven techniques:
- Rotate IPs.
- Rotate User Agents.
- Use a CAPTCHA resolver.
- Avoid hidden traps.
- Simulate human behavior.
- Save cookies.
- Hide automation indicators.
Let's go!
What Is CAPTCHA?
CAPTCHA is a short form of "Completely Automated Public Turing test to tell Computers and Humans Apart." It aims to prevent automated programs from accessing websites, protecting them from potential harm and bot-like activities like scraping. Generally, a CAPTCHA is a challenge the user must solve before accessing a protected website.
CAPTCHAs are easy for humans to solve but difficult for machines to understand, making them hard for web scrapers to bypass. For example, in the image below, the user must check the box to prove they're human. A bot can't obey this instruction intuitively.
How Does CAPTCHA Block You While Web Scraping
CAPTCHAs take various forms depending on the website's implementation. While some appear every time you visit a web page, most are triggered by bot-like activities such as web scraping.
During scraping, a CAPTCHA may appear due to any of the following reasons:
- Sending multiple requests from the same IP within a few seconds.
- Repeated automation patterns, such as clicking the same link frequently or accessing the same pages repeatedly.
- Suspicious automation interactions, such as visiting many pages rapidly without interaction, clicking at an unusual speed, or rapidly filling out a form.
- Disregarding the
robots.txt
file by accessing restricted web pages.
Can CAPTCHA Be Bypassed?
Yes, you can bypass CAPTCHAs, but it's not an easy task. The recommended approach is to prevent CAPTCHAs from appearing in the first place and, if blocked, to retry the request.
Alternatively, you can solve the CAPTCHA, but the success rate is much lower, and the cost is significantly higher. Most CAPTCHA-solving services send requests to human solvers and return the solution. This approach slows down your scraper and substantially reduces its efficiency.
Avoiding CAPTCHAs is more reliable, as it employs all the required measures to prevent automated actions that trigger them. Below, we'll cover the best approaches for bypassing CAPTCHAs during web scraping so you can get the data you want.
How to Bypass CAPTCHA While Web Scraping
This section will explain seven techniques to bypass the frustrating CAPTCHA obstacles while scraping in Python.
1. Rotate IPs
If many requests come from the same IP address, websites detect it as bot activity and block it. To prevent that, rotate your IPs to scrape without interruptions.
You have to create a pool of proxies and programmatically rotate them to change your IP address per request. Let's quickly see how to implement that in Python by requesting https://httpbin.io/ip
, a test website that returns your IP address.
We'll grab a few proxies from the Free Proxy List for this.
We've only used free proxies to show how IP rotation works. They're unsuitable for real-life projects, and the ones used in this tutorial may not work at the time of reading. Feel free to grab new ones from the Free Proxy List website.
Import the Requests library and create a proxy list:
# import the required libraries
import requests
# create a proxy list
proxy_list = [
{
"http": "http://27.64.18.8:10004",
"https": "http://27.64.18.8:10004",
},
{
"http": "http://161.35.70.249:3128",
"https": "http://161.35.70.249:3129",
},
# ...
]
Add itertools
to your imports. You'll use this module to create a cycler instance for the proxies. Define a proxy_rotator
function that returns a generator for the proxy list:
# ...
import itertools
# define a proxy rotator using a generator
def proxy_rotator(proxy_list):
return itertools.cycle(proxy_list)
Create a new generator instance from the rotator function. Loop through a range of requests and pass the generator function as the proxy parameter. The next
function ensures that your request uses the next proxy address on the list and starts from the beginning once you exhaust it:
# create a generator from the proxy rotator function
proxy_gen = proxy_rotator(proxy_list)
# rotate the proxies for three requests
for request in range(3):
# send a request to httpbin.io
response = requests.get("https://httpbin.io/ip", proxies=next(proxy_gen))
# print the response text to see your current IP
print(response.text)
Combine the snippets. Here's the complete code:
# import the required libraries
import requests
import itertools
# create a proxy list (create your own proxy list)
proxy_list = [
{
"http": "http://27.64.18.8:10004",
"https": "http://27.64.18.8:10004",
},
{
"http": "http://161.35.70.249:3128",
"https": "http://161.35.70.249:3129",
},
# ...
]
# define a proxy rotator using a generator
def proxy_rotator(proxy_list):
return itertools.cycle(proxy_list)
# create a generator from the proxy rotator function
proxy_gen = proxy_rotator(proxy_list)
# rotate the proxies for three requests
for request in range(3):
# send a request to httpbin.io
response = requests.get("https://httpbin.io/ip", proxies=next(proxy_gen))
# print the response text to see your current IP
print(response.text)
The above code returns each proxy's IP for three requests and restarts from the beginning once the request exhausts the list, proving that the scraper rotates the proxies in the proxy list:
{
"origin": "27.64.18.8:33662"
}
{
"origin": "161.35.70.249:56647"
}
{
"origin": "27.64.18.8:76563"
}
As mentioned, free proxies usually fail due to their short lifespan. Your best option is to use a premium CAPTCHA proxy server that automatically masks your IP and changes the assigned address.
If you're interested in learning more, check out our guide on rotating proxies in Python.
2. Rotate User Agents
Rotating the User Agent header is another way to prevent CAPTCHAs from appearing while scraping. The User Agent is a string sent with every request. It identifies the browser/HTTP client and operating system of the request source.
The information a User Agent provides helps websites optimize their pages for different devices and browsers, but anti-bot measures also use it to identify and block bots.
To avoid blocks, your User Agent should look natural, have consistent information, and be up-to-date. Then, you should rotate it to avoid using the same User Agent for every request.
Using a concept similar to the previous IP rotation code, we'll create an example Python scraper that rotates the User Agent from a list. We'll request https://httpbin.io/user-agent
, a test website that returns the current User Agent.
First, create a User Agent list. You can compile a list of real ones from WhatIsMyBrowser:
# ...
# create a User Agent list
user_agent_list = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0",
# ...
]
Define a function that returns a User Agent generator:
# ...
# define a User Agent rotator
def rotate_ua(user_agent_list):
return itertools.cycle(user_agent_list)
Create a generator instance and use it in multiple requests to rotate the User Agents:
# ...
# create a generator instance
user_agent_generator = rotate_ua(user_agent_list)
# rotate the User Agent for 4 requests
for request in range(4):
# send a request to httpbin.io
response = requests.get(
"https://httpbin.io/user-agent",
headers={"User-Agent": next(user_agent_generator)},
)
# print the response text to see the current User Agent
print(response.text)
Merge the snippets, and you'll get the following final code:
# import the required libraries
import requests
import itertools
# create a User Agent list
user_agent_list = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0",
# ...
]
# define a User Agent rotator
def rotate_ua(user_agent_list):
return itertools.cycle(user_agent_list)
# create a generator instance
user_agent_generator = rotate_ua(user_agent_list)
# rotate the User Agent for 4 requests
for request in range(4):
# send a request to httpbin.io
response = requests.get(
"https://httpbin.io/user-agent",
headers={"User-Agent": next(user_agent_generator)},
)
# print the response text to see the current User Agent
print(response.text)
The above code rotates the User Agent per request as shown:
{
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
}
{
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
}
{
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0"
}
{
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
}
You now know how to rotate User Agent during scraping.
Check out our list of best User Agents for web scraping to get started.
3. Use a CAPTCHA Resolver
CAPTCHA resolvers are services that automatically solve CAPTCHAs, allowing you to scrape websites without interruptions. A popular example is 2Captcha, which employs human workers to solve CAPTCHA challenges.
When a CAPTCHA-solving service receives a request to solve a CAPTCHA, it transfers it to a human worker who solves the puzzle and sends its solution to the solver service. The service then returns the CAPTCHA answer to a scraper, which uses it to solve the CAPTCHA challenge on a target web page.
While that appears to be an easy fix, it has disadvantages: it's expensive at scale and only works with some CAPTCHA types.
4. Avoid Hidden Traps
Did you know that websites use sneaky traps to detect bots? For example, the honeypot trap tricks bots into interacting with hidden elements like invisible form fields or links.
These traps are only visible to bots, not to human users. Interaction with such traps allows the website to spot bot behavior and flag the bot's IP as suspicious.
But you can learn how these traps work and how to spot them. One way is to inspect the website's HTML for hidden elements and avoid elements with unusual names or values.
You can learn more about honeypot traps and how to bypass them.
5. Simulate Human Behavior
Accurately simulating human behavior is essential to bypass CAPTCHA while scraping a website. For instance, making multiple requests within a few milliseconds can result in a rate-limited IP ban.
One way to mimic human behavior is to add delays between requests to reduce your request frequency. You can randomize the delays to make it more intuitive. Another approach is implementing exponential backoffs to increase the wait time after each failed request.
The code below shows how to wait between requests and randomize the wait intervals using Python's time and random modules.
import time
import random
import requests
# make a GET request to the given URL and print the status code
def make_request(url):
response = requests.get(url)
print(f"Request to {url} returned status code: {response.status_code}")
# list of URLs to request
urls = [
"https://www.scrapingcourse.com/ecommerce/page/1/",
"https://www.scrapingcourse.com/ecommerce/page/2/",
"https://www.scrapingcourse.com/ecommerce/page/3/",
]
# range for random wait time (in seconds)
min_wait = 1
max_wait = 5
# iterate through each URL in the list
for url in urls:
make_request(url)
wait_time = random.uniform(min_wait, max_wait)
print(f"Waiting for {wait_time:.2f} seconds before the next request...")
time.sleep(wait_time)
print("All requests completed.")
In addition to delays, adding human interactions (clicking, scrolling, hovering, etc.) to your scraper reduces the chances of anti-bot detection. It can be achieved with a headless browser, e.g., Selenium. Selenium lets you control browsers like Chrome programmatically and create headless browser sessions.
Check out our in-depth guide on headless browsers in Python and Selenium to learn how to implement it.
6. Save Cookies
Cookies can be your secret weapon when it comes to web scraping. These small files contain data about your interactions with a website, including your login status, preferences, and more.
If you're scraping behind a login, cookies can be beneficial since they save you the hassle of logging in again, reducing the risk of getting caught. You can also use cookies to persist or pause a web scraping session and resume later.
With HTTP clients like Requests and headless browsers like Selenium, you can programmatically save and load cookies and extract data under the radar.
Let's see how to store session cookies using the Requests library in Python. We'll set demo cookies (samplecookies
) on a test website like https://httpbin.io
, retrieve the cookies, and then store them in a JSON file.
Here's the URL to set the specified cookies:
https://httpbin.io/cookies/set?cookiename=samplecookies
Add Python's json
to your imports. Create a new cookie session instance and request the above URL with that instance:
# import the required library
# ...
import json
# create a session
session = requests.Session()
# set the cookie
response = session.get("https://httpbin.io/cookies/set?cookiename=samplecookies")
Retrieve the cookies into a dictionary and insert it into the cookie_info
dictionary with other information:
# ...
# convert the cookies to a dictionary
cookies = requests.utils.dict_from_cookiejar(session.cookies)
# create a dictionary with more information
cookie_info = {
"cookies": cookies,
"url": "https://httpbin.io",
"timestamp": response.headers.get("Date"),
}
Finally, store the cookie_info
dictionary into a JSON file:
# ...
# write the cookie information to a JSON file
with open("cookie_info.json", "w") as f:
json.dump(cookie_info, f, indent=4)
Combine the snippets, and here's the complete code:
# import the required library
import requests
import json
# create a session
session = requests.Session()
# set the cookie
response = session.get("https://httpbin.io/cookies/set?cookiename=samplecookies")
# convert the cookies to a dictionary
cookies = requests.utils.dict_from_cookiejar(session.cookies)
# create a dictionary with more information
cookie_info = {
"cookies": cookies,
"url": "https://httpbin.io",
"timestamp": response.headers.get("Date"),
}
# write the cookie information to a JSON file
with open("cookie_info.json", "w") as f:
json.dump(cookie_info, f, indent=4)
The above code stores the cookie information into a JSON file. You should see the following when you open cookie_info.json
from your project root:
{
"cookies": {
"cookiename": "samplecookies"
},
"url": "https://httpbin.io",
"timestamp": "Tue, 23 Jul 2024 22:56:02 GMT"
}
That's it! You just stored session cookies with the Requests library in Python.
7. Hide Automation Indicators
You should still be careful when using a headless browser because websites can identify bots by looking for automation indicators such as browser fingerprints.
However, plugins like Selenium Stealth hide bot-like parameters, and you can also use them to automate human-like mouse movements and keyboard strokes without being noticed.
Check out our tutorial on how to avoid bot detection with Selenium to keep your scraping activities running.
Conclusion
Preventing CAPTCHAs from hindering web scraping is no easy feat, but you're now better equipped with the techniques to tackle this challenge. However, implementing the above methods can take time and effort for large-scale projects.
CAPTCHA? No Problem, Scrape Without Getting Blocked With ZenRows
ZenRows provides all the tools required to bypass CAPTCHAs and any other anti-bot at scale.
Use ZenRows` auto-rotating premium proxies and full-fledged web scraping API to ensure you successfully scrape any website without getting blocked.
Try ZenRows for free and see for yourself.