If you're a web scraping developer, you know the frustration of running into CAPTCHAs. They have less than a 10% failure rate and evolve yearly, becoming one of the most reliable anti-bot measures.
In this article, you'll learn seven proven methods to avoid CAPTCHA and reCAPTCHA while web scraping.
- Avoid hidden traps.
- Use real headers.
- Rotate headers.
- Use rotating proxies.
- Implement headless browsers.
- Disable automation indicators.
- Make your scraper look like a real user.
Let's go!
What Is a CAPTCHA?
CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) are interruptions you must solve before accessing a protected web page. They appear in different forms of challenges.
Websites use them to determine whether you're an actual user or a bot by testing your solving accuracy. CAPTCHAs are often time-sensitive, ensuring you solve them within a given timeframe.
Some websites only display the CAPTCHA after an underground behavior analysis shows bot-like signals. They use this technique to avoid compromising user experience.
Types of CAPTCHAs
Below are some common CAPTCHA roadblocks you'll run into during scraping.
reCAPTCHA
reCAPTCHA is Google's security provision against bot-like activities. It uses advanced risk analysis techniques to differentiate between human and bot users. reCAPTCHA prioritizes user experience by allowing legitimate users to pass through with minimal interaction. Once it analyzes your traffic and determines you're human, it grants access to the protected page. Otherwise, it flags you as a bot and blocks your access.
You can tell that a website is protected by Google reCAPTCHA when you see the reCAPTCHA badge above.
Audio CAPTCHA
Audio CAPTCHAs are a security measure that uses audio to differentiate between humans and bots. Unlike the standard visual puzzles, they present an audio challenge, such as asking you to distinguish between different voices, input spoken letters or numbers, or follow instructions based on the voice note.
While generally more difficult for bots to bypass, audio CAPTCHAs can be challenging for users with accessibility limitations or those unfamiliar with the spoken language. Most CAPTCHA providers offer alternative options for users who struggle with audio.
Puzzle CAPTCHA
As the name implies, Puzzle CAPTCHAs require users to solve an interactive puzzle to prove they're humans.
Bots often fail puzzle challenges because they come in different forms. For example, a puzzle CAPTCHA may prompt you to arrange broken image pieces, slide an object to its correct position, match images, or click specific patterns.
Text CAPTCHAs
These challenges use text characters to prompt users to type strings presented in an image.
3D CAPTCHAs
3D CAPTCHAs are an improvement over the text challenge and use 3D characters, which are more difficult for bots to recognize.
Math Challenges
This method triggers a mathematical equation for the user to solve.
Image CAPTCHAs
In this case, the user has to identify particular objects in a grid image.
Invisible and Passive CAPTCHAs
These CAPTCHAs are more difficult to recognize since they're embedded within the code.
Invisible tests don't interfere with user experience. For example, clicking a submit button can run a JavaScript challenge to verify if your browser behaves like a regular one humans use.
Passive CAPTCHAs are time-based checks. For example, if it takes a human over two seconds to type and you did it in 0.1 seconds, it's flagged as suspicious.
Websites often combine these two techniques to strengthen security.
How Does CAPTCHA Work?
To avoid CAPTCHA and reCAPTCHA, you must understand when the challenge might appear. Generally, different scenarios may trigger a CAPTCHA:
- Unusual spikes in traffic from the same user within a short period.
- Suspicious interactions, such as visiting many pages at one go, opening a page without scrolling, rapidly filling a form, clicking at an unusual speed, and more.
- Random analysis by firewalls with high-security measures that detect bot activities.
Check out our guides on web scraping best practices and scraping without getting blocked for a detailed guide on tackling these challenges.
Is There a Way to Bypass CAPTCHA and reCAPTCHA?
It's technically impossible to skip a CAPTCHA once it appears or stop it from appearing when you trigger it. You can't eliminate CAPTCHAs from protected websites because you have no control over their implementation.
However, there are two ways to ensure CAPTCHAs don't block you while crawling or scraping. These include solving or bypassing them.
Solving a CAPTCHA involves using third-party solvers like 2Captcha to tackle the challenge when it appears. Most CAPTCHA-solving services employ advanced machine vision algorithms or human solvers to boost success. However, they're often expensive and unreliable at scale.
Bypassing CAPTCHAs is more effective. This method employs all required measures to avoid triggering the CAPTCHA, including mimicking human behavior, rotating proxies, modifying the request headers, and more. We'll share details on how to achieve this in the next section.
How to Avoid CAPTCHA and reCAPTCHA When Scraping
Web scrapers use various methods to avoid CAPTCHAs. These are seven of the best-proven ones:
1. Avoid Hidden Traps
Honeypots are traps that are invisible to humans and visible to bots. They can be entire web pages, forms, or data fields that bots often interact with during activities like scraping.
Most websites hide honeypot traps using JavaScript, e.g., display:none
. Since bots typically scan website elements, they're more likely to see and interact with these hidden components.
There are three main categories of honeypots:
- Low-interaction honeypots: These only mimic a limited number of services and provide little insight into the bot's tactics. However, they may expose the bot's type and origin.
- High-interaction honeypots: Fully functional systems or networks that closely mimic actual production environments. These are more deceptive and gather more information about incoming attacks and bots.
- Production honeypots: Real systems with comprehensive monitoring techniques deployed alongside production software. They are genuine and help the security team gain deeper insights into the operation mechanisms of incoming attacks and bot activities.
Once a bot interacts with a honeypot trap, such as clicking a hidden link or filling out a disguised form, the honeypot mechanism triggers an alert to reveal the bot's activity. The website's owner then implements measures to block the incoming bot request, denying you access to scraping your desired data.
Here are some actionable ways to avoid honeypot traps:
- Avoid interacting with hidden elements: As mentioned above, honeypots are often hidden, and anti-bot measures don't expect humans to interact with them. While crawling links, you should avoid following hidden anchor tags, as they may lead to honeypots. Ensure you inspect the web element adequately and apply programmatic measures to avoid interacting with unnecessary hidden website components.
-
Respect the terms of service: Ensure you check a website's terms before scraping it. Check bot engagement rules like the
robots.txt
file to see the pages you can crawl. Then, ensure you scrape at off-peak hours and extend your request intervals to avoid interrupting other users' activities. - Avoid public networks: A server might set up a honeypot on shared public networks. That's because public Wi-Fi connections are often less secure and lack the encryption of private networks. This weakness allows anti-bots to monitor all traffic on the network, making it easier to identify automated scraping behavior by analyzing a bot's browsing patterns compared to human users.
Check out our detailed article on understanding and avoiding honeypots to learn more.
2. Use Real Headers
The request headers tell the server about the browser or HTTP client sending a request and determine how the server responds to an incoming request.
One of the first security measures most anti-bots employ is to scan the headers of an incoming request for bot-like parameters. In advanced cases, they compare the incoming request headers with those of known bots to determine whether they are genuine. Any deviation from those of an actual browser will trigger a CAPTCHA to block your request.
HTTP clients like Python's Requests and JavaScript's Axios lack the essential request headers for web scraping, making them vulnerable to anti-bot detection. For example, the request headers sent by Python's Requests library use a bot-like value like python-requests/2.31.0
as their User Agent header.
Headless browsers such as Selenium and Puppeteer often send bot-like User Agent parameters in headless mode, such as HeadlessChrome
.Such header values make it obvious that your request is automated, resulting in a potential block.
See an example Selenium User Agent header below:
"User-Agent": [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/126.0.0.0 Safari/537.36"
]
Compare the above with the actual Chrome User Agent below. You'll see that it features Chrome instead of the bot-like HeadlessChrome
flag:
"User-Agent": [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
]
To appear legitimate and reduce the chances of getting blocked, replace library-based headers with those of a real browser. You can even copy a browser's complete request headers and use them in your scraper.
3. Rotate Headers
Too many requests with the same HTTP headers are suspicious because a human user wouldn't visit 1,000 pages in five minutes.
You should rotate your request headers to avoid getting flagged as a bot. This technique makes it look like you're requesting from a different browser or machine per request.
While you may not rotate all your request header fields, important ones, such as the User Agent, Sec-Ch-Ua-Platform
, Sec-Ch-Ua
, and Referer, often require rotation in large-scale scraping.
However, be careful to keep the headers consistent during rotation, as most anti-bots flag header mismatches as bot-like. For instance, swapping a Windows Chrome User Agent with that of Linux requires updating the platform header (Sec-Ch-Ua-Platform
) to Linux. Similarly, when you change the Chrome version in the User Agent string, you should update the version string in the Sec-Ch-Ua
header.
4. Use Rotating Proxies
If you rotate your header set without changing your IP address, your request will still look suspicious. Proxies are intermediaries between you and the server, allowing you to mimic a user from a different location.
Sending multiple requests from one IP address can get you blocked due to rate limiting. Proxy rotation ensures that your IP address changes every time. This technique is helpful during large-scale scraping, making you appear as a different user during each request.
There are two proxy categories: free and premium. Free proxies are unsuitable for real-life projects due to their short lifespan.
It's best to rotate quality paid residential proxies, which use IPs assigned to daily internet users by network providers. Most residential proxy providers maintain a pool containing millions of these IP addresses from several geolocations. So you don't have to maintain the IP pool.
Some paid proxy services even provide advanced solutions to help you bypass difficult CAPTCHAs like reCAPTCHA.
ZenRows is one of the best residential proxy providers, offering up to 55 million IPs covering 185+ countries. It also auto-rotates proxies to mimic human users and features flexible geo-targeting to access geo-restricted content. ZenRows also gives you access to advanced anti-bot auto-bypass tools under the same price cap.
5. Implement Headless Browsers
Headless browsers are browsers that you can use without a user interface. They allow you to execute JavaScript and simulate human actions like clicking, scrolling, hovering, dragging and dropping, form filling, and more.
Popular headless browsers include Selenium, Playwright, and Puppeteer. Since they allow you to simulate human interactions, you can leverage them to reduce the chances of getting blocked.
If you want to learn more about scraping with each of the mentioned headless browsers, check out the following detailed tutorials:
- Web scraping with Selenium in Python.
- Content extraction with Puppeteer.
- Web scraping with Playwright in Python.
However, these headless browsers are usually insufficient against sophisticated CAPTCHAs like reCAPTCHA and hCaptcha. That's because they leak a lot of bot-like information, making them vulnerable to detection.
You can fortify these headless browsers to boost their anti-bot bypass capabilities. For example, you can change the default User Agent and scrape behind proxies. There are also stealth plugins for each of the most popular headless browsers:
- Selenium: Selenium Stealth and Undetected ChromeDriver to scrape a web page undetected.
- Puppeteer: The Stealth plugin patches Puppeteer's browser instance with various anti-bot evasion strategies.
- Playwright: The Playwright Stealth plugin is also available to boost Playwright's anti-bot bypass capabilities.
6. Disable Automation Indicators
Most browser-based tools have specific indicators and WebDriver flags that disclose you're a bot. For example, Selenium and Puppeteer have the navigator.webdriver
property set to true by default.
Plugins for headless browsers, such as Puppeteer-stealth, implement many techniques to erase these traces. To learn how to implement them, read our tutorial on avoiding detection with Puppeteer. You can even patch Puppeteer stealth to boost its anti-bot evasion strategies further.
7. Make Your Scraper Look Like a Real User
Mimicking human behavior and avoiding bot-like patterns are crucial to bypassing detection. Anti-bot measures track user behavior, such as navigation patterns, mouse movement, hovering style, scrolling directions, and clicking coordinates to differentiate between humans and bots.
As mentioned in a previous bypass method, headless browsers let you simulate human behavior. Since they can execute JavaScript, headless browsers can also improve your scraper's chances of bypassing underground JavaScript challenges.
You can implement the following strategies to imitate a real user behavior:
- Randomize actions such as scrolling back and forth.
- Click on visible components.
- Type into form fields.
- Use random time intervals between interactions.
- Implement exponential backoffs to delay requests after failed attempts.
By following these behavioral patterns, you can avoid triggering CAPTCHAs and other website security measures.
To learn more, look at our guide on anti-scraping techniques.
Conclusion
We've learned how to avoid CAPTCHA and reCAPTCHA when web scraping. For example, you should bypass honeypot traps by skipping hidden links, rotate real HTTP headers with your IP, and implement headless browsers to mimic human behavior with randomized actions.
To simplify your scraping tasks, we recommend using ZenRows, a full-fledged web scraping solution that automatically implements these bypass techniques, among many others. All it takes is a single API call. Sign up now and try it for free.
Frequent Questions
How Do You Get Rid of a CAPTCHA?
CAPTCHAs are designed to stop bots from accessing websites. However, if you have legitimate reasons for scraping a website, there are some ways to bypass such challenges, including:
- Use a solving service that can automatically solve CAPTCHAs for you.
- Use a headless browser to simulate human behavior
- Apply human behavioral patterns to your scraper.
- Use rotating proxies and actual browser headers.