How to Bypass CAPTCHA with Scrapy

February 21, 2024 · 3 min read

You'll often encounter CAPTCHAs while scraping with Scrapy. This will always get you blocked if you don't find a way to bypass them.

In this article, you'll learn the different ways of bypassing CAPTCHAs in Scrapy.

Can you Solve CAPTCHAs with Scrapy?

You can solve CAPTCHAs in Scrapy using three methods. These include using a web scraping API, employing a CAPTCHA resolver, or rotating premium proxies. 

A Web scraping API helps you avoid CAPTCHAs and other anti-bot protections completely. You can rotate premium proxies to prevent CAPTCHA services from flagging your IP address. And CAPTCHA resolvers work by passing your request to a dedicated CAPTCHA-solving service or a human.

All these methods help you focus on data extraction without worrying about getting blocked. The next sections show you how to implement each in full detail.

Method #1: Bypass any CAPTCHA with a Web Scraping API

As powerful as Scrapy is for web scraping, it can be blocked by CAPTCHAs or other anti-bot protection. The best solution to bypass CAPTCHAs is to use a web scraping API like ZenRows. It provides everything you need to avoid CAPTCHA challenges, including premium proxy rotation, JavaScript rendering capabilities, automatic header management, browser fingerprint randomization, and more. Let's see how ZenRows performs against a protected page like the Antibot Challenge page.

Start by signing up for a new account, and you'll get to the Request Builder.

building a scraper with zenrows
Click to open the image in full screen

Paste the target URL, enable JS Rendering, and activate Premium Proxies.

Next, select Python and click on the API connection mode. Then, copy the generated code and paste it into your script.

scraper.py
# pip3 install requests
import requests

url = "https://www.scrapingcourse.com/antibot-challenge"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
    "url": url,
    "apikey": apikey,
    "js_render": "true",
    "premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params, print(response.text)

The generated code uses Python's Requests library as the HTTP client. You can install this library using pip:

Terminal
pip3 install requests

Run the code, and you'll successfully access the page:

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

Congratulations! 🎉 You’ve successfully bypassed the anti-bot challenge page using ZenRows. This works for any website.

Method #2: Use a CAPTCHA Resolver

You can solve CAPTCHAs in Scrapy with CAPTCHA-resolving services. Most solving services like 2CAPTCHA employ human solvers, and the request might take some time.  

You'll solve the reCAPTCHA demo on the 2CAPTCHA website to see how that works. Here's the unsolved CAPTCHA. 

CAPTCHA Demo
Click to open the image in full screen

To solve that with 2CAPTCHA, install the solver package using pip:

Terminal
pip install 2captcha-python

You need two things to solve the reCAPTCHA CAPTCHA. These include your 2CAPTCHA API key and the target website's site key. 

Sign up on the 2CAPTCHA website and grab your API key from your dashboard.

2CAPTCHA Dashboard
Click to open the image in full screen

You'll find the site key in the target website's HTML. Launch the demo website on a browser like Chrome and right-click on the CAPTCHA box. Then click "Inspect". Expand the outer element and look for the data-sitekey attribute, as shown below:

Data Site Key Element Demo
Click to open the image in full screen

It's time to write your web scraping code. 

Start your spider class by defining a function that solves the reCAPTCHA CAPTCHA. Pass your API key with the 2CAPTCHA instance and use the recaptcha method to solve the CAPTCHA based on the site key and the target URL.

scraper.py
# import the required libraries
import scrapy
from twocaptcha import TwoCaptcha

class TutorialSpider(scrapy.Spider):
    name = "scraper"
    start_urls = ["https://2captcha.com/demo/recaptcha-v2-callback"]

    def solve_with_2captcha(self, sitekey, url):

        # start the 2CAPTCHA instance
        captcha2_api_key = "YOUR_2CAPTCHA_API_KEY"
        solver = TwoCaptcha(captcha2_api_key)

        try:

            # resolve the CAPTCHA
            result = solver.recaptcha(sitekey=sitekey, url=url)

            if result:
                print(f"Solved: {result}")
                return result["code"]
            else:
                print("CAPTCHA solving failed")
                return None

        except Exception as e:
            print(e)
            return None

Next, write another function to use the previous solver function. This function passes the response URL and site key to the solver function. It then executes the scraping logic if successful.

scraper.py
class TutorialSpider(scrapy.Spider):
    
    # ...

    def solve_captcha(self, response):

        # specify reCAPTCHA sitekey, replace with the target site key
        captcha_sitekey = "6LfD3PIbAAAAAJs_eEHvoOl75_83eXSqpPSRFJ_u"

        # call the CAPTCHA solving function
        captcha_solved = self.solve_with_2captcha(captcha_sitekey, response.url)

        # check if CAPTCHA is solved and proceed with scraping
        if captcha_solved:
            print("CAPTCHA solved successfully")

            # extract elements after solving CAPTCHA successfully
            element = response.css("title::text").get()
            print("Scraped element:", element)

Finally, define the parse method and send a callback to the above solver function. 

scraper.py
class TutorialSpider(scrapy.Spider):
    
    #...

    def parse(self, response):

        # send a request to solve the CAPTCHA using the solver function as a callback
        yield scrapy.Request(url=response.url, callback = self.solve_captcha)

Here's the final code:

scraper.py
# import the required libraries
import scrapy
from twocaptcha import TwoCaptcha

class TutorialSpider(scrapy.Spider):
    name = "scraper"
    start_urls = ["https://2captcha.com/demo/recaptcha-v2-callback"]

    def solve_with_2captcha(self, sitekey, url):

        # start the 2CAPTCHA instance
        
        solver = TwoCaptcha("<YOUR_CAPTCHA_2_API_KEY">)

        try:

            # resolve the CAPTCHA
            result = solver.recaptcha(sitekey=sitekey, url=url)

            if result:
                print(f"Solved: {result}")
                return result["code"]
            else:
                print("CAPTCHA solving failed")
                return None

        except Exception as e:
            print(e)
            return None

    def solve_captcha(self, response):

        # specify reCAPTCHA sitekey
        captcha_sitekey = "6LfD3PIbAAAAAJs_eEHvoOl75_83eXSqpPSRFJ_u"

        # call the CAPTCHA solving function
        captcha_solved = self.solve_with_2captcha(captcha_sitekey, response.url)

        # check if CAPTCHA is solved and proceed with scraping
        if captcha_solved:
            print("CAPTCHA solved successfully")

            # extract elements after solving CAPTCHA successfully
            element = response.css("title::text").get()
            print("Scraped element:", element)

    def parse(self, response):

        # send a request to solve the CAPTCHA using the solver function as a callback
        yield scrapy.Request(url=response.url, callback = self.solve_captcha)

The code solves the reCAPTCHA CAPTCHA successfully and returns a solved code, as shown:

Output
Solved: {
'captchaId': '75653786097', 
'code': '03AFcWeA7Ap7jFxiBmNjBbwiHSGjMCD_oP3Ae8cUxzdtqJnNkj4XnuUJOUFRfUkkjU_GPCXwqHYYFCynXdrQhAQce-F...'
}
CAPTCHA solved successfully
Scraped element: How to solve reCAPTCHA V2 Callback on PHP, Java, Python, Go, Csharp, CPP

That's it! You just solved a CAPTCHA with 2CAPTCHA. However, remember that 2CAPTCHA doesn't solve all CAPTCHAs and can be expensive for large-scale projects.

Method #3: Rotate Premium Proxies

Proxy rotation can help bypass CAPTCHAs, but it's less effective than the two previous methods. Some websites limit the number of requests from every IP address and often spin a CAPTCHA for those that exceed their limits. 

Rotating proxies helps mask your IP address and prevents the server from identifying the request source. Thus, you can scrape the web unnoticed and avoid runtime interruptions due to IP bans. 

However, ensure you use premium proxies when dealing with CAPTCHAs because the free ones usually don't work. There are also many CAPTCHA-compatible proxies out there.

You can use proxies with Scrapy and also rotate them. Check our full tutorial on using proxies with Scrapy to learn more.

Conclusion

This article has highlighted the various techniques of bypassing CAPTCHAs in Scrapy. You've learned to achieve this with a web scraping API, a CAPTCHA-solving service, and premium proxy rotation.

As mentioned, the best of the three is to use web scraping APIs, and ZenRows comes on top. ZenRows is an all-in-one web scraping solution for bypassing CAPTCHAs and other anti-bot systems. Try ZenRows for free!

Ready to get started?

Up to 1,000 URLs for free are waiting for you