How to Use Botright for Web Scraping

August 1, 2024 · 8 min read

Does your Python web scraper get blocked by CAPTCHAs? Botright, an improved version of Playwright, has a few tricks to help you solve them and make your scraping easier than before.

In this article, you'll learn how Botright works and how to use it for web scraping and solving CAPTCHAs.

What Can You Do With Botright?

Botright is an improved version of Playwright designed to bypass CAPTCHAs and anti-bots while scraping with Python. It runs a browser instance of Chromium or Firefox using Playwright under the hood, allowing you to execute JavaScript and scrape dynamic content.

As opposed to base Playwright, Botright doesn't have a bot-like attribute such as the WebDriver instance. It also uses scraped Chrome fingerprint data to become less detectable during browser fingerprinting tests.

Let's see the difference between Playwright and Botright in practice. The fingerprinting test on CreepJS shows that the base Playwright version uses the WebDriver instance by default. See a sample of the result below, with webDriverIsOn returning true:

Example
webDriverIsOn: true
hasHeadlessUA: false
hasHeadlessWorkerUA: false

Botright, on the other hand, returns false for the same fingerprinting test. This proves it appears less bot-like than Playwright:

Example
webDriverIsOn: false
hasHeadlessUA: false
hasHeadlessWorkerUA: false

Botright also features dedicated solvers for popular CAPTCHAs, such as reCAPTCHA and hCaptcha. Its ability to use a real browser and mimic a legitimate user allows it to bypass basic anti-bot checks like JavaScript challenges. However, it may still lose when faced with advanced anti-bots like Cloudflare and DataDome.

Botright is asynchronous by default and supports concurrent scraping to extract data from multiple pages simultaneously. However, as the library isn't thread-safe, you must run a separate browser instance per URL to prevent threading conflicts.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Tutorial: How to Scrape With Botright

In this section, you'll learn how to perform basic scraping tasks with Botright by extracting product information from the ScrapingCourse infinite scrolling challenge page.

You'll start with an initial HTTP request, then scroll the page to get specific product details, and finally export the scraped data to a CSV file.

The target website uses infinite scrolling to load content as you scroll down:

Infinite Scrolling Page
Click to open the image in full screen

Ready to start? Let's move on to the initial setup.

Prerequisites

Botright doesn't support Python versions later than 3.10 due to dependency conflicts. Install a Python version below 3.9 for the best experience.

Now, install Botright with pip:

Terminal
pip install botright

This command installs Playwright. Once installation is complete, download the required browser binaries:

Terminal
playwright install

You'll also need to install the Ungoogled Chromium binary, a Chromium variant with improved privacy, to boost anti-bot evasion capability. Download and install it from the official download page, and Botright will use it automatically.

Create a project root folder with a new scraper.py file. You can work with any code editor, but this tutorial uses VS Code on a Windows operating system.

You're now ready to scrape with Botright! Let's get started.

Scrape With Botright

Let's see how Botright works by building a scraper that extracts product names, prices, and image URLs from the target website.

Before you begin, inspect the target website's HTML. Open the website on a browser, e.g., Chrome. Right-click the first product and select Inspect.

Each target product information is inside a div element with the class name "product".

Inspect Element
Click to open the image in full screen

Since the page uses infinite scrolling, automate downward scrolling to extract all the product information.

To begin, import asyncio and Botright into your scraper file. Asyncio is a built-in Python library that lets you execute asynchronous functions. Define an asynchronous scraper function that accepts a "page" instance as an argument. Then, extract all the product containers with Botright's query selector:

scraper.py
# import the required library
import asyncio
import botright

async def scraper(page):
   
    # extract all the product containers
    products = await page.query_selector_all(".product-item")

Create an empty product_data list to collect the scraped data. Loop through each product container to extract its product information using CSS selectors. Add the scraped items to a dictionary and append them to the empty array:

scraper.py
async def scraper(page):
   
    # ...
   
    # create an empty array to write the scraped data
    product_data = []

    # loop through each product container to extract its data
    for product in products:
       
        name = await product.query_selector(".product-name")
        price = await product.query_selector(".product-price")
        image = await product.query_selector("img")

        # extract each product's actual value into a dictionary
        extracted_data = {
            "name": await name.inner_text(),
            "price": await price.inner_text(),
            "image": await image.get_attribute("src")
        }

        # append the extracted data to the product_data array
        product_data.append(extracted_data)

The next step is to expand the scraper function to export the scraped data to a CSV file. First, add the csv package to the imported libraries. Open a CSV file, create matching field names, and insert the data into a row:

scraper.py
# import the required library
# ...
import csv

async def scraper(page):
   
    # ...
   
    # write the extracted product data to a CSV file
    with open("products.csv", "w", newline="", encoding="utf-8") as csv_file:

        # specify the CSV's field names to match the extracted_data dictionary
        field_names = ["name", "price", "image"]

        # set each field name as the corresponding column name
        writer = csv.DictWriter(csv_file, fieldnames=field_names)
       
        # insert each data into a row
        writer.writeheader()
        for data in product_data:
            writer.writerow(data)

    print("Data successfully exported to CSV!"

Scraping logic done! Now, let's define your scrolling logic in an asynchronous scroller function.

Start Botright in headless mode, create a new page instance, and request the target website:

scraper.py
async def scroller():

    # start the botright instance in headless mode
    botright_client = await botright.Botright(headless=True)
    browser = await botright_client.new_browser()

    # create a new page instance
    page = await browser.new_page()

    # open the target web page
    await page.goto("https://scrapingcourse.com/infinite-scrolling")

Now, modify that scroller function to scroll the page continuously.

Set an initial page height to track the scroll height. Open a while loop and scroll the page with JavaScript's scroll event. Pause for more content to load as you scroll and obtain the new height value:

scraper.py
async def scroller():

    # ...

    # set the last height to zero
    last_height = 0

    # implement continuous scrolling in a while loop
    while True:

        # scroll down to bottom
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
   
        # wait for the page to load
        await page.wait_for_timeout(5000)
       
        # get the new height
        new_height = await page.evaluate("document.body.scrollHeight")

Update the previous height value to the new one. Break the loop once the last height is the same as the new height. Then, execute the scraper function based on the page instance to extract all the target product content. Exit the loop and close the Botright instance:

scraper.py
    # ...

    # implement continuous scrolling in a while loop
    while True:

        # ...

        # break the loop if there are no more heights to scroll
        if new_height == last_height:

            # extract data once all content has loaded
            await scraper(page)

            break

        # update the initial height to the new height
        last_height = new_height

    # close the botright browser instance
    await botright_client.close()

Finally, execute the scroller function using asyncio:

scraper.py
# execute the scroller function
if __name__ == "__main__":
    asyncio.run(scroller())

Combine all the snippets. Here's the full code:

scraper.py
# import the required library
import asyncio
import botright
import csv

async def scraper(page):
   
    # extract all the product containers
    products = await page.query_selector_all(".product-item")

    # create an empty array to write the scraped data
    product_data = []

    # loop through each product container to extract its data
    for product in products:
       
        name = await product.query_selector(".product-name")
        price = await product.query_selector(".product-price")
        image = await product.query_selector("img")

        # extract each product's actual value into a dictionary
        extracted_data = {
            "name": await name.inner_text(),
            "price": await price.inner_text(),
            "image": await image.get_attribute("src")
        }

        # append the extracted data to the product_data array
        product_data.append(extracted_data)
   
    # write the extracted product data to a CSV file
    with open("products.csv", "w", newline="", encoding="utf-8") as csv_file:

        # specify the CSV's field names to match the extracted_data dictionary
        field_names = ["name", "price", "image"]

        # set each field name as the corresponding column name
        writer = csv.DictWriter(csv_file, fieldnames=field_names)
       
        # insert each data into a row
        writer.writeheader()
        for data in product_data:
            writer.writerow(data)

    print("Data successfully exported to CSV!")

async def scroller():

    # start the botright instance in headless mode
    botright_client = await botright.Botright(headless=True)
    browser = await botright_client.new_browser()

    # create a new page instance
    page = await browser.new_page()

    # open the target web page
    await page.goto("https://scrapingcourse.com/infinite-scrolling")

    # set the last height to zero
    last_height = 0

    # implement continuous scrolling in a while loop
    while True:

        # scroll down to bottom
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
   
        # wait for the page to load
        await page.wait_for_timeout(5000)
       
        # get the new height
        new_height = await page.evaluate("document.body.scrollHeight")

        # break the loop if there are no more heights to scroll
        if new_height == last_height:

            # extract data once all content has loaded
            await scraper(page)

            break

        # update the initial height to the new height
        last_height = new_height

    # close the botright browser instance
    await botright_client.close()

# execute the scroller function
if __name__ == "__main__":
    asyncio.run(scroller())

Run the code above. You'll see a new products.csv file in your project root directory with the extracted product information:

CSV File
Click to open the image in full screen

Great job! You've extracted all the products from a dynamic web page with Botright by implementing infinite scrolling.

Apart from handling JavaScript rendering, Botright also claims to solve CAPTCHAs. Let's see how it works in the next section.

Solve CAPTCHAs With Botright

Botright uses image recognition libraries such as OpenCV, scikit-image, Torchvision, and Ultralytics to identify and solve CAPTCHA images.

Let's test how the CAPTCHA-solving functionality works using the Google reCAPTCHA demo page.

Recaptcha
Click to open the image in full screen

Like before, run the Botright browser instance in headless mode to increase the chances of success, as bypassing CAPTCHA hardly ever works in non-headless mode. Request the reCAPTCHA demo page and call Botright's solve_recaptcha function to solve the on-page CAPTCHA:

scraper.py
# import the required libraries
import asyncio
import botright

async def scraper():
   
     # start the botright instance in the GUI mode
    botright_client = await botright.Botright(headless=True)
    browser = await botright_client.new_browser()

    # create a new page instance
    page = await browser.new_page()

    # open the target web page
    await page.goto("https://www.google.com/recaptcha/api2/demo")

    # solve the CAPTCHA
    await page.solve_recaptcha()

Screenshot the web page to see the result. Close the Botright instance and run the scraper function using asyncio:

scraper.py
async def scraper():
    # ...
    
    # screenshot the page to capture the solved CAPTCHA
    await page.screenshot(path="reCAPTCHA-demo-screenshot.png")

    # close the botright browser instance
    await botright_client.close()

# execute the scraper function
if __name__ == "__main__":
    asyncio.run(scraper())

Combine both snippets to get the following complete code:

scraper.py
# import the required libraries
import asyncio
import botright

async def scraper():
   
     # start the botright instance in the GUI mode
    botright_client = await botright.Botright(headless=True)
    browser = await botright_client.new_browser()

    # create a new page instance
    page = await browser.new_page()

    # open the target web page
    await page.goto("https://www.google.com/recaptcha/api2/demo")

    # solve the CAPTCHA
    await page.solve_recaptcha()

    # screenshot the page to capture the solved CAPTCHA
    await page.screenshot(path="reCAPTCHA-demo-screenshot.png")

    # close the botright browser instance
    await botright_client.close()

# execute the scraper function
if __name__ == "__main__":
    asyncio.run(scraper())

The code above solves the reCAPTCHA puzzle. Here's the generated screenshot:

Recaptcha
Click to open the image in full screen

It works! You've just successfully dealt with Google's reCAPTCHA using Botright. However, despite this powerful capability, Botright still has a few significant limitations that may hinder your web scraping efforts.

Limitations of Botright and Best Alternative

Botright is a useful web scraping tool for bypassing specific types of CAPTCHA, including reCAPTCHA and hCAPTCHA. Still, it only solves these CAPTCHAs 50-80% of the time and is powerless against services like the Geetest V3 and V4 CAPTCHAs.

Additionally, Botright's API isn't thread-safe. If not properly managed, this limitation can result in potential thread conflict and execution deadlocks during concurrent scraping. Botright recommends running a separate browser instance per thread to effectively manage your system's resources.

According to an issue on GitHub, Botright presents dependency issues with Python 3.11+. To use Botright, you may need to downgrade to a Python version lower than 3.9. Otherwise, your Botright scraper fails due to version incompatibility.

Botright also struggles with advanced anti-bot systems like Cloudflare and Akamai despite its claim that it can bypass them. We ran a 100-iteration benchmark to test Botright's ability to bypass Cloudflare. None of the requests passed.

Fortunately, a web scraping API, such as ZenRows, can overcome all these challenges. ZenRows provides a full scraping toolkit along with a foolproof CAPTCHA and anti-bot bypass system. With automatic retries, concurrent requests, and a pricing plan where you only pay for successful requests, you can avoid bottlenecks and performance issues and easily scale up.

Let's see how it works by scraping the full-page HTML of a Cloudflare-protected website like the G2 Reviews page.

Sign up to open the ZenRows Request Builder. Paste the target URL in the link box, activate Premium Proxies, and select JS Rendering. Choose Python as your preferred language and select the API connection mode. Copy and paste the generated code into your Python script.

building a scraper with zenrows
Click to open the image in full screen

The generated code should look like this:

scraper.py
# pip install requests
import requests

url = "https://www.g2.com/products/salesforce-salesforce-sales-cloud/reviews"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
    "url": url,
    "apikey": apikey,
    "js_render": "true",
    "premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)

The above code extracts the protected website's HTML. Here's the result:

Output
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8" />
    <link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
    <title>Salesforce Sales Cloud Reviews from June 2024</title>
</head>
<body>
    <!-- other content omitted for brevity -->
</body>

Congratulations! You've just bypassed a Cloudflare-protected website with ZenRows.

Conclusion

You've seen how Botright works and how to use it for Python web scraping. You now know how to:

  • Implement infinite scrolling with Botright to scrape dynamic content.
  • Export the extracted data to a CSV file.
  • Solve Google's reCAPTCHA with Botright.

Despite faring better than Playwright against anti-bot solutions, Botright still can't bypass advanced anti-bots and struggles with complex CAPTCHAs like Geetst. It also requires rigorous setup, considering it doesn't support higher Python versions.

The best way to overcome these challenges and scrape any website without limitations is to use ZenRows, an all-in-one web scraping solution that works with any programming language. Try ZenRows for free now without a credit card!

Ready to get started?

Up to 1,000 URLs for free are waiting for you