Does your Python web scraper get blocked by CAPTCHAs? Botright, an improved version of Playwright, has a few tricks to help you solve them and make your scraping easier than before.
In this article, you'll learn how Botright works and how to use it for web scraping and solving CAPTCHAs.
What Can You Do With Botright?
Botright is an improved version of Playwright designed to bypass CAPTCHAs and anti-bots while scraping with Python. It runs a browser instance of Chromium or Firefox using Playwright under the hood, allowing you to execute JavaScript and scrape dynamic content.
As opposed to base Playwright, Botright doesn't have a bot-like attribute such as the WebDriver instance. It also uses scraped Chrome fingerprint data to become less detectable during browser fingerprinting tests.
Let's see the difference between Playwright and Botright in practice. The fingerprinting test on CreepJS shows that the base Playwright version uses the WebDriver instance by default. See a sample of the result below, with webDriverIsOn
returning true:
webDriverIsOn: true
hasHeadlessUA: false
hasHeadlessWorkerUA: false
Botright, on the other hand, returns false for the same fingerprinting test. This proves it appears less bot-like than Playwright:
webDriverIsOn: false
hasHeadlessUA: false
hasHeadlessWorkerUA: false
Botright also features dedicated solvers for popular CAPTCHAs, such as reCAPTCHA and hCaptcha. Its ability to use a real browser and mimic a legitimate user allows it to bypass basic anti-bot checks like JavaScript challenges. However, it may still lose when faced with advanced anti-bots like Cloudflare and DataDome.
Botright is asynchronous by default and supports concurrent scraping to extract data from multiple pages simultaneously. However, as the library isn't thread-safe, you must run a separate browser instance per URL to prevent threading conflicts.
Tutorial: How to Scrape With Botright
In this section, you'll learn how to perform basic scraping tasks with Botright by extracting product information from the ScrapingCourse infinite scrolling challenge page.
You'll start with an initial HTTP request, then scroll the page to get specific product details, and finally export the scraped data to a CSV file.
The target website uses infinite scrolling to load content as you scroll down:
Ready to start? Let's move on to the initial setup.
Prerequisites
Botright doesn't support Python versions later than 3.10 due to dependency conflicts. Install a Python version below 3.9 for the best experience.
You can install multiple Python versions on your computer. If you've pre-installed a higher Python version like 3.12 and want to use the newly installed Python 3.9, you can create a dedicated virtual environment for Python 3.9 and run your project in that environment.
Now, install Botright with pip
:
pip install botright
This command installs Playwright. Once installation is complete, download the required browser binaries:
playwright install
You'll also need to install the Ungoogled Chromium binary, a Chromium variant with improved privacy, to boost anti-bot evasion capability. Download and install it from the official download page, and Botright will use it automatically.
Create a project root folder with a new scraper.py
file. You can work with any code editor, but this tutorial uses VS Code on a Windows operating system.
You're now ready to scrape with Botright! Let's get started.
Scrape With Botright
Let's see how Botright works by building a scraper that extracts product names, prices, and image URLs from the target website.
Before you begin, inspect the target website's HTML. Open the website on a browser, e.g., Chrome. Right-click the first product and select Inspect.
Each target product information is inside a div
element with the class name "product".
Since the page uses infinite scrolling, automate downward scrolling to extract all the product information.
To begin, import asyncio and Botright into your scraper file. Asyncio is a built-in Python library that lets you execute asynchronous functions. Define an asynchronous scraper function that accepts a "page" instance as an argument. Then, extract all the product containers with Botright's query selector:
# import the required library
import asyncio
import botright
async def scraper(page):
# extract all the product containers
products = await page.query_selector_all(".product-item")
Create an empty product_data
list to collect the scraped data. Loop through each product container to extract its product information using CSS selectors. Add the scraped items to a dictionary and append them to the empty array:
async def scraper(page):
# ...
# create an empty array to write the scraped data
product_data = []
# loop through each product container to extract its data
for product in products:
name = await product.query_selector(".product-name")
price = await product.query_selector(".product-price")
image = await product.query_selector("img")
# extract each product's actual value into a dictionary
extracted_data = {
"name": await name.inner_text(),
"price": await price.inner_text(),
"image": await image.get_attribute("src")
}
# append the extracted data to the product_data array
product_data.append(extracted_data)
The next step is to expand the scraper function to export the scraped data to a CSV file. First, add the csv
package to the imported libraries. Open a CSV file, create matching field names, and insert the data into a row:
# import the required library
# ...
import csv
async def scraper(page):
# ...
# write the extracted product data to a CSV file
with open("products.csv", "w", newline="", encoding="utf-8") as csv_file:
# specify the CSV's field names to match the extracted_data dictionary
field_names = ["name", "price", "image"]
# set each field name as the corresponding column name
writer = csv.DictWriter(csv_file, fieldnames=field_names)
# insert each data into a row
writer.writeheader()
for data in product_data:
writer.writerow(data)
print("Data successfully exported to CSV!"
Scraping logic done! Now, let's define your scrolling logic in an asynchronous scroller function.
Start Botright in headless mode, create a new page instance, and request the target website:
async def scroller():
# start the botright instance in headless mode
botright_client = await botright.Botright(headless=True)
browser = await botright_client.new_browser()
# create a new page instance
page = await browser.new_page()
# open the target web page
await page.goto("https://scrapingcourse.com/infinite-scrolling")
Now, modify that scroller function to scroll the page continuously.
Set an initial page height to track the scroll height. Open a while
loop and scroll the page with JavaScript's scroll event. Pause for more content to load as you scroll and obtain the new height value:
async def scroller():
# ...
# set the last height to zero
last_height = 0
# implement continuous scrolling in a while loop
while True:
# scroll down to bottom
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
# wait for the page to load
await page.wait_for_timeout(5000)
# get the new height
new_height = await page.evaluate("document.body.scrollHeight")
Update the previous height value to the new one. Break the loop once the last height is the same as the new height. Then, execute the scraper function based on the page instance to extract all the target product content. Exit the loop and close the Botright instance:
# ...
# implement continuous scrolling in a while loop
while True:
# ...
# break the loop if there are no more heights to scroll
if new_height == last_height:
# extract data once all content has loaded
await scraper(page)
break
# update the initial height to the new height
last_height = new_height
# close the botright browser instance
await botright_client.close()
Finally, execute the scroller function using asyncio
:
# execute the scroller function
if __name__ == "__main__":
asyncio.run(scroller())
Combine all the snippets. Here's the full code:
# import the required library
import asyncio
import botright
import csv
async def scraper(page):
# extract all the product containers
products = await page.query_selector_all(".product-item")
# create an empty array to write the scraped data
product_data = []
# loop through each product container to extract its data
for product in products:
name = await product.query_selector(".product-name")
price = await product.query_selector(".product-price")
image = await product.query_selector("img")
# extract each product's actual value into a dictionary
extracted_data = {
"name": await name.inner_text(),
"price": await price.inner_text(),
"image": await image.get_attribute("src")
}
# append the extracted data to the product_data array
product_data.append(extracted_data)
# write the extracted product data to a CSV file
with open("products.csv", "w", newline="", encoding="utf-8") as csv_file:
# specify the CSV's field names to match the extracted_data dictionary
field_names = ["name", "price", "image"]
# set each field name as the corresponding column name
writer = csv.DictWriter(csv_file, fieldnames=field_names)
# insert each data into a row
writer.writeheader()
for data in product_data:
writer.writerow(data)
print("Data successfully exported to CSV!")
async def scroller():
# start the botright instance in headless mode
botright_client = await botright.Botright(headless=True)
browser = await botright_client.new_browser()
# create a new page instance
page = await browser.new_page()
# open the target web page
await page.goto("https://scrapingcourse.com/infinite-scrolling")
# set the last height to zero
last_height = 0
# implement continuous scrolling in a while loop
while True:
# scroll down to bottom
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
# wait for the page to load
await page.wait_for_timeout(5000)
# get the new height
new_height = await page.evaluate("document.body.scrollHeight")
# break the loop if there are no more heights to scroll
if new_height == last_height:
# extract data once all content has loaded
await scraper(page)
break
# update the initial height to the new height
last_height = new_height
# close the botright browser instance
await botright_client.close()
# execute the scroller function
if __name__ == "__main__":
asyncio.run(scroller())
Run the code above. You'll see a new products.csv
file in your project root directory with the extracted product information:
Great job! You've extracted all the products from a dynamic web page with Botright by implementing infinite scrolling.
Apart from handling JavaScript rendering, Botright also claims to solve CAPTCHAs. Let's see how it works in the next section.
Solve CAPTCHAs With Botright
Botright uses image recognition libraries such as OpenCV, scikit-image, Torchvision, and Ultralytics to identify and solve CAPTCHA images.
Let's test how the CAPTCHA-solving functionality works using the Google reCAPTCHA demo page.
Like before, run the Botright browser instance in headless mode to increase the chances of success, as bypassing CAPTCHA hardly ever works in non-headless mode. Request the reCAPTCHA demo page and call Botright's solve_recaptcha
function to solve the on-page CAPTCHA:
# import the required libraries
import asyncio
import botright
async def scraper():
# start the botright instance in the GUI mode
botright_client = await botright.Botright(headless=True)
browser = await botright_client.new_browser()
# create a new page instance
page = await browser.new_page()
# open the target web page
await page.goto("https://www.google.com/recaptcha/api2/demo")
# solve the CAPTCHA
await page.solve_recaptcha()
Screenshot the web page to see the result. Close the Botright instance and run the scraper function using asyncio
:
async def scraper():
# ...
# screenshot the page to capture the solved CAPTCHA
await page.screenshot(path="reCAPTCHA-demo-screenshot.png")
# close the botright browser instance
await botright_client.close()
# execute the scraper function
if __name__ == "__main__":
asyncio.run(scraper())
Combine both snippets to get the following complete code:
# import the required libraries
import asyncio
import botright
async def scraper():
# start the botright instance in the GUI mode
botright_client = await botright.Botright(headless=True)
browser = await botright_client.new_browser()
# create a new page instance
page = await browser.new_page()
# open the target web page
await page.goto("https://www.google.com/recaptcha/api2/demo")
# solve the CAPTCHA
await page.solve_recaptcha()
# screenshot the page to capture the solved CAPTCHA
await page.screenshot(path="reCAPTCHA-demo-screenshot.png")
# close the botright browser instance
await botright_client.close()
# execute the scraper function
if __name__ == "__main__":
asyncio.run(scraper())
The code above solves the reCAPTCHA puzzle. Here's the generated screenshot:
It works! You've just successfully dealt with Google's reCAPTCHA using Botright. However, despite this powerful capability, Botright still has a few significant limitations that may hinder your web scraping efforts.
Limitations of Botright and Best Alternative
Botright is a useful web scraping tool for bypassing specific types of CAPTCHA, including reCAPTCHA and hCAPTCHA. Still, it only solves these CAPTCHAs 50-80% of the time and is powerless against services like the Geetest V3 and V4 CAPTCHAs.
Additionally, Botright's API isn't thread-safe. If not properly managed, this limitation can result in potential thread conflict and execution deadlocks during concurrent scraping. Botright recommends running a separate browser instance per thread to effectively manage your system's resources.
According to an issue on GitHub, Botright presents dependency issues with Python 3.11+. To use Botright, you may need to downgrade to a Python version lower than 3.9. Otherwise, your Botright scraper fails due to version incompatibility.
Botright also struggles with advanced anti-bot systems like Cloudflare and Akamai despite its claim that it can bypass them. We ran a 100-iteration benchmark to test Botright's ability to bypass Cloudflare. None of the requests passed.
Fortunately, a web scraping API, such as ZenRows, can overcome all these challenges. ZenRows provides a full scraping toolkit along with a foolproof CAPTCHA and anti-bot bypass system. With automatic retries, concurrent requests, and a pricing plan where you only pay for successful requests, you can avoid bottlenecks and performance issues and easily scale up.
Let's see how it works by scraping the full-page HTML of a Cloudflare-protected website like the G2 Reviews page.
Sign up to open the ZenRows Request Builder. Paste the target URL in the link box, activate Premium Proxies, and select JS Rendering. Choose Python as your preferred language and select the API connection mode. Copy and paste the generated code into your Python script.
The generated code should look like this:
# pip install requests
import requests
url = "https://www.g2.com/products/salesforce-salesforce-sales-cloud/reviews"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
"premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)
The above code extracts the protected website's HTML. Here's the result:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<title>Salesforce Sales Cloud Reviews from June 2024</title>
</head>
<body>
<!-- other content omitted for brevity -->
</body>
Congratulations! You've just bypassed a Cloudflare-protected website with ZenRows.
Conclusion
You've seen how Botright works and how to use it for Python web scraping. You now know how to:
- Implement infinite scrolling with Botright to scrape dynamic content.
- Export the extracted data to a CSV file.
- Solve Google's reCAPTCHA with Botright.
Despite faring better than Playwright against anti-bot solutions, Botright still can't bypass advanced anti-bots and struggles with complex CAPTCHAs like Geetst. It also requires rigorous setup, considering it doesn't support higher Python versions.
The best way to overcome these challenges and scrape any website without limitations is to use ZenRows, an all-in-one web scraping solution that works with any programming language. Try ZenRows for free now without a credit card!