Web Scraping with SeleniumBase and Python in 2025

Idowu Omisola
Idowu Omisola
October 16, 2024 · 5 min read

Are you searching for a faster and more user-friendly alternative to Selenium for web scraping in Python?

SeleniumBase might be the answer. It's an easier alternative to Selenium for automating, testing, and scraping websites. It has built-in tools for data collection, works with modern browsers, and even helps bypass anti-bot defenses.

Here's what you're about to learn in this guide:

Let's get started!

What Is SeleniumBase?

SeleniumBase is an open-source Python framework designed to simplify automation, testing, and web scraping with less code compared to standard Selenium. It is built on Selenium and uses its WebDriver API to interact with and extract data from web pages. 

SeleniumBase supports the pytest plugin, which allows you to customize test runs directly from the command line. With command-line options, you can set the browser type, enable headless mode, configure proxies, change user agents, and more.

One standout feature of SeleniumBase is its Undetected ChromeDriver (UC) Mode. This mode helps bots act more like real users, making bypassing anti-bot detection systems easier. As a result, it provides smoother interactions on websites with security features like CAPTCHAs, making web scraping and testing more effective.

SeleniumBase also offers seamless proxy configuration for added anonymity and to avoid IP bans during scraping. Check out our detailed guide on how to set up SeleniumBase with proxies for more details.

Benefits of SeleniumBase

  • SeleniumBase automatically captures screenshots and logs detailed error information when issues arise.
  • It reduces the need for repetitive code, making scripts faster and easier to write compared to standard Selenium.
  • It supports parallel testing or scraping across multiple browsers or sessions.
  • SeleniumBase is used for both automated testing and web scraping, making it a flexible choice for various web tasks.

Building a Basic Scraper with SeleniumBase

Using this e-commerce demo website as our target website, let's create a basic scraper with  SeleniumBase to extract the full HTML of the page. Here's a preview of what the target page looks like:

Scrapingcourse Ecommerce Store
Click to open the image in full screen

This tutorial assumes you already have Python installed. If not, install the latest version from the Python download page.

Next, install SeleniumBase by running the following command in your terminal:

Terminal
pip3 install seleniumbase

Create a scraper.py file in your project directory and write the following Python code. The code imports BaseCase from SeleniumBase and uses the get_page_source() method to fetch the raw HTML:

scraper.py
from seleniumbase import BaseCase

class Scraper(BaseCase):
  def test_product_details(self):
    
      # open the e-commerce demo webpage
      self.open("https://www.scrapingcourse.com/ecommerce/")
    
      # extract the full-page HTML
      page_html = self.get_page_source()
    
      # print HTML
      print(page_html)

Verify that your scraper.py SeleniumBase script works by running the following command:

Terminal
pytest scraper.py -s

Once that browser instance closes, SeleniumBase will extract the website's full-page HTML:

Output
<!DOCTYPE html>
<html lang="en-US">
<head>
    <!--- ... --->
 
    <title>Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com</title>
   
  <!--- ... --->
</head>
<body class="home archive ...">
    <p class="woocommerce-result-count" id="result-count">Showing 1-16 of 188 results</p>
    <ul class="products columns-4" id="product-list">

        <!--- ... --->

    </ul>
</body>
</html>

Awesome! You've just built a simple SeleniumBase scraper. Now, let's build on this to extract specific product details.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Extracting Data with SeleniumBase

SeleniumBase supports CSS selectors, XPath, and IDs to target specific web elements. It also has built-in wait mechanisms to ensure that elements are fully loaded before interacting with them, making it reliable for dynamic content or JavaScript-heavy websites.

From the target website, you'll scrape the product's names, image sources, and product URLs on the target page.

Scraping Product Names

Let's start by scraping product names. First, inspect the webpage to identify the HTML structure of the product name.

Right-click on a product name on the target page and select Inspect. The product names are inside h2 tags with a class name product-name.

scrapingcourse ecommerce homepage inspect first product li
Click to open the image in full screen

Next, create a product_data dictionary to store the product names. Use the find_elements method to locate all the product names on the page and store them in the dictionary:

scraper.py
from seleniumbase import BaseCase

class Scraper(BaseCase):
    def test_product_details(self):
        # open the e-commerce demo webpage
        self.open("https://www.scrapingcourse.com/ecommerce/")
        
        # create a dictionary to store the product details
        product_data = {
            "Product Names": []
        }
        
        # find all product names using the css selector
        product_names = self.find_elements("h2.product-name")
        for product in product_names:
            product_data["Product Names"].append(product.text)
        
        # print the product data dictionary
        print(product_data)

This code will output the names of all the products listed on the page:

Output
{'Product Names': ['Abominable Hoodie', 'Adrienne Trek Jacket', 'Aeon Capri', 'Aero Daily Fitness Tee', 'Aether Gym Pant', 'Affirm Water Bottle', 'Aim Analog Watch', 'Ajax Full-Zip Sweatshirt', 'Ana Running Short', 'Angel Light Running Short', 'Antonia Racer Tank', 'Apollo Running Short', 'Arcadio Gym Short', 'Argus All-Weather Tank', 'Ariel Roll Sleeve Sweatshirt', 'Artemis Running Short']}

Awesome progress! Let's continue.

Scraping Product Images

Next, you'll scrape the product images. Right-click an image and select Inspect to locate the img tag with the class name product-image.

Use the find_elements method to locate all the image elements on the page. Then, loop through each image and use the get_attribute method to extract the image URLs. The URLs are stored in the product_data dictionary:

scraper.py
# ...

class Scraper(BaseCase):
    def test_product_details(self):
 
        # ...
 
        # create a dictionary to store the product details
        product_data = {
            # ...
            "Image URLs": [],
        }
 
        # find all product images and extract their URLs
        product_images = self.find_elements("img.product-image")
        for image in product_images:
            product_data["Image URLs"].append(image.get_attribute("src"))
 
        # print the product data dictionary
        print(product_data)

This above code will output the product image URLs:

Output
{# ...
'Image URLs': ['https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main.jpg', 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_main.jpg', 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wp07-black_main.jpg', 


# ... omitted for brevity
                'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wsh04-black_main.jpg']}

Great! Let's move on.

Finally, let's extract the links to each product. Right-click on a product link and select Inspect. The product links are inside anchor a tags with the class name woocommerce-LoopProduct-link.

Use the find_elements method to locate the product links. Then, loop through each element and use the get_attribute method to extract the href attribute:

scraper.py
# ...

class Scraper(BaseCase):
   def test_product_details(self):

       # ...

       # create a dictionary to store the product details
       product_data = {
           # ...
           "Product URLs": [],
       }

       # find all product links and extract their URLs
       product_links = self.find_elements("a.woocommerce-LoopProduct-link")
       for link in product_links:
           product_data["Product URLs"].append(link.get_attribute("href"))

       # print the product data dictionary
       print(product_data)

This code will output the product URLs of all the products listed on the page:

Output
{# ...
'Product URLs': ['https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/', 'https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/', 'https://www.scrapingcourse.com/ecommerce/product/aeon-capri/', 


# ... omitted for brevity
                  'https://www.scrapingcourse.com/ecommerce/product/artemis-running-short/']}

Nice work so far! 

Now, let's put everything together into a complete scraper:

scraper.py
from seleniumbase import BaseCase

class Scraper(BaseCase):
  def test_product_details(self):
    
      # open the e-commerce demo webpage
      self.open("https://www.scrapingcourse.com/ecommerce/")
    
      # create a dictionary to store the product details
      product_data = {
          "Product Names": [],
          "Image URLs": [],
          "Product URLs": []
      }

      # find all product names using the css selector
      product_names = self.find_elements("h2.product-name")
      for product in product_names:
          product_data["Product Names"].append(product.text)

      # find all product images and extract their urls
      product_images = self.find_elements("img.product-image")
      for image in product_images:
          product_data["Image URLs"].append(image.get_attribute("src"))

      # find all product links and extract their URLs
      product_links = self.find_elements("a.woocommerce-LoopProduct-link")
      for link in product_links:
          product_data["Product URLs"].append(link.get_attribute("href"))

      # print the product data dictionary
      print(product_data)

Here's the output of the combined code:

Output
{'Product Names': ['Abominable Hoodie', 'Adrienne Trek Jacket', 
                   # ... omitted for brevity
                   'Artemis Running Short'], 


 'Image URLs': ['https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main.jpg', 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_main.jpg', 
                # ... omitted for brevity
                'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wsh04-black_main.jpg'], 

 'Product URLs': ['https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/', 'https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/', 
                  # ... omitted for brevity
                  'https://www.scrapingcourse.com/ecommerce/product/artemis-running-short/']}

Congrats! You've now successfully built a full scraper with SeleniumBase to extract product details.

Automating Browser Interactions

So far, we've focused on extracting static content like product names, images, and links. But what if you need to simulate real browser interactions? 

With SeleniumBase, you can easily mimic different browser activities like scrolling, clicking, filling out forms, logging in, etc. This is especially useful when dealing with dynamic pages that require user input or interaction.

In this section, you'll learn how to simulate browser interactions and capture a screenshot of a product page after logging in. We'll use a simple login challenge page, which requires authentication before accessing product data:

Scrapingcourse login challenge page
Click to open the image in full screen

Create a login.py file in your project directory and add the following SeleniumBase script. This script will open the login page and input the credentials. It will then click the login button and capture a screenshot of the product page:

login.py
from seleniumbase import BaseCase

class Scraper(BaseCase):
  def test_login_and_interact(self):
    
      # open the login page
      self.open("https://www.scrapingcourse.com/login")
    
      # fill out the email and password fields
      self.type("#email", "[email protected]")  # enter the demo email
      self.type("#password", "password")  # enter the demo password

      # click the login button
      self.click('button[type="submit"]')

      # take a screenshot after clicking login
      self.save_screenshot("login_screenshot.png")

Run the above code with the following command:

Terminal
pytest login.py -s

This script will automate the logging process and save the screenshot of the project page in your project directory.

And that's it! You've successfully used SeleniumBase to automate browser interactions.

Avoiding Blocks with SeleniumBase

You now know how to perform basic web scraping operations using SeleniumBase. However, it's important to understand that many websites use advanced anti-bot measures, such as CAPTCHA challenges, IP bans, behavioral monitoring, etc., to block scrapers. These protections can make it difficult to retrieve data as your scraper might be flagged and blocked as non-human traffic. That's why it's important to know how to scrape without getting blocked.

SeleniumBase UC Mode provides enhanced capabilities for bypassing anti-bot detection systems. It allows your scrapers to mimic human behavior. The UC mode is based on an improved version of Undetected ChromeDriver and implements several evasion measures, including automatically changing User Agents, automatically setting Chromium arguments, special uc_*() methods for bypassing CAPTCHAs, and more.

Let's try to scrape the anti-bot challenge page using Seleniumbase UC Mode.

Create a new bypass_antibot.py file and initialize the browser using UC Mode. Navigate to the target URL and use uc_open_with_reconnect and uc_gui_click_captcha to evade detection and handle CAPTCHAs. After bypassing the measures, retrieve the page source, print it, and close the browser.

Here's how your code would look:

bypass_antibot.py
# import the Driver class from seleniumbase
from seleniumbase import Driver

# initialize driver with UC mode enabled
driver = Driver(uc=True)

# set target URL
url = "https://www.scrapingcourse.com/antibot-challenge"

# open URL using UC mode with 4 second reconnect time to bypass initial detection
driver.uc_open_with_reconnect(url, reconnect_time=4)

# attempt to bypass CAPTCHA using UC mode's built-in method
driver.uc_gui_click_captcha()

# retrieve and print the page source after bypassing anti-bot measures
page_html = driver.get_page_source()
print(page_html)

# close the browser and end the session
driver.quit()

Run the above code using the following command:

Terminal
python bypass_antibot.py

You'll get the following output:

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

Congratulations! You bypassed the anti-bot challenge to retrieve the HTML of the target page. 

Although SeleniumBase can help you bypass anti-bots in some cases, it's not always reliable and effective at scale. It gets detected in headless mode. Also, the tool is open-source, so anti-bot developers can gain insights into its bypass mechanisms and block it. So, how can you overcome the limitations of SeleniumBase? The next section presents a better alternative to SeleniumBase.

Best Alternative to SeleniumBase

ZenRows Scraper API is designed to overcome the challenges of modern web scraping. It excels at bypassing CAPTCHAs and anti-bot mechanisms, boasting a 98.7% success rate in accessing protected web pages.

To demonstrate its capabilities, let's use it to scrape the same site we attempted with SeleniumBase.

To get started, sign up for ZenRows. After logging in, you'll be redirected to the Request Builder page. Enter the target URL in the link box, activate Premium Proxies, and enable JS Rendering boost mode. Select your programming language (e.g., Python) and choose the API connection mode.

building a scraper with zenrows
Click to open the image in full screen

The Request Builder will generate Python code similar to this:

Example
# pip install requests
import requests

url = 'https://www.scrapingcourse.com/antibot-challenge'
apikey = '<YOUR_ZENROWS_API_KEY>'

params = {
    'url': url,
    'apikey': apikey,
    'js_render': 'true',
    'premium_proxy': 'true',
}

response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)

After running this code, you'll receive the whole HTML content of the target page:

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

ZenRows' web scraping API offers a robust solution for your web scraping needs, with features like JavaScript rendering, CAPTCHA bypass, anti-bot detection avoidance, automated proxy management, and more. It can handle even the most secure websites at scale, making it the perfect choice for web scraping. 

Conclusion

This tutorial covered the most essential concepts for web scraping in Python using SeleniumBase. You now know how to:

  • Set up SeleniumBase for web scraping.
  • Extract text, images, and links from a web page.
  • Automate browser interactions, such as logging into websites.
  • Bypass anti-bot measures.

Web scraping presents numerous challenges, especially with the advanced anti-bot technologies websites use. Consider using a solution like the ZenRows web scraping API to bypass even the most sophisticated anti-bot systems and scrape without interruptions.

Give ZenRows a try today!

Ready to get started?

Up to 1,000 URLs for free are waiting for you