Web Scraping With a Headless Browser in Python [Selenium Tutorial]

Rubén del Campo
Rubén del Campo
October 1, 2024 · 9 min read

Are you considering using a headless browser for web scraping in Python? Python headless browsers have automation features, making them suitable for scraping dynamic websites that require complex user interactions.

With many headless browser options available, choosing the right one can be challenging. In this guide, we'll cover what Python headless browsers are, help you choose the best one, and show you how to use Selenium headless mode in Python.

Let's get started!

What Is a Headless Browser in Python?

A Python headless browser is a browser without a graphical user interface (GUI) but has the capabilities of a conventional browser. Unlike a regular browser, the activities in a headless browser are invisible and usually controlled via an automation script.

One of the critical advantages of headless browsers is that they don't render memory-demanding graphics, making them considerably faster than GUI-based browsers.

Popular Python headless browsers include Selenium and Playwright. These tools typically let you switch between headless and GUI modes and are commonly used for test automation and web scraping. 

Headless browsers allow you to execute JavaScript and automate user interactions, such as clicking, scrolling, typing, hovering, and more. Typical headless browsers, like Selenium, let you add a waiting mechanism to your script, allowing web elements to load before further actions. 

These abilities make headless browsers suitable for extracting data from websites that render content dynamically using JavaScript.

Benefits of a Python Headless Browser

Here are the pros of using a headless browser.

  • Scraping JavaScript-Rendered Pages: JavaScript-rendered content can take some time to load due to dynamic requests. Headless browsers have the ability to wait for elements to appear, ensuring they're available before interaction during test automation or scraping.
  • Improved Performance: Since headless browsers don't rely on the GUI, they eliminate the memory overhead associated with loading heavy graphics and resources, such as videos, animations, and images. This ability improves overall performance during automation and scraping.
  • Saving Resources: Unlike GUI-based browsers, headless browsers are lightweight, requiring fewer system resources than regular browsers. This feature allows for greater scalability without complex computation requirements. It also improves multitasking since you can use the machine while the browser runs in the background.

Disadvantages of a Python Headless Browser

Despite the upsides, headless browsers have some limitations we'll discuss briefly.

  • Easy to Detect by Anti-bot Systems: Headless browsers present bot-like signals, such as the HeadlessChrome User Agent flag, missing plugins, and the absence of WebGL or Canvas rendering. These limitations make them prone to anti-bot detection since they often won't pass fingerprinting checks.
  • Lack of Visual Rendering: While the absence of visual rendering improves scraping performance, it limits efficient debugging since you can't see what's happening behind the scenes.
  • High Setup Requirements: Headless browsers are typically more technical to set up than standard HTTP clients like Requests. In addition to library installation, you'll still need to deal with steps like WebDriver installation and management.
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The Best Headless Browsers for Python

Below is a non-exhaustive list of the most popular Python headless browsers.

Selenium

Selenium is one of the most popular automation libraries for software testing and web scraping. It supports plugins like the Undetected ChromeDriver to bypass anti-bot detection during scraping.

The library uses the WebDriver protocol and supports major browsers, including Chrome, Firefox, Edge, Safari, and Internet Explorer. Selenium runs the GUI mode (non-headless) by default but allows you to switch to the headless mode.

One drawback of web scraping with Selenium is its steep learning curve, particularly for advanced tasks. It also offers limited direct control over browser properties like the navigator field. Additionally, executing JavaScript can be complicated, requiring wrapping the script in a string, making the code messier than alternatives like Playwright.

Playwright

Playwright is an automation library that supports most major browsers, including Chrome, Firefox, Safari, and Edge. Scraping with Playwright is a top choice due to its straightforward API methods and availability of evasion plugins like the Playwright Stealth plugin.

Playwright relies on the Chrome DevTools Protocol, offering better control over the browser properties. This feature makes it possible to manipulate browser properties directly using JavaScript.

Playwright also lets you download browser binaries via a terminal command. However, the process increases the overall size of the installation file and consumes extra disk space. Another limitation is that, unlike Selenium, Playwright doesn't support legacy browsers like Internet Explorer.

Pyppeteer

Pyppeteer is an unofficial Python port for Puppeteer, a Node.js browser automation library. It mainly runs the Chromium browser and lets you use Puppeteer's API methods. 

Like Puppeteer, Pyppeteer executes the browser in headless mode by default, but you can switch to the GUI mode for extra tasks like debugging.

Scraping with Pyppeteer has some critical setbacks worth knowing. It is unmaintained and isn't current with Puppeteer's latest releases. For instance, Puppeteer added support for Firefox, but Pyppeteer remains stuck with Chromium, limiting it to a single browser environment.

Splash

Splash is a JavaScript rendering tool with an HTTP API. It features a dedicated server that runs a lightweight browser to interact with web pages and scrape dynamic content.

Splash works well with Scrapy via the Scrapy-Splash middleware and uses Lua scripting to automate user actions. When you write a Lua automation script in Splash, you can run it via the Splash execution endpoint. This ability to communicate through an HTTP API makes Splash usable across various programming languages.

The major downside of scraping with Splash is that it only offers a partial browser feature. Its execution relies heavily on Lua scripting, which presents challenges, such as poor backward compatibility with older versions and limited error handling.

MechanicalSoup

MechanicalSoup is another Python library that automates user actions on a website. Under the hood, it uses the Requests library as an HTTP client and BeautifulSoup as an HTML parser, making it an excellent scraping tool.

The only attribute that qualifies MechanicalSoup as a headless browser is its low-level browser, which is strictly headless. Unlike full-featured tools such as Selenium and Playwright, it doesn't offer a full-fledged browser environment.

However, the library persists cookies across pages, making it suitable for automating simple user actions, such as page navigation and form submission. That said, MechanicalSoup isn't the most appropriate tool for scraping dynamic websites, as it can't handle JavaScript rendering.

How to Run a Headless Browser With Selenium and Python?

We'll now show you how to automate the browser using Python Selenium in headless mode. But first, let's see what types of headless browsers Selenium supports.

What Headless Browser Is Included in Selenium?

Chrome, Edge, and Firefox are the three headless browsers in Selenium Python. You can use the following three browsers to execute Python Selenium headlessly.

1. Headless Chrome Selenium Python

Chrome began shipping with headless capability, starting with version 59. You can call the Chrome binary from the command line to perform headless Chrome.

Terminal
chrome --headless --disable-gpu --remote-debugging-port=9222 https://httpbin.io

2. Headless Edge

This browser was initially built with EdgeHTML, a proprietary browser engine from Microsoft. However, in late 2018, it was relaunched as a Chromium browser with Blink and V8 engines, inheriting Chrome's headless capability.

Terminal
msedge --headless --disable-gpu --remote-debugging-port=9222 https://httpbin.io

3. Headless Firefox

Firefox is another widely used browser built on the Gecko rendering engine. To open headless Firefox, type the command on the command line or terminal, as seen below.

Terminal
firefox -headless https://httpbin.io

Now, let's go through the steps of running Selenium headless with Python using the Chrome browser. 

For this example, we'll scrape product details like names and prices from Ecommerce Challenge, a demo site dedicated to web scrapers with real ecommerce features.

Prerequisites

Before scraping a web page with a Python headless browser, let's install the following tools:

  • Python: Download the latest version from the official website and follow the installation wizard to set it up.
  • Selenium: Run the command pip install selenium.
  • Chrome: Download it as a regular browser if you don't have it installed yet.

Installing and importing WebDriver used to be necessary, but not anymore because Selenium version 4+ handles WebDriver management automatically.

If you have an earlier Selenium version installed, update it for the latest features and functionality. To check your current version, run this command in the terminal running the virtual environment where you've installed Python:

Terminal
pip3 show Selenium

To force-install the newest version, run the following command:

Terminal
pip3 install --upgrade selenium

You're ready to scrape some product data using a Python headless browser. Let's go!

Step 1: Import and Set Up the Selenium WebDriver

The next step is to open your Python script, import the WebDriver module from Selenium, and create a new Chrome instance:

Example
# pip3 install selenium
from selenium import webdriver

# set up Chrome
driver = webdriver.Chrome()

Step 2: Open the Page

Open the target website with the ChromeDriver instance. This step confirms that your environment works correctly. The code below will open Chrome in GUI mode to visit your target page:

Example
# ...
# open the target website
driver.get("https://www.scrapingcourse.com/ecommerce/")

The page should look like this:

scrapingcourse ecommerce controlled by automated software
Click to open the image in full screen

Step 3: Switch to Python Selenium Headless Mode

The current scraper opens the browser in non-headless (GUI) mode. But we don't want the browser to appear on the monitor. So, we'll need to run Chrome headlessly. 

Switching to headless Chrome in Python is straightforward. We only need to add two lines of code. 

Set up ChromeOptions, add the headless flag, and include the options parameter when calling the Webdriver.

Here's the modified code:

Example
# ...

# set up ChromeOptions
options = webdriver.ChromeOptions()

# add headless Chrome option
options.add_argument("--headless=new")

# set up Chrome in headless mode
driver = webdriver.Chrome(options=options)

The complete code looks like this:

Example
# pip3 install selenium
from selenium import webdriver

# set up ChromeOptions
options = webdriver.ChromeOptions()

# add headless Chrome option
options.add_argument("--headless=new")

# set up Chrome in headless mode
driver = webdriver.Chrome(options=options)

# open the target website
driver.get("https://www.scrapingcourse.com/ecommerce/")

# close the driver instance and release its resources
driver.quit()

After running the code, you'll see nothing but minor information on the terminal.

Output
DevTools listening on ws://127.0.0.1:55269/devtools/browser/af651a52-6e59-457f-8312-9287b351c2fa

The Chrome GUI doesn't appear this time. Your Selenium scraper now runs the headless mode.

Awesome!

But how do you verify if the code successfully reaches the target page without runtime errors? You'll need to log outputs to confirm.

To do this, let's log the scraping result in the terminal. Add two lines of code below to print the page URL and title. This step checks if Selenium gets the required data:

Example
# ...
# print the current URL and page title
print(f"Page URL: {driver.current_url}")
print(f"Page Title: {driver.title}")

Here's the output:

Output
Page URL: https://www.scrapingcourse.com/ecommerce/
Page Title: Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com

Our crawler gets the intended results! So, let's scrape some product data.

Step 4: Scrape the Data

Before scraping specific product data, we'll need to find the HTML element that holds the data by inspecting the web page with a real browser.

To do this, open the target site on a regular browser like Chrome. Right-click the first product and select Inspect to open the Chrome DevTools to the Elements tab.

Let's go ahead and find the elements that hold the product names and prices. The browser will highlight the area covered by the selected element, making it easier to identify the right element.

Each product element is inside a list tag (li) with the class name product. This list tag is the parent element containing product information (name, price, etc.) bearing descriptive class names:

scrapingcourse ecommerce homepage inspect first product li
Click to open the image in full screen

To scrape product information inside each list tag, you'll loop through them and extract the information in each container.

Selenium provides several ways to select and extract the desired element, like using the element ID, Tag Name, class name, CSS Selectors, or XPath. Let's use the CSS selector method.

First, add the By class to your imports. Create an empty list to collect the scraped data and obtain all the parent elements using Selenium's find_elements method. The find_elements method returns an array of all the list tags containing the products:

Example
# pip3 install selenium
# ...
from selenium.webdriver.common.by import By


# ...
# define an empty list to collect scraped data
scraped_data = []

# get all parent elements
products = driver.find_elements(By.CSS_SELECTOR, ".product")

Using a for loop, iterate through the extracted parent elements to collect the product name and price from each product card into a dictionary. Then, append the collected data to the scraped-data` list. Finally, print the scraped data and close the browser:

Example
# ...

# loop through the parents to extract product information
for product in products:
    data = {
        "Name": product.find_element(By.CSS_SELECTOR, ".product-name").text,
        "Price": product.find_element(By.CSS_SELECTOR, ".price").text,
    }

    # append the scraped data to the empty list
    scraped_data.append(data)

# print the scraped data
print(scraped_data)

# close the driver instance and release its resources
driver.quit()

Combining all the snippets gives the following complete code:

Example
# pip3 install selenium
from selenium import webdriver
from selenium.webdriver.common.by import By

# set up ChromeOptions
options = webdriver.ChromeOptions()

# add headless Chrome option
options.add_argument("--headless=new")

# set up Chrome in headless mode
driver = webdriver.Chrome(options=options)

# open the target website
driver.get("https://www.scrapingcourse.com/ecommerce/")

# define an empty list to collect scraped data
scraped_data = []

# get all parent elements
products = driver.find_elements(By.CSS_SELECTOR, ".product")

# loop through the parents to extract product information
for product in products:
    data = {
        "Name": product.find_element(By.CSS_SELECTOR, ".product-name").text,
        "Price": product.find_element(By.CSS_SELECTOR, ".price").text,
    }

    # append the scraped data to the empty list
    scraped_data.append(data)

# print the scraped data
print(scraped_data)

# close the driver instance and release its resources
driver.quit()

The above code outputs the following:

Output
[
    {'Name': 'Abominable Hoodie', 'Price': '$69.00'},
    {'Name': 'Adrienne Trek Jacket', 'Price': '$57.00'},
    # ... other products omitted for brevity
    {'Name': 'Ariel Roll Sleeve Sweatshirt', 'Price': '$39.00'},
    {'Name': 'Artemis Running Short', 'Price': '$45.00'},
]

Nice job! You just scraped product data from a website using Selenium in headless mode in Python.

Best Alternative for When You Get Blocked Web Scraping With Selenium and Python

Although Selenium headless works seamlessly with dynamic websites, it won't work for websites protected with anti-bot measures. 

Let's test its anti-bot bypass ability by scraping the full-page HTML of this Cloudflare Challenge page:

Example
# pip3 install selenium
from selenium import webdriver

# set up ChromeOptions
options = webdriver.ChromeOptions()

# add headless Chrome option
options.add_argument("--headless=new")

# set up Chrome in headless mode
driver = webdriver.Chrome(options=options)

# open the target website
driver.get("https://www.scrapingcourse.com/cloudflare-challenge")

# print the page source
print(driver.page_source)

# close the driver instance and release its resources
driver.quit()

Selenium got blocked by Cloudflare's interstitial page. Here's the output:

Output
<!DOCTYPE html>
<html lang="en-US">
<head>
    <title>Just a moment...</title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <!-- ... -->
</head>
<body>
    <!-- ... -->
</body>
</html>

The above result is expected because Selenium and other headless browser tools don't have built-in mechanisms to evade anti-bot detection. As mentioned earlier, headless browsers don't handle fingerprinting checks efficiently, lack relevant actual browser plugins, and present bot-like attributes that make them easily detectable.

A web scraping solution like ZenRows is the best alternative to headless browsers. ZenRows is a complete toolkit with all the essential benefits of a headless browser, including JavaScript rendering and user interactions. At the same time, it mitigates flaws like getting blocked and scaling issues.

Let's see how ZenRows works by scraping the Cloudflare Challenge page that blocked us previously.

Sign up to open the ZenRows Request Builder. Paste the target URL in the link box and activate Premium Proxies and JS Rendering. Select Python as your programming language and choose the API connection mode.

Copy and paste the generated code into your Python file.

building a scraper with zenrows
Click to open the image in full screen

The ZenRows Python scraper API uses the Requests library. Ensure you install it using pip:

Terminal
pip3 install requests

The generated code should look like this:

Example
# pip3 install requests
import requests

url = "https://www.scrapingcourse.com/cloudflare-challenge"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
    "url": url,
    "apikey": apikey,
    "js_render": "true",
    "premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)

Run the above code. You'll see that it outputs the protected website's full-page HTML:

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>Cloudflare Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Cloudflare challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

Congratulations 🎉! You bypassed Cloudflare's anti-bot with ZenRows in Python.

However, if you prefer sticking to an open-source headless browser due to specific project requirements, use the ZenRows Scraping Browser.

The scraping browser handles advanced fingerprint management and fortifies your headless browser with cutting-edge stealth evasions to scrape without detection.

Conclusion

In this article, you've learned what a headless browser is and the different types. We also showed you step-by-step instructions on scraping data from a web page using a headless Chrome in Selenium Python. Python headless browsers offer advantages, such as JavaScript rendering, user interaction, and reduced memory footprint.

Despite the benefits, headless browsers can't bypass anti-bot mechanisms independently. We recommend using ZenRows to fix all these limitations while retaining all the benefits of headless browsers.

Try ZenRows for free now without a credit card!

Ready to get started?

Up to 1,000 URLs for free are waiting for you