The Anti-bot Solution to Scrape Everything? Get Your Free API Key! ๐Ÿ˜Ž

Headless Browser in Python and Selenium

November 23, 2022 ยท 10 min read

A Python headless browser is a tool that can be used to scrape dynamic content smoothly without the need for a real browser, reducing scraping costs and scaling your crawling process.

Web scraping using a browser-based solution helps you deal with a site that requires JavaScript. On the other hand, web scraping can be a long process, especially when dealing with complex sites or a massive list of data.

In this guide, we'll be deep-diving into Python headless browsers: types, pros and cons.

Let's dive right in!

What is a headless browser in Python?

A headless browser is a web browser without a graphical user interface (GUI) but has the capabilities of a real browser. It carries all standard functionalities like dealing with JavaScript, clicking links and so on. Python is one of the programming languages that lets you enjoy its full capabilities.

Using a Python headless browser, you can automate the browser and learn the language. It can save you time in the development and scraping phases. That's possible because one of the benefits of the Python headless browser is it uses less memory.

Benefits of the Python headless browser

Python headless browser or any headless browser process uses less memory than a real browser. The computer reaches this condition because it does not need to draw graphical elements for the browser and the website.

Python headless browsers are fast and can speed up scraping processes. For example, if you want to scrape data from the website, you can program your scraper to grab the data as soon as it exists on the headless browser, so you don't have to wait for the page to fully load.

It can also multitask since you can keep using the computer while the headless browser runs in the background.

Disadvantages of Python headless browser

Some of the disadvantages of Python headless browsers are the inability to perform actions that need visual interaction and it is hard to debug.

Python Selenium Headless

The most popular Python headless browser is Python Selenium, and its main use is for automating web applications, including web scraping. Python Selenium carries the same capability as the browser and Selenium. Consequently, if the browser can be operated headless, so can Python Selenium.

What headless browser is included in Selenium?

Chrome, Edge, and Firefox are the three headless browsers in Selenium Python. These three browsers can be used to perform Python Selenium headless.

1. Headless Chrome Selenium Python

Starting from version 59, Chrome was shipping with headless capability. You can call the Chrome binary from the command line to perform headless Chrome.

chrome --headless --disable-gpu --remote-debugging-port=9222 https://zenrows.com

2. Edge

This browser was initially built with EdgeHTML, a proprietary browser engine from Microsoft. However, in late 2018, it was rebuilt as a Chromium browser with Blink and V8 engines making it one of the best headless browsers.

3. Firefox

Like the other browsers, Firefox is also widely used, and to open headless Firefox, type the command on the command line or terminal as seen below.

firefox -headless https://zenrows.com

How do you go headless in Selenium Python?

You'll learn how to do browser automation headlessly through Python Selenium headless. As an example, let's try to scrape some Pokemon details like names, links, and prices from ScrapeMe. Here's what the page looks like:

Click to open the image in full screen

Prerequisites

Before scraping a web page with a Python headless browser, let's install the tools we'll be using for the tutorial:

1. Installing Selenium

After installing Python, let's go ahead and install Selenium in Python using the command code below:

pip install selenium

This'll install Python Selenium on your computer. The good news is most of the time Python automatically installs Selenium, so you don't have to worry.

2. Installing Webdriver Manager

Python Selenium needs a Webdriver with the same browser version installed. You can download and set up ChromeDriver to work with the installed Chrome browser. Although there are some drawbacks, like forgetting the specific path for the Webdriver binary or having a version mismatch between Webdriver and the browser.

To avoid these issues, we'll be making use of Webdriver Manager for Python. The library helps you manage the Webdriver related to your environment. It'll download the correct Webdriver and provide relevant linking to the binary. So you don't need to write it in the script explicitly.

You can also install the WebDrive manager using pip. Simply open a Command Line or Terminal and type the below command.

pip install webdriver-manager

Now we are ready to scrape some Pokemon data using a Python headless browser. Let's dive right in!

Step 1: Open the page

Let's write a code that opens the page. This step is needed to confirm that our environment is set up properly and ready for scraping.

Running the code below will open Chrome automatically and go to the target page.

from selenium import webdriver 
from selenium.webdriver.chrome.service import Service as ChromeService 
from webdriver_manager.chrome import ChromeDriverManager 
 
url = "https://scrapeme.live/shop/" 
 
with webdriver.Chrome(service=ChromeService(ChromeDriverManager().install())) as driver: 
	driver.get(url)

Congratulations! You successfully open the page, and it'll look like this:

Click to open the image in full screen
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step 2: Switch to Python Selenium headless mode

Once the page is open, the rest of the process will be easier. Of course, we don't want the browser to appear on the monitor, and we want Chrome to run headlessly. Switching to headless chrome in Python is pretty straightforward.

We only need to add two lines of code and use the options parameter when calling the Webdriver.

# ... 
options = webdriver.ChromeOptions() 
options.headless = True 
with webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options) as driver: 
	# ...

The complete code will look like this:

from selenium import webdriver 
from selenium.webdriver.chrome.service import Service as ChromeService 
from webdriver_manager.chrome import ChromeDriverManager 
 
url = "https://scrapeme.live/shop/" 
options = webdriver.ChromeOptions() #newly added 
options.headless = True #newly added 
with webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options) as driver: #modified 
	driver.get(url)

If you run the code, you'll see nothing but minor information on the Command Line/Terminal.

Click to open the image in full screen

No Chrome appears and invades our screen.

Awesome!

But is the code successfully reaching the page? How to verify if the scraper opens the right page? Is there any error? Those are some questions raised when you perform a Python Selenium headless.

Your code must create a log for as much as needed to address the question above.

The logs can vary, and it depends on your needs. It can be a log file in text format, a dedicated log database, or a simple output on the Terminal.

For simplicity, we'll create a log by outputting the scraping result on the terminal. Let's add two lines of code below to print the page URL and title. That'll ensure that the Python Selenium headless runs as wanted.

# ... 
with webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options) as driver: 
	# ... 
	print("Page URL:", driver.current_url) 
	print("Page Title:", driver.title)
Click to open the image in full screen

Our crawler won't get unintended results, which is amazing. Since that's done, let's go scrape some pokemon data.

Step 3: Scrape the data

Before going deep, we'll need to find the HTML element that holds the data, and it can be done using a real browser to inspect the web page. To do this, simply right-click on any image and select Inspect. Chrome DevTools will be opened on the Elements tab.

From the elements shown, let's go ahead and find the elements that hold the names, prices, and other information. The browser will highlight the area covered by the selected element, making it easier for us to identify the right element.

Click to open the image in full screen
<a href="https://scrapeme.live/shop/Bulbasaur/" class="woocommerce-LoopProduct-link woocommerce-loop-product__link">

Quite easy, right?

To get the name element, simply hover over the name until the headless browser highlights it.

Click to open the image in full screen
<h2 class="woocommerce-loop-product__title">Bulbasaur</h2>

The h2 element stays inside the parent element. Let's put that element into the code so headless chrome in Selenium Python can identify which element should be extracted from the page. Selenium provides several ways to select and extract the desired element, like using element ID, Tag Name, class, CSS Selectors, and XPath. Let's use the XPath method.

Tweaking the initial code a little bit will make the scraper show more information, like the URL, title, and names of Pokemons. Here's what our new code looks like:

#... 
from selenium.webdriver.common.by import By 
 
#... 
 
with webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options) as driver: 
	#... 
 
	pokemons_data = [] 
 
	parent_elements = driver.find_elements(By.XPATH, "//a[@class='woocommerce-LoopProduct-link woocommerce-loop-product__link']") 
	for parent_element in parent_elements: 
		pokemon_name = parent_element.find_element(By.XPATH, ".//h2") 
		print(pokemon_name.text)
Click to open the image in full screen

Next up, let's pull the Pokemon links from the href attribute under the parent element. This HTML tag will help the scraper spot the links.

Click to open the image in full screen
<a href="https://scrapeme.live/shop/Bulbasaur/" class="woocommerce-LoopProduct-link woocommerce-loop-product__link">

Since we got that out of the way, let's add some code to help our scraper extract and store the names and links, we can then use a Python dictionary to manage the data, so they don't get mixed up. The dictionary will contain each Pokemon's data.

#... 
parent_elements = driver.find_elements(By.XPATH, "//a[@class='woocommerce-LoopProduct-link woocommerce-loop-product__link']") 
for parent_element in parent_elements: 
	pokemon_name = parent_element.find_element(By.XPATH, ".//h2") 
	pokemon_link = parent_element.get_attribute("href") 
 
	temporary_pokemons_data = { 
		"name": pokemon_name.text, 
		"link": pokemon_link 
	} 
 
	pokemons_data.append(temporary_pokemons_data) 
	print(temporary_pokemons_data)
Click to open the image in full screen

Superb!

Now let's go for the last piece of Pokemon data we need, the price.

Let's hop over to the real browser and do a quick inspection of the element that holds the price. We can then insert a span tag on the code for the scraper to know which element contains the prices.

Click to open the image in full screen
#... 
for parent_element in parent_elements: 
	pokemon_name = parent_element.find_element(By.XPATH, ".//h2") 
	pokemon_link = parent_element.get_attribute("href") 
	pokemon_price = parent_element.find_element(By.XPATH, ".//span") 
 
	temporary_pokemons_data = { 
		"name": pokemon_name.text, 
		"link": pokemon_link, 
		"price": pokemon_price.text 
	} 
 
print(temporary_pokemons_data)

And that's all!

The only thing left is to run the script, and voila, we got all Pokemons data.

Click to open the image in full screen

In case you got lost somewhere, here's what the complete code should look like:

from selenium import webdriver 
from selenium.webdriver.common.by import By 
 
from selenium.webdriver.chrome.service import Service as ChromeService 
from webdriver_manager.chrome import ChromeDriverManager 
 
url = "https://scrapeme.live/shop/" 
 
options = webdriver.ChromeOptions() 
options.headless = True 
with webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options) as driver: 
	driver.get(url) 
 
	print("Page URL:", driver.current_url) 
	print("Page Title:", driver.title) 
 
	parent_elements = driver.find_elements(By.XPATH, "//a[@class='woocommerce-LoopProduct-link woocommerce-loop-product__link']") 
	for parent_element in parent_elements: 
		pokemon_name = parent_element.find_element(By.XPATH, ".//h2") 
		pokemon_link = parent_element.get_attribute("href") 
		pokemon_price = parent_element.find_element(By.XPATH, ".//span") 
 
		temporary_pokemons_data = { 
			"name": pokemon_name.text, 
			"link": pokemon_link, 
			"price": pokemon_price.text 
		} 
 
		print(temporary_pokemons_data)

What is the best headless browser?

Selenium headless Python isn't the only headless browser out there, as there are other alternatives, some serve only one programming language, and some others provide binding to many languages.

Excluding Selenium, here are some of the best headless browsers to use for your scraping project.

1. ZenRows

ZenRows is an all-in-one web scraping tool that uses a single API call to handle all anti-bot bypass, from rotating proxies and headless browsers to CAPTCHAs. The residential proxies provided help you crawl a web page and browse like a real user without getting blocked.

ZenRows works great with almost all popular programming languages and you can take advantage of the ongoing free trial, no credit card required.

2. Puppeteer

Puppeteer is a Node.js library providing an API to operate Chrome/Chromium in headless mode. Google developed it in 2017, and it keeps gathering momentum. It has complete access to Chrome using DevTools protocols, making Puppeteer outperform other tools when dealing with Chrome.

Puppeteer is also easier to set up and has a faster speed than Selenium. The downside is it works only with JavaScript language. So if you aren't familiar with the language, you'll find it challenging to use Puppeteer.

3. HtmlUnit

This headless browser is a GUI-Less browser for Java programs, and It can simulate specific browsers (i.e., Chrome, Firefox, or Internet Explorer) when properly configured. The JavaScript support is fairly good and keeps enhancing. Unfortunately, you can only use HtmlUnit with Java language.

4. Zombie.JS

Zombie.JS is a lightweight framework for testing client-side JavaScript code and also serves as a Node.js library. Because the main purpose is for testing, it runs flawlessly with a testing framework. Similarly to Puppeteer, you can only utilize Zombie.JS using JavaScript language.

5. Playwright

Playwright is essentially a Node.js library made for browser automation, but it provides API for other languages like Python, .NET, and Java. It's relatively fast compared to Python Selenium.

Conclusion

Python headless browser offers benefits for the web scraping process. For example, it minimizes memory footprint, deals with JavaScript perfectly, and can run on a no-GUI environment, plus, the implementation takes only a few lines of code.

In this guide, we went through a step-by-step guide to scrape data from a web page using a headless chrome in Selenium Python.

Some of the disadvantages of Python headless browsers discussed in this guide are:
  1. It cannot evaluate graphical elements, therefore, you won't be able to perform any actions that need visual interaction under headless mode.
  2. It's hard to debug.

ZenRows is a tool that provides various services to help you do web scraping in any scenario, including using a Python headless browser with just a simple API call. Take advantage of the free trial and scrape data without stress and headaches.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.