Selenium vs BeautifulSoup in 2023: Which Is Better?

January 20, 2023 ยท 7 min read

Choosing a web scraping option between Selenium vs BeautifulSoup is not rocket science. While both are excellent libraries, there are some key differences to consider while making this decision, like programming language compatibility, browser support and performance.

The table below highlights the main difference between Selenium and BeautifulSoup:

Criterion BeautifulSoup Selenium
Ease of use User-friendly Complex to set up and use
Programming Languages Only Python Python, C#, JavaScript, PHP, Java and Perl
Browser Support It doesn't require a browser instance Chrome, Brave, Edge, IE, Safari, Firefox and Opera
Performance Faster since it just requires the page source Slower as it uses a web driver instance to navigate through the web
Functionality Mainly used for parsing and extracting information from HTML and XML documents Selenium can automate browser actions, like clicking buttons, filling out forms and navigating between pages
Operating System Support Windows, Linux and macOS Windows, Linux and macOS
Architecture Simple HTTP and XML parser to navigate through structured data JSON Wire protocol to manage Web Drivers
Prerequisites BeautifulSoup package and a module to send requests (like as httpx) Selenium Bindings and Browser Drivers
Dynamic Content It only works with static web pages Selenium can scrape dynamically generated content

Now that you have an idea of the differences between Selenium and BeautifulSoup, let's go ahead and discuss these libraries in detail, also doing a scraping example of each to show how effective they're.

BeautifulSoup

BeautifulSoup is a Python web scraping library for web scraping and parsing HTML and XML documents, giving us more options to navigate through a structured data tree. It allows you to extract information from a web page's HTML or XML code by providing a simple, easy-to-use API.

The library can parse and navigate through the page, making it easy to find and extract the content you need. BeautifulSoup can extract data such as text, links, images and other elements from a web page.

What Are the Advantages of BeautifulSoup?

The main advantages of BeautifulSoup over Selenium are:
  • It's faster.
  • It's beginner-friendly and easier to set up.
  • It works independently from browsers.
  • It requires less time to run.
  • It can parse HTML and XML documents.
  • BeautifulSoup is easier to debug.

What Are the Disadvantages of BeautifulSoup?

The disadvantages of BeautifulSoup are:
  • It can't interact with web pages like a human user
  • It can only parse data. Therefore, you'll need to install other modules to extract the data, like requests or httpx.
  • It supports only Python.
  • You'll need a different module to scrape JavaScript-rendered web pages since BeautifulSoup only lets you navigate through HTML or XML files.

When to Use BeautifulSoup

BeautifulSoup is best used for web scraping tasks that involve parsing and extracting information from static HTML pages and XML documents. For example, if you need to scrape data from a website that has a simple structure, such as a blog or an online store, BeautifulSoup can easily extract the information you need by parsing the HTML code.

If you are looking to scrape dynamic content, Selenium is a better alternative.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Web Scraping Sample with BeautifulSoup

Let's go through a quick scraping tutorial to get more insights about the performance comparison between BeautifulSoup vs Selenium.

Since BeautifulSoup just provides a way to navigate through the data, we'll use another module to download the website data. Let's use requests and scrape a paragraph from a Wikipedia article:

Random Wikipedia page
Random Wikipedia page

To get started, inspect the page elements to find the introduction element. You'll find it in the second p tag under the div with mv-parser-output class:

Inspect Wikipedia page
Inspect Wikipedia page

What's left is to send a GET request to the website and define a BeautifulSoup object to get the elements. Start by importing the necessary tools:

# load the required packages 
from bs4 import BeautifulSoup 
# we need a module to connect to websites, you can also use built-in urrlib module. 
import requests 
 
url = "https://en.wikipedia.org/wiki/CSS_Baltic" 
# get the website data 
response = requests.get(url)
After that, define the object and parse the HTML. Then, extract the data using the find and find_all methods.
  • find returns the first occurrence of the element.
  • find_all returns all found elements.
# parse response text using html.parser 
soup = BeautifulSoup(response.text, "html.parser") 
 
# get the main div element 
main_div = soup.find("div", {"class": "mw-body-content mw-content-ltr"}) 
# extract the content div 
content_div = main_div.find("div", {"class": "mw-parser-output"}) 
# second div is the first paragraph 
second_p = main_div.find_all("p")[1] 
 
# print out the extracted data 
print(second_p.text)

Take a look at the full code for BeautifulSoup:

# load the required packages 
from bs4 import BeautifulSoup 
 
# we need a module to connect to websites, you can also use built-in urrlib module. 
import requests 
 
url = "https://en.wikipedia.org/wiki/CSS_Baltic" 
# get the website data 
response = requests.get(url) 
# parse response text using html.parser 
soup = BeautifulSoup(response.text, "html.parser") 
 
# get the main div element 
main_div = soup.find("div", {"class": "mw-body-content mw-content-ltr"}) 
# extract the content div 
content_div = main_div.find("div", {"class": "mw-parser-output"}) 
# second div is the first paragraph 
second_p = main_div.find_all("p")[1] 
 
# print out the extracted data 
print(second_p.text)

After running the script, here's what your output should look like:

CSS[a] Baltic was an ironclad warship that served in the Confederate States Navy during the American Civil War. A towboat before the war, she was purchased by the state of Alabama in December 1861 for conversion into an ironclad. After being transferred to the Confederate Navy in May 1862 as an ironclad, she served on Mobile Bay off the Gulf of Mexico. Baltic's condition in Confederate service was such that naval historian William N. Still Jr. has described her as "a nondescript vessel in many ways".[3] Over the next two years, parts of the ship's wooden structure were affected by wood rot. Her armor was removed to be put onto the ironclad CSS Nashville in 1864. By that August, Baltic had been decommissioned. Near the end of the war, she was taken up the Tombigbee River, where she was captured by Union forces on May 10, 1865. An inspection of Baltic the next month found that her upper hull and deck were rotten and that her boilers were unsafe. She was sold on December 31, and was likely broken up in 1866.

That's all!

Although BeautifulSoup can only scrape static web pages, it's also possible to extract dynamic data by combining it with a different library. Learn how to do that by using ZenRows API with Python Requests and BeautifulSoup.

Selenium

Selenium is an open-source browser automation tool often used for web scraping. It's been around for over a decade, and its main components are Selenium IDE (used to record actions before automating them), Selenium WebDriver (to execute commands in the browser) and Selenium Grid (for parallel execution).

Selenium can also handle dynamic web pages, which are difficult to scrape using BeautifulSoup.

What Are the Advantages of Selenium?

The advantages of Selenium are:
  • It's easy to use.
  • It supports multiple programming languages, like JavaScript, Ruby, Python and C#.
  • It can automate Firefox, Edge, Safari and even a custom QtWebKit browser.
  • Selenium can interact with the web page's JavaScript code, execute XHR requests and wait for elements to load before scraping the data. In other words, you can scrape dynamic web pages handily, while it's more challenging to detect and interact with pages whose content changes after page load with BeautifulSoup.

What Are the Disadvantages of Selenium?

The disadvantages of Selenium are:
  • Selenium set-up methods are complex.
  • It uses more resources compared to BeautifulSoup.
  • It can become slow when you start scaling up your application.

When to Use Selenium

A key difference between Selenium vs BeautifulSoup is the type of data they can scrape. Selenium is ideal for scraping websites that require interaction with the page, like filling out forms, clicking buttons or navigating between pages. For example, if you need to scrape data from a website that requires login, Selenium can automate the login process and navigate through the pages to scrape the data.

Also, Selenium it's an excellent library for scraping JS-rendered web pages.

Web Scraping Sample with Selenium

Let's run through a tutorial on web scraping with Selenium using the same web page. We'll also navigate to the article with Selenium to emphasize the dynamic content scraping.

Start by importing the required packages:

from selenium import webdriver 
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.common.by import By 
# web driver manager: https://github.com/SergeyPirogov/webdriver_manager 
# will help us automatically download the web driver binaries 
# then we can use `Service` to manage the web driver's state. 
from webdriver_manager.chrome import ChromeDriverManager 
 
# we will need these to wait for dynamic content to load 
from selenium.webdriver.support.wait import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

Then, configure the Selenium instance using the WebDriver's options object:

# we can configure the selenium using webdriver's options object 
options = webdriver.ChromeOptions() 
options.headless = True # just set it to the headless mode 
# this returns the path web driver downloaded 
chrome_path = ChromeDriverManager().install() 
# define the chrome service and pass it to the driver instance 
chrome_service = Service(chrome_path) 
driver = webdriver.Chrome(service=chrome_service, options=options)

In this case, we'll go to Wikipedia's homepage and use the search bar. That will show Selenium's ability to interact with the page and scrape dynamic content. When we inspect the elements, we can see the search bar is an input element with the vector-search-box-input class.

Search box on Wikipedia page
Search box on Wikipedia page

You'll need to click on the inpunt and then write the query. For that, Selenium provides the send_keys function that will fill in the form for us.

# find the search box 
search_box = driver.find_element(By.CSS_SELECTOR, "input.vector-search-box-input") 
# click to the search box 
search_box.click() 
# search for the article 
search_box.send_keys("CSS Baltic")

The next step is connecting Selenium to the web page, which you can do by interacting with the website. Locate the search bar and search for the article title. You'll find the search results stored as a tags with the mw-searchSuggest-link class:

Title selector on Wikipedia page
Title selector on Wikipedia page

Click on the first result to extract the data, and use the WebDriverWait() method to wait for the webpage to load:

try: 
	# wait for 10 seconds for content to load. 
	search_suggestions = WebDriverWait(driver, 10).until( 
		EC.presence_of_all_elements_located((By.CSS_SELECTOR, "a.mw-searchSuggest-link")) 
	) 
	# click to the first suggestion 
	search_suggestions[0].click() 
	 
	# extract the data using same selectors as in beautiful soup. 
	main_div = driver.find_element(By.CSS_SELECTOR, "div.mw-body-content") 
	content_div = main_div.find_element(By.CSS_SELECTOR, "div.mw-parser-output") 
	paragraphs = content_div.find_elements(By.TAG_NAME, "p") 
	 
	# we need the second paragraph 
	intro = paragraphs[1].text 
	 
	print(intro) 
except Exception as error: 
	print(error)

Here's what your code should look like after combining all the chunks:

from selenium import webdriver 
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.common.by import By 
# web driver manager: https://github.com/SergeyPirogov/webdriver_manager 
# will help us automatically download the web driver binaries 
# then we can use `Service` to manage the web driver's state. 
from webdriver_manager.chrome import ChromeDriverManager 
 
# we will need these to wait for dynamic content to load 
from selenium.webdriver.support.wait import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
 
# we can configure the selenium using webdriver's options object 
options = webdriver.ChromeOptions() 
options.headless = True # just set it to the headless mode 
# this returns the path web driver downloaded 
chrome_path = ChromeDriverManager().install() 
# define the chrome service and pass it to the driver instance 
chrome_service = Service(chrome_path) 
driver = webdriver.Chrome(service=chrome_service, options=options) 
 
url = "https://en.wikipedia.org/wiki/Main_Page" 
 
driver.get(url) 
 
# find the search box 
search_box = driver.find_element(By.CSS_SELECTOR, "input.vector-search-box-input") 
# click to the search box 
search_box.click() 
# search for the article 
search_box.send_keys("CSS Baltic") 
 
try: 
	# wait for 10 seconds for content to load. 
	search_suggestions = WebDriverWait(driver, 10).until( 
		EC.presence_of_all_elements_located((By.CSS_SELECTOR, "a.mw-searchSuggest-link")) 
	) 
	# click to the first suggestion 
	search_suggestions[0].click() 
 
	# extract the data using same selectors as in beautiful soup. 
	main_div = driver.find_element(By.CSS_SELECTOR, "div.mw-body-content") 
	content_div = main_div.find_element(By.CSS_SELECTOR, "div.mw-parser-output") 
	paragraphs = content_div.find_elements(By.TAG_NAME, "p") 
 
	# we need the second paragraph 
	intro = paragraphs[1].text 
 
	print(intro) 
except Exception as error: 
	print(error) 
 
driver.quit()

And voilà! Take a look at the output:

CSS[a] Baltic was an ironclad warship that served in the Confederate States Navy during the American Civil War. A towboat before the war, she was purchased by the state of Alabama in December 1861 for conversion into an ironclad. After being transferred to the Confederate Navy in May 1862 as an ironclad, she served on Mobile Bay off the Gulf of Mexico. Baltic's condition in Confederate service was such that naval historian William N. Still Jr. has described her as "a nondescript vessel in many ways".[3] Over the next two years, parts of the ship's wooden structure were affected by wood rot. Her armor was removed to be put onto the ironclad CSS Nashville in 1864. By that August, Baltic had been decommissioned. Near the end of the war, she was taken up the Tombigbee River, where she was captured by Union forces on May 10, 1865. An inspection of Baltic the next month found that her upper hull and deck were rotten and that her boilers were unsafe. She was sold on December 31, and was likely broken up in 1866.

Key Differences: Selenium vs BeautifulSoup

Let's analyze critical considerations to decide between the two libraries:
  • Functionality.
  • Speed.
  • Ease of use.

Functionality

Selenium is a web browser automation tool that can interact with web pages like a human user, whereas BeautifulSoup is a library for parsing HTML and XML documents. This means Selenium has more functionality since it can automate browser actions such as clicking buttons, filling out forms and navigating between pages. BeautifulSoup is more limited and is mainly used for parsing and extracting data.

Speed

Which library is faster between BeautifulSoup vs Selenium? You're not the first to ask this, and to highlight it: BeautifulSoup is faster than Selenium since it doesn't require an actual browser instance.

To compare Selenium vs BeautifulSoup speed, we used ScrapeThisSite, ran the scripts presented above 1,000 times and plotted the results using a bar chart.

We used the os module in Python to run the scripts and the time module to calculate the time difference. We defined t0=time.time(), t1=time.time() and t2=time.time() between the os commands and stored the differences. The result was saved in a Pandas data frame.

import os 
import time 
import matplotlib.pyplot as plt 
 
import pandas as pd 
 
d = { 
	"selenium": [], 
	"bs4": [] 
} 
 
N = 1000 
 
for i in range(N): 
	print("-"*20, f"Experiment {i+1}", "-"*20) 
	t0 = time.time() 
	os.system("python3 'beautifulsoup_script.py'") 
	t1 = time.time() 
	os.system("python3 'selenium_script.py'") 
	t2 = time.time() 
	d["selenium"].append(t2-t1) 
	d["bs4"].append(t1-t0) 
 
df = pd.DataFrame(d) 
df.to_csv("data.csv", index=False)

Here's the test result of Selenium vs Puppeteer after running the code:

Selenium vs bs4
Selenium vs bs4

The result shows that BeautifulSoup is about 70% faster than Selenium. Therefore, the optimal choice regarding this specific criterion is BeautifulSoup.

Ease of Use

BeautifulSoup is more user-friendly than Selenium. It has a simple API and is easy to understand for beginners. Selenium, on the other hand, can be more complex to set up and use as it requires knowledge of programming concepts such as web drivers and browser automation.

Which Is Better: Selenium vs BeautifulSoup

There isn't a direct answer to which one is better between Selenium and BeautifulSoup for scraping since it depends on factors like your web scraping needs, long-term library support and cross-browser support. BeautifulSoup is fast but, compared to Selenium, it supports fewer programming languages and can only scrape static web pages.

BeautifulSoup and Selenium are undoubtedly great libraries for scraping, but the headaches start during large-scale web scraping or when scraping popular sites since protection measures might detect your bots. The best way to avoid this is by using a web scraping API like ZenRows.

ZenRows is a web scraping tool that handles all antibot bypass for you in a single API call, and it's equipped with essential features like rotating proxies, headless browsers, automatic retries and more. You can try it for free now.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Want to keep learning?

We will be sharing all the insights we have learned through the years in the following blog posts. If you don't want to miss a piece and keep learning, we'd be thrilled to have us in our newsletter.

No spam guaranteed. You can unsubscribe at any time.