The Anti-bot Solution to Scrape Everything? Get Your Free API Key! ๐Ÿ˜Ž

Puppeteer vs. Selenium: Which Is Better in 2024

October 14, 2023 ยท 6 min read

Are you stuck deciding between Puppeteer vs. Selenium for web scraping? We get you. Both are fantastic browser automation frameworks, and it's essential to consider your scraping needs and the resources available when deciding.

See the key differences between Puppeteer and Selenium in the table below, and then let's dig into the details.

Criteria Puppeteer Selenium
Compatible Languages Only JavaScript is officially supported, but there are unofficial PHP and Python ports Java, Python, C#, Ruby, PHP, JavaScript and Kotlin
Browser Support Chromium and experimental Firefox support Chrome, Safari, Firefox, Opera, Edge and IE
Performance 60% faster than Selenium Fast
Operating System Support Windows, Linux and macOS Windows, Linux, macOS and Solaris
Architecture Event-driven architecture with headless browser instances JSONWire protocol on the web driver to control the browser instance
Prerequisites JavaScript package is enough Selenium Bindings (for the picked programming language) and browser web drivers
Community Small community compared to Selenium Established a collection of documentation along with a huge community

Let's go ahead and discuss these libraries in detail and do a scraping example of each to show how effective they're in extracting data from a web page.

Puppeteer

Puppeteer is an automation Node.js library that supports headless Chromium or Chrome over the DevTools protocol. It provides tools for automating tasks in Chrome or Chromium, like taking screenshots, generating PDFs and navigating pages.

Puppeteer can also be used to test web pages by simulating user interactions like clicking buttons, filling out forms and verifying the results are displayed as expected.

What Are the Advantages of Puppeteer?

The advantages of Puppeteer are:

  • It's easy to use.
  • No set-up is required since it comes bundled with Chromium.
  • It runs headless by default, but it can also be disabled.
  • Puppeteer has event-driven architecture, removing the need for manual sleep calls in your code.
  • It can take screenshots, generate PDFs and automatize every action on the browser.
  • It provides management capabilities such as recording runtime and load performance, which help optimize and debug your scraper.
  • Puppeteer can crawl SPA (Single Page Applications) and generate pre-rendered content (i.e., server-side rendering).
  • You can create Puppeteer scripts by recording your actions on the browser using the WebTools console.

What Are the Disadvantages of the Puppeteer?

The disadvantages of Puppeteer are:

  • Compared to Selenium, Puppeteer supports fewer browsers.
  • Puppeteer focuses on JavaScript only, although there are unofficial ports for Python and PHP.
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Web Scraping Sample with Puppeteer

Let's go through a quick Puppeteer web scraping tutorial to get more insights about the performance comparison between Puppeteer vs. Selenium. We'll extract the table items from the Crime and Thriller category of the Danube website:

Danube Store: Crime and Thrillers
Click to open the image in full screen

To get started, import the Puppeteer module and create an asynchronous function to run the Puppeteer code:

program.js
const puppeteer = require('puppeteer'); 
async function main() { 
	// write code here 
}

Once done, launch the browser instance and create a page using the puppeteer.launch() and newPage() methods:

program.js
const browser = await puppeteer.launch({ headless: 'new' }) 
const page = await browser.newPage();

Use the goto method to navigate to the page created and pass waitUntil: networkidle2 to wait until all the network traffic finishes. Then wait for the ul element with the sidebar-list class to load the list items:

program.js
await page.goto('https://danube-webshop.herokuapp.com/', { waitUntil: 'networkidle2' }) 
await page.waitForSelector('ul.sidebar-list');

Click on the first element list to navigate to the Crime & Thrillers category's page, then add the waitForNavigation() method to wait for the web page to load. Next, handle the asynchronous operation by wrapping these in a Promise:

program.js
await Promise.all([ 
	page.waitForNavigation(), 
	page.click("ul[class='sidebar-list'] > li > a"), 
]);

Wait for the book previews to load using the waitForSelector() method. Then extract the books using the querySelectorAll() method and store them in the books variable. Finally, scrape the title, author, price and rating for each preview, and print it out:

program.js
await page.waitForSelector("li[class='preview']"); 
 
const books = await page.evaluateHandle( 
	() => [...document.querySelectorAll("li[class='preview']")] 
) 
 
const processed_data = await page.evaluate(elements => { 
	let data = [] 
	elements.forEach( element => 
		{ 
			let title = element.querySelector("div.preview-title").innerHTML; 
			let author = element.querySelector("div.preview-author").innerHTML; 
			let rating = element.querySelector("div.preview-details > p.preview-rating").innerHTML; 
			let price = element.querySelector("div.preview-details > p.preview-price").innerHTML; 
 
			let result = {title: title, author: author, rating: rating, price: price} 
			data.push(result); 
		}) 
	return data 
}, books) 
 
console.log(processed_data) 
 
await page.close(); 
await browser.close();

Now, let's wrap it up in the main function, then we can use main() to run it:

program.js
// import the puppeteer library 
const puppeteer = require('puppeteer'); 
 
// create the asynchronous main function 
async function main() { 
	// launch a headless browser instance 
	const browser = await puppeteer.launch({ headless: true }) 
 
	// create a new page object 
	const page = await browser.newPage(); 
 
	// navigate to the target URL, wait until the loading finishes 
	await page.goto('https://danube-webshop.herokuapp.com/', { waitUntil: 'networkidle2' }) 
 
	// wait for left-side bar to load 
	await page.waitForSelector('ul.sidebar-list'); 
 
	// click to the first element and wait for the navigation to finish 
	await Promise.all([ 
		page.waitForNavigation(), 
		page.click("ul[class='sidebar-list'] > li > a"), 
	]); 
 
	// wait for previews to load 
	await page.waitForSelector("li[class='preview']"); 
 
	// extract the book previews 
	const books = await page.evaluateHandle( 
		() => [...document.querySelectorAll("li[class='preview']")] 
	) 
 
	// extract the relevant data using page.evaluate 
	// just pass the elements as the second argument and processing function as the first argument 
	const processed_data = await page.evaluate(elements => { 
		// define an array to store the extracted data 
		let data = [] 
		// use a forEach loop to loop through every preview 
		elements.forEach( element => 
			{ 
				// get the HTMl text of title, author, rating, and price data, respectively. 
				let title = element.querySelector("div.preview-title").innerHTML; 
				let author = element.querySelector("div.preview-author").innerHTML; 
				let rating = element.querySelector("div.preview-details > p.preview-rating").innerHTML; 
				let price = element.querySelector("div.preview-details > p.preview-price").innerHTML; 
 
				// build a dictionary and store the data as key:value pairs 
				let result = {title: title, author: author, rating: rating, price: price} 
				// append the data to the `data` array 
				data.push(result); 
			}) 
		// return the result (it will be stored in `processed_data` variable) 
		return data 
	}, books) 
 
	// print out the extracted data 
	console.log(processed_data) 
	// close the page and browser respectively 
	await page.close(); 
	await browser.close(); 
} 
 
// run the main function to scrape the data 
main();

Go ahead and run the code. Your output should look like this:

Output
[ 
	{ 
		title: 'Does the Sun Also Rise?', 
		author: 'Ernst Doubtingway', 
		rating: 'โ˜…โ˜…โ˜…โ˜…โ˜†', 
		price: '$9.95' 
	}, 
	{ 
		title: 'The Insiders', 
		author: 'E. S. Hilton', 
		rating: 'โ˜…โ˜…โ˜…โ˜…โ˜†', 
		price: '$9.95' 
	}, 
	{ 
		title: 'A Citrussy Clock', 
		author: 'Bethany Urges', 
		rating: 'โ˜…โ˜…โ˜…โ˜…โ˜…', 
		price: '$9.95' 
	} 
]

Selenium

Selenium is an open-source end-to-end testing and web automation tool often used for web scraping, having Selenium IDE, Selenium Webdriver and Selenium Grid as its main components.

Selenium IDE is used to record actions before automating them, and Selenium Grid is used for parallel execution and Selenium WebDriver to execute commands in the browser.

What Are the Advantages of Selenium?

The advantages of Selenium are:

  • It's easy to use.
  • Selenium supports various programming languages, like Python, Java, JavaScript, Ruby, and C#.
  • It can automate Firefox, Edge, Safari and even a custom QtWebKit browser.
  • It's possible to scale Selenium to hundreds of instances with different techniques, like setting up the cloud servers with different browser settings.
  • It can operate on Windows, macOS and Linux.

What Are the Disadvantages of Selenium?

The disadvantages of Selenium are:

  • Selenium set-up methods are complex.

Web Scraping Sample with Selenium

As we did with Puppeteer, let's run through a tutorial on web scraping with Selenium using the same target site. Start by importing the necessary modules and then configure the Selenium instance:

program.py
# for timing calculations and sleeping the scraper 
import time 
 
# import the Selenium 
from selenium import webdriver 
from selenium.webdriver.common.by import By 

options = webdriver.ChromeOptions() 
options.add_argument("--headless") 

Initialize the Chrome webdriver:

program.py
driver = webdriver.Chrome(options=options)

Point to the target website using the driver.get() method:

program.py
url = "https://danube-webshop.herokuapp.com/" 
driver.get(url)

Click on the Crime & Thrillers category element and add time.sleep(1) to sleep the scraper for one second. Then use the find_elements method to extract the book previews:

program.py
time.sleep(1) 
crime_n_thrillers = driver.find_element(By.CSS_SELECTOR, "ul[class='sidebar-list'] > li") 
crime_n_thrillers.click() 
time.sleep(1) 
books = driver.find_elements(By.CSS_SELECTOR, "div.shop-content li.preview")

Extract the title, author, rating and pricing, respectively, from the elements using BeautifulSoup, and wrap this up in the extract function:

program.py
def extract(element): 
	title = element.find_element(By.CSS_SELECTOR, "div.preview-title").text 
	author = element.find_element(By.CSS_SELECTOR, "div.preview-author").text 
	rating = element.find_element(By.CSS_SELECTOR, "div.preview-details p.preview-rating").text 
	price = element.find_element(By.CSS_SELECTOR, "div.preview-details p.preview-price").text 
	# return the extracted data as a dictionary 
	return {"title": title, "author": author, "rating": rating, "price": price} 

Loop through the previews and extract the data, then quit the driver:

program.py
extracted_data = [] 
for element in books: 
	data = extract(element) 
	extracted_data.append(data) 

print(extracted_data)	
driver.quit()

Here's what the output looks like:

Output
[ 
	{'title': 'Does the Sun Also Rise?', 'author': 'Ernst Doubtingway', 'rating': 'โ˜…โ˜…โ˜…โ˜…โ˜†', 'price': '$9.95'}, 
	{'title': 'The Insiders', 'author': 'E. S. Hilton', 'rating': 'โ˜…โ˜…โ˜…โ˜…โ˜†', 'price': '$9.95'}, 
	{'title': 'A Citrussy Clock', 'author': 'Bethany Urges', 'rating': 'โ˜…โ˜…โ˜…โ˜…โ˜…', 'price': '$9.95'} 
]

Puppeteer vs. Selenium: Speed Comparison

Is Puppeteer faster than Selenium? That question is often asked and, to be direct: Puppeteer is faster than Selenium.

To compare Puppeteer vs. Selenium speed, we used the Danube-store sandbox, ran the scripts we presented above 20 times, and averaged the execution times.

We used the time module to set a starting and ending time to find the time it takes to run the Selenium script. We put start_time = time.time() at the beginning and end_time = time.time() at the end of the script. The difference between the start and end times was then calculated with end_time - start_time.

That is the full script used for Selenium:

program.py
import time 
 
from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.webdriver.support.ui import WebDriverWait 

def extract(element): 
	title = element.find_element(By.CSS_SELECTOR, "div.preview-title").text 
	author = element.find_element(By.CSS_SELECTOR, "div.preview-author").text 
	rating = element.find_element(By.CSS_SELECTOR, "div.preview-details p.preview-rating").text 
	price = element.find_element(By.CSS_SELECTOR, "div.preview-details p.preview-price").text 
 
	return {"title": title, "author": author, "rating": rating, "price": price} 
 
# start the timer 
start = time.time() 
 
options = webdriver.ChromeOptions() 
options.add_argument("--headless") 

# create a new instance of the Chrome driver
driver = webdriver.Chrome(options=options) 
 
url = "https://danube-webshop.herokuapp.com/" 
 
driver.get(url)

# get the first page and click to the its link 
# first element will be the Crime & Thrillers category 
time.sleep(1) 
crime_n_thrillers = driver.find_element(By.CSS_SELECTOR, "ul[class='sidebar-list'] > li") 
crime_n_thrillers.click() 
time.sleep(1) 

# get the data div and extract the data using beautifulsoup 
books = driver.find_elements(By.CSS_SELECTOR, "div.shop-content li.preview") 
 
extracted_data = [] 
for element in books: 
	data = extract(element) 
	extracted_data.append(data) 

print(extracted_data) 
 
end = time.time() 
 
print(f"The whole script took: {end-start:.4f}") 
 
driver.quit()

For the Puppeteer script, we used the Date object. Add const start = Date.now(); at the start and const end = Date.now(); at the end of the script. Then the difference can be calculated with end-start.

And here's the script used for Puppeteer:

program.js
const puppeteer = require('puppeteer'); 
 
async function main() { 
	const start = Date.now(); 
 
	const browser = await puppeteer.launch({ headless: 'new' }) 
    const page = await browser.newPage(); 
	await page.goto('https://danube-webshop.herokuapp.com/', { waitUntil: 'networkidle2' }) 
 
	await page.waitForSelector('ul.sidebar-list'); 
 
	await Promise.all([ 
		page.waitForNavigation(), 
		page.click("ul[class='sidebar-list'] > li > a"), 
	]); 
 
	await page.waitForSelector("li[class='preview']"); 
	const books = await page.evaluateHandle( 
		() => [...document.querySelectorAll("li[class='preview']")] 
	) 
 
	const processed_data = await page.evaluate(elements => { 
		let data = [] 
		elements.forEach( element => 
			{ 
				let title = element.querySelector("div.preview-title").innerHTML; 
				let author = element.querySelector("div.preview-author").innerHTML; 
				let rating = element.querySelector("div.preview-details > p.preview-rating").innerHTML; 
				let price = element.querySelector("div.preview-details > p.preview-price").innerHTML; 
 
				let result = {title: title, author: author, rating: rating, price: price} 
				data.push(result); 
			}) 
		return data 
	}, books) 
 
	console.log(processed_data) 
	await page.close(); 
	await browser.close(); 
 
	const end = Date.now(); 
	console.log(`Execution time: ${(end - start) / 1000} s`); 
} 
 
main();

You can see the performance test result of Selenium vs. Puppeteer below:

Puppeteer vs. Selenium - Speed Results
Click to open the image in full screen

The chart shows that Puppeteer is about 60% faster than Selenium. Therefore, scaling up the Puppeteer applications for Chromium-needed projects is the optimal choice for web scraping in this regard.

Puppeteer vs. Selenium: Which Is Better?

So which one is better between Selenium and Puppeteer for scraping? There isn't a direct answer to that question since it depends on multiple factors, like long-term library support, cross-browser support and your web scraping needs.ย 

Puppeteer is fast, but compared to Selenium, it supports fewer browsers. Selenium also supports more languages compared to Puppeteer.

Conclusion

Although using Puppeteer or Selenium is a good option for a web scraper, scaling up and optimizing your web scraping project might be difficult because advanced antibots are capable of detecting and blocking these libraries. The best way to avoid this is by making use of a web scraping API, like ZenRows.

ZenRows is a web scraping tool that handles all anti-bot bypass for you in a single API call, and it's equipped with amazing features like rotating proxies, headless browsers, automatic retries and more. You can try it for free now.

Did you find the content helpful? Spread the word and share it on Twitter, or LinkedIn.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.