Are you stuck deciding between Puppeteer vs. Selenium for web scraping? We get you. Both are fantastic browser automation frameworks, and it's essential to consider your scraping needs and the resources available when deciding.
See the key differences between Puppeteer and Selenium in the table below, and then let's dig into the details.
Criteria | Puppeteer | Selenium |
---|---|---|
Compatible Languages | Only JavaScript is officially supported, but there are unofficial PHP and Python ports | Java, Python, C#, Ruby, PHP, JavaScript and Kotlin |
Browser Support | Chromium and experimental Firefox support | Chrome, Safari, Firefox, Opera, Edge and IE |
Performance | 60% faster than Selenium | Fast |
Operating System Support | Windows, Linux and macOS | Windows, Linux, macOS and Solaris |
Architecture | Event-driven architecture with headless browser instances | JSONWire protocol on the web driver to control the browser instance |
Prerequisites | JavaScript package is enough | Selenium Bindings (for the picked programming language) and browser web drivers |
Community | Small community compared to Selenium | Established a collection of documentation along with a huge community |
Let's go ahead and discuss these libraries in detail and do a scraping example of each to show how effective they're in extracting data from a web page.
Puppeteer
Puppeteer is an automation Node.js library that supports headless Chromium or Chrome over the DevTools protocol. It provides tools for automating tasks in Chrome or Chromium, like taking screenshots, generating PDFs and navigating pages.
Puppeteer can also be used to test web pages by simulating user interactions like clicking buttons, filling out forms and verifying the results are displayed as expected.
What Are the Advantages of Puppeteer?
The advantages of Puppeteer are:
- It's easy to use.
- No set-up is required since it comes bundled with Chromium.
- It runs headless by default, but it can also be disabled.
- Puppeteer has event-driven architecture, removing the need for manual sleep calls in your code.
- It can take screenshots, generate PDFs and automatize every action on the browser.
- It provides management capabilities such as recording runtime and load performance, which help optimize and debug your scraper.
- Puppeteer can crawl SPA (Single Page Applications) and generate pre-rendered content (i.e., server-side rendering).
- You can create Puppeteer scripts by recording your actions on the browser using the WebTools console.
What Are the Disadvantages of the Puppeteer?
The disadvantages of Puppeteer are:
- Compared to Selenium, Puppeteer supports fewer browsers.
- Puppeteer focuses on JavaScript only, although there are unofficial ports for Python and PHP.
Web Scraping Sample with Puppeteer
Let's go through a quick Puppeteer web scraping tutorial to get more insights about the performance comparison between Puppeteer vs. Selenium. We'll extract the table items from the Crime and Thriller category of the Danube website:
To get started, import the Puppeteer
module and create an asynchronous function to run the Puppeteer code:
const puppeteer = require('puppeteer');
async function main() {
// write code here
}
Once done, launch the browser instance and create a page using the puppeteer.launch()
and newPage()
methods:
const browser = await puppeteer.launch({ headless: 'new' })
const page = await browser.newPage();
Use the goto
method to navigate to the page created and pass waitUntil
: networkidle2
to wait until all the network traffic finishes. Then wait for the ul
element with the sidebar-list
class to load the list items:
await page.goto('https://danube-webshop.herokuapp.com/', { waitUntil: 'networkidle2' })
await page.waitForSelector('ul.sidebar-list');
Click on the first element list to navigate to the Crime & Thrillers
category's page, then add the waitForNavigation()
method to wait for the web page to load. Next, handle the asynchronous operation by wrapping these in a Promise
:
await Promise.all([
page.waitForNavigation(),
page.click("ul[class='sidebar-list'] > li > a"),
]);
Wait for the book previews to load using the waitForSelector()
method. Then extract the books using the querySelectorAll()
method and store them in the books variable. Finally, scrape the title, author, price and rating for each preview, and print it out:
await page.waitForSelector("li[class='preview']");
const books = await page.evaluateHandle(
() => [...document.querySelectorAll("li[class='preview']")]
)
const processed_data = await page.evaluate(elements => {
let data = []
elements.forEach( element =>
{
let title = element.querySelector("div.preview-title").innerHTML;
let author = element.querySelector("div.preview-author").innerHTML;
let rating = element.querySelector("div.preview-details > p.preview-rating").innerHTML;
let price = element.querySelector("div.preview-details > p.preview-price").innerHTML;
let result = {title: title, author: author, rating: rating, price: price}
data.push(result);
})
return data
}, books)
console.log(processed_data)
await page.close();
await browser.close();
Now, let's wrap it up in the main
function, then we can use main()
to run it:
// import the puppeteer library
const puppeteer = require('puppeteer');
// create the asynchronous main function
async function main() {
// launch a headless browser instance
const browser = await puppeteer.launch({ headless: true })
// create a new page object
const page = await browser.newPage();
// navigate to the target URL, wait until the loading finishes
await page.goto('https://danube-webshop.herokuapp.com/', { waitUntil: 'networkidle2' })
// wait for left-side bar to load
await page.waitForSelector('ul.sidebar-list');
// click to the first element and wait for the navigation to finish
await Promise.all([
page.waitForNavigation(),
page.click("ul[class='sidebar-list'] > li > a"),
]);
// wait for previews to load
await page.waitForSelector("li[class='preview']");
// extract the book previews
const books = await page.evaluateHandle(
() => [...document.querySelectorAll("li[class='preview']")]
)
// extract the relevant data using page.evaluate
// just pass the elements as the second argument and processing function as the first argument
const processed_data = await page.evaluate(elements => {
// define an array to store the extracted data
let data = []
// use a forEach loop to loop through every preview
elements.forEach( element =>
{
// get the HTMl text of title, author, rating, and price data, respectively.
let title = element.querySelector("div.preview-title").innerHTML;
let author = element.querySelector("div.preview-author").innerHTML;
let rating = element.querySelector("div.preview-details > p.preview-rating").innerHTML;
let price = element.querySelector("div.preview-details > p.preview-price").innerHTML;
// build a dictionary and store the data as key:value pairs
let result = {title: title, author: author, rating: rating, price: price}
// append the data to the `data` array
data.push(result);
})
// return the result (it will be stored in `processed_data` variable)
return data
}, books)
// print out the extracted data
console.log(processed_data)
// close the page and browser respectively
await page.close();
await browser.close();
}
// run the main function to scrape the data
main();
Go ahead and run the code. Your output should look like this:
[
{
title: 'Does the Sun Also Rise?',
author: 'Ernst Doubtingway',
rating: '★★★★☆',
price: '$9.95'
},
{
title: 'The Insiders',
author: 'E. S. Hilton',
rating: '★★★★☆',
price: '$9.95'
},
{
title: 'A Citrussy Clock',
author: 'Bethany Urges',
rating: '★★★★★',
price: '$9.95'
}
]
Selenium
Selenium is an open-source end-to-end testing and web automation tool often used for web scraping, having Selenium IDE, Selenium Webdriver and Selenium Grid as its main components.
Selenium IDE is used to record actions before automating them, and Selenium Grid is used for parallel execution and Selenium WebDriver to execute commands in the browser.
What Are the Advantages of Selenium?
The advantages of Selenium are:
- It's easy to use.
- Selenium supports various programming languages, like Python, Java, JavaScript, Ruby, and C#.
- It can automate Firefox, Edge, Safari and even a custom QtWebKit browser.
- It's possible to scale Selenium to hundreds of instances with different techniques, like setting up the cloud servers with different browser settings.
- It can operate on Windows, macOS and Linux.
What Are the Disadvantages of Selenium?
The disadvantages of Selenium are:
- Selenium set-up methods are complex.
Web Scraping Sample with Selenium
As we did with Puppeteer, let's run through a tutorial on web scraping with Selenium using the same target site. Start by importing the necessary modules and then configure the Selenium instance:
# for timing calculations and sleeping the scraper
import time
# import the Selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
options = webdriver.ChromeOptions()
options.add_argument("--headless")
Initialize the Chrome webdriver:
driver = webdriver.Chrome(options=options)
Point to the target website using the driver.get()
method:
url = "https://danube-webshop.herokuapp.com/"
driver.get(url)
Click on the Crime & Thrillers category element and add time.sleep(1)
to sleep the scraper for one second. Then use the find_elements
method to extract the book previews:
time.sleep(1)
crime_n_thrillers = driver.find_element(By.CSS_SELECTOR, "ul[class='sidebar-list'] > li")
crime_n_thrillers.click()
time.sleep(1)
books = driver.find_elements(By.CSS_SELECTOR, "div.shop-content li.preview")
Extract the title, author, rating and pricing, respectively, from the elements using BeautifulSoup, and wrap this up in the extract
function:
def extract(element):
title = element.find_element(By.CSS_SELECTOR, "div.preview-title").text
author = element.find_element(By.CSS_SELECTOR, "div.preview-author").text
rating = element.find_element(By.CSS_SELECTOR, "div.preview-details p.preview-rating").text
price = element.find_element(By.CSS_SELECTOR, "div.preview-details p.preview-price").text
# return the extracted data as a dictionary
return {"title": title, "author": author, "rating": rating, "price": price}
Loop through the previews and extract the data, then quit the driver:
extracted_data = []
for element in books:
data = extract(element)
extracted_data.append(data)
print(extracted_data)
driver.quit()
Here's what the output looks like:
[
{'title': 'Does the Sun Also Rise?', 'author': 'Ernst Doubtingway', 'rating': '★★★★☆', 'price': '$9.95'},
{'title': 'The Insiders', 'author': 'E. S. Hilton', 'rating': '★★★★☆', 'price': '$9.95'},
{'title': 'A Citrussy Clock', 'author': 'Bethany Urges', 'rating': '★★★★★', 'price': '$9.95'}
]
Puppeteer vs. Selenium: Speed Comparison
Is Puppeteer faster than Selenium? That question is often asked and, to be direct: Puppeteer is faster than Selenium.
To compare Puppeteer vs. Selenium speed, we used the Danube-store sandbox, ran the scripts we presented above 20 times, and averaged the execution times.
We used the time module to set a starting and ending time to find the time it takes to run the Selenium script. We put start_time = time.time()
at the beginning and end_time = time.time()
at the end of the script. The difference between the start and end times was then calculated with end_time - start_time
.
That is the full script used for Selenium:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
def extract(element):
title = element.find_element(By.CSS_SELECTOR, "div.preview-title").text
author = element.find_element(By.CSS_SELECTOR, "div.preview-author").text
rating = element.find_element(By.CSS_SELECTOR, "div.preview-details p.preview-rating").text
price = element.find_element(By.CSS_SELECTOR, "div.preview-details p.preview-price").text
return {"title": title, "author": author, "rating": rating, "price": price}
# start the timer
start = time.time()
options = webdriver.ChromeOptions()
options.add_argument("--headless")
# create a new instance of the Chrome driver
driver = webdriver.Chrome(options=options)
url = "https://danube-webshop.herokuapp.com/"
driver.get(url)
# get the first page and click to the its link
# first element will be the Crime & Thrillers category
time.sleep(1)
crime_n_thrillers = driver.find_element(By.CSS_SELECTOR, "ul[class='sidebar-list'] > li")
crime_n_thrillers.click()
time.sleep(1)
# get the data div and extract the data using beautifulsoup
books = driver.find_elements(By.CSS_SELECTOR, "div.shop-content li.preview")
extracted_data = []
for element in books:
data = extract(element)
extracted_data.append(data)
print(extracted_data)
end = time.time()
print(f"The whole script took: {end-start:.4f}")
driver.quit()
For the Puppeteer script, we used the Date
object. Add const start = Date.now()
; at the start and const end = Date.now()
; at the end of the script. Then the difference can be calculated with end-start
.
And here's the script used for Puppeteer:
const puppeteer = require('puppeteer');
async function main() {
const start = Date.now();
const browser = await puppeteer.launch({ headless: 'new' })
const page = await browser.newPage();
await page.goto('https://danube-webshop.herokuapp.com/', { waitUntil: 'networkidle2' })
await page.waitForSelector('ul.sidebar-list');
await Promise.all([
page.waitForNavigation(),
page.click("ul[class='sidebar-list'] > li > a"),
]);
await page.waitForSelector("li[class='preview']");
const books = await page.evaluateHandle(
() => [...document.querySelectorAll("li[class='preview']")]
)
const processed_data = await page.evaluate(elements => {
let data = []
elements.forEach( element =>
{
let title = element.querySelector("div.preview-title").innerHTML;
let author = element.querySelector("div.preview-author").innerHTML;
let rating = element.querySelector("div.preview-details > p.preview-rating").innerHTML;
let price = element.querySelector("div.preview-details > p.preview-price").innerHTML;
let result = {title: title, author: author, rating: rating, price: price}
data.push(result);
})
return data
}, books)
console.log(processed_data)
await page.close();
await browser.close();
const end = Date.now();
console.log(`Execution time: ${(end - start) / 1000} s`);
}
main();
You can see the performance test result of Selenium vs. Puppeteer below:
The chart shows that Puppeteer is about 60% faster than Selenium. Therefore, scaling up the Puppeteer applications for Chromium-needed projects is the optimal choice for web scraping in this regard.
Puppeteer vs. Selenium: Which Is Better?
So which one is better between Selenium and Puppeteer for scraping? There isn't a direct answer to that question since it depends on multiple factors, like long-term library support, cross-browser support and your web scraping needs.Â
Puppeteer is fast, but compared to Selenium, it supports fewer browsers. Selenium also supports more languages compared to Puppeteer.
Conclusion
Although using Puppeteer or Selenium is a good option for a web scraper, scaling up and optimizing your web scraping project might be difficult because advanced antibots are capable of detecting and blocking these libraries. The best way to avoid this is by making use of a web scraping API, like ZenRows.
ZenRows is a web scraping tool that handles all anti-bot bypass for you in a single API call, and it's equipped with amazing features like rotating proxies, headless browsers, automatic retries and more. You can try it for free now.