Interested in using Puppeteer in Python? Luckily, there's an unofficial Python wrapper over the original Node.js library: Pyppeteer!
In this article, you'll learn how to use Pyppeteer for web scraping, including:
- Browse with it and extract data.
- Interact with dynamic content.
- Take screenshots.
- Integrate proxies.
- Automate logins.
- Solve common errors.
Let's get started! 👍
What Is Pyppeteer in Python
Pyppeteer is a tool to automate a Chromium browser with code, allowing Python developers to gain JavaScript-rendering capabilities to interact with modern websites and simulate human behavior better.
It comes with a headless browser mode, which gives you the full functionality of a browser but without a graphical user interface, increasing speed and saving memory. This makes it a powerful tool for web scraping without getting blocked, as it can emulate real browser behavior.
Now, let's begin with the Pyppeteer tutorial.
How to Use Pyppeteer
Let's go over the fundamentals of using Puppeteer in Python, for which you need the installation procedure to move further.
1. How to Install Pyppeteer in Python
You must have Python 3.6+ installed on your system as a prerequisite. If you are new to it, check out an installation guide.
Note: Feel free to refresh your Python web scraping foundation with our tutorial if you need to.
Then, use the command below to install Pyppeteer:
pip install pyppeteer
When you launch Pyppeteer for the first time, it'll download the most recent version of Chromium (150MB) if it isn't already installed, taking longer to execute as a result.
2. How to Use Pyppeteer
To use Pyppeteer, start by importing the required packages.
#pip install asyncio
import asyncio
from pyppeteer import launch
Now, create a function to get an object going to ScrapingCourse.com, a demo website with e-commerce features.
async def scraper():
browser =await launch({"headless": False})
page = await browser.newPage()
await page.goto("https://www.scrapingcourse.com/ecommerce/")
await browser.close()
Note: Setting the headless
option to False
launches a Chrome instance with GUI. We didn't use True
because we're testing.
Then, an asynchronous call to the scraper()
function puts the script into action.
asyncio.run(scraper())
Here's what the complete code looks like:
import asyncio
from pyppeteer import launch
async def scraper():
browser =await launch({"headless": False})
page = await browser.newPage()
await page.goto("https://www.scrapingcourse.com/ecommerce/")
## get HTML
await browser.close()
asyncio.run(scraper())
And it works:
Notice the prompt "Chrome is being controlled by automated test software".
3. Scrape Pages with Pyppeteer
Add a few lines of code to wait until the page loads, return its HTML and close the browser instance.
htmlContent = await page.content()
await browser.close()
return htmlContent
Add them to your script and print the HTML.
import asyncio
from pyppeteer import launch
async def main():
browser =await launch({"headless": False})
page = await browser.newPage()
await page.goto("https://www.scrapingcourse.com/ecommerce/")
## get HTML
htmlContent = await page.content()
await browser.close()
return htmlContent
response = asyncio.run(scraper())
print(response)
Here you have:
<!DOCTYPE html>
<html lang="en-US">
<head>
<!--- ... --->
<title>Ecommerce Test Site to Learn Web Scraping – ScrapingCourse.com</title>
<!--- ... --->
</head>
<body class="home archive ...">
<p class="woocommerce-result-count">Showing 1–16 of 188 results</p>
<ul class="products columns-4">
<!--- ... --->
<li>
<h2 class="woocommerce-loop-product__title">Abominable Hoodie</h2>
<span class="price">
<span class="woocommerce-Price-amount amount">
<bdi>
<span class="woocommerce-Price-currencySymbol">$</span>69.00
</bdi>
</span>
</span>
<a aria-describedby="This product has multiple variants. The options may ...">Select options</a>
</li>
<!--- ... other products omitted for brevity --->
</ul>
</body>
</html>
Congratulations! 🎉 You scraped your first web page using Pyppeteer.
4. Parse HTML Contents for Data Extraction
Pyppeteer is quite a powerful tool that also allows parsing the raw HTML of a page to extract the desired information. In our case, the products' titles and prices from the ScrapingCourse.com store.
Let's take a look at the source code to identify the elements we're interested in. For that, go to the website, right-click anywhere and select "Inspect". The HTML will be shown in the Developer Tools window.
Look closely at the screenshot above. Each product is in a dedicated list tag with class product
. The product titles are in <h2>
tags. Similarly, the prices are inside the <span>
tags, having the price
class.
Back to your code, use querySelectorAll()
to extract all the parent elements li
. Then, loop through each parent element to extract product titles and prices.
import asyncio
from pyppeteer import launch
async def scraper():
# launch the browser and a new page instance
browser = await launch({"headless": False})
page = await browser.newPage()
# visit the target website
await page.goto("https://www.scrapingcourse.com/ecommerce/")
# select the product parent element (list tags)
products = await page.querySelectorAll("li.product")
# loop through the product parent element to extract titles and prices
for product in products:
title_element = await product.querySelector("h2")
title = await title_element.getProperty("textContent")
price_element = await product.querySelector("span.price")
price = await price_element.getProperty("textContent")
# output the result
print(f"Title: {await title.jsonValue()} || Price: {await price.jsonValue()}")
# close the browser
await browser.close()
# run your scraper function asynchronously
asyncio.run(scraper())
response = asyncio.get_event_loop().run_until_complete(main())
Run the script, and see the result:
Title: Abominable Hoodie || Price: $69.00
#... other products omitted for brevity
Title: Artemis Running Short || Price: $45.00
Nice job! 👏
5. Interact with Dynamic Pages
Many websites nowadays, like ScrapingCourse infinite scrolling page, are dynamic, meaning that JavaScript determines how often its contents change. For example, social media websites usually use infinite scrolling for their post timeline.
Scraping such websites is a challenging task with Requests and BeautifulSoap libraries. However, Pyppeteer comes in handy for the job, and we'll use it to wait for events, click on buttons and scroll down.
Wait for the Page to Load
You must wait for the contents of the current page to load before proceeding to the next activity when using programmatically controlled browsers, and the two most popular approaches to achieve this are waitFor()
and waitForSelector()
.
waitFor()
The waitFor()
function waits for a time specified in milliseconds.
The following example opens the page in Chromium and waits for 4000 milliseconds before closing it.
import asyncio
from pyppeteer import launch
async def scraper():
# launch the browser and a new page instance
browser =await launch({"headless": False})
page = await browser.newPage()
# visit the target website
await page.goto("https://www.scrapingcourse.com/infinite-scrolling")
# wait for page to load
await page.waitFor(4000)
# close the browser
await browser.close()
asyncio.run(scraper())
waitForSelector()
waitForSelector()
waits for a particular element to appear on the page before continuing.
For example, the following script waits for some <div>
to appear before moving on to the next step.
import asyncio
from pyppeteer import launch
async def scraper():
# launch the browser and a new page instance
browser =await launch({"headless": False})
page = await browser.newPage()
# visit the target website
await page.goto("https://www.scrapingcourse.com/infinite-scrolling")
# wait for a particular element
await page.waitForSelector("div.product-grid", {"visible": True})
# close the browser
await browser.close()
asyncio.run(scraper())
The waitForSelector()
method accepts two arguments: a CSS Selector pointing to the desired element and an optional options
dictionary. In our case above, options
is {visible: True}
to wait until the <div>
element becomes visible.
Click on a Button
You can use Pyppeteer Python to click buttons or other elements on a web page. All you need to do is find that particular element using the selectors and call the click()
method.
The example you see next clicks on the first product on the target page by following its index from a list of similar elements.
import asyncio
from pyppeteer import launch
async def scraper():
# launch the browser and a new page instance
browser =await launch({"headless": False})
page = await browser.newPage()
# visit the target website
await page.goto("https://www.scrapingcourse.com/infinite-scrolling")
# get the product to click (first product)
first_product = await page.querySelectorAll("img.product-image")
# click the first product
await first_product[0].click()
# close the browser
await browser.close()
asyncio.run(scraper())
As mentioned earlier, web scraping developers wait for the page to load before interacting further. For example, you can wait for an element to be visible before clicking.
In the next example, you wait for the product grid to load on the initial target before clicking the first item. Once the next page loads, wait for the product container to load and then print the product page title.
Let's see how to do this with Pyppeteer:
import asyncio
from pyppeteer import launch
async def scraper():
# launch the browser and a new page instance
browser =await launch({"headless": False})
page = await browser.newPage()
# visit the target website
await page.goto("https://www.scrapingcourse.com/infinite-scrolling")
# wait for the element grid to be visible
await page.waitForSelector("div.product-grid", {"visible": True})
# get the product to click (first product)
first_product = await page.querySelectorAll("img.product-image")
# click the first product
await first_product[0].click()
# wait for the product on the next page to be visible
await page.waitForSelector("div.woocommerce-product-gallery", {"visible": True})
# get the page title
page_title = await page.title()
# ouput the title
print(page_title)
# close the browser
await browser.close()
# run the scraper function
asyncio.run(scraper())
The output is the following:
Chaz Kangeroo Hoodie – Ecommerce Test Site to Learn Web Scraping
Notice we incorporated the waitForSelector()
method to add robustness to the code.
Scroll the Page
Pyppeteer is useful for modern websites that use infinite scrolls to load the content, and the evaluate()
function helps in such cases. Look at this code below to see how.
import asyncio
from pyppeteer import launch
async def scraper():
# launch the browser and a new page instance
browser =await launch({"headless": False})
page = await browser.newPage()
# set the viewport of the page
await page.setViewport({"width": 1280, "height": 720})
# visit the target website
await page.goto("https://www.scrapingcourse.com/infinite-scrolling")
# scroll the page vertically
await page.evaluate("""{window.scrollBy(0, document.body.scrollHeight);}""")
# wait for page to load
await page.waitFor(5000)
# close the browser
await browser.close()
# run the scraper function
asyncio.run(scraper())
We created a browser object (with a page
tab) and set the viewport size of the browser window (it's the visible area of the web page and affects how it's rendered on the screen). The script will scroll the browser window by one screen.
You can employ this scrolling to load all the data and scrape it. For example, assume you want to get all the product names from the infinite scroll page:
import asyncio
from pyppeteer import launch
async def scraper():
# launch the browser and a new page instance
browser =await launch()
page = await browser.newPage()
# set the viewport of the page
await page.setViewport({"width": 1280, "height": 720})
# visit the target website
await page.goto("https://www.scrapingcourse.com/infinite-scrolling")
# get the height of the current page
current_height = await page.evaluate("document.body.scrollHeight")
while True:
# scroll to the bottom of the page
await page.evaluate("window.scrollBy(0, document.body.scrollHeight)")
# wait for the page to load new content
await page.waitFor(4000)
# update the height of the current page
new_height = await page.evaluate("document.body.scrollHeight")
# break the loop if we have reached the end of the page
if new_height == current_height:
break
current_height = new_height
# select the product parent element (list tags)
products = await page.querySelectorAll("div.product-item")
# loop through the product parent element to extract titles and prices
for product in products:
title_element = await product.querySelector("span.product-name")
title = await title_element.getProperty("textContent")
price_element = await product.querySelector("span.product-price")
price = await price_element.getProperty("textContent")
# output the result
print(f"Title: {await title.jsonValue()} || Price: {await price.jsonValue()}")
# close the browser
await browser.close()
# run the scraper function
asyncio.run(scraper())
The Pyppeteer script above navigates to the page and gets the current scroll height, then iteratively scrolls the page vertically until no more scrolling happens. The waitFor()
method waits for two seconds in each scroll to ensure the page loads content properly.
In the end, names for all the loaded products are printed as shown in the following output snippet.
Title: Chaz Kangeroo Hoodie || Price: $52
Title: Teton Pullover Hoodie || Price: $70
#... other products omitted for brevity
Title: Antonia Racer Tank || Price: $34
Title: Breathe-Easy Tank || Price: $34
Great job so far! 👏
6. Take a Screenshot with Pyppeteer
It would be convenient to observe what the scraper is doing, right? But you don't see any GUI in real-time in production.
Fortunately, Pyppeteer's screenshot feature can help with debugging. The following code opens a webpage, takes a screenshot of the full page and saves it in the current directory with the "web_screenshot.png" name.
import asyncio
from pyppeteer import launch
async def main():
browser =await launch({"headless": True})
page = await browser.newPage()
await page.goto("https://www.scrapingcourse.com/infinite-scrolling")
await page.screenshot({"path": "web_screenshot.png"})
await browser.close()
asyncio.run(main())
7. Use a Proxy with Pyppeteer
While doing web scraping, you need to use proxies to avoid being blocked by the target website. But why is that?
If you access a website with hundreds or thousands of daily requests, the site can blacklist your IP, and you won't be able to scrape the content anymore.
Proxies act as an intermediary between you and the target website, giving you new IPs. And some web scraping proxy providers, like ZenRows, have default IP rotation mechanisms to prevent getting the address banned so that you save money.
Take a look at the following code snippet to learn to integrate a proxy with Pyppeteer in the launch method.
import asyncio
from pyppeteer import launch
async def main():
browser =await launch({"args": ["--proxy-server=<PROXY_IP_ADDRESS>:<PROXY_PORT>"], "headless": False})
page = await browser.newPage()
await page.authenticate({"username": "<YOUR_USERNAME>", "password": "<YOUR_PASSWORD>"})
await page.goto("https://www.scrapingcourse.com/ecommerce/")
await browser.close()
asyncio.run(main())
Note: If the proxy requires a username and password, you can set the credentials using the authenticate()
method.
8. Login with Pyppeteer
You might want to scrape content behind a login sometimes, and Pyppeteer can help in this regard.
Go to the ScrapingCourse login challenge page, where you can play around login automation.
Note: You'll find the default login credentials at the top of the login box, as shown below. Use "[email protected]" as the username and "password" as your password.
Let's look at the HTML of those elements.
The script below enters the user credentials and then clicks on the login button with Pyppeteer. After that, it waits five seconds to let the next page load completely. Finally, it takes a screenshot of the page to test whether the login was successful.
import asyncio
from pyppeteer import launch
async def main():
browser =await launch()
page = await browser.newPage()
await page.goto("https://www.scrapingcourse.com/login")
await page.type("#email", "[email protected]");
await page.type("#password", "password");
await page.click("button.btn.submit-btn")
await page.waitFor(5000);
await page.screenshot({"path": "scrapingcourse-loged-in.png"})
await browser.close()
asyncio.run(main())
See the outoput screenshot below:
Congratulations! 😄 You successfully logged in.
Note: This website was simple and required only a username and password, but some websites implement more advanced security measures. Read our guide on how to scrape behind a login with Python to learn more.
Solve Common Errors
You may face some errors when setting up Pyppeteer, so find here how to solve them if appearing.
Error: Pyppeteer Is Not Installed
While installing Pyppeteer, you may encounter the "Unable to install Pyppeteer" error.
The Python version on your system is the root cause, as Pyppeteer supports only Python 3.6+ versions. So, if you have an older version, you may encounter such installation errors. The solution is upgrading Python and reinstalling Pyppeteer.
Error: Pyppeteer Browser Closed Unexpectedly
Let's assume you execute your Pyppeteer Python script for the first time after installation but encounter this error: pyppeteer.errors.BrowserError: Browser closed unexpectedly
.
That means not all Chromium dependencies were completely installed. The solution is manually installing the Chrome driver using the following command:
pyppeteer-install
Conclusion
Pyppeteer is an unofficial Python port for the classic Node.js Puppeteer library. It's a setup-friendly, lightweight, and fast package suitable for web automation and dynamic website scraping.
This tutorial has taught you how to perform basic headless web scraping with Python's Puppeteer and deal with web logins and advanced dynamic interactions. Additionally, you know now how to integrate proxies with Pyppeteer.
If you need more features, check out the official manual, for example to set a custom user agent in Pyppeteer.
Need to scrape at a large scale without worrying about infrastructure? Let ZenRows help you with its massively scalable web scraping API.
Frequent Questions
What Is the Difference Between Pyppeteer and Puppeteer?
The difference is that Puppeteer is an official Node.js NPM package, while Pyppeteer is an unofficial Python cover over the original Puppeteer.
The primary distinction between them is the baseline programming language and the developer APIs they offer. For the rest, they have almost the same capabilities for automating web browsers.
What Is the Python Wrapper for Puppeteer?
Pyppeteer is Puppeteer's Python wrapper. Using the Chromium DevTools Protocol, the Python package of Pyppeteer offers an API for controlling the headless version of Google Chrome or Chromium, which enables you to carry out web automation activities like website scraping, web application testing, and automating repetitive processes.
What Is the Equivalent of Puppeteer in Python?
The equivalent of Puppeteer in Python is Pyppeteer, a library that allows you to control headless Chromium and allows you to render JavaScript and automate user interactions with web pages.
Can I Use Puppeteer with Python?
Yes, you can use Puppeteer with Python. However, you must first create a bridge to connect Python and JavaScript. Pyppeteer is exactly that.
What Is the Python Version of Puppeteer?
The Python version of Puppeteer is Pyppeteer. Similar to Puppeteer in functionality, Pyppeteer offers a high-level API for managing the browser.