Pyppeteer: Use Puppeteer in Python [2024]

May 16, 2024 ยท 10 min read

Interested in using Puppeteer in Python? Luckily, there's an unofficial Python wrapper over the original Node.js library: Pyppeteer!

In this article, you'll learn how to use Pyppeteer for web scraping, including:

Let's get started! ๐Ÿ‘

What Is Pyppeteer in Python

Pyppeteer is a tool to automate a Chromium browser with code, allowing Python developers to gain JavaScript-rendering capabilities to interact with modern websites and simulate human behavior better.

It comes with a headless browser mode, which gives you the full functionality of a browser but without a graphical user interface, increasing speed and saving memory.

Now, let's begin with the Pyppeteer tutorial.

How to Use Pyppeteer

Let's go over the fundamentals of using Puppeteer in Python, for which you need the installation procedure to move further.

1. How to Install Pyppeteer in Python

You must have Python 3.6+ installed on your system as a prerequisite. If you are new to it, check out an installation guide.

Note: Feel free to refresh your Python web scraping foundation with our tutorial if you need to.

Then, use the command below to install Pyppeteer:

Terminal
pip install pyppeteer

When you launch Pyppeteer for the first time, it'll download the most recent version of Chromium (150MB) if it isn't already installed, taking longer to execute as a result.

2. How to Use Pyppeteer

To use Pyppeteer, start by importing the required packages.

program.py
#pip install asyncio
import asyncio
from pyppeteer import launch

Now, create a function to get an object going to ScrapingCourse.com, a demo website with e-commerce features.

program.py
async def scraper():
   browser =await launch({"headless": False})
   page = await browser.newPage()
   await page.goto("https://www.scrapingcourse.com/ecommerce/")


   await browser.close()

Note: Setting the headless option to False launches a Chrome instance with GUI. We didn't use True because we're testing.

Then, an asynchronous call to the scraper() function puts the script into action.

program.py
asyncio.run(scraper())

Here's what the complete code looks like:

program.py
import asyncio
from pyppeteer import launch

async def scraper():
   browser =await launch({"headless": False})
   page = await browser.newPage()
   await page.goto("https://www.scrapingcourse.com/ecommerce/")

   ## get HTML
   await browser.close()
 
asyncio.run(scraper())

And it works:

scrapingcourse ecommerce controlled by automated software
Click to open the image in full screen
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Notice the prompt "Chrome is being controlled by automated test software".

3. Scrape Pages with Pyppeteer

Add a few lines of code to wait until the page loads, return its HTML and close the browser instance.

htmlContent = await page.content()
await browser.close()
return htmlContent

Add them to your script and print the HTML.

program.py
import asyncio
from pyppeteer import launch

async def main():
   browser =await launch({"headless": False})
   page = await browser.newPage()
   await page.goto("https://www.scrapingcourse.com/ecommerce/")


   ## get HTML
   htmlContent = await page.content()
   await browser.close()
   return htmlContent


response = asyncio.run(scraper())
print(response)

Here you have:

Output
<!DOCTYPE html>
<html lang="en-US">
<head>
    <!--- ... --->
  
    <title>Ecommerce Test Site to Learn Web Scraping โ€“ ScrapingCourse.com</title>
    
  <!--- ... --->
</head>
<body class="home archive ...">
    <p class="woocommerce-result-count">Showing 1โ€“16 of 188 results</p>
    <ul class="products columns-4">
        <!--- ... --->
      
        <li>
            <h2 class="woocommerce-loop-product__title">Abominable Hoodie</h2>
            <span class="price">
                <span class="woocommerce-Price-amount amount">
                    <bdi>
                        <span class="woocommerce-Price-currencySymbol">$</span>69.00
                    </bdi>
                </span>
            </span>
            <a aria-describedby="This product has multiple variants. The options may ...">Select options</a>
        </li>
      
        <!--- ... other products omitted for brevity --->
    </ul>
</body>
</html>

Congratulations! ๐ŸŽ‰ You scraped your first web page using Pyppeteer.

4. Parse HTML Contents for Data Extraction

Pyppeteer is quite a powerful tool that also allows parsing the raw HTML of a page to extract the desired information. In our case, the products' titles and prices from the ScrapingCourse.com store.

Let's take a look at the source code to identify the elements we're interested in. For that, go to the website, right-click anywhere and select "Inspect". The HTML will be shown in the Developer Tools window.

Product Container Inspection
Click to open the image in full screen

Look closely at the screenshot above. Each product is in a dedicated list tag with class product. The product titles are in <h2> tags. Similarly, the prices are inside the <span> tags, having the price class.

Back to your code, use querySelectorAll() to extract all the parent elements li. Then, loop through each parent element to extract product titles and prices.

program.py
import asyncio
from pyppeteer import launch

async def scraper():

    # launch the browser and a new page instance
    browser = await launch({"headless": False})
    page = await browser.newPage()

    # visit the target website
    await page.goto("https://www.scrapingcourse.com/ecommerce/")

    # select the product parent element (list tags)
    products = await page.querySelectorAll("li.product")

    # loop through the product parent element to extract titles and prices
    for product in products:
        title_element = await product.querySelector("h2")
        title = await title_element.getProperty("textContent")
 
        price_element = await product.querySelector("span.price")
        price = await price_element.getProperty("textContent")

        # output the result
        print(f"Title: {await title.jsonValue()} || Price: {await price.jsonValue()}")

    # close the browser
    await browser.close()

# run your scraper function asynchronously
asyncio.run(scraper())


response = asyncio.get_event_loop().run_until_complete(main())

Run the script, and see the result:

Output
Title: Abominable Hoodie || Price: $69.00

#... other products omitted for brevity

Title: Artemis Running Short || Price: $45.00

Nice job! ๐Ÿ‘

5. Interact with Dynamic Pages

Many websites nowadays, like ScrapingCourse infinite scrolling page, are dynamic, meaning that JavaScript determines how often its contents change. For example, social media websites usually use infinite scrolling for their post timeline.ย 

Scraping such websites is a challenging task with Requests and BeautifulSoap libraries. However, Pyppeteer comes in handy for the job, and we'll use it to wait for events, click on buttons and scroll down.

Wait for the Page to Load

You must wait for the contents of the current page to load before proceeding to the next activity when using programmatically controlled browsers, and the two most popular approaches to achieve this are waitFor() and waitForSelector().

waitFor()

The waitFor() function waits for a time specified in milliseconds.

The following example opens the page in Chromium and waits for 4000 milliseconds before closing it.

File
import asyncio
from pyppeteer import launch

async def scraper():
   # launch the browser and a new page instance
   browser =await launch({"headless": False})
   page = await browser.newPage()

   # visit the target website
   await page.goto("https://www.scrapingcourse.com/infinite-scrolling")

   # wait for page to load
   await page.waitFor(4000)
   
   # close the browser
   await browser.close()

asyncio.run(scraper())
waitForSelector()

waitForSelector() waits for a particular element to appear on the page before continuing.

For example, the following script waits for some <div> to appear before moving on to the next step.

scraper.py
import asyncio
from pyppeteer import launch

async def scraper():
   # launch the browser and a new page instance
   browser =await launch({"headless": False})
   page = await browser.newPage()

   # visit the target website
   await page.goto("https://www.scrapingcourse.com/infinite-scrolling")

   # wait for a particular element
   await page.waitForSelector("div.product-grid", {"visible": True})
   
   # close the browser
   await browser.close()

asyncio.run(scraper())

The waitForSelector() method accepts two arguments: a CSS Selector pointing to the desired element and an optional options dictionary. In our case above, options is {visible: True} to wait until the <div> element becomes visible.ย 

Click on a Button

You can use Pyppeteer Python to click buttons or other elements on a web page. All you need to do is find that particular element using the selectors and call the click() method.ย 

The example you see next clicks on the first product on the target page by following its index from a list of similar elements.

scraper.py
import asyncio
from pyppeteer import launch

async def scraper():
   # launch the browser and a new page instance
   browser =await launch({"headless": False})
   page = await browser.newPage()

   # visit the target website
   await page.goto("https://www.scrapingcourse.com/infinite-scrolling")

   # get the product to click (first product)
   first_product = await page.querySelectorAll("img.product-image")

   # click the first product
   await first_product[0].click()

   # close the browser
   await browser.close()

asyncio.run(scraper())

As mentioned earlier, web scraping developers wait for the page to load before interacting further. For example, you can wait for an element to be visible before clicking.

In the next example, you wait for the product grid to load on the initial target before clicking the first item. Once the next page loads, wait for the product container to load and then print the product page title.

Let's see how to do this with Pyppeteer:

scraper.py
import asyncio
from pyppeteer import launch

async def scraper():
   # launch the browser and a new page instance
   browser =await launch({"headless": False})
   page = await browser.newPage()

   # visit the target website
   await page.goto("https://www.scrapingcourse.com/infinite-scrolling")

   # wait for the element grid to be visible
   await page.waitForSelector("div.product-grid", {"visible": True})

   # get the product to click (first product)
   first_product = await page.querySelectorAll("img.product-image")

   # click the first product
   await first_product[0].click()

   # wait for the product on the next page to be visible
   await page.waitForSelector("div.woocommerce-product-gallery", {"visible": True})

   # get the page title
   page_title = await page.title()

   # ouput the title
   print(page_title)

   # close the browser
   await browser.close()

# run the scraper function
asyncio.run(scraper())

The output is the following:

Output
Chaz Kangeroo Hoodie โ€“ Ecommerce Test Site to Learn Web Scraping

Notice we incorporated the waitForSelector() method to add robustness to the code.

Scroll the Page

Pyppeteer is useful for modern websites that use infinite scrolls to load the content, and the evaluate() function helps in such cases. Look at this code below to see how.

scraper.py
import asyncio
from pyppeteer import launch

async def scraper():
    # launch the browser and a new page instance
    browser =await launch({"headless": False})
    page = await browser.newPage()

    # set the viewport of the page
    await page.setViewport({"width": 1280, "height": 720})

    # visit the target website
    await page.goto("https://www.scrapingcourse.com/infinite-scrolling")

    # scroll the page vertically
    await page.evaluate("""{window.scrollBy(0, document.body.scrollHeight);}""")

    # wait for page to load
    await page.waitFor(5000)

    # close the browser
    await browser.close()

# run the scraper function
asyncio.run(scraper())

We created a browser object (with a page tab) and set the viewport size of the browser window (it's the visible area of the web page and affects how it's rendered on the screen). The script will scroll the browser window by one screen.

You can employ this scrolling to load all the data and scrape it. For example, assume you want to get all the product names from the infinite scroll page:

scraper.py
import asyncio
from pyppeteer import launch

async def scraper():
   # launch the browser and a new page instance
    browser =await launch()
    page = await browser.newPage()

    # set the viewport of the page
    await page.setViewport({"width": 1280, "height": 720})

    # visit the target website
    await page.goto("https://www.scrapingcourse.com/infinite-scrolling")

    # get the height of the current page
    current_height = await page.evaluate("document.body.scrollHeight")

    while True:
        # scroll to the bottom of the page
        await page.evaluate("window.scrollBy(0, document.body.scrollHeight)")

        # wait for the page to load new content
        await page.waitFor(4000)

        # update the height of the current page
        new_height = await page.evaluate("document.body.scrollHeight")

        # break the loop if we have reached the end of the page
        if new_height == current_height:
            break

        current_height = new_height

    # select the product parent element (list tags)
    products = await page.querySelectorAll("div.product-item")
    
    # loop through the product parent element to extract titles and prices
    for product in products:
        title_element = await product.querySelector("span.product-name")
        title = await title_element.getProperty("textContent")

        price_element = await product.querySelector("span.product-price")
        price = await price_element.getProperty("textContent")

        # output the result
        print(f"Title: {await title.jsonValue()} || Price: {await price.jsonValue()}")

    # close the browser
    await browser.close()

# run the scraper function
asyncio.run(scraper())

The Pyppeteer script above navigates to the page and gets the current scroll height, then iteratively scrolls the page vertically until no more scrolling happens. The waitFor() method waits for two seconds in each scroll to ensure the page loads content properly.

In the end, names for all the loaded products are printed as shown in the following output snippet.

Output
Title: Chaz Kangeroo Hoodie || Price: $52
Title: Teton Pullover Hoodie || Price: $70

#... other products omitted for brevity

Title: Antonia Racer Tank || Price: $34
Title: Breathe-Easy Tank || Price: $34

Great job so far! ๐Ÿ‘ย 

6. Take a Screenshot with Pyppeteer

It would be convenient to observe what the scraper is doing, right? But you don't see any GUI in real-time in production.

Fortunately, Pyppeteer's screenshot feature can help with debugging. The following code opens a webpage, takes a screenshot of the full page and saves it in the current directory with the "web_screenshot.png" name.

scraper.py
import asyncio
from pyppeteer import launch
async def main():
   browser =await launch({"headless": True})
   page = await browser.newPage()
   await page.goto("https://www.scrapingcourse.com/infinite-scrolling")
   await page.screenshot({"path": "web_screenshot.png"})
   await browser.close()
asyncio.run(main())

7. Use a Proxy with Pyppeteer

While doing web scraping, you need to use proxies to avoid being blocked by the target website. But why is that?

If you access a website with hundreds or thousands of daily requests, the site can blacklist your IP, and you won't be able to scrape the content anymore.ย 

Proxies act as an intermediary between you and the target website, giving you new IPs. And some web scraping proxy providers, like ZenRows, have default IP rotation mechanisms to prevent getting the address banned so that you save money.

Take a look at the following code snippet to learn to integrate a proxy with Pyppeteer in the launch method.

scraper.py
import asyncio
from pyppeteer import launch

async def main():
   browser =await launch({"args": ["--proxy-server=address:port"], "headless": False})
   page = await browser.newPage()
   await page.authenticate({"username": "user", "password": "passw"})
   await page.goto("https://www.scrapingcourse.com/ecommerce/")
   await browser.close()
   
asyncio.run(main())

Note: If the proxy requires a username and password, you can set the credentials using the authenticate() method.

8. Login with Pyppeteer

You might want to scrape content behind a login sometimes, and Pyppeteer can help in this regard.

Go to the ScrapingCourse login challenge page, where you can play around login automation.

Note: You'll find the default login credentials at the top of the login box, as shown below. Use "[email protected]" as the username and "password" as your password.

Scrapingcourse login challenge page
Click to open the image in full screen

Let's look at the HTML of those elements.

Scrapincourse login challenge inspection
Click to open the image in full screen

The script below enters the user credentials and then clicks on the login button with Pyppeteer. After that, it waits five seconds to let the next page load completely. Finally, it takes a screenshot of the page to test whether the login was successful.

program.py
import asyncio
from pyppeteer import launch

async def main():
   browser =await launch()
   page = await browser.newPage()
   await page.goto("https://www.scrapingcourse.com/login")
   await page.type("#email", "[email protected]");
   await page.type("#password", "password");
   await page.click("button.btn.submit-btn")
   await page.waitFor(5000);
   await page.screenshot({"path": "scrapingcourse-loged-in.png"})
   await browser.close()

asyncio.run(main())

See the outoput screenshot below:

Scrapingcourse successful login screenshot
Click to open the image in full screen

Congratulations! ๐Ÿ˜„ You successfully logged in.ย 

Note: This website was simple and required only a username and password, but some websites implement more advanced security measures. Read our guide on how to scrape behind a login with Python to learn more.

Solve Common Errors

You may face some errors when setting up Pyppeteer, so find here how to solve them if appearing.

Error: Pyppeteer Is Not Installed

While installing Pyppeteer, you may encounter the "Unable to install Pyppeteer" error.

The Python version on your system is the root cause, as Pyppeteer supports only Python 3.6+ versions. So, if you have an older version, you may encounter such installation errors. The solution is upgrading Python and reinstalling Pyppeteer.ย 

Error: Pyppeteer Browser Closed Unexpectedly

Let's assume you execute your Pyppeteer Python script for the first time after installation but encounter this error: pyppeteer.errors.BrowserError: Browser closed unexpectedly.

That means not all Chromium dependencies were completely installed. The solution is manually installing the Chrome driver using the following command:

Terminal
pyppeteer-install

Conclusion

Pyppeteer is an unofficial Python port for the classic Node.js Puppeteer library. It's a setup-friendly, lightweight, and fast package suitable for web automation and dynamic website scraping.ย 

This tutorial has taught you how to perform basic headless web scraping with Python's Puppeteer and deal with web logins and advanced dynamic interactions. Additionally, you know now how to integrate proxies with Pyppeteer.ย 

If you need more features, check out the official manual, for example to set a custom user agent in Pyppeteer.

Need to scrape at a large scale without worrying about infrastructure? Let ZenRows help you with its massively scalable web scraping API.

Frequent Questions

What Is the Difference Between Pyppeteer and Puppeteer?

The difference is that Puppeteer is an official Node.js NPM package, while Pyppeteer is an unofficial Python cover over the original Puppeteer.

The primary distinction between them is the baseline programming language and the developer APIs they offer. For the rest, they have almost the same capabilities for automating web browsers.

What Is the Python Wrapper for Puppeteer?

Pyppeteer is Puppeteer's Python wrapper. Using the Chromium DevTools Protocol, the Python package of Pyppeteer offers an API for controlling the headless version of Google Chrome or Chromium, which enables you to carry out web automation activities like website scraping, web application testing, and automating repetitive processes.

What Is the Equivalent of Puppeteer in Python?

The equivalent of Puppeteer in Python is Pyppeteer, a library that allows you to control headless Chromium and allows you to render JavaScript and automate user interactions with web pages.

Can I Use Puppeteer with Python?

Yes, you can use Puppeteer with Python. However, you must first create a bridge to connect Python and JavaScript. Pyppeteer is exactly that.

What Is the Python Version of Puppeteer?

The Python version of Puppeteer is Pyppeteer. Similar to Puppeteer in functionality, Pyppeteer offers a high-level API for managing the browser.

Ready to get started?

Up to 1,000 URLs for free are waiting for you