The Anti-bot Solution to Scrape Everything? Get Your Free API Key! ๐Ÿ˜Ž

Pyppeteer: Use Puppeteer in Python [2024]

March 10, 2023 ยท 10 min read

Interested in using Puppeteer in Python? Luckily, there's an unofficial Python wrapper over the original Node.js library: Pyppeteer!

In this article, you'll learn how to use Pyppeteer for web scraping, including:

Let's get started! ๐Ÿ‘

What Is Pyppeteer in Python

Pyppeteer is a tool to automate a Chromium browser with code, allowing Python developers to gain JavaScript-rendering capabilities to interact with modern websites and simulate human behavior better.

It comes with a headless browser mode, which gives you the full functionality of a browser but without a graphical user interface, increasing speed and saving memory.

Now, let's begin with the Pyppeteer tutorial.

How to Use Pyppeteer

Let's go over the fundamentals of using Puppeteer in Python, for which you need the installation procedure to move further.

1. How to Install Pyppeteer in Python

You must have Python 3.6+ installed on your system as a prerequisite. If you are new to it, check out an installation guide.

Note: Feel free to refresh your Python web scraping foundation with our tutorial if you need to.

Then, use the command below to install Pyppeteer:

Terminal
pip install pyppeteer

When you launch Pyppeteer for the first time, it'll download the most recent version of Chromium (150MB) if it isn't already installed, taking longer to execute as a result.

2. How to Use Pyppeteer

To use Pyppeteer, start by importing the required packages.

program.py
#pip install asyncio
import asyncio
from pyppeteer import launch

Now, create a function to get an object going to the ScrapeMe website.

program.py
async def main():
   browserObj =await launch({"headless": False})
   url = await browserObj.newPage()
   await url.goto('https://scrapeme.live/shop/')


   await browserObj.close()

Note: Setting the headless option to False launches a Chrome instance with GUI. We didn't use True because we're testing.

Then, an asynchronous call to the main() function puts the script into action.

program.py
asyncio.get_event_loop().run_until_complete(main())

Here's what the complete code looks like:

program.py
import asyncio
from pyppeteer import launch


async def main():
   browserObj =await launch({"headless": False})
   url = await browserObj.newPage()
   await url.goto('https://scrapeme.live/shop/')


   ## Get HTML
    await browserObj.close()
 
asyncio.get_event_loop().run_until_complete(main())

And it works:

Output ScrapeMe
Click to open the image in full screen
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Notice the prompt "Chrome is being controlled by automated test software".

3. Scrape Pages with Pyppeteer

Add a few lines of code to wait until the page loads, return its HTML and close the browser instance.

htmlContent = await url.content()
await browserObj.close()
return htmlContent

Add them to your script and print the HTML.

program.py
import asyncio
from pyppeteer import launch


async def main():
   browserObj =await launch({"headless": False})
   url = await browserObj.newPage()
   await url.goto('https://scrapeme.live/shop/')


   ## Get HTML
   htmlContent = await url.content()
   await browserObj.close()
   return htmlContent


response = asyncio.get_event_loop().run_until_complete(main())
print(response)

Here you have:

Scraped Output
Click to open the image in full screen

Congratulations! ๐ŸŽ‰ You scraped your first web page using Pyppeteer.

4. Parse HTML Contents for Data Extraction

Pyppeteer is quite a powerful tool that also allows parsing the raw HTML of a page to extract the desired information. In our case, the products' titles and prices from the ScrapeMe store.

Let's take a look at the source code to identify the elements we're interested in. For that, go to the website, right-click anywhere and select "Inspect". The HTML will be shown in the Developer Tools window.

HTML Elements
Click to open the image in full screen

Look closely at the screenshot above. The product titles are in the <h2> tags. Similarly, the prices are inside the <span> tags, having the amount class.ย 

Back to your code, use querySelectorAll() to extract all the <h2> and <span> elements, with the amount class in the second case, thanks to CSS Selectors. Then, add a loop to store the information in a JSON file.

program.py
import asyncio
from pyppeteer import launch


async def main():
    browserObj =await launch({"headless": False})
    url = await browserObj.newPage()
    await url.goto('https://scrapeme.live/shop/')
    titles = await url.querySelectorAll("h2")
    prices = await url.querySelectorAll("span .amount")
    for t,p in zip(titles,prices):
        title = await t.getProperty("textContent")
        price = await p.getProperty("textContent")
        # printing the article titles
        print(await title.jsonValue())
        print(await price.jsonValue())


response = asyncio.get_event_loop().run_until_complete(main())

Run the script, and see the result:

Scraped HTML Output
Click to open the image in full screen

Nice job! ๐Ÿ‘

5. Interact with Dynamic Pages

Many websites nowadays, like ScrapingClub, are dynamic, meaning that JavaScript determines how often its contents change. For example, social media websites usually use infinite scrolling for their post timeline.ย 

Scraping such websites is a challenging task with Requests and BeautifulSoap libraries. However, Pyppeteer comes in handy for the job, and we'll use it to wait for events, click on buttons and scroll down.

Wait for the Page to Load

You must wait for the contents of the current page to load before proceeding to the next activity when using programmatically controlled browsers, and the two most popular approaches to achieve this are waitFor() and waitForSelector().

waitFor()

The waitFor() function waits for a time specified in milliseconds.

The following example opens the page in Chromium and waits for 4000 milliseconds before closing it.

File
import asyncio
from pyppeteer import launch


async def main():
   browserObj =await launch({"headless": False})
   url = await browserObj.newPage()




   await url.goto('https://scrapingclub.com/exercise/list_infinite_scroll/')
   await url.waitFor(4000)
   await browserObj.close()


asyncio.get_event_loop().run_until_complete(main())
waitForSelector()

waitForSelector() waits for a particular element to appear on the page before continuing.

For example, the following script waits for some <div> to appear before moving on to the next step.

program.py
import asyncio
from pyppeteer import launch

async def main():
   browserObj =await launch({"headless": False})
   url = await browserObj.newPage()
   await url.goto('https://scrapingclub.com/exercise/list_infinite_scroll/')
   await url.waitForSelector('div .alert', {'visible': True})
   await browserObj.close()

asyncio.get_event_loop().run_until_complete(main())

The waitForSelector() method accepts two arguments: a CSS Selector pointing to the desired element and an optional options dictionary. In our case above, options is {visible: True} to wait until the <div> element becomes visible.ย 

Click on a Button

You can use Pyppeteer Python to click buttons or other elements on a web page. All you need to do is find that particular element using the selectors and call the click() method.ย 

The example you see next clicks on a link at the page's footer by following the body > div > div > div > a path.

program.py
import asyncio
from pyppeteer import launch

async def main():
   browserObj =await launch({"headless": False})
   url = await browserObj.newPage()
   await url.goto('https://scrapingclub.com/exercise/list_infinite_scroll/')
   await url.click('body > div > div > div > a')
   await browserObj.close()
   
asyncio.get_event_loop().run_until_complete(main())

As mentioned earlier, web scraping developers wait for the page to load before interacting further, for example with the click() method. In the image below, you see we clicked on a link at the bottom of the initial target. Then, we waited for the title to load on the secondary target to scrape the heading title.

scrape the heading title result screenshot
Click to open the image in full screen

Let's see how to do this with Pyppeteer:

program.py
import asyncio
from pyppeteer import launch


async def main():


    browserObj=await launch({"headless": True})
    url = await browserObj.newPage()
    await url.goto('https://scrapingclub.com/exercise/list_infinite_scroll/')


    await url.waitForSelector('body > div > div > div > a')
    await url.click('body > div > div > div > a')


    await url.waitForSelector('a[href="https://github.com/michael-yin/"]')


    success_message = await    url.querySelector('a[href="https://github.com/michael-yin/"]')
 
    text = await success_message.getProperty("textContent")
    print (await text.jsonValue())
    await browserObj.close()


asyncio.get_event_loop().run_until_complete(main())

The output is the following:

Output
MichaelYin

Notice we incorporated the waitForSelector() method to add robustness to the code.

Scroll the Page

Pyppeteer is useful for modern websites that use infinite scrolls to load the content, and the evaluate() function helps in such cases. Look at this code below to see how.

program.py
import asyncio
from pyppeteer import launch
 
async def main():
    browserObj =await launch({"headless": False})
    url = await browserObj.newPage()
    await url.setViewport({'width': 1280, 'height': 720})
    await url.goto('https://scrapingclub.com/exercise/list_infinite_scroll/')
    await url.evaluate("""{window.scrollBy(0, document.body.scrollHeight);}""")
    await url.waitFor(5000)
    await browserObj.close()
 
asyncio.get_event_loop().run_until_complete(main())

We created a browser object (with a url tab) and set the viewport size of the browser window (it's the visible area of the web page and affects how it's rendered on the screen). The script will scroll the browser window by one screen.

You can employ this scrolling to load all the data and scrape it. For example, assume you want to get all the product names from the infinite scroll page:

program.py
import asyncio
from pyppeteer import launch
 
async def main():
    
    browserObj =await launch({"headless": False})
    url = await browserObj.newPage()
    await url.setViewport({'width': 1280, 'height': 720})
    await url.goto('https://scrapingclub.com/exercise/list_infinite_scroll/')

    # Get the height of the current page
    current_height = await url.evaluate('document.body.scrollHeight')
    
    while True:
        # Scroll to the bottom of the page
        await url.evaluate('window.scrollBy(0, document.body.scrollHeight)')

        # Wait for the page to load new content
        await url.waitFor(2000)

        # Update the height of the current page
        new_height = await url.evaluate('document.body.scrollHeight')

        # Break the loop if we have reached the end of the page
        if new_height == current_height:
            break

        current_height = new_height
    
    #Scraping in Action
    all_product_tags = await url.querySelectorAll("div > h4 > a")
    for a in all_product_tags:
        product_name = await a.getProperty("textContent")
        print(await product_name.jsonValue())

    await browserObj.close()

asyncio.get_event_loop().run_until_complete(main())

The Pyppeteer script above navigates to the page and gets the current scroll height, then iteratively scrolls the page vertically until no more scrolling happens. The waitFor() method waits for two seconds in each scroll to ensure the page loads content properly.

In the end, names for all the loaded products are printed as shown in this partial output snippet.

Output of Scroll Scrape with Wait
Click to open the image in full screen

Great job so far! ๐Ÿ‘ย 

6. Take a Screenshot with Pyppeteer

It would be convenient to observe what the scraper is doing, right? But you don't see any GUI in real-time in production.

Fortunately, Pyppeteer's screenshot feature can help with debugging. The following code opens a webpage, takes a screenshot of the full page and saves it in the current directory with the "web_screenshot.png" name.

program.py
import asyncio
from pyppeteer import launch
async def main():
   browserObj =await launch({"headless": True})
   url = await browserObj.newPage()
   await url.goto('https://scrapingclub.com/exercise/list_infinite_scroll/')
   await url.screenshot({'path': 'web_screenshot.png'})
   await browserObj.close()
asyncio.get_event_loop().run_until_complete(main())

7. Use a Proxy with Pyppeteer

While doing web scraping, you need to use proxies to avoid being blocked by the target website. But why is that?

If you access a website with hundreds or thousands of daily requests, the site can blacklist your IP, and you won't be able to scrape the content anymore.ย 

Proxies act as an intermediary between you and the target website, giving you new IPs. And some web scraping proxy providers, like ZenRows, have default IP rotation mechanisms to prevent getting the address banned so that you save money.

Take a look at the following code snippet to learn to integrate a proxy with Pyppeteer in the launch method.

program.py
import asyncio
from pyppeteer import launch

async def main():
   browserObj =await launch({'args': ['--proxy-server=address:port'], "headless": False})
   url = await browserObj.newPage()
   await url.authenticate({'username': 'user', 'password': 'passw'})
   await url.goto('https://scrapeme.live/shop/')
   await browserObj.close()
   
asyncio.get_event_loop().run_until_complete(main()

Note: If the proxy requires a username and password, you can set the credentials using the authenticate() method.

8. Login with Pyppeteer

You might want to scrape content behind a login sometimes, and Pyppeteer can help in this regard.

Go to the Quotes website, where you can realize about a Login on the top-right of the screen.

Quotes Web View
Click to open the image in full screen

Clicking on the login link will redirect you to the login page, which contains input fields for the username and password, as well as a submit button.ย 

Note: Since this website is intended for testing, you can use "admin" as a username and "12345" as a password.

Let's look at the HTML of those elements.

Quotes Login Form
Click to open the image in full screen

The script below enters the user credentials and then clicks on the login button with Pyppeteer. After that, it waits five seconds to let the next page load completely. Finally, it takes a screenshot of the page to test whether the login was successful.

program.py
import asyncio
from pyppeteer import launch

async def main():
   browserObj =await launch()
   url = await browserObj.newPage()
   await url.goto('http://quotes.toscrape.com/login')
   await url.type('#username', 'admin');
   await url.type('#password', '12345');
   await url.click('body > div > form > input.btn.btn-primary')
   await url.waitFor(5000);
   await url.screenshot({'path': 'quotes.png'})
   ## Get HTML
   await browserObj.close()

asyncio.get_event_loop().run_until_complete(main())

Congratulations! ๐Ÿ˜„ You successfully logged in.ย 

Note: This website was simple and required only a username and password, but some websites implement more advanced security measures. Read our guide on how to scrape behind a login with Python to learn more.

Solve Common Errors

You may face some errors when setting up Pyppeteer, so find here how to solve them if appearing.

Error: Pyppeteer Is Not Installed

While installing Pyppeteer, you may encounter the "Unable to install Pyppeteer" error.

The Python version on your system is the root cause, as Pyppeteer supports only Python 3.6+ versions. So, if you have an older version, you may encounter such installation errors. The solution is upgrading Python and reinstalling Pyppeteer.ย 

Error: Pyppeteer Browser Closed Unexpectedly

Let's assume you execute your Pyppeteer Python script for the first time after installation but encounter this error: pyppeteer.errors.BrowserError: Browser closed unexpectedly.

That means not all Chromium dependencies were completely installed. The solution is manually installing the Chrome driver using the following command:

Terminal
pyppeteer-install

Conclusion

Pyppeteer is an unofficial Python port for the classic Node.js Puppeteer library. It's a setup-friendly, lightweight, and fast package suitable for web automation and dynamic website scraping.ย 

This tutorial has taught you how to perform basic headless web scraping with Python's Puppeteer and deal with web logins and advanced dynamic interactions. Additionally, you know now how to integrate proxies with Pyppeteer.ย 

If you need more features, check out the official manual, for example to set a custom user agent in Pyppeteer.

Need to scrape at a large scale without worrying about infrastructure? Let ZenRows help you with its massively scalable web scraping API.

Frequent Questions

What Is the Difference Between Pyppeteer and Puppeteer?

The difference is that Puppeteer is an official Node.js NPM package, while Pyppeteer is an unofficial Python cover over the original Puppeteer.

The primary distinction between them is the baseline programming language and the developer APIs they offer. For the rest, they have almost the same capabilities for automating web browsers.

What Is the Python Wrapper for Puppeteer?

Pyppeteer is Puppeteer's Python wrapper. Using the Chromium DevTools Protocol, the Python package of Pyppeteer offers an API for controlling the headless version of Google Chrome or Chromium, which enables you to carry out web automation activities like website scraping, web application testing, and automating repetitive processes.

What Is the Equivalent of Puppeteer in Python?

The equivalent of Puppeteer in Python is Pyppeteer, a library that allows you to control headless Chromium and allows you to render JavaScript and automate user interactions with web pages.

Can I Use Puppeteer with Python?

Yes, you can use Puppeteer with Python. However, you must first create a bridge to connect Python and JavaScript. Pyppeteer is exactly that.

What Is the Python Version of Puppeteer?

The Python version of Puppeteer is Pyppeteer. Similar to Puppeteer in functionality, Pyppeteer offers a high-level API for managing the browser.

Did you find the content helpful? Spread the word and share it on Twitter, or LinkedIn.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.