The Anti-bot Solution to Scrape Everything? Get Your Free API Key! ๐Ÿ˜Ž

How to Use a Proxy with Pyppeteer in 2024

July 21, 2023 ยท 1 min read

Routing HTTP requests through different IP addresses is an essential method to avoid getting blocked while web scraping. For that reason, let's learn how to implement a Pyppeteer proxy in this tutorial!

Prerequisites

Ensure you have Python 3.6 or later installed on your local machine.

Then, install Pyppeteer from PyPI using pip by running the below command.

Terminal
pip install pyppeteer
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

How to Use a Proxy with Pyppeteer

To get started, create the script scraper.py to make a request to ident.me in order to see your current IP.

scraper.py
import asyncio
from pyppeteer import launch
 
async def main():
    # Create a new headless browser instance
    browser = await launch()
    # Create a new page
    page = await browser.newPage()
    # Navigate to target website
    await page.goto('https://ident.me')
    # Select the body element
    body = await page.querySelector('body')
    # Get the text content of the selected element
    content = await page.evaluate('(element) => element.textContent', body)
    # Dump the result
    print(content)
    await browser.close()
 
asyncio.get_event_loop().run_until_complete(main())

Run the script to get the content of the target page's body.

Terminal
python scraper.py

Now, it's time to implement a Pyppeteer proxy in your script. For that, go grab a free proxy from FreeProxyList (the one we used might not work for you).

The used launch() method in the scraper.py script creates a new browser instance and allows you to specify some options. One of the options is args, which is a list of additional arguments to pass to the browser process, so set the --proxy-server argument to instruct the browser to route the Pyppeteer requests through a proxy.

scraper.py
# ...
async def main():
    # Create a new headless browser instance
    browser = await launch(args=['--proxy-server=http://20.219.108.109:8080'])
    # Create a new page
    page = await browser.newPage()
# ...

Here's the full code:

scraper.py
import asyncio
from pyppeteer import launch
 
async def main():
    # Create a new headless browser instance
    browser = await launch(args=['--proxy-server=http://20.219.108.109:8080'])
    # Create a new page
    page = await browser.newPage()
    # Navigate to target website
    await page.goto('https://ident.me')
    # Select the body element
    body = await page.querySelector('body')
    # Get the text content of the selected element
    content = await page.evaluate('(element) => element.textContent', body)
    # Dump the result
    print(content)
    await browser.close()
 
asyncio.get_event_loop().run_until_complete(main())

Run the script again with the command line python scraper.py, and you should get the IP of your proxy printed on the screen this time.

Output
20.219.108.109

Well done, you just used a proxy with Pyppeteer!

Proxy Authentication with Pyppeteer

If you use a premium proxy, you'll need to authenticate with a username and password. For that, use the --proxy-auth argument:

scraper.py
# ...
    # Create a new headless browser instance
    browser = await launch(args=[
        '--proxy-server=http://20.219.108.109:8080'
        '--proxy-auth=USERNAME:PASSWORD'
        ])
# ...

Alternatively, you can use the page API to authenticate, like below:

scraper.py
# ...
    # Create a new page
    page = await browser.newPage()
    await page.authenticate({ 'username': 'USERNAME', 'password': 'PASSWORD' })
# ...

Set a Dynamic Proxy with Pyppeteer

Rather than a static proxy as used before, you'll need to use a dynamic proxy for web scraping to avoid getting banned. You can do that in Pyppeteer, creating multiple instances of the browser, each with its own proxy configuration.

Start by grabbing more free proxies and create a list of them:

scraper.py
# ...
import random
 
proxies = [
'http://20.219.108.109:8080',
'http://210.22.77.94:9002',
'http://103.150.18.218:80',
]
# ...

Then, create an asynchronous function that takes a proxy as an argument and makes a Pyppeteer request to ident.me through it:

scraper.py
     
# ...
async def init_pyppeteer_proxy_request(url):
    # Create a new headless browser instance
    browser = await launch(args=[
        f'--proxy-server={url}',
        ])
    # Create a new page
    page = await browser.newPage()
    # Navigate to target website
    await page.goto('https://ident.me')
    # Select the body element
    body = await page.querySelector('body')
    # Get the text content of the selected element
    content = await page.evaluate('(element) => element.textContent', body)
    # Dump the result
    print(content)
    await browser.close()
# ...

Now, update the main() function to call the created function with a random proxy selection:

scraper.py
# ...
async def main():
    for i in range(3):
        await init_pyppeteer_proxy_request(random.choice(proxies))
# ...

Your code should look like this right now:

scraper.py

import asyncio
from pyppeteer import launch
import random
 
proxies = [
'http://20.219.108.109:8080',
'http://210.22.77.94:9002',
'http://103.150.18.218:80',
]
 
async def init_pyppeteer_proxy_request(url):
    # Create a new headless browser instance
    browser = await launch(args=[
        f'--proxy-server={url}',
        ])
    # Create a new page
    page = await browser.newPage()
    # Navigate to target website
    await page.goto('https://ident.me')
    # Select the body element
    body = await page.querySelector('body')
    # Get the text content of the selected element
    content = await page.evaluate('(element) => element.textContent', body)
    # Dump the result
    print(content)
    await browser.close()
 
async def main():
    for i in range(3):
        await init_pyppeteer_proxy_request(random.choice(proxies))
    
 
asyncio.get_event_loop().run_until_complete(main())

Run the script, and you should get a random result for each request like the one below.ย 

Output
20.219.108.109

103.150.18.218

103.150.18.218
zenrows request builder
Click to open the image in full screen

Put the Python scraper code that the request builder generated into a new file and install Python Requests (or any other HTTP request library):

Terminal
pip install requests

Now, run your scraper, and you'll get OpenSea's HTML page scraped and printed on the console.:

opensea-dump
Click to open the image in full screen

Conclusion

Using a proxy with Pyppeteer can significantly improve your web scraping success, and you've learned how to make requests with both static and dynamic proxies.

You also learned how an alternative tool can do the job faster and more reliably. If you need to scrape on a large scale without worrying about infrastructure and having more guarantees to get the data you need, ZenRows's web scraping API can be your ally.

Did you find the content helpful? Spread the word and share it on Twitter, or LinkedIn.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.