Routing HTTP requests through different IP addresses is an essential method to avoid getting blocked while web scraping. For that reason, let's learn how to implement a Pyppeteer proxy in this tutorial!
Prerequisites
Ensure you have Python 3.6 or later installed on your local machine.
Then, install Pyppeteer from PyPI using pip
by running the below command.
pip install pyppeteer
How to Use a Proxy with Pyppeteer
To get started, create the script scraper.py
to make a request to ident.me
in order to see your current IP.
import asyncio
from pyppeteer import launch
async def main():
# Create a new headless browser instance
browser = await launch()
# Create a new page
page = await browser.newPage()
# Navigate to target website
await page.goto('https://ident.me')
# Select the body element
body = await page.querySelector('body')
# Get the text content of the selected element
content = await page.evaluate('(element) => element.textContent', body)
# Dump the result
print(content)
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
Run the script to get the content of the target page's body.
python scraper.py
Note: The first time you launch Pyppeteer, it'll automatically download the latest version of Chromium.
Now, it's time to implement a Pyppeteer proxy in your script. For that, go grab a free proxy from FreeProxyList (the one we used might not work for you).
The used launch()
method in the scraper.py
script creates a new browser instance and allows you to specify some options. One of the options is args
, which is a list of additional arguments to pass to the browser process, so set the --proxy-server
argument to instruct the browser to route the Pyppeteer requests through a proxy.
# ...
async def main():
# Create a new headless browser instance
browser = await launch(args=['--proxy-server=http://20.219.108.109:8080'])
# Create a new page
page = await browser.newPage()
# ...
Here's the full code:
import asyncio
from pyppeteer import launch
async def main():
# Create a new headless browser instance
browser = await launch(args=['--proxy-server=http://20.219.108.109:8080'])
# Create a new page
page = await browser.newPage()
# Navigate to target website
await page.goto('https://ident.me')
# Select the body element
body = await page.querySelector('body')
# Get the text content of the selected element
content = await page.evaluate('(element) => element.textContent', body)
# Dump the result
print(content)
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
Run the script again with the command line python scraper.py
, and you should get the IP of your proxy printed on the screen this time.
20.219.108.109
Well done, you just used a proxy with Pyppeteer!
Proxy Authentication with Pyppeteer
If you use a premium proxy, you'll need to authenticate with a username and password. For that, use the --proxy-auth
argument:
# ...
# Create a new headless browser instance
browser = await launch(args=[
'--proxy-server=http://20.219.108.109:8080'
'--proxy-auth=<YOUR_USERNAME>:<YOUR_PASSWORD>'
])
# ...
Alternatively, you can use the page API to authenticate, like below:
# ...
# Create a new page
page = await browser.newPage()
await page.authenticate({ 'username': '<YOUR_USERNAME>', 'password': '<YOUR_PASSWORD>' })
# ...
Note: Remember to update <YOUR_USERNAME>
and <YOUR_PASSWORD>
with your credentials.
Set a Dynamic Proxy with Pyppeteer
Rather than a static proxy as used before, you'll need to use a dynamic proxy for web scraping to avoid getting banned. You can do that in Pyppeteer, creating multiple instances of the browser, each with its own proxy configuration.
Start by grabbing more free proxies and create a list of them:
# ...
import random
proxies = [
'http://20.219.108.109:8080',
'http://210.22.77.94:9002',
'http://103.150.18.218:80',
]
# ...
Then, create an asynchronous function that takes a proxy as an argument and makes a Pyppeteer request to ident.me
through it:
# ...
async def init_pyppeteer_proxy_request(url):
# Create a new headless browser instance
browser = await launch(args=[
f'--proxy-server={url}',
])
# Create a new page
page = await browser.newPage()
# Navigate to target website
await page.goto('https://ident.me')
# Select the body element
body = await page.querySelector('body')
# Get the text content of the selected element
content = await page.evaluate('(element) => element.textContent', body)
# Dump the result
print(content)
await browser.close()
# ...
Now, update the main()
function to call the created function with a random proxy selection:
# ...
async def main():
for i in range(3):
await init_pyppeteer_proxy_request(random.choice(proxies))
# ...
Your code should look like this right now:
import asyncio
from pyppeteer import launch
import random
proxies = [
'http://20.219.108.109:8080',
'http://210.22.77.94:9002',
'http://103.150.18.218:80',
]
async def init_pyppeteer_proxy_request(url):
# Create a new headless browser instance
browser = await launch(args=[
f'--proxy-server={url}',
])
# Create a new page
page = await browser.newPage()
# Navigate to target website
await page.goto('https://ident.me')
# Select the body element
body = await page.querySelector('body')
# Get the text content of the selected element
content = await page.evaluate('(element) => element.textContent', body)
# Dump the result
print(content)
await browser.close()
async def main():
for i in range(3):
await init_pyppeteer_proxy_request(random.choice(proxies))
asyncio.get_event_loop().run_until_complete(main())
Run the script, and you should get a random result for each request like the one below.Â
20.219.108.109
103.150.18.218
103.150.18.218
Put the Python scraper code that the request builder generated into a new file and install Python Requests (or any other HTTP request library):
pip install requests
Now, run your scraper, and you'll get OpenSea's HTML page scraped and printed on the console.:
Conclusion
Using a proxy with Pyppeteer can significantly improve your web scraping success, and you've learned how to make requests with both static and dynamic proxies.
You also learned how an alternative tool can do the job faster and more reliably. If you need to scrape on a large scale without worrying about infrastructure and having more guarantees to get the data you need, ZenRows's web scraping API can be your ally.