Have you been detected as a bot while web scraping with Selenium?
No wonder. Selenium is an excellent tool for scraping dynamic websites, but it can’t bypass complex anti-bot systems on its own. To prevent IP blocks, bypass geolocation restrictions, and manage rate limits, you can add a proxy to your Selenium scraper.
In this article, you’ll learn how to do it. Here’s what we’ll cover:
- How to set up a proxy in Selenium?
- How to rotate proxies in Selenium?
- How to use premium proxies?
Let's dive in!
What Is a Selenium Proxy?
A proxy acts as an intermediary between a client and a server. Through it, the client makes requests to other servers anonymously and securely and avoids geographical restrictions.
Headless browsers can be configured to use proxy servers like HTTP clients. A proxy helps protect your IP address and avoid blocks when scraping protected websites, like Amazon, with Selenium.
Proxy-powered Selenium is particularly useful for browser automation activities such as testing and web scraping. Keep reading to learn how to set up a proxy in Selenium for web scraping!
How to Set Up a Proxy in Selenium
In this section, you'll learn how to set up a Selenium proxy using Python. We'll use Chrome, as it's the most popular browser for automation.
If you prefer using another programming language, check out the following tutorials:
- How to Set a Proxy in Selenium With NodeJS
- How to Set a Proxy in Selenium Java
- How to Set a Proxy in Selenium PHP
- How to Set a Proxy in Selenium C#
Let's start by setting up a basic Python script to control Chrome with Selenium.
The snippet below initializes a headless Chrome driver and visits httpbin, a webpage that returns the IP address of the client making the request. Finally, the script prints the response HTML.
# pip install selenium webdriver-manager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
# set Chrome options to run in headless mode
options = Options()
options.add_argument("--headless=new")
# initialize Chrome driver
driver = webdriver.Chrome(
service=Service(ChromeDriverManager().install()),
options=options
)
# navigate to the target webpage
driver.get("https://httpbin.io/ip")
# print the HTML of the target webpage
print(driver.page_source)
# release the resources and close the browser
driver.quit()
The code will print the following HTML:
<html><head><meta name="color-scheme" content="light dark"><meta charset="utf-8"></head><body><pre>{
"origin": "50.217.226.40:80"
}
</pre><div class="json-formatter-container"></div></body></html>
Note: If you haven't upgraded to Selenium 4 yet, do it, since WebDriver comes built-in with the latest versions. You can verify your current version using pip show selenium
and upgrade to the newest version with pip install --upgrade selenium
.
Awesome! You're now ready to set up your Selenium proxy in Python using the Chrome driver.
To set a proxy in Selenium, you need to:
- Retrieve a valid proxy server.
- Specify it in the
--proxy-server
Chrome option. - Visit your target page.
Let's go over the whole process step-by-step.
First, get a free proxy address from the Free Proxy List website. Configure Selenium with Options
to launch Chrome using a proxy. Then, print the body content of the target webpage.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
# define the proxy address and port
proxy = "20.235.159.154:80"
# set Chrome options to run in headless mode using a proxy
options = Options()
options.add_argument("--headless=new")
options.add_argument(f"--proxy-server={proxy}")
# initialize Chrome driver
driver = webdriver.Chrome(
service=Service(ChromeDriverManager().install()),
options=options
)
# navigate to the target webpage
driver.get("https://httpbin.io/ip")
# print the body content of the target webpage
print(driver.find_element(By.TAG_NAME, "body").text)
# release the resources and close the browser
driver.quit()
The controlled instance of Chrome will now perform all requests through the specified proxy.
Here's what it'll return:
{
"origin": "20.235.159.154:80"
}
The site response matches the proxy server address. That means Selenium is visiting pages through the proxy server.
Free proxies are short-lived and unreliable, so the one used in the snippet above won't work at the time of reading. We'll see a better alternative later in the tutorial.
Great! You now know the basics of using a Python Selenium proxy.
However, using a single proxy isn't enough. For instance, some websites implement rate limiting, which restricts the number of requests you can make from a single IP within a given time frame. They can also block you if you make several requests within a short timeframe.
To avoid these limitations and reduce the risk of being blocked, you need to implement advanced strategies like proxy rotation and premium proxies. We'll cover these methods later in the tutorial.
Proxy Authentication in Selenium
Some proxy servers rely on authentication to restrict access to users without valid credentials. That's usually the case with commercial solutions or premium proxies.
The Selenium syntax to specify a username and password in an authenticated proxy URL looks like this:
<PROXY_PROTOCOL>://<YOUR_USERNAME>:<YOUR_PASSWORD>@<PROXY_IP_ADDRESS>:<PROXY_PORT>
However, using a URL in --proxy-server
won't work because the Chrome driver ignores the username and password by default. That's where a third-party plugin, such as Selenium Wire
, comes to the rescue.
Selenium Wire extends Selenium to give you access to the requests made by the browser and change them as desired. Run the command below to install it:
pip install blinker==1.7.0 selenium-wire
Selenium Wire is no longer maintained, and the library has a dependency on blinker==1.7.0
. To ensure that you can run Selenium Wire smoothly, you need to install it with the fixed blinker dependency.
Use Selenium Wire to deal with proxy authentication, as shown below:
from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
# configure the proxy
proxy_username = "<YOUR_USERNAME>"
proxy_password = "<YOUR_PASSWORD>"
proxy_address = "20.235.159.154"
proxy_port = "80"
# formulate the proxy url with authentication
proxy_url = f"http://{proxy_username}:{proxy_password}@{proxy_address}:{proxy_port}"
# set selenium-wire options to use the proxy
seleniumwire_options = {
"proxy": {
"http": proxy_url,
"https": proxy_url
},
}
# set Chrome options to run in headless mode
options = Options()
options.add_argument("--headless=new")
# initialize the Chrome driver with service, selenium-wire options, and chrome options
driver = webdriver.Chrome(
service=Service(ChromeDriverManager().install()),
seleniumwire_options=seleniumwire_options,
options=options
)
# navigate to the target webpage
driver.get("https://httpbin.io/ip")
# print the body content of the target webpage
print(driver.find_element(By.TAG_NAME, "body").text)
# release the resources and close the browser
driver.quit()
This code may result in a ]407: Proxy Authentication Required
](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/407) error. A proxy server responds with that HTTP status when the credentials aren't correct, so make sure the proxy URL uses a valid username and password.
Learn more in our guide to Selenium Wire.
Best Protocols for a Proxy in Selenium
When it comes to choosing a protocol for a Selenium proxy, the most common options are HTTP, HTTPS, and SOCKS5.
HTTP proxies send data over the internet, while HTTPS proxies encrypt it to provide an extra security layer. That's why the latter is more popular and secure.
Another useful protocol for Selenium proxies is SOCKS5, also known as SOCKS. It supports a wider range of web traffic, including email and FTP, which makes it a more versatile protocol.
Overall, HTTP and HTTPS proxies are good for web scraping and crawling, and SOCKS finds applications in tasks that involve non-HTTP traffic.
Use a Rotating Proxy in Selenium With Python
If your script makes several requests in a short interval, the server may consider it suspicious and block your IP. Websites can detect and block requests from specific IP addresses, making it difficult for you to scrape data effectively.
However, using a rotating proxy approach can solve this problem. By switching proxies after a particular period or number of requests, your end IP will keep changing. This makes you appear as a different user each time, preventing the server from banning you.
Let's learn how to build a proxy rotator in Selenium with selenium-wire
.
First, you need to create a pool of proxies. In this example, we'll use some free proxies.
Store them in an array as follows:
PROXIES = [
"http://19.151.94.248:88",
"http://149.169.197.151:80",
# ...
"http://212.76.118.242:97"
]
Then, extract a random proxy with random.choice()
and use it to initialize a new driver instance. Here's what your final code should look like:
from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import random
# the list of proxy to rotate on
PROXIES = [
"http://20.235.159.154:80",
"http://149.169.197.151:80",
# ...
"http://212.76.118.242:97"
]
# randomly select a proxy
proxy = random.choice(PROXIES)
# set selenium-wire options to use the proxy
seleniumwire_options = {
"proxy": {
"http": proxy,
"https": proxy
},
}
# set Chrome options to run in headless mode
options = Options()
options.add_argument("--headless=new")
# initialize the Chrome driver with service, selenium-wire options, and chrome options
driver = webdriver.Chrome(
service=Service(ChromeDriverManager().install()),
seleniumwire_options=seleniumwire_options,
options=options
)
# navigate to the target webpage
driver.get("https://httpbin.io/ip")
# print the body content of the target webpage
print(driver.find_element(By.TAG_NAME, "body").text)
# release the resources and close the browser
driver.quit()
The following is the output for manually running this code three times:
# request 1
{
"origin": "149.169.197.151:1286"
}
# request 2
{
"origin": "20.235.159.154:3224"
}
# request 3
{
"origin": "212.76.118.242:97"
}
Well done! You’ve just built a working Selenium proxy rotator. You can learn more tips and trick in our definitive guide on how to rotate proxies in Python.
However, most requests will fail since free proxies are error-prone. That's why you should add retry logic with random timeouts.
But that's not the only issue. Try to test the IP rotator logic against G2 Reviews, a website protected by anti-bot technologies:
driver.get("https://www.g2.com/products/asana/reviews")
You'll get the following output:
<!DOCTYPE html>
<html class="no-js" lang="en-US">
<head>
<title>Attention Required! | Cloudflare</title>
</head>
<body>
<!-- ... -->
<div class="cf-wrapper cf-header cf-error-overview">
<h1 data-translate="block_headline">Sorry, you have been blocked</h1>
</div>
<!-- ... -->
</body>
</html>
The target server detected the rotating proxy Selenium request as a bot and responded with a 403 Unauthorized
error.
In fact, free proxies will usually get you blocked. We used them to demonstrate the basics, but you should never rely on them in a real-world project.
The solution? A premium proxy!
Add Premium Proxies to Selenium
As seen above, free proxies are unreliable, and you should prefer premium proxies for web scraping. If you need ideas on where to get them, check our list of the best proxy providers for scraping.
Premium proxies offer seamless anti-bot bypassing with automated residential IP rotation and geolocation capabilities. This allows you to scrape data efficiently without the risk of being rate-limited or blocked, all while maintaining anonymity.
Let's see how to add auto-rotating premium proxies using ZenRows’ proxy service and access the G2 Reviews page that blocked us in the previous section.
Sign up to get started with ZenRows. Once you register, you'll get redirected to the Request Builder page. Paste your target URL, click on the Premium Proxies checkbox, and select the JS Rendering boost mode. Select Python as the language, and click on the Proxy tab. Finally, copy the generated code.
Now, install the requests library:
pip install requests
Then, paste the generated Python code into your script:
# pip install requests
import requests
url = "https://www.g2.com/products/asana/reviews"
proxy = "http://<YOUR_ZENROWS_API_KEY>:js_render=true&[email protected]:8001"
proxies = {"http": proxy, "https": proxy}
response = requests.get(url, proxies=proxies, verify=False)
print(response.text)
Run it, and you'll get the target page's HTML content:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
</head>
<body>
<!-- other content omitted for brevity -->
</body>
Fantastic! You successfully accessed a protected website using ZenRows premium proxies. Now, you have a proxy scraping solution with Selenium's capabilities.
However, premium proxies aren’t a foolproof solution. If you're looking for a complete anti-bot bypass toolkit, you should use a web-scraping API, such as ZenRows. It includes premium proxies and other essential features like a built-in headless browser, request header management, TLS fingerprints, and more.
Error 403: Forbidden for Proxy in Selenium Grid
Selenium Grid allows you to control remote browsers and run cross-platform scripts in parallel. However, using it may lead to getting an Error 403: Forbidden for Proxy
, one of the most common errors you can encounter during web scraping. That happens for two reasons:
- Another process is already running on port 4444.
- You aren't sending
RemoteWebDriver
requests to the correct URL.
By default, the Selenium server hub listens on http://localhost:4444
. If another process is running on the 4444 port, end it or start Selenium Grid using another port.
If that doesn't solve the issue, make sure you're connecting the remote driver to the right hub URL, as shown below:
import selenium.webdriver as webdriver
# ...
webdriver.Remote('http://localhost:4444/wd/hub', {})
Perfect! The error should be gone now!
Conclusion
This step-by-step tutorial showed how to set up a proxy in Selenium with Python. You’ve started with the basics of adding a proxy to Selenium and then moved on to more advanced topics, such as rotating proxies or using premium proxies.
Now you know:
- What a Selenium proxy is.
- The basics of setting a proxy with Selenium in Python.
- How to deal with authenticated proxies in Selenium.
- How to implement a rotating proxy and why this approach doesn't work with free proxies.
- What a premium proxy is and how to use it.
While proxies are one of the ways to avoid anti-bot detection systems, they don’t work 100% of the time, and require a lot of manual maintenance. To avoid the hassle of finding and configuring proxies and confidently bypass any anti-bot measures, use a web scraping API, such as ZenRows. Try ZenRows for free!