Set a Urllib3 Proxy: Tutorial 2024 

November 6, 2023 · 4 min read

If you've ever encountered an Access denied: 403 Forbidden or IP Blocked error while web scraping or making HTTP requests, you're not alone. Websites often employ security measures that can detect and block your web scraper.

However, the good news is you can overcome those challenges by using a proxy with urllib3 when web scraping in Python. You'll learn how in this tutorial.

What Is a Urllib3 Proxy?

A Urllib3 proxy is a tool to route HTTP requests through an intermediary server, which acts as a bridge between you and your target web page. Your requests are first forwarded to the proxy server, then it communicates with the page and returns you the response. 

How to Set a Proxy with Urllib3

Here are the steps to set up a proxy with Urllib3, including additional tweaks that'll increase your chances of avoiding detection.

Step 1: Get Started with Urllib3

Let's begin by creating a basic Urllib3 scraper that makes a normal HTTP request to a target URL. 

Here's a basic Urllib3 script that makes a GET request to Httpbin, an API that returns the client's IP address. For that, it uses Urllib3's PoolManager instance, which serves as a centralized manager for handling connections to web servers. It also covers the complexities of connection pooling, so you don't have to.

Terminal
import urllib3
 
# Create a PoolManager instance for sending requests.
http = urllib3.PoolManager()
 
# Send a GET request
resp = http.request("GET", "http://httpbin.io/ip")
 
# Print the returned data.
print(resp.data)

The result of the request above should be your IP address.

Output
b'{\n  "origin": "107.010.55.0"\n}\n'

Step 2: Set a Urllib3 Proxy

For this step, you need a proxy, but you can grab a free one from FreeProxyList. We recommend using HTTPS proxies because they work for both HTTPS and HTTP requests.

Urllib3 provides a ProxyManager object for tunneling requests through a proxy. So, to configure that, create a ProxyManager instance and pass your proxy URL as an argument. Then, make your request through it and log your response.

Like the PoolManager object, ProxyManager handles all the details of connections. So, we don't need both.

Terminal
import urllib3
 
# Create a Proxy Manager for managing proxy servers
proxy = urllib3.ProxyManager("http://75.89.101.60:80")
 
# Make GET request through the proxy
response = proxy.request("GET", "http://httpbin.io/ip")
 
#Print the returned data
print(response.data)

Run the script, and your response should be your proxy's IP address.

Output
b'{\n  "origin": "75.89.101.61:38706"\n}\n'

Congrats! You've set up your first Urllib3 proxy.

However, we used a free proxy in the above example, which is an unreliable option. In a real-world scenario, you'll need premium proxies for web scraping, which often requires additional configuration. Let's see how to use such in Urllib3.

Step 3: Proxy Authentication with Urllib3: Username & Password

Most premium proxies require authentication to verify the user's legitimacy. Proxy providers use this as a crucial security measure to control who can access their servers.

To authenticate a proxy with Urllib3, you must provide credentials (username and password) as part of the request headers when sending a request through a proxy server.

However, to ensure your credentials are properly interpreted as headers, they must be encoded. For that, Urllib3 provides the urllib3.util.make_headers function that takes your username and password as arguments and returns a dictionary containing the right headers for HTTP requests. 

So, if the proxy in step 2 were premium, you'd authenticate it by encoding your credentials into request headers and using the new headers in your HTTP request, like in the code below.

Terminal
import urllib3
 
# Build headers for the basic_auth component
auth_creds = urllib3.util.make_headers(proxy_basic_auth="username:password")
 
# Create a Proxy Manager for managing proxy servers
proxy = urllib3.ProxyManager("http://75.89.101.60:80", proxy_headers=auth_creds)
 
# Make GET request through the proxy
response = proxy.request("GET", "http://httpbin.io/ip")
 
#Print the returned data
print(response.data)
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step 4: Rotate Proxies (You Need to!)

Websites flag too many requests as suspicious activity and can block your proxy. Fortunately, you can avoid that by rotating through multiple proxies. This way, you distribute traffic across multiple IP addresses, making your requests appear to come from different users.

To rotate proxies with Urllib3, create a proxy list and randomly select one for each request. 

Here's a step-by-step example.

Import random and define your proxy list. You can grab a few proxies from FreeProxyList to create your list (we'll see how to do the same with premium ones later on).

scraper.py
import urllib3 
import random
 
# Define a list of proxy URLs
proxy_list = [
    "http://8.219.97.248:80",
    "http://50.168.49.109:80",
    # Add more proxy URLs as needed
]
 
#..

Randomly select a proxy from the list using the random.choice() method. Then, create a ProxyManager instance using the random proxy, make your request, and log the response like in step 2.

Terminal
# Randomly select a proxy from the list
proxy_url = random.choice(proxy_list)
 
# Create a Proxy Manager instance using the random proxy
proxy = urllib3.ProxyManager(proxy_url)
 
# Make GET request through the proxy
response = proxy.request("GET", "http://httpbin.io/ip")
 
#Print the returned data
print(response.data)

Putting everything together, your complete code should look like this:

scraper.py
import urllib3 
import random
 
# Define a list of proxy URLs
proxy_list = [
    "http://8.219.97.248:80",
    "http://50.168.49.109:80",
    # Add more proxy URLs as needed
]
 
# Randomly select a proxy from the list
proxy_url = random.choice(proxy_list)
 
# Create a Proxy Manager instance using the random proxy
proxy = urllib3.ProxyManager(proxy_url)
 
# Make GET request through the proxy
response = proxy.request("GET", "http://httpbin.io/ip")
 
#Print the returned data
print(response.data)

To verify it works, run the code multiple times. You should get a different IP address for each request.

Here are our results for two requests:

Output
b'{\n  "origin": "8.219.97.248"\n}\n'
b'{\n  "origin": "50.168.49.109:"\n}\n'

Bingo! 

It's important to note that we only used free proxies with Urllib3 in this example to explain the concept. As mentioned before, you'll need premium proxies when making requests to real websites because free ones are prone to failure and easily detected.

Step 5: Understand What Proxies You Need for Real-world Scraping

In the previous step, we stated that free proxies work only for testing purposes, so let's see how they perform in a real-world example. For that, replace your target URL with the actual website G2, which is protected by Cloudflare. You'll get an error message similar to the one below. 

Output
b'
<!DOCTYPE html>\n
<!--[if lt IE 7]>
 
   
  </head>\n <body>\n <div.......">\n 
    <h1 data-translate="block_headline">Sorry, you have been blocked</h1>
    <h2 class="cf-subheadline">
            <span data-translate="unable_to_access">You are unable to access</span> g2.com
    </h2>
 
#....

That proves that you need premium proxies. However, there are two types for data extraction: Residential and datacenter. Residential proxies are most recommended and most reliable because they use IP addresses associated with real residential devices, making it difficult for websites to detect them as bots. 

Check out our comparison of the best web scraping proxy providers.

To make it easier (and cheaper), you can use ZenRows, a web scraping API that offers a residential proxy rotator by default, as well as other features you need to avoid getting blocked, including User Agent rotation, anti-CAPTCHA, JavaScript rendering, and more.

Let's try to scrape G2 using ZenRows. To get started, sign up for free, and you'll get to the Request Builder page.

ZenRows Request Builder
Click to open the image in full screen

Paste your target URL (https://www.g2.com/), check the box for Premium Proxies, and activate the AI anti-bot boost mode. Then, Select Python as the language you'll use to get your request code generated on the right.

You'll see the Requests library is suggested, but you can absolutely use Urllib3. You only need to create a PoolManager instance and encode the parameters using the request_encode_url() method. 

scraper.py
import urllib3
 
url = "https://www.g2.com/"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
    "url": url,
    "apikey": apikey,
    "js_render": "true",
    "antibot": "true",
    "premium_proxy": "true",
}
 
# Create a urllib3 PoolManager
http = urllib3.PoolManager()
 
# Encode the parameters and make a GET request through the ZenRows API
request_url = "https://api.zenrows.com/v1/"
response = http.request_encode_url("GET", request_url, fields=params)
 
 
# Print the response content
print(response.data)

Run the code, and you'll get the HTML of the page.

Output
b'<!DOCTYPE html>
#..
 
<title id="icon-label-55be01c8a779375d16cd458302375f4b">G2 - Business Software Reviews</title>
 
#..
 
<h1 ...id="main">Where you go for software.</h1>

Easy peasy! You can now scrape any website without getting blocked at any scale.

Best Practice: Environment Variables with Urllib3

Environment variables are system-level variables that store configuration information, accessible by the operating system and applications running on the system. They're helpful to have a secure and efficient way to manage sensitive data because they can be set outside of the application code, keeping confidential information separate from the codebase.

To set an environment variable in Windows, open your command prompt or terminal and run a command following this structure:

Terminal
setx VARIABLE_NAME VARIABLE_VALUE

You can do the same for Linux using the command below.

Terminal
export VARIABLE_NAME=variable_value

For example, to set your API key, you can use ZENROWS_API_KEY as the variable name and replace <YOUR_ZENROWS_API_KEY> with your actual ZenRows API key.

Terminal
setx API_KEY <YOUR_ZENROWS_API_KEY>

You can now access the API key from the environment module using the os module. 

scraper.py
import urllib3
import os
 
url = "https://www.g2.com/"
apikey = os.environ.get("<YOUR_ZENROWS_API_KEY>")
params = {
    "url": url,
    "apikey": apikey,
    "js_render": "true",
    "antibot": "true",
    "premium_proxy": "true",
}
 
# Create a urllib3 PoolManager
http = urllib3.PoolManager()
 
# Encode the parameters and make a GET request through the ZenRows API
request_url = "https://api.zenrows.com/v1/"
response = http.request_encode_url("GET", request_url, fields=params)

# Print the response content
print(response.data)
 

Conclusion

A Urllib3 proxy enables you to route your requests through different IP addresses to reduce your chances of getting blocked while web scraping. We saw that free proxies are unreliable, and you should use premium residential proxies for better results.

However, residential proxies are not always enough. In any case, consider ZenRows, the all-in-one solution for bypassing all anti-bot measures. Sign up now to try it for free.

Ready to get started?

Up to 1,000 URLs for free are waiting for you