If you've ever encountered an Access denied: 403 Forbidden
or IP Blocked
error while web scraping or making HTTP requests, you're not alone. Websites often employ security measures that can detect and block your web scraper.
However, the good news is you can overcome those challenges by using a proxy with urllib3 when web scraping in Python. You'll learn how in this tutorial.
What Is a Urllib3 Proxy?
A Urllib3 proxy is a tool to route HTTP requests through an intermediary server, which acts as a bridge between you and your target web page. Your requests are first forwarded to the proxy server, then it communicates with the page and returns you the response.
How to Set a Proxy with Urllib3
Here are the steps to set up a proxy with Urllib3, including additional tweaks that'll increase your chances of avoiding detection.
Step 1: Get Started with Urllib3
Let's begin by creating a basic Urllib3 scraper that makes a normal HTTP request to a target URL.
Here's a basic Urllib3 script that makes a GET
request to Httpbin, an API that returns the client's IP address. For that, it uses Urllib3's PoolManager
instance, which serves as a centralized manager for handling connections to web servers. It also covers the complexities of connection pooling, so you don't have to.
import urllib3
# Create a PoolManager instance for sending requests.
http = urllib3.PoolManager()
# Send a GET request
resp = http.request("GET", "http://httpbin.io/ip")
# Print the returned data.
print(resp.data)
The result of the request above should be your IP address.
b'{\n "origin": "107.010.55.0"\n}\n'
Step 2: Set a Urllib3 Proxy
For this step, you need a proxy, but you can grab a free one from FreeProxyList. We recommend using HTTPS proxies because they work for both HTTPS and HTTP requests.
Urllib3 provides a ProxyManager
object for tunneling requests through a proxy. So, to configure that, create a ProxyManager
instance and pass your proxy URL as an argument. Then, make your request through it and log your response.
Like the PoolManager
object, ProxyManager
handles all the details of connections. So, we don't need both.
import urllib3
# Create a Proxy Manager for managing proxy servers
proxy = urllib3.ProxyManager("http://75.89.101.60:80")
# Make GET request through the proxy
response = proxy.request("GET", "http://httpbin.io/ip")
#Print the returned data
print(response.data)
Run the script, and your response should be your proxy's IP address.
b'{\n "origin": "75.89.101.61:38706"\n}\n'
Congrats! You've set up your first Urllib3 proxy.
However, we used a free proxy in the above example, which is an unreliable option. In a real-world scenario, you'll need premium proxies for web scraping, which often requires additional configuration. Let's see how to use such in Urllib3.
Step 3: Proxy Authentication with Urllib3: Username & Password
Most premium proxies require authentication to verify the user's legitimacy. Proxy providers use this as a crucial security measure to control who can access their servers.
To authenticate a proxy with Urllib3, you must provide credentials (username and password) as part of the request headers when sending a request through a proxy server.
However, to ensure your credentials are properly interpreted as headers, they must be encoded. For that, Urllib3 provides the urllib3.util.make_headers
function that takes your username and password as arguments and returns a dictionary containing the right headers for HTTP requests.
So, if the proxy in step 2 were premium, you'd authenticate it by encoding your credentials into request headers and using the new headers in your HTTP request, like in the code below.
import urllib3
# Build headers for the basic_auth component
auth_creds = urllib3.util.make_headers(proxy_basic_auth="<YOUR_USERNAME>:<YOUR_PASSWORD>")
# Create a Proxy Manager for managing proxy servers
proxy = urllib3.ProxyManager("http://75.89.101.60:80", proxy_headers=auth_creds)
# Make GET request through the proxy
response = proxy.request("GET", "http://httpbin.io/ip")
#Print the returned data
print(response.data)
Step 4: Rotate Proxies (You Need to!)
Websites flag too many requests as suspicious activity and can block your proxy. Fortunately, you can avoid that by rotating through multiple proxies. This way, you distribute traffic across multiple IP addresses, making your requests appear to come from different users.
To rotate proxies with Urllib3, create a proxy list and randomly select one for each request.
Here's a step-by-step example.
Import random
and define your proxy list. You can grab a few proxies from FreeProxyList to create your list (we'll see how to do the same with premium ones later on).
import urllib3
import random
# Define a list of proxy URLs
proxy_list = [
"http://8.219.97.248:80",
"http://50.168.49.109:80",
# Add more proxy URLs as needed
]
#..
Randomly select a proxy from the list using the random.choice()
method. Then, create a ProxyManager
instance using the random proxy, make your request, and log the response like in step 2.
# Randomly select a proxy from the list
proxy_url = random.choice(proxy_list)
# Create a Proxy Manager instance using the random proxy
proxy = urllib3.ProxyManager(proxy_url)
# Make GET request through the proxy
response = proxy.request("GET", "http://httpbin.io/ip")
#Print the returned data
print(response.data)
Putting everything together, your complete code should look like this:
import urllib3
import random
# Define a list of proxy URLs
proxy_list = [
"http://8.219.97.248:80",
"http://50.168.49.109:80",
# Add more proxy URLs as needed
]
# Randomly select a proxy from the list
proxy_url = random.choice(proxy_list)
# Create a Proxy Manager instance using the random proxy
proxy = urllib3.ProxyManager(proxy_url)
# Make GET request through the proxy
response = proxy.request("GET", "http://httpbin.io/ip")
#Print the returned data
print(response.data)
To verify it works, run the code multiple times. You should get a different IP address for each request.
Here are our results for two requests:
b'{\n "origin": "8.219.97.248"\n}\n'
b'{\n "origin": "50.168.49.109:"\n}\n'
Bingo!
It's important to note that we only used free proxies with Urllib3 in this example to explain the concept. As mentioned before, you'll need premium proxies when making requests to real websites because free ones are prone to failure and easily detected.
Step 5: Understand What Proxies You Need for Real-world Scraping
In the previous step, we stated that free proxies work only for testing purposes, so let's see how they perform in a real-world example. For that, replace your target URL with the actual website G2, which is protected by Cloudflare. You'll get an error message similar to the one below.
b'
<!DOCTYPE html>\n
<!--[if lt IE 7]>
</head>\n <body>\n <div.......">\n
<h1 data-translate="block_headline">Sorry, you have been blocked</h1>
<h2 class="cf-subheadline">
<span data-translate="unable_to_access">You are unable to access</span> g2.com
</h2>
#....
That proves that you need premium proxies. However, there are two types for data extraction: Residential and datacenter. Residential proxies are most recommended and most reliable because they use IP addresses associated with real residential devices, making it difficult for websites to detect them as bots.
Check out our comparison of the best web scraping proxy providers.
To make it easier (and cheaper), you can use ZenRows, a web scraping API that offers a residential proxy rotator by default, as well as other features you need to avoid getting blocked, including User Agent rotation, anti-CAPTCHA, JavaScript rendering, and more.
Let's try to scrape G2 using ZenRows. To get started, sign up for free, and you'll get to the Request Builder page.
Paste your target URL (https://www.g2.com/
), check the box for Premium Proxies
, and activate the JS Rendering
boost mode. Then, Select Python as the language you'll use to get your request code generated on the right.
You'll see the Requests library is suggested, but you can absolutely use Urllib3. You only need to create a PoolManager instance and encode the parameters using the request_encode_url() method.
import urllib3
url = "https://www.g2.com/"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
"premium_proxy": "true",
}
# Create a urllib3 PoolManager
http = urllib3.PoolManager()
# Encode the parameters and make a GET request through the ZenRows API
request_url = "https://api.zenrows.com/v1/"
response = http.request_encode_url("GET", request_url, fields=params)
# Print the response content
print(response.data)
Run the code, and you'll get the HTML of the page.
b'<!DOCTYPE html>
#..
<title id="icon-label-55be01c8a779375d16cd458302375f4b">G2 - Business Software Reviews</title>
#..
<h1 ...id="main">Where you go for software.</h1>
Easy peasy! You can now scrape any website without getting blocked at any scale.
Best Practice: Environment Variables with Urllib3
Environment variables are system-level variables that store configuration information, accessible by the operating system and applications running on the system. They're helpful to have a secure and efficient way to manage sensitive data because they can be set outside of the application code, keeping confidential information separate from the codebase.
To set an environment variable in Windows, open your command prompt or terminal and run a command following this structure:
setx VARIABLE_NAME VARIABLE_VALUE
You can do the same for Linux using the command below.
export VARIABLE_NAME=variable_value
For example, to set your API key, you can use ZENROWS_API_KEY
as the variable name and replace <YOUR_ZENROWS_API_KEY>
with your actual ZenRows API key.
setx API_KEY <YOUR_ZENROWS_API_KEY>
You can now access the API key from the environment module using the os
module.
import urllib3
import os
url = "https://www.g2.com/"
apikey = os.environ.get("<YOUR_ZENROWS_API_KEY>")
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
"premium_proxy": "true",
}
# Create a urllib3 PoolManager
http = urllib3.PoolManager()
# Encode the parameters and make a GET request through the ZenRows API
request_url = "https://api.zenrows.com/v1/"
response = http.request_encode_url("GET", request_url, fields=params)
# Print the response content
print(response.data)
Conclusion
A Urllib3 proxy enables you to route your requests through different IP addresses to reduce your chances of getting blocked while web scraping. We saw that free proxies are unreliable, and you should use premium residential proxies for better results.
However, residential proxies are not always enough. In any case, consider ZenRows, the all-in-one solution for bypassing all anti-bot measures. Sign up now to try it for free.