Are you facing blocks and bans while scraping websites with the httpx library? Websites are increasing their measures to block bot traffic, so you'll need an HTTP proxy to access your desired data.
In this article, you’ll learn how to configure a proxy in httpx to avoid getting blocked, from basic configuration to handling authentication.
How to Set Your Proxy With Httpx in Python
A httpx proxy acts as a bridge between your scraper and the target server. It routes your requests from its server, increasing anonymity and allowing you to distribute traffic across multiple servers.
To set a proxy in httpx, use the proxies\
parameter when making a request. This parameter lets you include your proxy configurations as URL query parameters in the request.
Let's put it into practice.
First, here's a basic httpx script to which you can add proxy configuration.
import httpx
r = httpx.get("https://httpbin.io/ip")
print(r.text)
The code snippet above navigates to HTTPbin, a test website that returns the client's IP address and prints its content. Since no proxy settings were added, it will return your machine's IP address.
Step 1: Add a Proxy in Httpx
This tutorial uses a free proxy from the Free Proxy List. It may no longer work at the time of reading, so feel free to switch to a new one.
Start by defining your proxy settings using the following format: <PROXY_PROTOCOL>://<PROXY_IP_ADDRESS>:<PROXY_PORT>
.
import httpx
# define your proxy settings
proxies = {
"http://": "http://216.137.184.253:80",
"https://": "http://216.137.184.253:80"
}
The code snippet above defines separate proxy URLs for HTTP and HTTPS connections. This is essential to tailor the proxy configuration to each protocol's specific requirements and security considerations, ensuring better performance.
Next, include the proxy configuration in your request using the `proxies` parameter. Then, print the text content to verify your code works.
Putting everything together, you'll have the following complete code.
import httpx
# define your proxy settings
proxies = {
"http://": "http://216.137.184.253:80",
"https://": "http://216.137.184.253:80"
}
# make a request with the specified proxy
r = httpx.get("https://httpbin.io/ip", proxies=proxies)
print(r.text)
Run it, and you'll get your proxy's IP address.
{
"origin": "216.137.184.253:40335"
}
Awesome!
Httpx also provides the httpx.Client
constructor that allows you to specify a proxy URL directly as an argument. You only need to create a Client
and pass the proxy to the Client
, like in the example below.
import httpx
# create a client with the specified proxy
with httpx.Client(proxy="http://216.137.184.253:80") as client:
# make requests using the client
r = client.get("https://httpbin.io/ip")
print(r.text)
This will yield the same result as the one above.
However, you should know that free proxies are only suitable for testing since they're unreliable and easily detected by websites. In real-world use cases, you'll need premium web scraping proxies. These proxies often require additional configuration because you must include the necessary credentials in your request.
Let's see how to authenticate a httpx proxy.
Step 2: Proxy Authentication With Httpx: Username and Password
Proxy authentication is necessary when the proxy server requires additional information, such as username and password, to allow access. This is common in corporate environments or when using premium proxy services.
To authenticate your httpx proxy, define your proxy settings using the following format: <PROXY_PROTOCOL>://<YOUR_USERNAME>:<YOUR_PASSWORD>@<PROXY_IP_ADDRESS>:<PROXY_PORT>
Here's how to modify the previous code to authenticate your proxy.
import httpx
# define your proxy settings
proxy_url = "http://<YOUR_USERNAME>:<YOUR_PASSWORD>@216.137.184.253:80"
# create a client with the specified proxy and credentials
with httpx.Client(proxy=proxy_url) as client:
# make requests using the client
r = client.get("https://httpbin.io/ip")
print(r.text)
Step 3: Rotate Proxies With Httpx
Scraping at scale often requires rotating between multiple proxies to avoid rate limiting, throttling, or IP bans. Websites often implement restrictions on the number of requests allowed per time frame, and exceeding this limit can result in getting blocked.
By rotating proxies, you distribute your requests across different IP addresses, making it appear as if they originate from various locations or devices.
To rotate proxies with httpx, maintain a pool of proxy URLs and dynamically select a different proxy for each request.
Let's put this into practice.
Import the necessary module (random
) and define your proxy pool or list to start. For this exercise, you can grab a few proxies from the Free Proxy List.
# import the necessary libraries
import httpx
import random
# define your proxy list
proxy_urls = [
"http://20.210.113.32:8123",
"http://47.56.110.204:8989",
"http://50.174.214.216:80",
# add more proxy URLs as needed
]
Next, select a proxy at random from the list using the random.choice()
method, a function provided by the Python random
module. Then, make your request using the selected proxy.
# select a random proxy URL
random_proxy = random.choice(proxy_urls)
# make a request using the selected proxy
with httpx.Client(proxy=random_proxy) as client:
r = client.get("https://httpbin.io/ip")
print(r.text)
Putting everything together, your complete code should look like this:
# import the necessary libraries
import httpx
import random
# define your proxy list
proxy_urls = [
"http://20.210.113.32:8123",
"http://47.56.110.204:8989",
"http://50.174.214.216:80",
# add more proxy URLs as needed
]
# select a random proxy URL
random_proxy = random.choice(proxy_urls)
# make a request using the selected proxy
with httpx.Client(proxy=random_proxy) as client:
r = client.get("https://httpbin.io/ip")
print(r.text)
To verify it works, make multiple requests. You should get a different IP address each time. Here are the results for two requests:
{
"origin": "20.210.113.32:8888"
}
{
"origin": "47.56.110.204:3128"
}
Nice job!
Choose the Best Premium Proxies to Scrape
While proxies can help avoid some instances of IP-based blocking, they aren't a foolproof anti-ban solution.
See for yourself. Try to scrape the Amazon product page below using the previous httpx proxy script.
# import the necessary libraries
import httpx
import random
# define your proxy list
proxy_urls = [
"http://20.210.113.32:8123",
"http://47.56.110.204:8989",
"http://50.174.214.216:80",
# add more proxy URLs as needed
]
# select a random proxy URL
random_proxy = random.choice(proxy_urls)
# make a request using the selected proxy
with httpx.Client(proxy=random_proxy) as client:
r = client.get("https://www.amazon.com/Lumineux-Teeth-Whitening-Strips-Treatments-Enamel-Safe/dp/B082TPDTM2/?th=1")
print(r.text)
You'll get the following result
<!DOCTYPE html>
<body>
<h4>Enter the characters you see below</h4>
<p class="a-last">
Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.
</p>
<!--
-->
</body>
The script above encountered an anti-bot challenge, asking to prove you're not a robot. One quick solution to this issue is using premium proxies, but you might still get blocked, especially against advanced anti-bot systems.
Your best option is using a web scraping API like ZenRows. This tool provides everything you need to scrape without getting blocked, including auto-rotating premium proxies, optimized headers, anti-CAPTCHAs, and more.
In addition, ZenRows handles the technicalities associated with bypassing any anti-bot system under the hood. To access your desired data, you only need to make a single request to the ZenRows API.
Let's see how ZenRows performs with the same webpage we tried to scrape earlier.
To get started, sign up for ZenRows for free, and you'll be directed to the Request Builder page.
Paste your target URL, select the JavaScript Rendering
mode, and check the box for Premium Proxies
to rotate proxies automatically. Select Python as the language, and it'll generate your request code on the right:
The generated code uses the Requests library, but you can achieve the same result with httpx. You only need to make a request to the ZenRows API endpoint (https://api.zenrows.com/v1/
) with the necessary parameters.
Your new script should look like this:
import httpx
# define URL and API key
url = 'https://www.amazon.com/Lumineux-Teeth-Whitening-Strips-Treatments-Enamel-Safe/dp/B082TPDTM2/?th=1'
apikey = '<YOUR_ZENROWS_API_KEY>'
# parameters for the ZenRows API request
params = {
'apikey': apikey,
'url': url,
'js_render': 'true',
'premium_proxy': 'true',
}
# make a GET request using httpx with the specified parameters
with httpx.Client(params=params, timeout=40) as client:
response = client.get("https://api.zenrows.com/v1/")
print(response.text)
Run it, and you'll get the page's HTML content.
<!DOCTYPE html>
<title>
Amazon.com: Lumineux Teeth Whitening Strips 21 Treatments...
</title>
//...
That’s how easy it is to scrape with ZenRows.
Conclusion
Setting a httpx proxy in Python can help you route your requests through a different IP address and avoid IP bans. However, it's important to remember that proxies aren't foolproof. Even premium proxies can be blocked by advanced anti-bot systems.
For guaranteed web scraping results, give ZenRows a try today.