Are your requests failing due to connection issues, HTTP errors, or other unexpected problems? While Python's Requests library simplifies making HTTP requests, it can sometimes fail to retrieve the desired data because of various errors.
This tutorial will explain the causes of failed requests and teach various ways to create Python's Requests retry mechanisms to reattempt your requests.
The two main methods we'll cover are:
Let's go!
Python Requests Retry Logic Explained
The request retry mechanism is a technique where you instruct your code to resend a request automatically if an HTTP error occurs. The retry logic is usually decision-based, depending on why and when the failure occurred.
Are request retries needed in all cases? How often should you retry when a request fails, and how many attempts should you make? We'll answer these questions in the upcoming section.
Important Concepts for Retry Logic
Not all request errors should trigger a retry. It's important to evaluate the cause of failure and decide whether retrying is appropriate.
It's best to only apply retries in specific scenarios, such as transient server problems (e.g., 5xx errors or network timeouts).
However, retrying after client-side or permanent issues, such as 404 (page not found), 422 or 400 (invalid data) or 401 authentication credential errors, is generally not appropriate.
You can retry a request immediately after failure, but that strategy can overwhelm the server and block you. Instead, implement a delay between retries. However, while web scraping, setting a constant delay time (like 5 seconds or less or more) creates a predictable, bot-like pattern, making it easier for websites to block you.
The recommended approach to avoid that is to use an exponential backoff strategy. This technique gradually increases the wait time between retries after a failed request, relieving the stress on the server and allowing you to mimic a regular user.
Types of Failed Requests
Understanding the reasons behind a failed request will allow you to develop mitigation strategies for each case. We can categorize failed requests into:
- Requests that timed out due to server problems (there's no HTTP response from the server).
- Requests that returned an HTTP error.
Let's see each one.
Requests That Timed Out
No HTTP response is returned when a request times out because the client didn't receive a reply within the specified time frame. It could happen for several reasons, such as server overload, server response issues, or slow network connections.
When faced with timeout scenarios, consider checking your internet connection. A stable connection may suggest the problem is server-related.
In Python, you can catch exceptions related to timeouts, such as requests.exceptions.Timeout
, and implement a Python retry mechanism conditionally or with strategies like exponential backoff. We'll look at these later on in this guide.
Requests That Returned an HTTP Error
In this case, an HTTP client establishes a connection with the server but receives an HTTP error status. This scenario indicates that the server is active, but the request cannot be processed successfully.
This type of request failure typically comes with a specific status code and an error message telling you what went wrong, including additional information that can provide insights into the problem.
For instance, this could be a 403 forbidden error:
403 Client Error: Forbidden for url: https://www.scrapingcourse.com/cloudflare-challenge
Your first approach to addressing this issue is to review the HTTP status code and error message while ensuring the request is formed correctly. If you suspect the error results from a temporary problem or server issues, you may retry the request cautiously.
Many factors can contribute to HTTP errors. Common causes in scraping include:
- Anti-bot Detection Measures: Often result in a 403 Forbidden error when the server rejects client requests identified as bots.
- SSL Handshake Failures: These can result in a 525 SSL Handshake Failure, indicating that the server failed to establish a secure connection with the client.
- Redirect Issues: A "310 Too Many Redirects" error can occur when a website creates a redirect loop, often due to bot-like request headers or suspicious behavior that triggers excessive redirections.
HTTP Status Error Codes for Failed Python Requests
The different errors in client-server communications are in the 4xx and 5xx code ranges. They include:
- 400 Bad Request.
- 401 Unauthorized.
- 403 Forbidden.
- 404 Not Found.
- 405 Method Not Allowed.
- 408 Request Timeout.
- 429 Too Many Requests.
- 500 Internal Server Error.
- 501 Not Implemented.
- 502 Bad Gateway.
- 503 Service Unavailable.
- 504 Gateway Timeout.
- 505 HTTP Version Not Supported.
The most common ones you'll see while web scraping are:
Error Code | Explanation |
---|---|
403 Forbidden | The server understands the request but won't fulfill it because it doesn't have the required permissions or access. |
429 Too Many Requests | The server has received too many requests from the same IP within a given time frame, so it's rate-limiting in web scraping. |
500 Internal Server Error | A generic server error occurred, indicating that something went wrong on the server while processing the request. |
502 Bad Gateway | The server acting as a gateway or proxy received an invalid response from an upstream server. |
503 Service Unavailable | The server is too busy or undergoing maintenance and can't handle the request at the moment. |
504 Gateway Timeout | An upstream server didn't respond quickly enough to the gateway or proxy. |
You can check out the MDN docs for more information on HTTP response status codes.
While we've provided an overview of some retry concepts, we'll dive into more detail in the following section.
Number of Retries
How often should you retry failed Python requests?
There's no specific standard for the number of retry attempts. You should tailor it to your scraping needs and the site's behavior. However, ensure your retry frequency remains reasonable.
For example, when scraping multiple pages from the same website, it's good practice to set fewer retries to avoid overwhelming the server with indefinite requests. This can lead to potential performance issues and eventual blocking.
Errors like "429 Too Many Requests" are temporary and should have more retries than those that aren't. A common approach is to set a maximum number of retries (e.g., 3 to 5) based on the specific site's response patterns and your scraping objectives.
Request Delay
Setting delays between requests prevents website and API overload while maintaining compliance with rate limits. You can set static or random delays or use exponential backoffs.
Fixed or Random Delay
A fixed delay means waiting the same time before each retry after a failed request. You can implement it in Python using the sleep()
function from the time module. However, fixed delays can appear unnatural and bot-like because human behavior varies in response times.
To make retries more intuitive and reduce suspicion, you can randomize the delay within a specific time range. For instance, you can sleep randomly between 1 and 5 seconds before retrying requests.
Like the number of retries, there isn't a standard rule for how long the delay should be. But, you can experiment with different reasonable delay values (e.g., between 300ms and 500ms) to find an optimal balance.
Backoff Strategy for the Delays
The backoff strategy is a technique for increasing delays between retries instead of fixed or random ones. Each request increases the delay by an exponential backoff factor, usually greater than one. This approach helps you handle temporary issues while avoiding server overload.
The backoff algorithm is as simple as the following:
backoff_factor * (2 ** (current_number_of_retries - 1))
The backoff factor is multiplied by 2 raised to the power of the retry count minus 1. For example, here are the delay sequences for backoff factors 2, 3, and 10:
# 2
1, 2, 4, 8, 16, 32, 64, 128
# 3
1.5, 3, 12, 24, 48, 96, 192, 384
# 10
5, 10, 20, 40, 80, 160, 320, 640
Now, let's use this knowledge to create our request retry strategies.
Best Methods for Python Requests Retry
This section will teach you the best methods to retry failed requests in Python so you can build a robust web crawler. They include:
- Using an existing retry wrapper: Python Sessions with HTTPAdapter.
- Coding your retry wrapper.
We recommend the former, but the latter might suit some scenarios where you want more customization.
Method 1: Use an Existing Retry Wrapper: Python Sessions with HTTPAdapter
Python Requests uses the urllib3 HTTP client under the hood. You can set up retries in Python with Requests' HTTP adapter class and the Retry utility class from the urllib3 package. The HTTPAdapter class lets you specify a retry strategy and change the request behavior.
To implement a simple Python Requests retry logic with the HTTPAdapter class, import the required libraries and define your options. In this example, we set the maximum number of requests to 4 and only reattempt if the error has a status code of either 403, 429, 500, 502, 503, or 504:
# pip3 install requests
import requests
from requests.adapters import HTTPAdapter
from urllib3.util import Retry
# define the retry strategy
retry_strategy = Retry(
total=4, # maximum number of retries
status_forcelist=[403, 429, 500, 502, 503, 504], # the HTTP status codes to retry on
)
Pass the retry strategy to the HTTPAdapter
in a new adapter object. Then, mount the adapter to a session object and use it for all requests:
# ...
# create an HTTP adapter with the retry strategy and mount it to the session
adapter = HTTPAdapter(max_retries=retry_strategy)
# create a new session object
session = requests.Session()
session.mount("http://", adapter)
session.mount("https://", adapter)
# make a request using the session object
response = session.get("https://www.scrapingcourse.com/cloudflare-challenge")
if response.status_code == 200:
print(f"SUCCESS: {response.text}")
else:
print(f"FAILED with status {response.status_code}")
Here's the complete code:
# pip3 install requests
import requests
from requests.adapters import HTTPAdapter
from urllib3.util import Retry
# define the retry strategy
retry_strategy = Retry(
total=4, # maximum number of retries
status_forcelist=[429, 500, 502, 503, 504], # the HTTP status codes to retry on
)
# create an HTTP adapter with the retry strategy and mount it to the session
adapter = HTTPAdapter(max_retries=retry_strategy)
# create a new session object
session = requests.Session()
session.mount("http://", adapter)
session.mount("https://", adapter)
# make a request using the session object
response = session.get("https://www.scrapingcourse.com/ecommerce")
if response.status_code == 200:
print(f"SUCCESS: {response.text}")
else:
print(f"FAILED with status {response.status_code}")
The above code now implements a simple retry logic using the HTTPAdapter.
Sessions and HTTPAdapter With a Backoff Strategy
To set increasing delays between retries with the backoff strategy, add the backoff_factor
parameter to the retry wrapper:
# ...
# define the retry strategy
retry_strategy = Retry(
total=4, # maximum number of retries
backoff_factor=2, # exponential backoff factor
status_forcelist=[429, 500, 502, 503, 504], # the HTTP status codes to retry on
)
# ... mount the adapter and define your scraping logic
You just added exponential backoff to your scraper using the HTTPAdapter.
Interested in building a custom retry strategy instead? Keep reading in the next section.
Method 2: Code Your Own Retry Wrapper
Unlike the previous option, we'll create a custom wrapper for the retry logic. Doing it yourself lets you implement custom error handlers, logs, and more.
To keep it straightforward, let's create a Python function (retry_request
) to simulate the retry logic implementation of method 1.
The function accepts the target URL as its first argument, then the max retries (total
), and status_forcelist
to specify the type of errors to retry the request:
# pip3 install requests
import requests
def retry_request(
url,
total=4,
status_forcelist=[
403,
429,
500,
502,
503,
504,
],
**kwargs,
):
# store the last response in an empty variable
last_response = None
# implement retry
for _ in range(total):
try:
response = requests.get(url, **kwargs)
if response.status_code in status_forcelist:
# track the last response
last_response = response
# retry request
continue
else:
return response
except requests.exceptions.ConnectionError:
pass
# log the response after the retry
return last_response
response = retry_request("https://www.scrapingcourse.com/ecommerce")
if response.status_code == 200:
print(f"SUCCESS: {response.text}")
else:
print(f"FAILED with status {response.status_code}")
Now, let's improve the above code with an exponential backoff.
Retry Python Requests With a Custom Backoff Strategy
To retry Python Requests with a custom backoff, take the previous code as a base. Then, create a separate function named backoff_delay
to calculate the delay:
def backoff_delay(backoff_factor, attempts):
# backoff algorithm
delay = backoff_factor * (2 ** (attempts - 1))
return delay
Add the backoff factor and implement the exponential delay with the time module:
def retry_request(
url,
backoff_factor=2,
total=4,
status_forcelist=[
403,
429,
500,
502,
503,
504,
],
**kwargs,
):
# store the last response in an empty variable
last_response = None
# implement retry
for attempt in range(total):
try:
response = requests.get(url, **kwargs)
if response.status_code in status_forcelist:
# implement backoff
delay = backoff_delay(backoff_factor, attempt)
sleep(delay)
print(f"retrying in {delay} seconds")
# track the last response
last_response = response
# retry request
continue
else:
return response
except requests.exceptions.ConnectionError:
pass
# log the response after the retry
return last_response
response = retry_request("https://www.scrapingcourse.com/ecommerce")
if response.status_code == 200:
print(f"SUCCESS: {response.text}")
else:
print(f"FAILED with status {response.status_code}")
Are you still getting blocked despite your backoffs? Let's solve that quickly in the next section!
Avoid Getting Blocked by Error 403 With Python Requests
Getting blocked is the biggest problem in web scraping. Some websites may block your IP address or use anti-bot measures like Cloudflare to prevent you from accessing the site if you are detected as a bot.
Unfortunately, delaying requests, retrying, or using exponential backoffs are usually insufficient to stop these anti-bot mechanisms.
To prove this, let's attempt to scrape the Cloudflare challenge page using the previous scraper:
# pip3 install requests
import requests
from time import sleep
def backoff_delay(backoff_factor, attempts):
# backoff algorithm
delay = backoff_factor * (2 ** (attempts - 1))
return delay
def retry_request(
url,
backoff_factor=2,
total=4,
status_forcelist=[
403,
429,
500,
502,
503,
504,
],
**kwargs,
):
# store the last response in an empty variable
last_response = None
# implement retry
for attempt in range(total):
try:
response = requests.get(url, **kwargs)
if response.status_code in status_forcelist:
delay = backoff_delay(backoff_factor, attempt)
sleep(delay)
print(f"retrying in {delay} seconds")
# track the last response
last_response = response
# retry request
continue
else:
return response
except requests.exceptions.ConnectionError:
pass
# log the response after the retry
return last_response
response = retry_request("https://www.scrapingcourse.com/cloudflare-challenge")
if response.status_code == 200:
print(f"SUCCESS: {response.text}")
else:
print(f"FAILED with status {response.status_code}")
The above code retried the request four times but failed with a 403 forbidden error despite implementing exponential backoff:
retrying in 1.0 seconds
retrying in 2 seconds
retrying in 4 seconds
retrying in 8 seconds
FAILED with status 403
Web scraping proxies may be considered a solution in the above scenario. However, proxies aren't also enough, and specific libraries and customizations are unlikely to keep up with evolving anti-bot systems.
Many developers use web scraping APIs as dedicated toolkits to bypass the 403 forbidden error and avoid getting blocked.
ZenRows is a popular web scraping solution with an all-in-one scraper API featuring premium proxy rotation, anti-bot auto-bypass, JavaScript rendering support, and more. With a 98.7% success rate, the ZenRows scraper API also provides free automatic request retries following best practices. This ensures you get all the data you want at scale without limitations.
You can integrate the ZenRows scraper API seamlessly with Python Requests. Let's see it in action!
Sign up on ZenRows to open the Request Builder. Paste your target URL in the link box and activate Premium Proxies and JS Rendering. Then, select Python as your programming language and choose the API connection mode.
Copy the generated Python code and paste it into your scraper.
The generated Python code should look like this:
# pip install requests
import requests
url = "https://www.scrapingcourse.com/cloudflare-challenge"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
"premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)
The above code bypasses Cloudflare and outputs the protected site's full-page HTML:
<html lang="en">
<head>
<!-- ... -->
<title>Cloudflare Challenge - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<h2>
You bypassed the Cloudflare challenge! :D
</h2>
<!-- other content omitted for brevity -->
</body>
</html>
Nice job 🎉! You just used Python's Requests and ZenRows' scraper API to bypass anti-bot protection.
Best Practice: Retry Python Requests With a Decorator
Using a decorator to implement retries offers a cleaner approach. You can easily apply the Python Requests retry logic in the decorator to multiple methods or functions.
Instead of coding the decorator yourself, you can use Tenacity, a community-maintained package that simplifies adding retry behavior to requests.
Start by installing Tenacity:
pip3 install tenacity
The retry
decorator from Tenacity takes in arguments like stop
for the maximum number of retries and wait
for details, among others. Feel free to learn more from the official Tenacity documentation.
Configure the retry
decorator as shown:
# pip3 install requests tenacity
# ...
from tenacity import retry, stop_after_attempt, wait_exponential
# define the retry decorator
@retry(
stop=stop_after_attempt(4), # maximum number of retries
wait=wait_exponential(multiplier=5, min=4, max=5), # exponential backoff
)
Now, place your scraper function directly under the decorator. Update the above snippet, and you'll get this complete code:
# pip3 install requests tenacity
import requests
from tenacity import retry, stop_after_attempt, wait_exponential
# define the retry decorator
@retry(
stop=stop_after_attempt(4), # maximum number of retries
wait=wait_exponential(multiplier=5, min=4, max=5), # exponential backoff
)
def scraper(url):
response = requests.get(url)
return response
# target url
Url = "https://www.scrapingcourse.com/ecommerce"
try:
response = scraper(url)
if response.status_code == 200:
print(f"SUCCESS: {response.text}")
else:
print(f"FAILED with status code: {response.status_code}")
except requests.RequestException as e:
print(f"FAILED: {e}")
The above code uses Tenacity's retry
decorator to control your Python request retry logic.
POST Retry With Python Requests
In addition to the GET method, you can try other HTTP methods, such as POST for creating new resources in the server and PUT for updating existing resources. For example, to submit a form.
You can use Tenacity to implement a retry on a POST request by replacing requests.get
with requests.post
. Try it with this login challenge page using the default login credentials (email: [email protected], password: password):
# pip3 install requests tenacity
import requests
from tenacity import retry, stop_after_attempt, wait_exponential
# define the retry decorator
@retry(
stop=stop_after_attempt(4), # Maximum number of retries
wait=wait_exponential(multiplier=1, min=1, max=60), # Exponential backoff
)
def scraper(url, payload):
response = requests.post(url, payload)
return response
# target url
url = "https://www.scrapingcourse.com/login"
# the payload with your login credentials
payload = {
"email": "[email protected]",
"password": "password",
}
try:
response = scraper(url, payload)
if response.status_code == 200:
print(f"SUCCESS: {response.text}")
else:
print(f"FAILED with status code: {response.status_code}")
except requests.RequestException as e:
print(f"FAILED: {e}")
Like the previous scraper, the above code will implement the retry failed POST requests using Tenacity's decorator.
Conclusion
Handling failed requests is critical to building a robust and reliable web scraper. In this tutorial, we showed you the importance of retrying failed requests and how to apply retry logic to your Python scraper using third-party libraries and custom methods with the Requests library.
Now you know:
- The most essential Python Requests retry logic considerations.
- The two best options for retries.
- How to retry requests for different HTTP methods.
Remember that most websites use anti-bot measures to prevent you from scraping their data. To overcome this barrier and scrape any website at scale without getting blocked, we recommend using a web scraper API like ZenRows.
Try ZenRows for free now without a credit card!