Do you want to customize your urllib request headers while scraping with Python? You’ve come to the right place!
In this article, you'll learn how to configure custom request headers in urllib.
Why Are Headers Important for Urllib?
Headers describe the request source (client) and how the recipient (server) handles the response during an HTTP interaction. The client is usually a browser or an HTTP client like Python's Requests and urllib.
There are two types of HTTP headers: the request and response headers. However, the request headers are the most important during web scraping because they convey information about the HTTP client.
Customizing the request headers while scraping with urllib lets you mimic a legitimate browser and avoid anti-bot detection. Another use case of header customization is session cookies manipulation for authenticating your scraper to extract data behind a login.
Let’s check the default urllib headers by requesting https://httpbin.io/headers
, a web page that returns your current HTTP request headers:
# import the required libraries
from urllib import request, error
# catch HTTP errors
try:
# send the request and obtain a response object
response = request.urlopen("https://httpbin.io/headers").read()
# print a decoded format to view your request headers
print(response.decode("utf-8"))
except error.HTTPError as e:
print(e)
The urllib library sends the following incomplete request headers, making your scraper vulnerable to anti-bot detection:
{
"headers": {
"Accept-Encoding": [
"identity"
],
"Connection": [
"close"
],
"Host": [
"httpbin.io"
],
"User-Agent": [
"Python-urllib/3.12"
]
}
}
Open the same URL (https://httpbin.io/headers
) on a legitimate browser like Chrome. You'll see detailed request headers like so:
{
"headers": {
"Accept": [
"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"
],
"Accept-Encoding": [
"gzip, deflate, br, zstd"
],
"Accept-Language": [
"en-US,en;q=0.9"
],
"Connection": [
"keep-alive"
],
"Host": [
"httpbin.io"
],
"Referer": [
"https://www.google.com/"
],
"Sec-Ch-Ua": [
"\"Google Chrome\";v=\"123\", \"Not:A-Brand\";v=\"8\", \"Chromium\";v=\"123\""
],
"Sec-Ch-Ua-Mobile": [
"?0"
],
"Sec-Ch-Ua-Platform": [
"\"Windows\""
],
"Sec-Fetch-Dest": [
"document"
],
"Sec-Fetch-Mode": [
"navigate"
],
"Sec-Fetch-Site": [
"cross-site"
],
"Upgrade-Insecure-Requests": [
"1"
],
"User-Agent": [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
]
}
}
Compare this result with urllib's request headers, and you'll see that urllib is missing important HTTP request headers.
First, urllib's user agent header describes your scraper as a Python bot, while the accepted encoding string is inaccurate. The referer, response type, platform, secured client hint user agent (Sec-Ch-Ua
), and accepted language headers are also missing.
These missing headers make your urllib web scraper vulnerable to anti-bot detection. You'll learn how to fix them in the next section.
How to Set Up Custom Headers With Urllib
As you've seen, some urllib request headers are missing while others are inaccurate. In this section, you'll learn the three strategies of setting custom request headers while scraping with urllib. In each case, you'll request https://httpbin.io/headers
to view your current request headers.
Add Headers With Urllib
Setting custom request headers in urllib requires specifying the header strings in the request parameter. Let's add the missing headers and edit the misconfigured ones to see how it works.
First, define the new headers in a dictionary. Note that the Sec-Fetch-Mode
and Sec-Fetch-Site
headers added below are essential backups to the referer. They inform the server that the client is navigating from another website (Google):
# import the required libraries
from urllib import request, error
# define new request headers
missing_headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Accept-Language": "en-US,en;q=0.9",
"Sec-Ch-Ua": "\"Google Chrome\";v=\"123\", \"Not:A-Brand\";v=\"8\", \"Chromium\";v=\"123\"",
"Referer":"https://www.google.com/",
"Sec-Ch-Ua-Platform": "\"Windows\"",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "cross-site",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
Add the new headers as request parameters and open the target website to view your HTTP request headers:
# ...
# catch HTTP errors
try:
# create a request param and add missing request headers
request_params = request.Request(
url="https://httpbin.io/headers",
headers=missing_headers
)
# send the request with the parameters and obtain a response object
response = request.urlopen(request_params).read()
# print a decoded format to view your request headers
print(response.decode("utf-8"))
except error.HTTPError as e:
print(e)
# import the required libraries
from urllib import request, error
# define new request headers
missing_headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Accept-Language": "en-US,en;q=0.9",
"Sec-Ch-Ua": "\"Google Chrome\";v=\"123\", \"Not:A-Brand\";v=\"8\", \"Chromium\";v=\"123\"",
"Referer":"https://www.google.com/",
"Sec-Ch-Ua-Platform": "\"Windows\"",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "cross-site",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
# catch HTTP errors
try:
# create a request param and add request headers
request_params = request.Request(
url="https://httpbin.io/headers",
headers=missing_headers
)
# send the request with the parameters and obtain a response object
response = request.urlopen(request_params).read()
# print a decoded format to view your request headers
print(response.decode("utf-8"))
except error.HTTPError as e:
print(e)
The code replaces the existing request headers and adds new ones, as shown:
{
"headers": {
"Accept": [
"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"
],
"Accept-Encoding": [
"gzip, deflate, br, zstd"
],
"Accept-Language": [
"en-US,en;q=0.9"
],
"Connection": [
"close"
],
"Host": [
"httpbin.io"
],
"Referer": [
"https://www.google.com/"
],
"Sec-Ch-Ua": [
"\"Google Chrome\";v=\"123\", \"Not:A-Brand\";v=\"8\", \"Chromium\";v=\"123\""
],
"Sec-Ch-Ua-Platform": [
"\"Windows\""
],
"Sec-Fetch-Mode": [
"navigate"
],
"Sec-Fetch-Site": [
"cross-site"
],
"User-Agent": [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
]
}
}
Your scraper now uses the new headers to navigate your target web page. In the next section, we will see how to edit some of these headers.
Edit a Header's Values
Editing a header in urllib involves intercepting an ongoing request and updating its header values on the fly. It's handy if you want to use different header values for several web pages.
Assume you're scraping two web pages and want to use a Windows user agent for the first and a Linux for the second. Let's request the target website twice and print its header values to see how to achieve that.
For the first request, you'll retain the previous headers dictionary containing a Windows user agent:
# import the required libraries
from urllib import request, error
# define new request headers
missing_headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Accept-Language": "en-US,en;q=0.9",
"Sec-Ch-Ua": "\"Google Chrome\";v=\"123\", \"Not:A-Brand\";v=\"8\", \"Chromium\";v=\"123\"",
"Referer":"https://www.google.com/",
"Sec-Ch-Ua-Platform": "\"Windows\"",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "cross-site",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
# catch HTTP errors
try:
# create a request param and add request headers
request_params = request.Request(
url="https://httpbin.io/headers",
headers=missing_headers
)
# send the first request with the existing Windows user agent
response_1 = request.urlopen(request_params).read()
# print a decoded format to view the first request headers
print(f"First header used: {response_1.decode("utf-8")}")
except error.HTTPError as e:
print(e)
Next, intercept the request to edit the existing user agent string to Linux. You'll also need to change the client hint platform header to Linux since you're now using a Linux user agent:
# ...
# catch HTTP errors
try:
# ...
# use a Linux user agent and platform for the second request
request_params.add_header("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36")
request_params.add_header("Sec-Ch-Ua-Platform", "\"Linux\"")
# send the second request with the edited headers
response_2 = request.urlopen(request_params).read()
# print a decoded format to view your request headers
print(f"Second header used: {response_2.decode("utf-8")}")
Combine the two snippets, and you'll get the following complete code:
# import the required libraries
from urllib import request, error
# define new request headers
missing_headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Accept-Language": "en-US,en;q=0.9",
"Sec-Ch-Ua": "\"Google Chrome\";v=\"123\", \"Not:A-Brand\";v=\"8\", \"Chromium\";v=\"123\"",
"Referer":"https://www.google.com/",
"Sec-Ch-Ua-Platform": "\"Windows\"",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "cross-site",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
# catch HTTP errors
try:
# create a request param and add request headers
request_params = request.Request(
url="https://httpbin.io/headers",
headers=missing_headers
)
# send the first request with the existing Windows user agent
response_1 = request.urlopen(request_params).read()
# print a decoded format to view the first request headers
print(f"First header used: {response_1.decode("utf-8")}")
# use a Linux user agent and platform for the second request
request_params.add_header("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36")
request_params.add_header("Sec-Ch-Ua-Platform", "\"Linux\"")
# send the second request with the edited headers
response_2 = request.urlopen(request_params).read()
# print a decoded format to view your request headers
print(f"Second header used: {response_2.decode("utf-8")}")
except error.HTTPError as e:
print(e)
The code outputs the headers for both requests. Pay attention to the difference in the user agent and client hint platform headers in both outputs:
First header used: {
"headers": {
"Accept": [
"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"
],
"Accept-Encoding": [
"gzip, deflate, br, zstd"
],
"Accept-Language": [
"en-US,en;q=0.9"
],
"Connection": [
"close"
],
"Host": [
"httpbin.io"
],
"Referer": [
"https://www.google.com/"
],
"Sec-Ch-Ua": [
"\"Google Chrome\";v=\"123\", \"Not:A-Brand\";v=\"8\", \"Chromium\";v=\"123\""
],
"Sec-Ch-Ua-Platform": [
"\"Windows\""
],
"Sec-Fetch-Mode": [
"navigate"
],
"Sec-Fetch-Site": [
"cross-site"
],
"User-Agent": [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
]
}
}
Second header used: {
"headers": {
"Accept": [
"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"
],
"Accept-Encoding": [
"gzip, deflate, br, zstd"
],
"Accept-Language": [
"en-US,en;q=0.9"
],
"Connection": [
"close"
],
"Host": [
"httpbin.io"
],
"Referer": [
"https://www.google.com/"
],
"Sec-Ch-Ua": [
"\"Google Chrome\";v=\"123\", \"Not:A-Brand\";v=\"8\", \"Chromium\";v=\"123\""
],
"Sec-Ch-Ua-Platform": [
"\"Linux\""
],
"Sec-Fetch-Mode": [
"navigate"
],
"Sec-Fetch-Site": [
"cross-site"
],
"User-Agent": [
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
]
}
}
You just edited the request headers to use a different user agent and platform for two requests. Congratulations!
Set the Order of the Headers
The arrangement of the entire header set doesn't significantly affect your scraper's performance. However, a strict arrangement rule applies to request header fields with comma-separated values.
Browsers like Chrome have a specific way of arranging comma-separated request header values. Using the same arrangement strengthens your scraper's ability to mimic a legitimate browser.
Examples of headers with comma-separated values include the encoding type, language type, and client hint user agent fields. Here's how Chrome arranges the values of each of these fields:
{
"Accept-Encoding": [
"gzip, deflate, br, zstd"
],
"Accept-Language": [
"en-US,en;q=0.9"
],
"Sec-Ch-Ua": [
"\"Google Chrome\";v=\"123\", \"Not:A-Brand\";v=\"8\", \"Chromium\";v=\"123\""
]
}
Changing the order of each field, as shown, can get you blocked:
{
"Accept-Encoding": [
"br, deflate, gzip, zstd"
],
"Accept-Language": [
"en;q=0.9,en-US"
],
"Sec-Ch-Ua": [
"\"Not:A-Brand\";v=\"8\", \"Google Chrome\";v=\"123\", \"Chromium\";v=\"123\""
]
}
A nifty way to mimic a typical browser arrangement is to inspect the request headers in the network tab.
To do that, open your target website via a browser like Chrome and go to the Network tab. Reload the web page and click a request name in the request table. Scroll to the Request Headers section and copy and paste the comma-separated fields into your headers dictionary:
If followed, the comma-separated fields in your request headers should look like this:
# import the required libraries
from urllib import request, error
# define new request headers
missing_headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Accept-Language": "en-US,en;q=0.9",
"Sec-Ch-Ua": "\"Google Chrome\";v=\"123\", \"Not:A-Brand\";v=\"8\", \"Chromium\";v=\"123\"",
"Referer":"https://www.google.com/",
"Sec-Ch-Ua-Platform": "\"Windows\"",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "cross-site",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
# catch HTTP errors
try:
# create a request param and add request headers
request_params = request.Request(
url="https://httpbin.io/headers",
headers=missing_headers
)
# send the request with the parameters and obtain a response object
response = request.urlopen(request_params).read()
# print a decoded format to view your request headers
print(response.decode("utf-8"))
except error.HTTPError as e:
print(e)
You now know how to set the right header order for your urllib scraper to mimic a real browser. That’s great!
Conclusion
In this tutorial, you've seen how to customize the request headers while scraping with urllib. Here's a recap of what you've learned:
- Adding missing request headers to your urllib web scraper.
- Intercepting the urllib request to edit existing HTTP request headers.
- The strategy for setting the appropriate request headers order.
However, bypassing advanced anti-bot systems requires more than setting custom request headers. Plus, header management gets complicated at scale, increasing your chances of getting blocked. We recommend using ZenRows with your web scraper to manage your headers, auto-rotate proxies, and solve all anti-bot-related problems. Try ZenRows for free!