Your Playwright HTTP headers can determine the success of your web scraper. You need to configure it properly to avoid getting blocked.
In this tutorial, you'll learn how to set the request headers while web scraping with Playwright, including header tactics to avoid anti-bot detection.
Why Are Playwright Headers Important?
Headers contain information about a client's request and the server's response. There are two types of HTTP headers: the request and the response headers.
The request headers contain the request metadata sent by a user, while the response headers detail how the server has handled the response data returned to the user.
The focus is on the request headers because they're the most relevant to web scraping. Playwright sends request headers to the target website's server, contributing significantly to your scraper's ability to bypass blocks.
Customizing Playwright's request headers adds some benefits to your scraper. For instance, changing the user agent and the Referer headers can increase your chances of bypassing anti-bots. The cookies header can also be essential for managing sessions while scraping behind a login.
The Playwright's default request headers look like this:
{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
"Host": "httpbin.org",
"Sec-Ch-Ua": "\"Not A(Brand\";v=\"99\", \"HeadlessChrome\";v=\"121\", \"Chromium\";v=\"121\"",
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": "\"Windows\"",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/121.0.6167.57 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-65cba2e0-1afd0413750fd77832e87811"
}
}
Feel free to check yours by sending a request to https://httpbin.org/headers
with the following code:
# import the required library
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
# launch a Chromium browser and start a new page
browser = p.chromium.launch()
context = browser.new_context()
page = context.new_page()
# launch the website
response = page.goto('https://httpbin.org/headers')
print(response.text())
# close the browser
browser.close()
Now, open the same website on an actual browser like Chrome and pay attention to the differences between both headers:
{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Host": "httpbin.org",
"Referer": "https://www.google.com/",
"Sec-Ch-Ua": "\"Not A(Brand\";v=\"99\", \"Google Chrome\";v=\"121\", \"Chromium\";v=\"121\"",
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": "\"Windows\"",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "cross-site",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-65cbb9e0-70df80a20e9bcb9277dc9a93"
}
}
The user agent string in the default headers describes the browser type as a headless Chrome, and the accepted language and Referer headers are also missing.
These are all typical request headers you shouldn't leave out because their absence can flag your scraper as a bot and get you blocked. Next, you'll learn to add custom HTTP request headers and edit existing ones in Playwright.
How to Set Up Custom Headers in Playwright
Setting custom headers in Playwright is easy. You'll see how in this section.
Add Headers with Playwright
You can add custom headers in Playwright right inside the browser context. This method is handy when maintaining the same headers across all page instances.
Adding a non-existing header extends the default header set. But when you modify the value of an existing header, the new value will replace the previous one.
The code below modifies the user agent and browser description headers. It then extends it with the accepted language header and adds the Referer header while loading the website with page.goto
because adding it inside the browser context might cause redirection issues:
# import the required library
from playwright.sync_api import sync_playwright
# specify the request headers
extra_headers = {
'sec-ch-ua': '\'Not A(Brand\';v=\'99\', \'Google Chrome\';v=\'121\', \'Chromium\';v=\'121\'',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36',
'accept-Language': 'en-US,en;q=0.9'
}
with sync_playwright() as p:
# launch a browser instance
browser = p.chromium.launch()
# intercept the request headers in the browser context
context = browser.new_context(extra_http_headers = extra_headers)
page = context.new_page()
# open the target web page and add the referrer header
response = page.goto('https://httpbin.org/headers', referer = 'https://www.google.com/')
# print the headers output
print(response.text())
browser.close()
The code overwrites the existing headers and extends them with the new ones, as shown below. It will affect all page instances:
{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Host": "httpbin.org",
"Referer": "https://www.google.com/",
"Sec-Ch-Ua": "'Not A(Brand';v='99', 'Google Chrome';v='121', 'Chromium';v='121'",
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": "\"Windows\"",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-65ccd31b-3a1995d3148e089d0a6e2b4f"
}
}
Awesome! You just added relevant request headers to all pages in Playwright. What if you want to apply the changes to specific page instances?
Edit Header's Values for Specific Pages
It's also possible to edit the request headers for specific Playwright pages. This is useful when managing multiple page instances, but you want each to use different request headers. Unlike the browser context method, it supports adding a Referer header.
You can achieve that by modifying the headers in the page instance.
The code below spins two Playwright page instances. The first page uses the modified headers, and the second sticks to the default headers:
# import the required library
from playwright.sync_api import sync_playwright
# specify the request headers
extra_headers = {
'sec-ch-ua': '\'Not A(Brand\';v=\'99\', \'Google Chrome\';v=\'121\', \'Chromium\';v=\'121\'',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36',
'accept-Language': 'en-US,en;q=0.9',
'referer': 'https://www.google.com/'
}
with sync_playwright() as p:
# launch a browser instance
browser = p.chromium.launch()
# intercept the request headers in the browser context
context = browser.new_context()
page_1 = context.new_page()
page_2 = context.new_page()
# modify the header set for the first page instance
page_1.set_extra_http_headers(extra_headers)
# open the target web page
response_1 = page_1.goto('https://httpbin.org/headers')
response_2 = page_2.goto('https://httpbin.org/headers')
# print the headers output
print(f'Page1-modified: {response_1.text()}')
print(f'Page2-default: {response_2.text()}')
browser.close()
The code modifies the headers for the first page without touching the default ones for the second page:
Page1-modified: {
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Host": "httpbin.org",
"Referer": "https://www.google.com/",
"Sec-Ch-Ua": "'Not A(Brand';v='99', 'Google Chrome';v='121', 'Chromium';v='121'",
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": "\"Windows\"",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-65ccd465-3ea3701e1ac6e7751a29175c"
}
}
Page2-default: {
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
"Host": "httpbin.org",
"Sec-Ch-Ua": "\"Not A(Brand\";v=\"99\", \"HeadlessChrome\";v=\"121\", \"Chromium\";v=\"121\"",
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": "\"Windows\"",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/121.0.6167.57 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-65ccd466-232d04c02ddb736953ebcc04"
}
}
You just edited the header values for specific page instances in Playwright. Nice! Next, you'll see the relevance of the order of the headers.
Set the Header Order
The order of headers might not matter for regular requests, but it can be significant in web scraping. Some websites check the order of incoming request headers against a typical browser to detect bots.
You can set the order of headers by imitating how a browser like Chrome arranges them. For example, if you are using a Chrome user agent, you should order your headers as they appear on a real Chrome browser. That affects existing headers and the new ones you create.
Launch https://httpbin.org/headers
on Chrome. Right-click anywhere on the website and select "Inspect". Then go to the "Network" tab and click "Doc" ("Fetch/XHR" for many other websites). Choose a request and scroll down in the Headers section to view the request headers.
In the above headers, the accepted encoding comes before the accepted language and the Referer. The user agent header comes last. If adding only these four headers to your request, you should maintain their relative positions like so:
# import the required library
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
# launch a browser instance
browser = p.chromium.launch()
# intercept the request headers in the browser context
context = browser.new_context()
page = context.new_page()
# set the headers in the page instance
page.set_extra_http_headers({
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'en-US,en;q=0.9',
'Referer':'https://www.google.com/',
'Sec-Ch-UA': '\'Not A(Brand\';v=\'99\', \'Google Chrome\';v=\'121\', \'Chromium\';v=\'121\'',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36'
})
response = page.goto('https://httpbin.org/headers')
print(response.text())
browser.close()
This code arranges the selected headers as shown:
{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Host": "httpbin.org",
"Referer": "https://www.google.com/",
"Sec-Ch-Ua": "'Not A(Brand';v='99', 'Google Chrome';v='121', 'Chromium';v='121'",
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": "\"Windows\"",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-65ccd784-2a32a1900f771f313ce41f19"
}
}
You now know the importance of the order of headers in Playwright. Good job! However, following that arrangement doesn't guarantee your request will maintain the specified order.
Make Playwright Web Scraping Simple with Web Scraping API
Playwright has powerful web scraping capabilities. However, maintaining a complete header set can get complicated and doesn't guarantee that you won't get blocked.
A reliable way to customize the request headers without manual labor is to use a web scraping API like ZenRows. It optimizes the headers, rotates the user agents, and has anti-bot features like premium proxy rotation and anti-CAPTCHA that let you scrape any website without getting blocked.
For example, Playwright might get blocked while trying to scrape a heavily protected website like G2. Try it yourself with the following code:
# import the required library
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
# launch a browser instance
browser = p.chromium.launch()
# intercept the request headers in the browser context
context = browser.new_context()
page = context.new_page()
response = page.goto('https://www.g2.com/products/asana/reviews')
print(response.text())
browser.close()
The website blocks the request with the Cloudflare Turnstile:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<!-- ... -->
<title>Attention Required! | Cloudflare</title>
</head>
<body>
<!-- ... -->
</body>
</html>
See a screenshot of the CAPTCHA below:
Now, try accessing the same website with ZenRows and Playwright. Sign up, and you'll get to the Request Builder. Activate Premium Proxies, set the Boost mode to JavaScript rendering, and choose the cURL option.
Copy and paste the generated cURL into your scraper like so:
# import the required library
from playwright.sync_api import sync_playwright
from urllib.parse import urlencode
with sync_playwright() as p:
# launch a browser instance
browser = p.chromium.launch()
# intercept the request headers in the browser context
context = browser.new_context()
page = context.new_page()
formatted_url = (
f'https://api.zenrows.com/v1/?'
f'apikey=<YOUR_ZENROWS_API_KEY>&'
f'{urlencode({'url':'https://www.g2.com/products/asana/reviews'})}&'
f'js_render=true&'
f'premium_proxy=true'
)
response = page.goto(formatted_url)
print(response.text())
browser.close()
The code accesses the protected website, showing its title:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<title>Asana Reviews 2024</title>
</head>
<body>
<!-- other content omitted for brevity -->
</body>
</html>
You just accessed and scraped a protected website by integrating ZenRows with Playwright. Congratulations!
Let's replace that URL with https://httpbin.org/headers
to see the configured headers:
# import the required library
from playwright.sync_api import sync_playwright
from urllib.parse import urlencode
with sync_playwright() as p:
# launch a browser instance
browser = p.chromium.launch()
# intercept the request headers in the browser context
context = browser.new_context()
page = context.new_page()
formatted_url = (
f'https://api.zenrows.com/v1/?'
f'apikey=<YOUR_ZENROWS_API_KEY>&'
f'{urlencode({'url':'https://httpbin.org/headers'})}&'
f'js_render=true&'
f'premium_proxy=true'
)
response = page.goto(formatted_url)
print(response.text())
browser.close()
ZenRows automatically optimizes the request headers, as shown below:
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Dnt": "1",
"Host": "httpbin.org",
"Sec-Ch-Ua": "\"Google Chrome\";v=\"119\", \"Chromium\";v=\"119\", \"Not?A_Brand\";v=\"24\"",
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": "\"macOS\"",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-65cd0197-49f4ecd54b722d0f7235587e"
}
That's it! Your scraper now uses ZenRows to customize the request headers automatically.
Conclusion
In this tutorial, you've learned how to customize the request headers in Playwright. Here's a recap of what you now know:
- How to add new headers and edit the existing ones in Playwright.
- Customizing the value of the headers for specific page instances.
- The importance of the order of request headers in Playwright.
- How to simulate a real browser’s headers.
Remember that advanced anti-bots will still block you at the slightest opportunity. The most straightforward way to optimize your request headers and bypass any anti-bot is to integrate ZenRows and scrape any website without limitations. Try ZenRows for free!