Are you getting blocked by Cloudflare's security measures? Modifying the User Agent is an effective strategy for bypassing Cloudflare.
Switching the User Agent to mimic legitimate browser traffic can help disguise your scraper as a regular browser. This will increase the likelihood of successfully accessing and scraping data from Cloudflare-protected pages.
Let's get started!
What Is the User Agent (and Why Cloudflare May Block Yours)?
HTTP request headers are a set of key-value pairs sent by the client to the server that provide essential information about the request. One of the key elements of these headers is the User Agent (UA).
The User Agent string helps identify the client making the request. The User Agent of most web scraper bots significantly differs from the User Agents of regular browsers, which is why Cloudflare detects and blocks them.
For example, this is what a Chrome User Agent looks like:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.6478.127 Safari/537.36
In contrast, let's see Selenium's User Agent. The following code targets HTTPBin, a page that returns the User Agent of the client making the request:
# import the required libraries
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
# run Chrome in headless mode
options = Options()
options.add_argument("--headless=new")
# start a driver instance
driver = webdriver.Chrome(options=options)
# open the target website
driver.get("https://httpbin.io/user-agent")
# print the HTML
print(driver.find_element(By.TAG_NAME, "body").text)
# release the allocated resources
driver.quit()
You'll get the following output on running this code:
{
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/126.0.0.0 Safari/537.36"
}
In this output, the User Agent string indicates that the browser is running in headless mode, as shown by HeadlessChrome/126.0.0.0
. This is how Cloudflare detects and blocks such requests.
Let's learn how to set up the User Agent to lower the chances of Cloudflare detection.
Change Your User Agent to Avoid Cloudflare Detection
Changing your User Agent is a simple yet effective way to avoid Cloudflare detection. In this section, we'll walk you through the process using Python and Selenium as an example.
Set Custom User Agent
First, grab the latest User Agent from our list of top User Agents for web scraping.
Next, import the necessary libraries in your code.
# pip3 install selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
Initialize a Chrome Options object and configure Selenium to use a custom User Agent.
# ...
# create a Chrome Options instance
options = Options()
options.add_argument("--headless=new")
# set a custom User Agent
custom_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
options.add_argument(f"user-agent={custom_user_agent}")
# Start the WebDriver instance with the options
driver = webdriver.Chrome(options=options)
Open the HTTPBin target website to verify the User Agent has been set correctly. Finally, print the response HTML and close the driver.
# ...
# open the target website
driver.get("https://httpbin.io/user-agent")
# print the User Agent to verify
print(driver.find_element(By.TAG_NAME, "body").text)
# release the allocated resources
driver.quit()
You'll get the following output:
{
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
}
Congrats! You successfully modified the User Agent.
While using a single custom User Agent might work for small-scale scraping, more is needed for extracting data from Cloudflare-protected pages at scale. Repeated requests from the same User Agent can still be detected and blocked by Cloudflare.
Let's learn how to build a manual User Agent rotator to reduce the chances of detection further.
Rotate Your User Agent
Rotating User Agents is critical to avoid getting blocked, as too many requests from the same User Agent can easily trigger detection mechanisms. Here's how you can rotate your User Agents randomly.
Grab a few User Agents from our list of top User Agents for web scraping.
# list of User Agent strings
user_agents = [
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14.4; rv:124.0) Gecko/20100101 Firefox/124.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
]
Use Python's random.choice method to choose a User Agent from this list randomly.
Let's modify the previous code and implement the rotating User Agent functionality:
import random
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
# list of User Agent strings
user_agents = [
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14.4; rv:124.0) Gecko/20100101 Firefox/124.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
]
# create a Chrome Options instance
options = Options()
options.add_argument("--headless=new")
# set a random User Agent
random_user_agent = random.choice(user_agents)
options.add_argument(f"user-agent={random_user_agent}")
# Start the WebDriver instance with the options
driver = webdriver.Chrome(options=options)
# open the target website
driver.get("https://httpbin.io/user-agent")
# print the User Agent to verify
print(driver.find_element(By.TAG_NAME, "body").text)
# release the allocated resources
driver.quit()
By running this code, the output would show the randomly selected User Agent, such as:
# request 1
{
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
# request 2
{
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
# ...
Constructing a proper User Agent string is crucial to avoid detection. Many anti-bot systems, such as Cloudflare, look for inconsistencies in these details. For instance, using WebKit versions associated with Safari in a modern Chrome User Agent string is a clear mismatch.
Inconsistent User Agent Example
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/605.1.15 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/605.1.15
In this incorrect example, AppleWebKit/605.1.15
is associated with Safari, not Chrome. AppleWebKit/605.1.15
and Safari/605.1.15
indicate a Safari browser version, but the string claims to be Chrome (Chrome/126.0.0.0
). This inconsistency signals that the User Agent string is fabricated.
Correct User Agent Example
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
In this correct example, AppleWebKit/537.36
is used by both Chrome and Safari. It's paired with the Chrome version Chrome/91.0.4472.124
and Safari/537.36
.
Keeping your User Agents up-to-date is crucial to avoid detection. Outdated User Agents can be easily flagged by security systems. Regularly refresh your list by sourcing the latest User Agents from reliable repositories or websites, and use automation tools to fetch them programmatically.
Ensure your list includes various browsers, versions, platforms, and devices to mimic real user traffic accurately.
However, maintaining this can be challenging and time-consuming. In the next section, we'll explore a better alternative to bypass Cloudflare efficiently.
How to Bypass Cloudflare Every Time
A manual User Agent rotator is most often not enough to avoid getting blocked by Cloudflare. Not only is it hard to maintain, but Cloudflare also has many more tricks up its sleeve. To bypass Cloudflare effectively, your scraper will likely need to arm itself with additional techniques, such as premium proxies.
The best solution to avoid Cloudflare blocks is using a web scraping API, like ZenRows.
Let's try to scrape Cloudflare Challenge, a page protected by Cloudflare, using our previous User Agent rotator script.
import random
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
# list of User Agent strings
user_agents = [
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14.4; rv:124.0) Gecko/20100101 Firefox/124.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
]
# create a ChromeOptions instance
options = Options()
options.add_argument("--headless=new")
# set a random User Agent
random_user_agent = random.choice(user_agents)
options.add_argument(f"user-agent={random_user_agent}")
# Start the WebDriver instance with the options
driver = webdriver.Chrome(options=options)
# open the target website
driver.get("https://www.scrapingcourse.com/cloudflare-challenge")
# take a screenshot
driver.save_screenshot("cloudflare_blocked_screenshot.png")
# release the allocated resources
driver.quit()
You'll get the following output on running this code:
The script failed due to Cloudflare protection.
Now, let's try to access the same website using ZenRows.
Visit the ZenRows homepage and sign up for an account. You'll get redirected to the Request Builder page.
Input the same target URL in the URL to Scrape box. Click on the Premium Proxies and JS Rendering Boost mode. Select the Python and click on the API tab.
Finally, copy the generated code into your script.
# pip install requests
import requests
url = "https://www.scrapingcourse.com/cloudflare-challenge"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
"premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)
You'll get the following output on running this code:
<html lang="en">
<head>
<!-- ... -->
<title>Cloudflare Challenge - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<h2>
You bypassed the Cloudflare challenge! :D
</h2>
<!-- other content omitted for brevity -->
</body>
</html>
Congratulations! You successfully bypassed the Cloudflare protection using ZenRows.
Conclusion
In this guide, you learned about the importance of using proper User Agents to avoid detection while web scraping. You've seen how bot-like User Agents are easily detected by anti-bot solutions and understood the necessity of regularly updating and rotating them.
However, maintaining and manually rotating User Agents is unscalable and often insufficient to bypass sophisticated defenses like Cloudflare. We recommend using ZenRows to bypass all anti-bot measures and scrape any website without getting blocked.