Does your Python scraper get blocked by Cloudflare? That's because you keep getting in the line of the anti-bot. But no worries. You can bypass it and scrape your desired data.
We'll show you 5 tested and trusted Python tools to help you scrape without getting blocked by Cloudflare.
Let's go!
Can Cloudflare Detect Python Scrapers?
Cloudflare is one of the most common anti-bot measures you'll encounter while scraping. It uses various detection techniques with different protection levels to block scrapers. While some protection levels are simple and easier to break with a few customizations, others are advanced, and bypassing them requires combining several solutions.
Like any other programming language, Python scrapers are automated scripts, so they're prone to Cloudfllare's detection.
For example, Python's Requests library, a popular HTTP client, will likely not pass Cloudflare's defense because it sends bot-like parameters like the python-requests/2.32.3
User Agent. Additionally, it doesn't support JavaScript rendering and lacks the browser-like features to automate human interaction with a website.
To fact-check that claim, let's see how the Requests library performs against a Cloudflare-protected website, G2 Reviews. Try it out with the code below:
# import the required library
import requests
# send the request
response = requests.get("https://www.g2.com/products/jira/reviews")
# validate and print the response
if response.status_code!=200:
print(f"The request failed with an error {response.status_code}")
else:
print(response.text)
The code fails with the Cloudflare 403 forbidden error, as shown below. This error is a result of Cloudflare blocking the Requests library due to its bot-like attributes:
The request failed with an error 403
Cloudflare keeps updating its defense system, making it more difficult to bypass. To stay ahead of Cloudflare's detection mechanisms, you'll need to implement specific bypass techniques in your web scraper.
For a detailed guide on different techniques useful for tricking Cloudflare, read our article on bypassing Cloudflare.
Now, let's show you the five solutions to bypass Cloudflare and scrape without getting blocked in Python.
How to Bypass Cloudflare in Python
Different libraries and tools may help bypass Cloudflare while web scraping in Python. Let's look at the five best examples and learn how each works.
Cloudscraper
Cloudscraper was built as an easy-to-use browser emulator for bypassing Cloudflare in Python. It's similar to the Requests library in functionality and parameter acceptance. Cloudscraper's JavaScript engine makes it possible to easily decode and parse JavaScript, allowing your request to imitate a regular web browser's behavior.
Clouscraper supports different browsers, including Chrome and Firefox, and emulates fingerprints, such as the cipher suites, for a secure client-server connection.
However, the downside of using Cloudscraper is that it doesn't pass advanced fingerprinting tests, making it unfit to evade sophisticated Cloudflare Turnstile CAPTCHA. That said, pairing it with paid CAPTCHA solvers like 2Captcha and fortifying it with proxies can enhance its evasion capability.
👍 Pros
- Easy to use.
- Features User Agent auto-rotation for Cloudflare bypass.
- Suitable for bypassing basic blocks.
👎 Cons
- It fails on websites using advanced Cloudflare protection.
- Not regularly updated.
- It only emulates a small fraction of the browser.
Check out our detailed tutorial on using Cloudscraper with Python to learn more.
ZenRows
The best way to bypass Cloudflare with Python is to use ZenRows. It's a web scraping solution that bypasses Cloudflare with a single API request. As a full-fledged bypass toolkit, ZenRows allows you to focus on your scraping logic while it handles anti-bot auto-bypass under the hood.
In addition to a scraping API, ZenRows also features residential proxy services with geo-targeting and auto-rotation features. If scraping a JavaScript-rendered website, it acts as a headless browser to automate human actions such as scrolling, clicking, and more. ZenRows also works with any programming language and is compatible with other libraries, making it easy to integrate into your existing workflows.
👍 Pros:
- Easy to use.
- Compatible with any programming language
- Bypasses Cloudflare and other anti-bot measures, regardless of the difficulty level.
- It features smart rotating premium proxies.
- Scrapes JavaScript-rendered pages.
- Easy to integrate into your existing workflows.
- 24/7 customer support with a complete knowledge base.
- Frequently updated.
👎 Cons:
- It's a paid service (but offers a free trial).
How to Bypass Cloudflare in Python Using ZenRows
Let's use ZenRows to scrape G2 Reviews, a website heavily protected by Cloudflare, to see how it works. To bypass Cloudflare with ZenRows, you only need your free API key and the target URL.
Sign up to open the ZenRows Request Builder. Paste the target URL in the link box, activate Premium Proxies, and click JS Rendering. Choose Python as your programming language and select the API connection mode. Now, copy and paste the generated code into your scraper file.
The generated code should look like this:
# pip install requests
import requests
url = 'https://www.g2.com/products/jira/reviews'
apikey = '<YOUR_ZENROWS_API_KEY>'
params = {
'url': url,
'apikey': apikey,
'js_render': 'true',
'premium_proxy': 'true',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
The above code accesses the protected web page and extracts its full-page HTML, as shown:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/assets/favicon-fdac..." rel="shortcut icon" type="image/x-icon" />
<title>Jira Reviews from July 2024</title>
</head>
<body>
<!-- other content omitted for brevity -->
</body>
Great! Your scraper now uses ZenRows to bypass Cloudflare.
Undetected-chromedriver
Undetected-chromedriver is a modified version of the Selenium ChromeDriver that prevents anti-bot detection. The module automatically loads a ChromeDriver binary into your machine and patches it to emulate a legitimate browser's fingerprint.
One of its main advantages is that it supports JavaScript execution to interact dynamically with websites. This feature adds a human touch to your request and boosts your chances of evading Cloudflare.
However, like other headless browsers, undetected Chromedriver introduces memory overhead since it triggers a browser instance. Although it may get blocked with more sophisticated Cloudflare protection, you can add proxies to undetected-chromedriver to give it more stealth.
👍 Pros:
- Increases the chances of bypassing browser fingerprinting tests.
- Ability to mimic human behavior.
- Suitable for handling JavaScript challenges.
👎 Cons:
- Browser instance reduces overall performance.
- Unable to bypass advanced Cloudflare protection.
Read our complete tutorial on how to use the undetected-chromedriver in Python to learn more.
Curl_cffi
Curl_cffi is an improved version of the standard cURL library in Python. It patches the standard cURL library with actual browser fingerprints, allowing it to emulate popular browsers like Chrome, Safari, and Edge.
It bypasses anti-bots like Cloudflare by replacing detectable bot-like signals, such as cURL's OpenSSL library, with Chrome's BoringSSL. However, curl_cffi isn't a headless browser and can't execute JavaScript. So, it has fingerprinting limitations and can still be flagged by advanced anti-bot measures.
👍 Pros:
- Emulates different browsers without introducing memory overhead.
- Good for bypassing basic fingerprinting tests.
- Simple learning curve.
👎 Cons:
- It can't bypass sophisticated anti-bot measures.
- Its inability to execute JavaScript reduces the chances of evading anti-bots.
Want to learn more about curl_cffi? Check out our detailed tutorial on using curl_cffi for Python web scraping.
Cfscrape
The cfscrape library is another popular browser emulator for bypassing Cloudflare in Python. This tool follows the same operational pattern as Cloudscraper. The only difference is that cfscrape doesn't allow you to specify a browser, as it only emulates Chrome by default.
That said, cfscrape isn't perfect: it can only handle webpages with simple anti-bot measures, meaning it's ineffective against advanced Cloudflare protection. It also requires downgrading to a lower urllib3 version because it doesn't support urlliib3 versions 2+. You might need to run the following command to install a specific lower version of urllib3:
pip install "urllib3<2"
👍 Pros:
- Easy to use and implement.
👎 Cons:
- Ineffective against advanced Cloudflare challenges.
- It lacks maintenance and updates.
- It's less feature-rich than other options like Cloudscraper.
- Incompatible with recent urllib3 versions (versions 2+).
- No option to switch browsers.
Want to learn more about bypassing blocks with cfscrape? Check out our detailed tutorial on using cfscrape with Python.
Conclusion
Knowing how to bypass anti-bots is as vital as the scraping process itself, especially when you're looking to scrape a web page protected by Cloudflare. In this article, we covered 5 techniques for bypassing Cloudflare using Python. They include Cloudscraper, cfscrape, ZenRows, undetected-chromedriver, and curl_cffi.
While the other tools often fail with large-scale scraping or advanced Cloudflare security measures, ZenRows is the only solution capable of bypassing Cloudflare and any other anti-bot at scale. What's more, ZenRows only requires a single API call and doesn't involve technical setup like the other tools.
Try ZenRows for free now without a credit card!
Frequent Questions
What Is Cloudflare Bot Manager?
Cloudflare Bot Manager is one of the most professional and implemented web security systems used to mitigate attacks from malicious bots. Unfortunately for us, web scrapers might be unfairly detected.
Cloudflare bot detection techniques include TLS fingerprinting, Event tracking and canvas fingerprinting. If you've tried to scrape a Cloudflare-protected site before, some of the errors you'll see include:
- Error 1020: access denied.
- Error 1010: the owner of this website has banned your access based on your browser's signature.
- Error 1015: you are being rate-limited.
- Error 1012: access denied.
These are usually accompanied by a Cloudflare 403 Forbidden HTTP response status code.