Getting detected by Cloudflare Bot Manager while scraping is quite frequent and can slow down your scraping process or even put a stop to the operation. The best way to avoid this is by making use of popular libraries created to get around this anti-bot protection.
In this article, we'll mention some proven tools to bypass Cloudflare in Python and share pieces of advice on how to use them to scrape any webpage whose data you're interested in.
Let's get started!
What Is Cloudflare Bot Manager?
Cloudflare Bot Manager is one of the most professional and implemented web security systems used to mitigate attacks from malicious bots. Unfortunately for us, web scrapers might be unfairly detected.
Cloudflare bot detection techniques include TLS fingerprinting, Event tracking and canvas fingerprinting. If you've tried to scrape a Cloudflare-protected site before, some of the errors you'll see include:
- Error 1020: access denied.
- Error 1010: the owner of this website has banned your access based on your browser's signature.
- Error 1015: you are being rate limited.
- Error 1012: access denied.
They're usually accompanied by a 403 Forbidden HTTP response status code.
Can Cloudflare Detect Python Scrapers?
Yes, Cloudflare is capable of detecting Python scrapers since they're not whitelisted and it assumes they're malicious by default. Therefore, your web scraper can get denied access to a web page.
We'll start by installing the library:
pip install requests
And then we'll send a request to the target website:
#Let's do a canonic scraping with requests import requests scraper = requests.get('https://opensea.io/rankings/trending').text print(scraper)
It didn't work. 😢
requests-based scraper returns a raw HTML content containing the error code at the top:
requests is not a reliable method for bypassing Cloudflare's security measures as it often returns an access denied error. So how do you avoid Python Cloudflare detection while scraping? Let's get into that.
How to Bypass Cloudflare in Python
There are different libraries to bypass Cloudflare while web scraping in Python:
Let's take a look at these tools and how they can be used successfully.
The best way to bypass Cloudflare with Python is using ZenRows. It's a web scraping API capable of bypassing Cloudflare in Python with a single request. It simplifies the process of integrating scraping tasks into your workflow with its advanced anti-bot features and proxy modes.
- Easy to use.
- Capable of bypassing anti-bots, like Cloudflare and CAPTCHAs.
- It can bypass Cloudflare v2 challenge CAPTCHA.
- Smart rotating and premium proxies are included.
- It's compatible with other libraries, making it easy to integrate into your existing workflows.
- Chat support done by developers.
- Constantly updated.
- It's a paid service, but offers a free trial.
How to Bypass Cloudflare in Python Using ZenRows
To crawl the data from unprotected sources, you'll only need two pieces of information: a free API key and the URL of your target website.
Thus, getting back to our case of scraping the Opensea website, you just 1) import the
requests library and 2) send a
get() request to the ZenRows API with the URL you want to scrape.
import requests response = requests.get("https://api.zenrows.com/v1/?apikey=YOUR_API_KEY&url=https%3A%2F%2Fopensea.io%2Frankings%2Ftrending") print(response.text)
When it comes to bypassing Cloudflare using Python, simply add
&antibot=true and the
proxy_country parameter to your request:
response_antibot = requests.get("https://api.zenrows.com/v1/?apikey=YOUR_API_KEY&url=https%3A%2F%2Fopensea.io%2Frankings%2Ftrending&antibot=true&premium_proxy=true&proxy_country=us") print(response_antibot.text)
To scrape a specific piece of information, complement your request with the Wait For Selector feature by adding
&wait_for=.background-load. This will make ZenRows wait for the desired content to load before proceeding with the data extraction.
response_specific = requests.get("https://api.zenrows.com/v1/?apikey=YOUR_API_KEY&url=https%3A%2F%2Fopensea.io%2Frankings%2Ftrending&js_render=true&wait_for=.content") print(response_specific.text)
In just a few seconds, ZenRows API will return the webpage content. Here's what we got from the Opensea web page:
<!DOCTYPE html><html lang="en-US"><head><meta charSet="utf-8"/><meta content="width=device-width,initial-scale=1" name="viewport"/><link href="https://opensea.io/rankings/trending" hrefLang="en" rel="alternate"/><link href="https://opensea.io/zh-CN/rankings/trending" hrefLang="zh-CN" rel="alternate"/><link href="https://opensea.io/zh-TW/rankings/trending" hrefLang="zh-TW" rel="alternate"/><link href="https://opensea.io/de-DE/rankings/trending" hrefLang="de-DE" rel="alternate"/><link href="https://opensea.io/es/rankings/trending" hrefLang="es" rel="alternate"/><link href="https://opensea.io/fr/rankings/trending" hrefLang="fr" rel="alternate"/><link href="https://opensea.io/kr/rankings/trending" hrefLang="kr" rel="alternate"/><link href="https://opensea.io/ja/rankings/trending" hrefLang="ja" rel="alternate"/><link rel="preload"......
This is all! You can use Python to do Cloudflare bypass for any website now.
cloudscraper was built as an easy-to-use algorithm for Python Cloudflare bypass. The package is very similar to
- Easy to use.
- It fails on websites using Cloudflare v2 challenge CAPTCHA.
- Difficult for beginners.
- Not updated frequently.
- It doesn't work well in large-scale scraping projects.
How to Bypass Cloudflare in Python Using cloudscraper
To use cloudscraper in Python to bypass Cloudflare, start by installing it:
pip install cloudscraper
The fastest way to employ cloudscraper is to call
create_scraper(). Then, cloudscraper operates the same way as a
requests session object; you just substitute calls for
requests.post() with either
import cloudscraper scraper = cloudscraper.create_scraper(delay=10, browser="chrome") content = scraper.get("https://opensea.io/rankings/trending").text print(content)
cloudscraper Python package should be complemented with an additional library like
BeautifulSoup4 to parse the data scraped:
from bs4 import BeautifulSoup as bs # To further process extracted data processed_content = bs(content, "html.parser") # These classes are not reliable, added here for demo purposes processed_content = processed_content.find_all(".eqFKWH .hmMxZB .mGAUR") scraped_data = list() for data in soup: scraped_data.append(data.get_text()) print(scraped_data)
Boom! Running the script should scrape the target website and your result should look like this:
[ 'PATCHWORKS', 'Moonrunners Official', 'Frog Affirmation Project (FAP)', 'Checks - VV Edition', … ]
However, the downside of using the cloudscraper library is that it can't bypass Cloudflare v2 challenge. This means that if you encounter a website that uses this type of protection, your scraper becomes ineffective. For example, if you try to parse forever21.com, cloudscraper will return the following error message:
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 Captcha challenge, This feature is not available in the opensource (free) version.
A possible solution is to use a third-party CAPTCHA solver, or a web scraping API that provides anti-bot bypass such as ZenRows.
The cfscrape package is another popular choice for web scraping bypass in Python for Cloudflare due to its low technical complexity. All you need to do is install the
requests module in order to interact with the cfscrape scraper. Its simplicity makes it a great choice for those looking to get started with web scraping without the need for advanced technical skills.
cfscrape isn't perfect: it can only handle the webpages with the classic Cloudflare anti-bots protection, meaning it'd be completely ineffective with reCAPTCHA challenges.
- Easy to use and implement.
- Ineffective with reCAPTCHA challenges.
- It lacks maintenance and updates.
- Not as feature-rich as other scraping libraries.
- It can't handle large-scale scraping.
How to Bypass Cloudflare in Python Using cfscrape
To use cfscrape to bypass Cloudflare in Python, run the installation command via
pip install cfscrape
The next step is to import the module and call the
create_scraper() method. The rest works the same way as the
requests library, so any request we make will bypass Cloudflare's anti-bot protection and crawl the necessary information from the web page.
import cfscrape scraper = cfscrape.create_scraper() scraped_data = scraper.get('https://opensea.io/rankings/trending') print(scraped_data.text)
The library returns the same HTML we saw in the previous example.
undetected-chromedriver, developed as an extension to Selenium, stands out among other analogs for its ability to bypass bot protection software. Generally, this module will automatically load a driver binary into your system and patch it later.
- It can bypass bot protection.
- It automatically loads and patches a driver binary.
- It's slow compared to other web scraping tools.
- Inefficient for large-scale web scraping tasks.
How to Bypass Cloudflare in Python Using undetected_chromedriver
To use undetected-chromedriver for Python Cloudflare bypass, start by installing it:
pip install undetected-chromedriver
Now, import undetected-chromedriver and use the
uc.Chrome() method to create a headless Chrome web browser object, and then use the
driver.get() method to add to the URL you want to scrape.
import undetected_chromedriver as uc driver = uc.Chrome() driver.get('https://opensea.io/rankings/trending')
It's important to note that the
undetected_chromedriver library is only designed to bypass Cloudflare's security measures and can't be used as a primary solution for complex scraping. Therefore, you'll have to combine this module with other libraries to scrape data from the website.
Here, you can see the output webpage opened in a fortified headless browser:
Knowing how to bypass anti-bots is as important as the scraping process itself, especially when you're looking to scrape a web page protected by Cloudflare. In this article, we covered the different techniques that can be used to bypass Cloudflare using Python: ZenRows, cloudscraper, cfscrape and undetected-chromebrowser.
While most of these tools are effective for bypassing Python Cloudflare detection, they fail when it comes to large-scale scraping or advanced Cloudflare security measures, like Cloudflare v2 challenge CAPTCHA. ZenRows is the only solution capable of bypassing any type of anti-bot, and you can get your free API key now.