Cloudflare identifies bot traffic and blocks it, which unhappily includes web scrapers. The good news is you can use the Python library Cloudscraper to bypass it.
In this web scraping tutorial, you'll learn how to bypass Cloudflare using Cloudscraper. We'll also discuss the common errors you may encounter and how to fix them.
Here's what you need to do in a nutshell:
- Step 1: Import Cloudscraper and other dependencies.
- Step 2: Create a Cloudscraper instance and define your target website.
- Step 3: Access the website to retrieve its data.
What Is Cloudscraper?
Cloudscraper is a web scraping library built exclusively for retrieving data from Cloudflare-protected websites. This ability and its compatibility with popular libraries like Python Requests and BeautifulSoup makes it a valuable tool for data extraction.ย
With Cloudscraper, you can bypass Cloudflare's I'm under attack mode
(IUAM) pages, like the one below:
How Does Cloudscraper Work?
Cloudscraper relies on executing JavaScript like a regular browser to bypass Cloudflare. To achieve that, it uses browser-like headers and a JavaScript Engine/interpreter to solve JavaScript challenges. This lets you emulate natural user behavior easily without manually deobfuscating Cloudflare's JavaScript.
Your Starting Point: Getting Blocked with Python Requests
Before we get to the solution, let's assess the problem of Cloudflare blocking your Python Requests scraper. We tried to scrape Open Sea's NFT Collection Stats, a Cloudflare-protected web page, to show you proof.
We sent an HTTP request to access our target website.
res = requests.get("https://opensea.io/rankings")
print("The status code is ", res.status_code)
print(res.text)
But we got this result:
The status code is 403
<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Access denied</title>
The result above shows a 403 status code, which is an error page. Plus, the Cloudflare system redirected us to an "Access denied" page instead of Open Sea's Collection stats page. This issue is commonly associated with web scraping, as explored in the 403 web scraping error article.
To get a clearer understanding, we saved the response locally to view it in a browser.
That happened because it detected our request as that of a bot and blocked us out. To bypass this Cloudflare error page, you must appear as human as possible.
Fortunately, Cloudscraper can get you there to an extent. But more on that later.
How to Bypass Cloudflare Using Cloudscraper in Python?
1. Install Cloudscraper
First, you'll need Python 3. Keep in mind that some systems have it pre-installed.
After that, install Cloudscraper and BeautifulSoup (to parse HTTP responses).
from bs4 import BeautifulSoup
import cloudscraper
2. Use Cloudscraper in Python
A lot of work goes into bypassing Cloudflare protection with Python.
However, with Cloudscraper, you don't need to worry about what goes on behind the scenes. Rather, you can call the scraper function and wait a few seconds to gain access.
Here's how to do it.
- Import Cloudscraper and other dependencies (BeautifulSoup).
from bs4 import BeautifulSoup
import cloudscraper
- Create a Cloudscraper instance and define your target website.
scraper = cloudscraper.create_scraper()
url = "https://opensea.io/rankings"
- Access the website to retrieve its data.
info = scraper.get(url)
print(info.status_code)
soup = BeautifulSoup(info.text, "html.parser")
print(soup.find(class_ = "gCpBEX").get_text())
The code block above prints the request status code and the page title "Collection stats" elements. So, by combining the three code blocks above, you write the complete script.
from bs4 import BeautifulSoup
import cloudscraper
url = "https://opensea.io/rankings"
scraper = cloudscraper.create_scraper()
info = scraper.get(url)
print(info.status_code)
soup = BeautifulSoup(info.text, "html.parser")
print(soup.find(class_ = "gCpBEX").get_text())
It brings the following result:
200
Collection stats
Now, you've successfully bypassed your first Cloudflare protection.
Cloudscraper Non-Default Features
Cloudscraper has many non-default features you can pass as an argument to built-in functions, such as create_scraper(), get_tokens(), and get_cookie_string().
Some examples include:
- Browser/user agent filtering.
- Cookies.
- CAPTCHA.
- Delays.
- JavaScript engines and interpreters.
Let's say you want to bypass a Cloudflare JavaScript challenge while appearing as a mobile user agent.
For that, you'll need a JavaScript engine and some of the following parameters:
scraper = cloudscraper.create_scraper(
interpreter='nodejs',
delay=10,
browser={
'browser': 'chrome',
'platform': 'android',
'desktop': False,
},
captcha={
'provider': '2captcha',
'api_key': 'you_2captcha_api_key',
},
)
The mobile and desktop parameters are "True" by default, so you must turn one off if you want only the other.
Also, Cloudscraper has a list of supported JavaScript engines and third-party CAPTCHA solvers. You can check the PyPI documentation for more details.
Can Cloudscraper Bypass Newer Cloudflare Versions?
Cloudflare frequently updates its bot protection techniques, so let's see how Cloudscraper fights against its newer versions.
For this example, we'll try to scrape Author as an example, a website that uses a newer Cloudflare version.
Upon visiting this website on a browser, it automatically redirects us to the Cloudflare waiting room. There it checks if our connection is secure.
Since we're sending this request from an actual browser, Cloudflare accepts our connection and redirects us to the original home page.
Now, let's try accessing this website's content with Cloudscraper.
import cloudscraper
url = "https://author.today/"
scraper = cloudscraper.create_scraper()
info = scraper.get(url)
print("the status code is ", info.status_code)
print(info.text)
And it brings the following result:
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 challenge, This feature is not available in the opensource (free) version.
As you can see, Cloudscraper isn't working against newer Cloudflare versions.
The displayed error message suggests that Cloudscraper has a paid version that would work. Unfortunately, that's not the case.
So, how can you solve this problem? Cloudflare's bot detection techniques quickly adapt to open-source bypassing tools and can start to detect them. That said, the only way to go past them is by imitating natural user behavior. You can achieve that with the help of headless browsers like Selenium or Puppeteer, alongside valid and proper HTTP headers.
However, these approaches also have their limitations and don't always work. Fortunately, the next section discusses an all-in-one alternative.
What Is a Good Cloudscraper Alternative?
If you've encountered trouble with newer Cloudflare versions, then it's time to switch the tool!
ZenRows is a powerful web scraping library that helps with bypassing Cloudflare, regardless of its frequent updates.
Let's try scraping our target website with it!
Start by creating a free account to get your free API key.
Once logged in, you'll see ZenRows' Request Builder. Now, do the following.
- Enter your target website URL.
- Set the power of your anti-bot bypass request using the
Antibot
boost mode. Also, check the box forPremium Proxies
. - Select Python as language.
- Click on the
Try It
button to verify it works.
Yay! ๐ฅณ While we saw Cloudscraper fail against newer Cloudflare versions, ZenRows succeeded.
With its intuitive API, you can easily bypass the anti-bot protection and extract the information you need from any website.
Furthermore, ZenRows can scale your web scraping efforts, so don't hesitate to try it for free.
Extended Cloudscraper Features
Cloudscraper allows you to add more functionality to your web scraper to emulate natural user behavior better. For instance, regarding the User Agent, cookies, CAPTCHA solving, etc.
Let's look at a few in a bit more detail, but please note they don't work against newer Cloudflare versions.
Cloudscraper Proxy
Proxies are essential to visit a website from multiple IP addresses and increase your anonymity to avoid the common Cloudscraper 403 Cloudflare error
To set up a Cloudscraper proxy, you only need to make your requests with the proxies
attribute, just like you would using Python Requests. Below's an example:
import cloudscraper
# Create new cloudscraper instance
scraper = cloudscraper.create_scraper()
# Specify proxies
proxy = {
'http': 'http://your-proxy-ip:port',
'https': 'https://your-proxy-ip:port'
}
# Make request using specified proxies
response = scraper.get('https://example.com', proxies=proxy)
Check out our guide on the best web scraping proxies to learn the best types and providers for web scraping.
Get around CAPTCHA with Cloudscraper
Cloudscraper also supports integrations with third-party CAPTCHA solvers, including 2Captcha, anticaptcha, CapSolver, CapMonster Cloud, deathbycaptcha, and 9kw.
To create a Cloudscraper CAPTCHA solver, pass the captcha
dictionary as an argument with two keys: provider
and authorisation credentials. Here's an example using 2Captcha:
import cloudscraper
# Create new cloudscraper instance with the captcha dictionary as an argument
scraper = cloudscraper.create_scraper(
captcha={
'provider': '2captcha',
'api_key': 'your_2captcha_api_key'
}
)
# Make your request.
response = scraper.get('https://example.com')
Cloudscraper Headers: User Agent
Cloudscraper lets you control which browser and device type you want to emulate. To do that, pass the browser attribute as an argument to the create_scraper()
method. The following example emulates an Android mobile device using a Chrome browser.
import cloudscraper
# create cloudscraper instance with a mobile chrome user agent on Android
scraper = cloudscraper.create_scraper(
browser={
'browser': 'chrome',
'platform': 'android',
'desktop': False
}
)
# Make Request
response = scraper.get('https://example.com')
Cloudscraper Sessions
When you visit a Cloudflare-protected website with Cloudscraper using Python, it sleeps for 5 seconds to solve JavaScript challenges under the hood. After that, you can use the existing Cloudflare sessions to continue scraping said website.
To do that, pass the session value as an argument to the create_scraper()
function. For example:
session = requests.session()
scraper = cloudscraper.create_scraper(sess=session)
Bold### Other Features
Other Cloudscraper features include delay
, debug
, and cryptography. Check out their documentation for more details.
Common Errors
Below are some common errors you may encounter when using Cloudscraper.
Paid Version of Cloudscraper
This error indicates the open-source (free) version of Cloudscraper lacks the functionality necessary to solve the challenge, and you need a paid version. However, a Cloudscraper paid version does not exist, nor does it work on newer versions of Cloudflare.ย
No Module Named Cloudscraper
Cloudflare not working? A common error is getting returned No module named 'cloudscraper'
. It indicates that the Python interpreter can't find the Cloudscraper module or library, even though you have it installed.
A reason for the No module named 'cloudscraper'
is you have alternative Python versions and Cloudscraper is not installed for the particular version on your machine.
Cloudscraper Module Can't Be Loaded in Python
The log _cloudflare plugin: cloudscraper module can't be loaded in Python
indicates that there might be an issue with the Python environment. So, verify that the module is installed in the right Python environment.
Use the following command to check the directory:
pip show cloudscraper
Conclusion
We saw that using Cloudscraper in Python is helpful with older Cloudflare versions, yet a different library, such as ZenRows, needs to be implemented to bypass its newer versions.
Also, you can save time and reduce costs by using a web scraping API designed to win over all sorts of anti-scraping protections and system updates.
Frequent Questions
How Do You Use Cloudscraper in Python?
To use Cloudscraper in Python, import the Cloudscraper module, create a new instance using the create_scraper()
method, and make your request using the scraper.get()
method.
Did you find the content helpful? Spread the word and share it on Twitter, or LinkedIn.