Cloudflare is a widely used web security and performance service. Its advanced anti-bot system uses high-level techniques to identify and block automated traffic, resulting in the "ACCESS DENIED" error message.
In this article, you'll learn how to bypass Cloudflare with Python and Scrapy Cloudflare middleware. You will also explore three additional and more effective methods for bypassing Cloudflare with Scrapy:
Let's get started!
What Is Scrapy-Cloudflare Middleware?
Scrapy Cloudflare middleware is a package that integrates with the Scrapy web scraping tool to handle Cloudflare challenges for you. It acts as an intermediary between your Scrapy spider and target servers, intercepting and manipulating requests and responses at various stages of the scraping process.
By leveraging the middleware in your Scrapy project, you have an increased chance of avoiding detection and blocks.
How Does Scrapy-Cloudflare Work?
When a Scrapy spider starts crawling, it generates requests for predefined URLs. These requests pass through the middleware pipeline, where Scrapy Cloudflare can modify them to simulate human behavior.
This tool's main functionality is to bypass Cloudflare's "I'm Under Attack Mode" page. When the Cloudflare challenge server responds to the request, the Scrapy Cloudflare middleware intercepts the response and solves the JavaScript challenges.
How to Bypass Cloudflare With Scrapy-Cloudflare Middleware?
This tutorial will walk you through bypassing Cloudflare using Python and Scrapy. Before making your requests, you need to add the middleware to your DOWNLOADER_MIDDLEWARES
settings.
1. Set Up Scrapy
Scrapy is an open-source framework that requires Python 3.6 or higher, so ensure it's installed. Then, install Scrapy with the following command in your terminal:
pip install scrapy
After that, run the following command to create a new Scrapy project. Replace test_project
with your project name.
scrapy startproject test_project
You'll get a response containing information about your new project and how to start a spider.
Navigate to your new project directory and start your first spider.
cd test_project
scrapy genspider (SpiderName) (TargetURL)
That generates a basic code template like the example below in the code editor.
2. Install and Integrate Scrapy Cloudflare Middleware
Navigate to the root directory and run the following command to install Scrapy Cloudflare middleware:
pip install scrapy_cloudflare_middleware
Next, open the settings.py
file and add the Scrapy Cloudflare middleware. This is how your settings.py
file should look:
BOT_NAME = "test_project"
SPIDER_MODULES = ["test_project.spiders"]
NEWSPIDER_MODULE = "test_project.spiders"
DOWNLOADER_MIDDLEWARES = {
"test_project.middlewares.TestProjectDownloaderMiddleware": 543,
"scrapy_cloudflare_middleware.middlewares.CloudFlareMiddleware": 560,
}
The Scrapy Cloudflare middleware is assigned a priority value of 560 to ensure it's executed just before the Scrapy built-in RetryMiddleware
.
Now, let's test our Scrapy spider against a Cloudflare-protected website, G2.
To do that, in your test.py
Spider, replace the example URL with https://www.g2.com/
. Then, print the response to the console.
Your complete code should look like this:
import scrapy
class TestSpider(scrapy.Spider):
name = "test"
allowed_domains = ["www.g2.com"]
start_urls = ["https://www.g2.com/"]
def parse(self, response):
print(response.text)
You'll get the following result when attempting to scrape G2:
// ..
INFO: Spider opened
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
INFO: Spider opened: test
INFO: Telnet console listening on 127.0.0.1:6023
DEBUG: Crawled (403) <GET https://www.g2.com/> (referer: None)
INFO: Ignoring response <403 https://www.g2.com/>: HTTP status code is not handled or not allowed
INFO: Closing spider (finished)
The request was unsuccessful. We got a Cloudflare 403 forbidden error, indicating access to our target website's resources is forbidden.
That's because Cloudflare's challenges have evolved since the last time the Scrapy Cloudflare middleware was updated, and it's no longer a viable tool for bypassing Cloudflare. But keep calm! The next section is about working solutions.
Best Alternatives to Scrapy Cloudflare Middleware
As you saw, Scrapy Cloudflare middleware isn't a viable option anymore and gets easily blocked by Cloudflare. Consider these effective alternatives to ensure your scraper's success:
Use a Web Scraping API
Web scraping APIs like ZenRows are a more effective and scalable solution for bypassing Cloudflare. You can completely replace Scrapy Cloudflare middleware with Zenrows, as it's capable of bypassing all anti-bot challenges and provides a complete anti-bot bypass toolkit.
Let's see it in action against G2, the same website that got you blocked in the last section! You'll retrieve your desired data using the ZenRows API.
First of all, sign up to ZenRows. You'll get redirected to the Request Builder page. Paste your target URL in the URL to the Scrape input field. Click on the Premium Proxies and JS Rendering checkboxes. Select Python as your programming language and click on the API tab.
We'll use the generated code to create a function for making requests using the ZenRows API. This function takes two arguments: your ZenRows API key and the target URL. Additionally, it should include the necessary parameters: "Premium Proxies" and "JS Rendering".
import scrapy
from urllib.parse import urlencode
# function to generate the ZenRows API URL with the required parameters
def get_zenrows_api_url(url, api_key):
params = {
"url": url,
"js_render": "true",
"premium_proxy": "true"
}
api_url = f"https://api.zenrows.com/v1/?apikey={api_key}&{urlencode(params)}"
return api_url
Now, use this function in your spider to send a request using the ZenRows API and extract your desired data. Your complete code will look like this:
import scrapy
from urllib.parse import urlencode
# function to generate the ZenRows API URL with the required parameters
def get_zenrows_api_url(url, api_key):
params = {
"url": url,
"js_render": "true",
"premium_proxy": "true"
}
api_url = f"https://api.zenrows.com/v1/?apikey={api_key}&{urlencode(params)}"
return api_url
class ScraperSpider(scrapy.Spider):
name = "scraper"
def start_requests(self):
urls = ["https://www.g2.com"]
api_key = "<YOUR_ZENROWS_API_KEY>"
for url in urls:
# make a request using the ZenRows API
api_url = get_zenrows_api_url(url, api_key)
yield scrapy.Request(api_url, callback=self.parse)
def parse(self, response):
# extract and print the HTML of the page
print(response.text)
Here's the result, which prints the complete HTML of the target page:
<html>
<!-- ... -->
<head>
<meta charset="utf-8">
<!-- ... -->
<title>Business Software and Services Reviews | G2</title>
<!-- ... -->
<meta content="en-us" http-equiv="content-language">
<meta content="website" property="og:type">
<meta content="G2" property="og:site_name">
<meta content="Business Software and Services Reviews | G2" property="og:title">
<meta content="https://www.g2.com/" property="og:url">
<meta
content="Compare the best business software and services based on user ratings and social data. Reviews for CRM, ERP, HR, CAD, PDM and Marketing software."
property="og:description">
<!-- ... -->
</head>
<body>
<!-- ... -->
</body>
</html>
Awesome, right? Integrating ZenRows API with Scrapy makes bypassing Cloudflare really easy.
Add Premium Proxies
Using proxies is an essential strategy for web scraping, especially when dealing with websites protected by services like Cloudflare. A proxy acts as an intermediary between your scraper and the target website. It masks your IP address and prevents your scraper from being easily detected and blocked.
There are multiple types of proxies, but for heavily protected websites, rotating and residential premium proxies are the most effective. Rotating proxies switch IP addresses with each request, which reduces the chances of being blocked. Residential proxies use IP addresses from real devices and locations, making your requests appear more legitimate and harder to detect.
You can integrate proxies with Scrapy using services like ZenRows' Proxy Rotator. ZenRows provides rotating residential premium proxies that can help you bypass any anti-bot protections.
Here's how your Scrapy proxy code using ZenRows would look:
import scrapy
class ScraperSpider(scrapy.Spider):
name = "scraper"
def start_requests(self):
urls = ["https://www.g2.com/"]
for url in urls:
# make a request using the ZenRows premium proxies
yield scrapy.Request(
url=url,
callback=self.parse,
meta={"proxy": "http://<YOUR_ZENROWS_API_KEY>:js_render=true&[email protected]:8001"},
)
def parse(self, response):
# extract and print the HTML of the page
print(response.text)
You'll get the complete HTML of the target page by running the above code:
<html>
<!-- ... -->
<head>
<meta charset="utf-8">
<!-- ... -->
<title>Business Software and Services Reviews | G2</title>
<!-- ... -->
<meta content="en-us" http-equiv="content-language">
<meta content="website" property="og:type">
<meta content="G2" property="og:site_name">
<meta content="Business Software and Services Reviews | G2" property="og:title">
<meta content="https://www.g2.com/" property="og:url">
<meta
content="Compare the best business software and services based on user ratings and social data. Reviews for CRM, ERP, HR, CAD, PDM and Marketing software."
property="og:description">
<!-- ... -->
</head>
<body>
<!-- ... -->
</body>
</html>
Congrats! You successfully bypassed the Cloudflare protection using premium proxies.
Optimize Your Request Headers
Request headers are key-value pairs sent by the client to the server with every HTTP request. It provides important information such as the browser type, operating system, referring page, etc.
One of the most crucial headers for web scraping is the User Agent, which identifies the browser and device making the request. Websites often use this header to determine whether a request is coming from a legitimate user or a bot.
By configuring your request headers, you can avoid appearing like a bot. Here's a brief example of how to modify the User Agent in a Scrapy spider
import scrapy
class ScraperSpider(scrapy.Spider):
name = "scraper"
def start_requests(self):
urls = ["https://httpbin.io/headers"]
# define custom headers
custom_headers = {
"Sec-Ch-Ua-Platform": "\"Linux\"",
"User-Agent": "Mozilla/5.0 (Linux; x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
}
for url in urls:
yield scrapy.Request(url, headers=custom_headers, callback=self.parse)
def parse(self, response):
# extract and print the HTML of the page
print(response.text)
In this example, the User-Agent
and Sec-Ch-Ua-Platform
headers are customized to make the request appear as if it's coming from a Linux-based Chrome browser. Adjusting these headers can help you blend in with normal web traffic and avoid detection.
Check out our comprehensive tutorial on how to set headers in Scrapy to learn more.
Keep in mind that optimizing your request headers works even better when combined with proxies, as it further reduces the chances of detection by distributing your requests across multiple IP addresses.
Conclusion
The Python Scrapy Cloudflare middleware relied on bypassing basic Cloudflare JavaScript challenges. However, the security system constantly updates its measures, and Scrapy Cloudflare middleware no longer works as before.
Fortunately, ZenRows is an alternative to Scrapy that offers a trial-and-tested path to avoid getting blocked. Sign up now and take it for a spin with a free trial!