WhatWaf for Scraping Websites (2024)

June 22, 2023 ยท 3 min read

Many websites have anti-bot systems in place to detect and block automated traffic. Here, WhatWaf plays a key role. Identifying the firewall behind and providing bypass insight enables developers to create robust scraping strategies, so you'll learn how to use WhatWaf to bypass Cloudflare and other protections in this article.

Ready? Let's get started.

What Is WhatWaf

WhatWaf is a Python-based tool designed to identify over 70 web application firewalls (WAFs) behind a target domain server and comes with additional features that can be useful in bypassing anti-bot detection for web scraping. For example, a built-in encoder to encode your payloads into the discovered bypasses.

How WhatWaf Works

While WhatWaf hasn't disclosed its inner workings, we can get a grasp on its reports: it starts with GET requests to gather HTTP responses. Then, they're analyzed and compared against a database of WAF signatures.

That's possible because firewalls can return different HTTP responses depending on their configuration and the specific rules set. So, by matching target responses to a WAF, WhatWaf can determine the server's security system.

In addition, WhatWaf performs a bypass analysis to recommend techniques to circumvent the firewalls, such as payload modification, tampering, obfuscation, etc.

Prerequisites

To get started, ensure you have Python installed. Most Linux OS have it pre-installed, but you can confirm this by running the following command:

Terminal
python --version

#for version 3.x
python3.x --version

After that, clone WhatWaf's GitHub repository:

Terminal
git clone https://github.com/Ekultek/WhatWaf.git

That automatically saves the tool's source code, which includes configuration files, in the WhatWaf directory.

Lastly, navigate to the WhatWaf directory and install the necessary dependencies using the following commands:

Terminal
cd WhatWaf

sudo pip install -r requirements.txt
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

How to Use WhatWaf

Run the following command to view the arguments we can use with the tool:

Terminal
sudo ./whatwaf --help

You'll get this:

WhatWaf Help
Click to open the image in full screen

From the result above, we see that we can pass a single URL to detect its WAF using the -u target URL format. Let's try it with Hack Yourself First, a security training resource.

Terminal
sudo ./whatwaf -u https://hack-yourself-first.com/Make/5?orderby=supercarid

Here's the first half of our result:

WhatWaf Result
Click to open the image in full screen

WhatWaf detected two firewalls: Microsoft's ASP.NET and Cloudflare.ย 

Here's the second half of the result, containing potential bypasses:

WhatWaf Bypass Result
Click to open the image in full screen

The descriptions outline different tampering techniques that can be applied to modify the payload in a specific way to facilitate a WAF bypass. We can incorporate them into our scraping strategy to get around the detected firewalls.

However, these bypass techniques aren't foolproof and only work in some situations. Furthermore, WhatWaf bypass analysis doesn't work for websites using advanced protection.ย 

Let's try investigating G2, a product review website.

Terminal
sudo ./whatwaf -u https://g2.com/

Here's the first half of our result. We can see that G2 is behind Cloudflare:

G2 WAF Result
Click to open the image in full screen

Let's see the result of the bypass analysis. Unfortunately, WhatWaf doesn't succeed in detecting possible bypasses.

G2 Bypass Result
Click to open the image in full screen

That raises the question: What works against advanced WAFs? Read on to find out.

Best WhatWaf Alternative

WhatWaf bypass analysis fails against advanced protection and isn't efficient or scalable. But ZenRows offers the ultimate solution because a single API will prevent detection and retrieve the necessary data. Let's explore ZenRows to scrape G2, where WhatWaf failed.

To follow along, sign up to get your free API key. You'll get to the Request Builder, and it's time to generate the scraping code. Pass in the target URL and check the boxes for Premium Proxies and JS Rendering to set these parameters to true. In our case, we chose Python as a language.

building a scraper with zenrows
Click to open the image in full screen

Install Python Requests using the following command (any other HTTP library also works).

Terminal
pip install requests

Then, copy the code ZenRows provided and run it in your favorite editor.

Terminal
# pip install requests
import requests

url = 'https://g2.com/'
apikey = '<YOUR_ZENROWS_API_KEY>'
params = {
    'url': url,
    'apikey': apikey,
	'js_render': 'true',
	'premium_proxy': 'true',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)

The page's HTML will print:

Terminal
//..
<title>Business Software and Services Reviews | G2</title>
//..

Congrats, you've finally bypassed Cloudflare.ย 

Conclusion

Knowing how to win over WAFs proves critical for any data extraction project.ย 

While WhatWaf can detect what firewall a target website uses, its bypass analysis fails against advanced measures. Fortunately, ZenRows offers an effective and scalable solution. Try it for free.

Ready to get started?

Up to 1,000 URLs for free are waiting for you