The Anti-bot Solution to Scrape Everything? Get Your Free API Key! ๐Ÿ˜Ž

Wafw00f to Scrape Websites in 2024

June 14, 2023 ยท 4 min read

Many websites use advanced systems to block bot traffic, which includes scrapers. The good news is you'll learn to use Wafw00f to bypass Cloudflare and other firewalls in this article.

Let's dive in!

What Is Wafw00f Used for

Wafw00f is an open-source tool designed to identify and fingerprint Web Application Firewalls (WAFs). It helps you determine if a website is behind a WAF and provides insights into its characteristics to help you bypass the anti-bot system.

How Wafw00f Works

Wafw00f analyzes HTTP responses received from target websites and compares patterns and signatures against a database of known WAFs. However, if that doesn't result in useful information, Wafw00f sends several special requests. If that's also not successful, it uses a special algorithm that guesses, using the previous analysis, the WAF in question.

Here's a structural overview of how Wafw00f works:

  1. Detecting the presence of a WAF: Wafw00f analyzes response headers, response body, and server response characteristics, amongst other things, to establish the presence of a firewall.
  2. Signature database: Wafw00f matches the observed server response patterns against a database of signatures and patterns associated with different WAFs to identify what is up against. These signatures include known error messages, headers, and blocking patterns exhibited by different WAFs.
  3. Fingerprinting: Once a WAF is detected, Wafw00f gathers additional information to bypass the firewall.
  4. Reporting: Lastly, Wafw00f provides a report containing the name of the WAF, its version, and behavioral patterns.

Let's see how to leverage this WAF fingerprinting tool to bypass WAFs and retrieve the desired data.

Prerequisites

Depending on your operating system, there are different approaches to installing Wafw00f.

Install on Linux

Clone the Wafw00f Github repository using the following command:

Terminal
git clone https://github.com/EnableSecurity/wafw00f

Then, navigate to the Wafw00f directory:

Terminal
cd wafw00f

Run the make command to install the necessary files, then grant execute permission:

Terminal
chmod +x setup.py

Lastly, install the Wafw00f setup configuration:

Terminal
python setup.py install

Now, you can run Waf00f:

Terminal
wafw00f https://example.com/

Install on Windows

For Windows, download Wafw00f's latest release and extract the configuration files to a desired directory. Within that directory, navigate to Wafw00f and run the Python setup script:

Terminal
cd wafw00f

python setup.py install

Here's an example of how to run Wafw00f:

Terminal
python main.py https://example.com

After cloning the GitHub repository and installing it on a Windows system, you can build a docker image if you have Docker installed.

Terminal
docker build . -t wafw00f


Run it like in the example below:

Example
docker run --rm -it wafw00f https://example.com
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

How to Use Wafw00f

Now that you're all set up, let's try scraping a website protected by a firewall: G2.

First, run the target website through Wafw00f to determine the WAF and its behavior:

Terminal
wafw00f https://www.g2.com/

Here's our result:

G2 Result
Click to open the image in full screen

Congrats, you've identified your first WAF!

From the example above, we see that G2 is behind Cloudflare. Also, the "No WAF detected by generic detection" message indicates that Cloudflare intercepted the requests before redirecting to the G2 URL, which isn't behind any firewall. So, if you can get past Cloudflare, you can retrieve the data you're after.

Getting past Cloudflare requires emulating human behavior using fortified headless browsers and actual browser User Agents in your requests. Alternatively, boycotting Cloudflare involves making requests directly to the target website's IP address.

However, both approaches are unreliable and will require tedious manual work. For example, headless browsers still get blocked, and finding the origin server's IP address can be challenging.

Fortunately, there's a more straightforward method to bypass WAFs. Let's find out how.

Best Wafw00f Alternative

ZenRows is a complete anti-bot bypass toolkit that enables users to get around Cloudflare and any other WAFs. It supports all languages, including Python, Java, NodeJS, Ruby, Go, etc. Let's try to scrape G2, a site protected by Cloudflare, using ZenRows.

Start by signing up to get your free API key. You'll get to the Request Builder page. There, paste the target URL and check the boxes for Anti-bot, Premium Proxy, and JavaScript Rendering to set the necessary parameters to true. In this case, we chose Python and the scraper's code is auto-generated.

ZenRows Dashboard
Click to open the image in full screen

Now, install Python Requests using the following command (although any HTTP library works).

Terminal
pip install requests

Lastly, copy the code ZenRows provides and run it in your favorite editor.

Example
# pip install requests
import requests

url = 'https://g2.com/'
apikey = 'Your API Key'
params = {
    'url': url,
    'apikey': apikey,
	'js_render': 'true',
	'antibot': 'true',
	'premium_proxy': 'true',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)

Here's our result:

Output
//..
<title>Business Software and Services Reviews | G2</title>
//..

Bingo, you've bypassed your first WAF!ย 

Conclusion

Bypassing WAFs with Wafw00f presents some unique challenges that may require tedious work and expertise. Fortunately, ZenRows offers an easier and more scalable solution that empowers you to bypass any anti-bot measures.

In other words, it's unnecessary to investigate a website to identify its WAF. Just plug in ZenRows, and you can retrieve the necessary data. Sign up now to get 1,000 free API credits.

Did you find the content helpful? Spread the word and share it on Twitter, or LinkedIn.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.