How to Bypass Bot Detection

October 28, 2022 ยท 11 min read

Many websites use anti-bot technologies. These make extracting data from them through web scraping more difficult. In this article, you'll learn the most commonly adopted bot protection techniques and how you can bypass bot detection.

Bots generate almost half of the world's Internet traffic, and many of them are malicious. This is why so many sites implement bot detection systems. Such technologies block requests that they don't recognize as executed by humans. As a result, bot detection is a problem for your scraping process.

Let's learn everything you need to know about mitigation and the most popular bot protection approach. Of course, you'll see how to defeat them.

What Is Bot Detection?

Bot detection or "bot mitigation" is the use of technology to figure out whether a user is a real human being or a bot. Specifically, these technologies collect data and/or apply statistical models to identify patterns, actions, and behaviors that mark traffic as coming from an automated bot.

A bot is an automated software application programmed to perform specific tasks. Bots generally navigate over a network. In detail, they imitate human behavior and interact with web pages and real users. Note that not all bots are bad, and even Google uses bots to crawl the Internet.

According to the 2022 Imperva Bad Bot Report, bot traffic made up 42.3% of all Internet activity in 2021. This makes bot detection a serious problem and a critical aspect when it comes to security. That's especially true considering that Imperva found out that 27.7% of online traffic is bad bots.

As you can see, malicious bots are very popular. Plus, they indiscriminately target small or large businesses. So, the problem of bot mitigation has become vitally important. That's why more and more sites are adopting bot protection systems.

Note that bot detection is part of the anti-scraping technologies because it can block your scrapers. After all, a web scraper is a software application that automatically crawls several pages. This makes web scrapers bots.

If you want your web scraper to be effective, you need to know how to bypass bot detection. Generally speaking, you have to avoid anti scraping. Only this way, you can equip your web scraper with what it needs to bypass web scraping.

That's the reason why we wrote an article to dig into the 7 anti-scraping techniques you need to know. Similarly, you might be interested in our guide on web scraping without getting blocked.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

How Do You Get Past Bot Detection?

There are general tips that are useful to know if you want to bypass anti-bot protection. These tips work in several other situations, and you should always apply them. That's because they allow your scraper to overcome most of the obstacles.

Considering that bot detection is about collecting data, you should protect your scraper under a web proxy. A proxy server acts as an intermediary between your scraper and your target website server. While doing this, it prevents your IP address and some HTTP headers from being exposed.

This allows you to protect your identity and makes fingerprinting more difficult. A website creates a digital fingerprint when it manages to profile you. This process works by looking at your computer specs, browser version, browser extensions, and preferences.

In other words, the idea is to uniquely identify you based on your settings and hardware. Then, a bot detection system can step in and verify whether your identity is real or not. But don't worry, you'll see the top 5 bot detection solutions and you'll learn how to bypass them soon.

As a general solution to bot detection, you should introduce randomness into your scraper. For example, you could introduce random pauses into the crawling process. After all, no human being works 24/7 nonstop. Also, you need to change your IP and HTTP headers as much as possible. This makes the requests made by the scraper more difficult to track.

As you can see, all these solutions are pretty general. If you want to avoid bot detection, you may need more effective approaches. As you are about to learn, bot detection bypass is generally harder than this, but learning about the top bot detection techniques next will serve you as a first approach.

Top 5 Bot Detection Solutions and How To Bypass Them

If you want your scraping process to never stop, you need to overcome several obstacles. Bot detection is one of them. So, let's dig into the 5 most adopted and effective anti-bot detection solutions.

Let's learn how to bypass bot detection.

1. IP Address Reputation

One of the most widely adopted anti-bot strategies is IP tracking. The bot detection system tracks all the requests a website receives. If too many requests come from the same IP in a limited amount of time, the system blocks the IP. This happens because only a bot could make so many requests in such a short time.

Also, the anti-bot protection system could block an IP because all its requests come at regular intervals. Again, this is something that only a bot can do. No human being can act so programmatically.

What is important to notice here is that these anti-bot systems can undermine your IP address reputation forever. IP reputation measures the behavioral quality of an IP address. In other terms, it quantifies the number of unwanted requests sent from an IP.

If your IP reputation deteriorates, this could represent a serious problem for your scraper. Especially, if you aren't using any IP protection system. Verify with Project Honey Pot if your IP has been compromised.

The only way to protect your IP is to use a rotation system. Keep in mind that premium proxy servers offer IP rotation. You can use a proxy with the Python Requests to bypass bot detection as follows:

import requests 
 
# defining the proxies server 
proxies = { 
	"http" : "http://yourhttpproxyserver.com:8080", 
	"https" : "http://yourhttpsproxyserver.com:8090", 
} 
 
# your web scraping target URL 
url = "https://targetwebsite.com/example" 
 
# performing an HTTP request with a proxy 
response = requests.get(url, proxies=proxies)

All you have to do is define a proxies dictionary that specifies the HTTP and HTTPS connections. This variable maps a protocol to the proxy URLs the premium service provides you with. Then, pass it to requests.get() via the proxies parameter. Learn more about proxies in requests.

Also, it's useful to know ZenRows offers an excellent premium proxy service.

2. HTTP Headers and User-Agent Tracking

Bot detection technologies typically analyze HTTP headers to identify malicious requests. In detail, they keep track of the headers of the last requests received. If a request doesn't contain an expected set of values in some key HTTP headers, the system blocks it.

The most important header these protection systems look at is the User-Agent header. This contains information that identifies the browser, OS, and/or vendor version from which the HTTP request came. If the request doesn't appear to come from a browser, the bot detection system is likely to identify it as coming from a script.

In other words, your web crawlers should always set a valid User-Agent header. Also, the anti-bot system may look at the Referer header. This string contains an absolute or partial address of the web page the request comes from. If this is missing, the system may mark the request as malicious.

You can set headers in your requests with the Python Requests to bypass bot detection as below:

import requests 
 
# defining the custom headers 
headers = { 
	"User-Agent": "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36", 
	"Referer": "https://targetwebsite.com/page1" 
} 
 
# your web scraping target URL 
url = "https://targetwebsite.com/example" 
 
# performing an HTTP request with a proxy 
response = requests.get(url, proxies=proxies, headers=headers)

Define a headers dictionary that stores your custom HTTP headers. Then, pass it to requests.get() through the headers parameter. Learn more about custom headers in requests.

3. JavaScript Challenges

A JavaScript challenge is a technique used by bot protection systems to prevent bots from visiting a given web page. A single page can contain hundreds of JS challenges. All users, even legitimate ones, will have to pass them to access the web page.

You can think of a JavaScript challenge as any kind of challenge executed by the browser via JS. A browser that can execute JavaScript will automatically face the challenge. This means that these challenges run transparently. The user mightn't even be aware of it.

But some JavaScript challenges may take time to run. This results in a delay of several seconds in page loading. In this case, the bot detection system may notify as below:

Cloudflare Waiting Room
Example of a JavaScript challenge screen

If you see such a screen on your target website, you now know that it uses a bot detection system. This means that if your scraper doesn't have a JavaScript stack, it won't be able to execute and pass the challenge.

Since web crawlers usually execute server-to-server requests, no browsers are involved. This means no JavaScript. Thus, they can't bypass bot detection. In other words, if you want to pass a JavaScript challenge, you have to use a browser.

So, your scraper app should adopt headless browser technology, such as Selenium or Puppeteer. For example, Selenium launches a real browser with no UI to execute requests. So, when using Selenium, the scraper opens the target web page in a browser. This helps Selenium bypass bot detection.

Now, approaching a JS challenge and solve it isn't easy. Yet, it's possible. Even when it comes to Cloudflare and Akamai, which provide the most difficult JavaScript challenges. Learn more on Cloudflare bot protection bypass and how to bypass Akamai. Also, you might be interested in learning how to bypass PerimeterX's bot detection.

4. Activity Analysis

Activity analysis is about collecting and analyzing data to understand whether the current user is a human or a bot. In detail, an activity analysis system continuously tracks and processes user data.

A bot protection system based on activity analysis looks for well-known patterns of human behavior. If it doesn't find enough of them, the system recognizes the user as a bot. Then, it can block it or challenge it with a JS challenge or CAPTCHA.

You can try to prevent them by stopping data collection. First, verify if your target website collects user data. To do this, you can examine the XHR section in the Network tab of Chrome DevTools.

A user data collection request
A user data collection request

Look for suspicious POST or PATCH requests that trigger when you perform an action on the web page. As in the example above, these requests generally send encoded data. Keep in mind that activity analysis collects user data via JavaScript, so check which JavaScript file performs these requests. You can see it in the "Initiator" column.

Now, block the execution of this file. Note that this approach might not work or even make the situation worse. Anyway, here's how you can do it with Pyppeteer (the Python port of Puppeteer):

import asyncio 
from pyppeteer import launch 
 
browser = await launch() 
page = await browser.newPage() 
 
# activating the request interception on Pyppeteer to block specific requests on this page 
await page.setRequestInterception(value=True) 
 
# registering the request event handler 
page.on(event="request", f=lambda request: asyncio.ensure_future(interceptRequest(req))) 
 
# defining the request event handler function 
async def interceptRequest(request: Request): 
	# if the request comes from the user data collection js file, block it 
	if request.url.endswith("79y983fxwwcc.js"): 
		await request.abort() 
	else: 
		await request.continue_() 
 
# visit the target page 
await page.goto("https://yourtargetwebsite.com")

This uses the Puppeteer request interception request feature to block unwanted data collection requests. This is what Python has to offer when it comes to web scraping. Now, consider also taking a look at our complete guide on web scraping in Python.

This just an example. Keep in mind tha finding ways to bypass bot detection in this case is very difficult. This is because they use artificial intelligence and machine learning to learn and evolve. Thus, a workaround to skip them mightn't work for long. At the same time, advanced anti-scraping services such as ZenRows offer solutions to bypass them.

5. CAPTCHAS

A CAPTCHA is a special kind of a challenge-response challenge adopted to figure out whether a user is human or not. CAPTCHAs provide tests to visitors that are hard to face for computers to perform but easy to solve for human beings.

Google provides one of the most advanced bot detection systems on the market based on CAPTCHA. This technology is called reCAPTCHA and represents one of the most effective strategies for bot mitigation.

As stated on the official page of the project, over five million sites use it. This makes CAPTCHAs one of the most popular anti-bot protection systems. Also, users got used to it and are not bothered to deal with them.

Example of a reCAPTCHA CAPTCHA
Example of a reCAPTCHA CAPTCHA

One of the best ways to pass CAPTCHAs is by adopting a CAPTCHA farm company. These companies offer automated services that scrapers can query to get a pool of human workers to solve CAPTCHAs for you. But definitely the fastest and cheapest option is to use a web scraping API that is smart enough to avoid the blocking screens. Find out more on how to automate CAPTCHA solving.

Conclusion

You've got an overview of what you need to know about bot mitigation, from standard to advanced ways to bypass bot detection. As shown here, there are many ways your scraper can be detected as a bot and blocked. At the same time, there are also several methods and tools to bypass anti-bot protection systems.

What matters is to know these bot detection technologies, so you know what to expect.

Specifically, in this article you've learned:
  • What bot detection is and how this is related to anti scraping.
  • How bot detection works.
  • What are the most popular and adopted anti-bot detection techniques, and first ideas on how you can bypass them in Python.

Thanks for reading! We hope that you found this guide helpful.

Since bypassing all these anti-bot detection systems is very challenging, you can sign up and try at ZenRows API for free. ZenRows API provides advanced scraping capabilities that allows you to forget about the bot detection problems. Save yourself headaches and many coding hours now.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Want to keep learning?

We will be sharing all the insights we have learned through the years in the following blog posts. If you don't want to miss a piece and keep learning, we'd be thrilled to have us in our newsletter.

No spam guaranteed. You can unsubscribe at any time.