14 Ways for Web Scraping Without Getting Blocked

July 12, 2024 ยท 14 min read

Does your scraper keep getting blocked? It's no wonder. Many anti-bot systems detect web scrapers and block them.

But you're about to forget about this problem forever. Below, you'll find 14 techniques to help your scraper appear human and let you scrape without getting blocked.

Without further ado, let's go.

1. Use Premium Proxies for Web Scraping

A proxy is an intermediary between you and the target website that makes your request seem to come from another location,

If your scraper makes too many requests at a time or tries to access content unavailable in your region, its IP can be blocked by anti-bots. In that case, you need a proxy server to mimic another machine's IP address.

Based on pricing, there are two proxy categories: free and premium proxies. Free proxies have a short lifespan, which makes them unsuitable for real-life web scraping projects. Even if you rotate these proxies, you risk detection because you share limited IPs with many other proxy users. That said, you can still use them to test how to integrate proxies into your web scraper.

For the best scraping experience, the recommended approach is to use premium web scraping proxies with residential IPs and an auto-rotating feature. These residential IPs offer more stealth and are suitable for production-ready web scrapers since they belong to daily users who use internet service provider (ISP) networks.

When picking a paid service, it's essential to check that it has all features suitable for web scraping, such as IP auto-rotation and geo-targeting.ย 

Providers like ZenRows offer auto-rotating premium proxies tailored for web scraping and crawling. The same plan gives you access to advanced features, such as flexible geo-targeting, anti-bot and CAPTCHA auto-bypass, and many more. ZenRows proxy is also easy to integrate into any web scraping tool.

generate residential proxies with zenrows
Click to open the image in full screen

2. Use a Web Scraping API

While this article includes other helpful bypass methods, they don't guarantee 100% success, especially when dealing with the most difficult anti-bots like Akamai, Cloudflare, and more.

The only way to scrape any website without interruption, regardless of its anti-bot complexity, is to use a web scraping API, such as ZenRows. It automatically bypasses all CAPTCHAs and anti-bot measures under the hood so you can focus on your scraping logic without worrying about getting blocked.

ZenRows also works with any programming language and acts as a headless browser for scraping dynamic websites. You only need a single API call to use it.

Let's show you how it works by scraping the G2 Reviews, a website heavily protected by Cloudflare.

Sign up to open the Request Builder. Paste the target URL in the link box, activate Premium Proxies, and click JS Rendering. Select your favorite programming language (we'll use Python in this case) and select the API connection mode. Then, copy the generated code and paste it into your scraper file.

building a scraper with zenrows
Click to open the image in full screen

The generated code should look like this:

Example
# pip install requests
import requests

url = 'https://www.g2.com/products/asana/reviews'
apikey = '<YOUR _ZENROWS_API_KEY>'
params = {
    'url': url,
    'apikey': apikey,
    'js_render': 'true',
    'premium_proxy': 'true',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)

Integrating the above code into your web scraper allows it to bypass any complex anti-bot at scale.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

3. Use Headless Browsers

To avoid being blocked when web scraping, you should interact with the target website like a regular user. One of the best ways to achieve that is to use a headless web browser, which is a web browser that works without a graphical user interface.

Popular headless browsers, including Selenium, Playwright, and Puppeteer, let you emulate user actions, such as:ย 

  • Clicking a button or a link.
  • Horizontal or vertical scrolling to scrape content from websites that load content with infinite scrolling.
  • Hovering over an element.
  • Dragging and dropping content across a web page.
  • Resolving alerts.
  • Filling forms interactively, which is often helpful in searching or automating logging in during scraping.

These features make headless browsers suitable for scraping JavaScript-rendered content. Their ability to execute JavaScript can also help bypass anti-bot checks, such as browser fingerprinting.

Although using a headless browser alone is usually insufficient against anti-bots, you can boost their anti-bot capability by adding proxies or replacing their User Agent.ย 

Some headless browsers also have dedicated plugins to avoid anti-bot detection. For example, you can fortify Selenium with Selenium Stealth to bypass detection. You can also patch Puppeteer with the Puppeteer Stealth plugin to remove bot-like signals like the automated WebDriver. Similarly, the Stealth plugin is available in Playwright.ย 

All the stealth plugins make you appear more like a human, increasing your ability to bypass anti-bot detection during web scraping.

Read the following guides to learn more about how to use headless browsers to avoid detection:

4. Set Real Request Headers

Request headers reveal metadata information about your request. They're one of the criteria that anti-bots check to detect bots. Anti-bots prioritize legitimate request headers like those sent from a real browser like Chrome.

However, the default request headers of most web scraping tools don't resemble those of a legitimate browser. They often contain bot-like parameters.

For example, Python's default Request headers look like the following, with many missing fields and bot-like signals like the python-requests/2.32.3 User Agent:

Example
{
  "headers": {
    "Accept": [
      "*/*"
    ],
    "Accept-Encoding": [
      "gzip, deflate, br, zstd"
    ],
    "Connection": [
      "keep-alive"
    ],
    "Host": [
      "httpbin.io"
    ],
    "User-Agent": [
      "python-requests/2.32.3"
    ]
  }
}

The above header set is prone to anti-bot detection because it's bot-like.ย 

Compare it with Chrome's default request headers below. You can check yours by opening https://httpbin.io/headers via your Chrome browser. You'll see that it contains all essential headers, including a valid browser User Agent.

The website you're trying to scrape expects such a legitimate request header set:

Example
{
  "headers": {
    "Accept": [
      "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"
    ],
    "Accept-Encoding": [
      "gzip, deflate, br, zstd"
    ],
    "Accept-Language": [
      "en-US,en;q=0.9"
    ],
    "Connection": [
      "keep-alive"
    ],
    "Host": [
      "httpbin.io"
    ],
    "Sec-Ch-Ua": [
      "\"Not/A)Brand\";v=\"8\", \"Chromium\";v=\"126\", \"Google Chrome\";v=\"126\""
    ],
    "Sec-Ch-Ua-Mobile": [
      "?0"
    ],
    "Sec-Ch-Ua-Platform": [
      "\"Windows\""
    ],
    "Sec-Fetch-Dest": [
      "document"
    ],
    "Sec-Fetch-Mode": [
      "navigate"
    ],
    "Sec-Fetch-Site": [
      "none"
    ],
    "Sec-Fetch-User": [
      "?1"
    ],
    "Upgrade-Insecure-Requests": [
      "1"
    ],
    "User-Agent": [
      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
    ]
  }
}

Customizing your scraper with an actual browser header set like the one above is one way to avoid getting blocked when scraping. Most web scraping tools allow you to customize the request headers. You can set these headers with your scraping library so the target website treats it like a regular browser.

Check out our article on the most critical request headers for web scraping to learn how to handle your request headers appropriately. You can also check our tutorial on how to set the User Agent during web scraping to learn more about customizing specific header fields like the User Agent.

5. Outsmart Honeypot Traps

Some websites set up honeypot traps, a mechanism designed to attract bots while being unnoticed by real users. These traps can mislead scrapers into targeting fake data.

Let's learn how to track the honey and avoid falling into its trap!

Most basic honeypot traps are hidden links in the website's HTML. One way to detect them is to watch out for links with CSS properties that make elements invisible.

Below is a basic JavaScript snippet that returns the ratio of hidden to visible links on the target website. Open the target website via your browser, right-click anywhere on the page, and select Inspect. Then, go to the Console tab and run this code:

Example
const linkFilter = () => {
    const allLinks = Array.from(document.querySelectorAll('a[href]'));
    console.log(`There are ${allLinks.length} total links`);

    const filteredLinks = allLinks.filter(link => {
        let linkCss = window.getComputedStyle(link);
        let isDisplayed = linkCss.getPropertyValue('display') !== 'none';
        let isVisible = linkCss.getPropertyValue('visibility') !== 'hidden';
        return isDisplayed && isVisible;
    });

    console.log(`There are ${filteredLinks.length} visible links`);
}

linkFilter();

You might see a result like the following, showing that the number of visible links is lesser than the total number of available ones. It means some links are hidden on that website, indicating the possible availability of a honeypot:

Output
There are 13 total links
There are 10 visible links

Honeypot traps usually come with tracking systems designed to fingerprint automated requests, allowing the website to identify similar requests in the future. Consequently, the target website can easily block your scraper from subsequent access to its content, even if it uses different IPs.

To avoid honeypots, your scraper shouldn't follow text links that are the same color as the website's background or are purposely hidden from users. Another fundamental way to avoid honeypot traps is to respect the robots.txt file.

6. Automate CAPTCHA Solving

CAPTCHAs are puzzles used to distinguish between humans and bots. You'll often encounter them when accessing sensitive website sections such as user dashboards, reviews, product pages, etc.ย 

The possibility of CAPTCHA appearing depends on the CAPTCHA type and website implementation. While some CAPTCHAs show whenever a user tries to open the protected page, others only trigger when the challenge detects bot-like activities like web scraping.

A couple of CAPTCHA-solving services can help you remove CAPTCHAs after they appear. Some examples are 2Captcha and AntiCaptcha. These solvers employ real humans and charge per test solved. However, they're usually slow and expensive at scale.

The recommended approach is to bypass the CAPTCHA and prevent it from appearing. To do that, your web scraper needs to imitate human behavior with tools like headless browsers. That said, the most effective and reliable solution is to use paid services like web scraping APIs.

Opt for a scraping API that offers auto-retries without charging for unsuccessful requests. That feature is handy in large-scale web scraping where the CAPTCHA appears multiple times due to heavy traffic. A solid example of such tools is ZenRows.

7. Avoid Fingerprinting

Fingerprinting collects specific hardware and software information, such as the operating system version, browser version, navigator fields, plugins, and more, to create a unique identifier for a machine or a browser.

During fingerprinting, the communication between the client and server involves a transport layer security (TLS) handshake to exchange a packet of encrypted data.ย 

This interaction starts with a "Client Hello" message, which includes supported TLS versions, an optional session ID, and cipher suites, among other settings. The server then responds with a "Server Hello" message detailing the selected settings for that session.

Most bots lack the mechanisms to perform the TLS handshake properly, which results in detection and subsequent blocking.ย 

Fortunately, you can modify your scraper's TLS settings to mimic human behavior. You can also leverage tools like Curl Impersonate, which already replicates some browser TLS layers. Read our article on bypassing TLS fingerprinting during scraping to learn more.

You can also follow the tips below to boost your chances of bypassing TLS fingerprinting:

  • Don't make the requests at the same time every day. Instead, send them at random times.
  • Change IPs often.
  • Use different request headers, including other User-Agents.
  • Configure your headless browser to use different screen sizes, resolutions, and fonts.
  • Use different headless browsers.

8. Use APIs to Your Advantage

Much of the information that websites display comes from APIs. This data is difficult to scrape because it's usually requested dynamically with JavaScript after the user has performed some actions.

Let's say you're trying to collect data from posts that appear on a website with "infinite scroll." In this case, you can't apply static scraping because getting the result requires automating your scraper to scroll continuously to the bottom of the page. You'll need a headless browser to automate the scrolling action on that page.

However, you can still use a static request tool to reverse engineer the API supplying the target data. This method also increases your chances of scraping behind possible anti-bot measures.

It involves intercepting incoming XHR (XMLHttpRequest) requests with an HTTP client like Python's Requests or JavaScript's Axios. To do that, you'll need to use the network inspector of your preferred browser and check the XHR (XMLHttpRequest) from the Network tab.

After intercepting the API request, you can parse the result with parsers like BeautifulSoup (for Python) or Cheerio (for JavaScript).

The shortcoming of this approach is that the API and target site might share an identical CDN. So, you can still get blocked since the API likely uses the same anti-bot protection as the target site.

To learn more, check out our detailed tutorial on scraping from infinite scrolling using the Requests library.

9. Stop Repeated Failed Attempts

One of the most suspicious situations for a webmaster is to see a large number of failed requests. Initially, they may not suspect that a bot is the cause and start investigating.

However, if they detect that these errors are due to bot activities like web scraping, they'll block your web scraper. This scenario is common in large-scale web scraping, where multiple requests tend to fail due to changes in the website structure or network issues.

There are a couple of ways to prevent it:

  • Use logs: Ensure you log failed scraping attempts and set up notifications to suspend scraping when a request fails.
  • Monitor website changes: Check for possible changes in the website layout, such as changes in the class name or IDs. Then, adjust your scraper to accommodate the new website structure.
  • Watch out for server latency: If the server response time suddenly becomes slower than usual, you're probably overloading it. Try reducing your request frequency to avoid getting noticed.
  • Leverage page object model: Adopt automation testing techniques like the page object model to separate element selectors from your scraping logic. This technique lets you quickly locate and adjust the affected elements rather than searching your entire codebase.

With these methods, you can avoid triggering bot alarms and lower your chances of getting blocked while scraping.

10. Scrape Google Cache

One strategy for scraping without detection is to scrape the cached version of your target website. While Google no longer supports access to cached pages, you can still get old web page copies from Wayback machines such as the Internet Archive.

However, the disadvantage of this method is that cached website versions contain outdated data, which means you may not get the desired results.

Getting cached data from the Internet Archive is easy. Let's use it to get the cached version of a Cloudflare-protected website like the G2 Reviews.ย 

Paste the target URL into the link box and press Enter. You'll see a calendar with several snapshot dates. Select the most recent date and time to get the website's latest cached version.

archived version of g2 reviews page
Click to open the image in full screen

Once that page appears, use your scraper to request the archive's full URL and extract your desired data.ย 

11. Randomize Request Rate

One of the most common consequences of sending multiple requests within a short interval is IP bans, which can be temporary or permanent, depending on the website's security measures. However, sending many requests is unavoidable in large-scale web scraping.

One way to stay safe is to regard the target website's request rules, such as rate limiting. Even if you rotate IPs, the security measure may use your request fingerprint to identify and block you once it detects unusual traffic.

Randomizing your request intervals helps you mimic human user behavior, reducing your chances of getting blocked. It involves implementing a random delay using methods like Python's time.sleep or JavaScript's setTimeout.ย 

Another request randomization technique to mimic human behavior is exponential backoffs/delays. This technique involves pausing your scraping task for a specific period after a failed request. If the request fails again, the previous wait time increases exponentially and accumulates for subsequent failures.

12. Diversify Crawling Pattern

Most web scraping projects follow a specific pattern to extract data from the same website. This approach can result in anti-bot detection.

For example, clicking the same elements, using the same scroll height, and following a similar navigation pattern for every request put you at risk of getting blocked. The recommended approach is diversifying your crawling pattern to resemble a human interaction.

To do that, you can perform random mouse hovering, click elements randomly, and scroll the page back and forth at various heights before scraping. This technique can keep anti-bots and underground challenges monitoring user behavior in check, forcing them to treat your scraper as a human.

13. Follow Robots.txt Rules

The robots.txt file contains a set of rules for crawling a website. It usually specifies the pages that bots shouldn't crawl or index. It may also include request delay rules to limit requests and prevent server overload.

Checking and following these rules makes your requests more ethical and can prevent the web server from flagging you as a bot. You can check the robots.txt file of any website by appending a /robots.txt to the end of its URL.ย 

For example, open the following URL via your browser to view G2's robots.txt file:

Example
https://www.g2.com/robots.txt                                                                                                                                                                                                                 

Here's a sample result:

Output
Sitemap: https://www.g2.com/sitemaps/sitemap_content_test.xml
Sitemap: https://www.g2.com/sitemaps/sitemap_index.xml.gz
Sitemap: https://www.g2.com/sitemaps/sitemap_index_compare.xml.gz

User-Agent: *
Disallow: /*?focus_review*
Disallow: /*&focus_review*
Disallow: /*?format=pdf*
Disallow: /*&format=pdf*
Disallow: /*/*/vote*
Disallow: /products/*/take_survey
Disallow: /products/*/leads/*
Disallow: /ahoy/
Disallow: /auth
Disallow: /batch
Disallow: /no_contact_leads/*

// ... omitted for brevity

Ignoring the robots.txt rule can result in instant IP bans and subsequent denial of access to the target website.

14. Reverse-engineer Anti-bot Systems

If your target website uses Cloudflare, Akamai, DataDome, PerimeterX, or a similar anti-bot service, you probably can't scrape the URL because it has blocked you. However, you can research and learn about the current detection methods of these anti-bots and outsmart them using reverse engineering.

Cloudflare, for example, uses different bot-detection methods. One of its most essential tools to block bots is the "waiting room". Even as a human, you should be familiar with this type of screen:

Cloudflare
Click to open the image in full screen

While waiting, JavaScript code runs under the hood to ensure the visitor isn't a bot. The good news is that this code runs on the client side, and you can tamper with it. However, it's obfuscated, and the script keeps changing.

Read our guide on bypassing Cloudflare, where we show you different anti-bot bypassing methods, including how to handle the waiting room. But be warned; it's a long and technically challenging process, requiring intense coding. The best way to automatically overcome any anti-bot protection and scrape without limitations is to use a web scraping API such as ZenRows.

Conclusion

You've learned 14 techniques to scrape without getting blocked. Keep in mind that some websites use multiple mechanisms to block you from scraping their content. Combining these methods to avoid being blocked increases the chance of success.

Let's recap the anti-block tips you've learned in this post:

Anti-scraper block Workaround Supported by ZenRows
Requests limits by the anti-bots Use premium proxies for web scraping, randomize request rate, stop repeated failed attempts โœ…
Datacenter IPs blocked Use premium proxies for web scraping โœ…
Cloudflare and other anti-bot systems Diversify crawling pattern, Use API to your advantage, reverse-engineer anti-bot systems, scrape Google Cache โœ…
Browser fingerprinting Use headless browsers and set real request headers โœ…
Honeypot traps Outsmart honeypot traps by skipping invisible links and circular references โœ…
CAPTCHAs on suspicious requests Premium proxies, user-like requests, and diversifying crawling patterns โœ…

Remember that you can still get blocked even after applying these tips. But you can replace all the techniques and tools mentioned in this article with ZenRows, a complete web scraping toolkit that automatically bypasses all blocks, including CAPTCHAs and even the most sophisticated anti-bots.

Try ZenRows for free now without a credit card!

Frequent Questions

How Do I Scrape a Website Without Being Blocked?

Websites employ various techniques to prevent bot traffic from accessing their pages. That's why you're likely to run into firewalls, waiting rooms, JavaScript challenges, and other obstacles while web scraping.

Fortunately, you can minimize the risk of getting blocked by trying the following:

  • Use premium proxies for web scraping.
  • Use a web scraping API.
  • Use headless browsers.
  • Set real request headers.
  • Outsmart honeypot traps.
  • Automate CAPTCHA solving.
  • Avoid fingerprinting.
  • Use API to your advantage.
  • Stop repeated failed attempts.
  • Scrape Google cache.
  • Randomize request rate.
  • Diversify crawling pattern.
  • Follow robots.txt rules.
  • Reverse engineer anti-bot systems.

Why Is Web Scraping Not Allowed?

Web scraping is legal but not always allowed because even publicly available data is often protected by copyright law and requires written authorization for commercial use. Luckily, you can scrape data legitimately by following the Fair use guidelines.

Also, a website may contain data protected by international regulations, like personal and confidential information, that requires explicit consent from the data subjects.

Can a Website Block You From Web Scraping?

Yes, if a website detects your tool is breaching the rules outlined in its robots.txt file or triggers an anti-bot measure, it'll block your scraper.

Some basic precautions to avoid bans are to use proxies with rotating IPs and to ensure your request headers appear natural. Moreover, your scraper should behave like a human as much as possible without sending out too many requests too fast.

Why Do Websites Block Scraping?

Websites have many reasons to prevent bot access to their pages. For example, many companies sell data, so they're doing that to protect their income. Security measures against hackers and unauthorized data use also ban all bots, including scrapers.

Another concern is that if misdesigned, scrapers can overload the site's servers with requests, causing monetary costs and disrupting the user experience.

Ready to get started?

Up to 1,000 URLs for free are waiting for you