Web Scraping Without Getting Blocked: 10 best tips

September 21, 2022 Β· 9 min read

Modern websites make web scraping difficult: firewalls, Javascript challenges, Captchas, and Ban lists are just some of the challenges you will face. But don't fear, in this article you'll learn the best tips for web scraping without getting blocked.

Why is web scraping not allowed?

There are many causes for getting blocked while scraping a website. Crawlers and scrapers are bots, and websites generally do not like bots because there are a lot of malicious ones.

You must also be aware that even publicly available data is usually protected by copyright law. So if you want to use the scraped content commercially, you need written authorization from the copyright holder.

There are some cases where you can use some data legally in what is known as Fair use. For example, when you quote someone else's work to criticize, comment, or parody it.

There are all sorts of companies whose main business is selling data, like news, sports results, images, statistics, and so on. They have different plans, prices, and licenses for their content. So, of course, they won't be happy if you try to scrape their copyrighted data, be it from their website or their paying customers.

Another reason websites don't allow web scraping is that it can overload their servers. If a web crawler or scraper is not correctly designed, it can fire thousands of unnecessary requests to the target website's server. This can cause monetary costs to the site owner and also impact the user experience for their human visitors.

A web scraper is a type of bot designed to automate the collecting and processing of information from the web. Many websites do not consider web scraping bad by itself. But because there are a lot of malicious bots, websites implement measures to block all of them. That is the reason even benign web scrapers can be blocked.

In summary, many companies will try to technically block bots to protect their websites from hackers or to prevent unauthorized data use. Luckily, if you want to scrape data legitimately, there are many ways to do it without getting blocked.

1. Set Real Request Headers

To avoid being blocked, your scraper activity should look as similar as possible to a regular user browsing the target website. Web browsers usually send a lot of information that HTTP clients or libraries don't.

Luckily this is easy to solve. First, go to Httpbin and check the request headers your current browser sends. In my case, I got this:

HTTP Headers
Click to open the image in fullscreen

One of the most important headers for web scraping is User-Agent. This is a string that informs the server about the operating system, vendor, and version of the requesting user agent. Then, using the library of your preference, set these headers so the target website thinks your web scraper is a regular web browser.

For specific instructions, you can check our guides on how to set headers for Javascript, PHP, and Python.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

2. Use Proxies

If your scraper makes too many requests from an IP address, websites can block that IP. In that case, you can use a proxy server with a different IP. It will act as an intermediary between your web scraping script and the website host.

There are many types of proxies. You can start testing how to integrate proxies with your scraper or crawler by using a free proxy. You can find one in Free Proxy List. But keep in mind that free proxies are usually slow and unreliable. They can also keep track of your activities and connections, self-identify as a proxy, or use IPs on banned lists.

If you are serious about web scraping without getting blocked, there are better alternatives to free proxies. For example, Zenrows offers an excellent premium proxies service. Ideally, you want rotating IPs so your activity seems to come from different users and does not look suspicious. This also helps you in the case that one IP gets banned or blacklisted because you can use other ones.

There's another important distinction between proxies: those with a data center IP and others that use a residential IP. Data center IPs are reliable but are easy to identify and block. Residential IP proxies are harder to detect since they belong to an Internet Service Provider (ISP) that might assign them to an actual user.

How to configure your scraper to use a proxy?

Once you get a proxy to use with your scraper, you need to connect the two of them. The exact process depends on the type of scraper you have.

If you are coding your web scraper with Python, we have a detailed guide on how to rotate proxies in Python.

If your web scraper runs on Node.js, you can configure Axios or another HTTP client to use a proxy in the following way:

const axios = require('axios'); 
 
const proxy = { 
	protocol: 'http', 
	host: '202.212.123.44', // Free proxy from the list 
	port: 80, 
}; 
 
(async () => { 
	const { data } = await axios.get('https://httpbin.org/ip', { proxy }); 
 
	console.log(data); 
	// { origin: '202.212.123.44' } 
})();

3. Use premium proxies for web scraping

High-speed and reliable proxies with residential IPs sometimes are referred to as Premium Proxies. For production crawlers and scrapers, it's common to use this type of proxies.

When selecting a proxy service, it's important to check that it works great for web scraping. If you pay for a high-speed, private proxy that gets its only IP blocked by your target website, you might have just drained your money down the toilet.

Companies like Zenrows provide premium proxies tailored for web scraping and web crawling. One additional advantage is that it works as an API service with proxies integrated, so you don't have to tie together the scraper and the proxy rotator.

4. Use headless browsers

To avoid being blocked when web scraping, you want your interactions with the target website to look like regular users visiting the URLs. One of the best ways to achieve that is to use a headless web browser. They are real web browsers that work without a graphical user interface.

Most popular web browsers like Google Chrome and Firefox support headless mode. However, even if you use an official browser in headless mode, you need to make its behavior look real. To achieve that, it's common to add some special request headers, like User-Agent.

Selenium and other browser automation suites allow you to combine headless browsers with proxies. This will enable you to hide your IP and decrease the risk of being blocked.

To learn more about using headless browsers to prevent having your web scraper blocked, check out our detailed guides for Javascript, PHP, and Python.

5. Outsmart honeypot traps

Some websites will set up honeypot traps. These are mechanisms designed to attract bots while being unnoticed by real users. They can confuse crawlers and scrapers by making them work with fake data. Let's learn how to get the honey without falling into the trap!

Some of the most basic honeypot traps are links that are present in the website's HTML code but are invisible to humans. Make your crawler or scraper identify links with CSS properties that make them invisible.

Ideally, your scraper shouldn't follow text links with the same color as the background or hidden from users on purpose. In the following snippet, you can see a basic Javascript snippet that identifies some invisible links in the DOM:

function filterLinks() { 
	let allLinksAr = Array.from(document.querySelectorAll('a[href]')); 
	console.log('There are ' + allLinksAr.length + ' total links'); 
	let filteredLinks = allLinksAr.filter(link => { 
		let linkCss = window.getComputedStyle(link); 
		let isDisplayed = linkCss.getPropertyValue('display') != 'none'; 
		let isVisible = linkCss.getPropertyValue('visibility') != 'hidden'; 
		if (isDisplayed && isVisible) return link; 
	}); 
	console.log('There are ' + filteredLinks.length + ' visible links'); 
}

Another fundamental way to avoid honeypot traps is to respect the robots.txt file. It's written only for bots and contains instructions about what parts of a website can be crawled or scraped and which should be avoided. You can learn more about this file directly from Google.

Honeypot traps usually come together with tracking systems designed to fingerprint automated requests. This way, the website can identify similar requests in the future, even if they don't come from the same IP.

6. Avoid fingerprinting

If you change a lot of parameters in your requests, but still your scraper is blocked, you might've been fingerprinted. This means that the antibot system uses some mechanism to identify you and block your activity.

To overcome fingerprinting mechanisms, make it more difficult for websites to identify your scraper. Unpredictability is key. For example:
  • Don't make the requests at the same time every day. Instead, send them at random times.
  • Change IPs.
  • Forge and rotate TLS fingerprints. You can learn more about this in our Bypass Cloudflare article.
  • Use different request headers (including different user agents).
  • Configure your headless browser to use different screen sizes, resolutions, and installed fonts.
  • Use different headless browsers.

7. Bypass antibot systems

If your target website uses Cloudflare, Akamai, or a similar anti-bot service, you will probably find that you could not scrape the URL because it has been blocked. Bypassing these systems is hard but not impossible.

Cloudflare, for example, uses different bot-detection methods. One of their most essential tools to block bots is the "waiting room". Even if you are not a bot, you should be familiar with this type of screen:

Cloudflare waiting room
Click to open the image in fullscreen

While you wait, some Javascript code checks to ensure the visitor is not a bot. The good news is this code runs on the client side, so we can tamper with it. Bad news, it's obfuscated and is not always the same script.

We have a comprehensive guide on How to Bypass Cloudflare but be warned, it's a long and difficult process. If you are being blocked by Cloudflare, the easiest way to bypass this type of protection is to use a service like Zenrows, which is designed to overcome any antibot systems.

8. Automate CAPTCHA solving

Captchas are one of the most difficult obstacles when trying to scrape a URL. Captchas are computer challenges specifically made to tell apart humans from bots. They are usually placed in sections with sensitive information. You should consider if you can get the information that you want even if you leave out the sections protected by a CAPTCHA.

It's tough to code anti-CAPTCHA solutions, but some companies offer to solve CAPTCHAs for you. They employ real humans to solve the CAPTCHAs and charge a certain amount of money per solved CAPTCHA. Some examples of these companies are Anti Captcha and 2 Captcha.

CAPTCHAs are slow and expensive to solve. Wouldn't it be better to avoid them altogether? ZenRows can help you if you are after content protected by an antibot system that might show you a CAPTCHA. It will get the content without any action on your side.

9. Use APIs to your advantage

Currently, much of the information that websites display comes from APIs. This data is difficult to scrape because it's usually requested dynamically with Javascript after the user has executed some action.

Let's say you're trying to collect data from posts that appear on a website with "infinite scroll". In this case, static web scraping is not the best option because you will always get the results from the first page.

For this kind of websites, you can use headless browsers or a scraping service that allows you to configure user actions. Zenrows provides a web scraping API to do just that without complicated headless browser configurations.

Alternatively, you can reverse engineer the APIs of the website. The first step is to use the network inspector of your preferred browser and check the XHR (XMLHttpRequest) requests that the page is making.

Network in DevTools
Click to open the image in fullscreen

Then you should check the parameters sent, for example, page numbers, dates, or reference IDs. Sometimes these parameters use simple encodings to prevent the APIs from being used by third parties. In that case, you can find out how to send the appropriate parameters with trial and error.

Other times, you will have to obtain authentication parameters with real users and browsers, and send this information to the server in the form of headers or cookies. In any case, you will need to study carefully the requests the website makes to its API.

Detailed request in DevTools
Click to open the image in fullscreen

Sometimes, figuring out the working of a private API can be a complex task, but if you manage to do it. The parsing job will be much simpler, as you will get the information already organized and structured, usually in JSON format.

10. Stop on repeated failed attempts

One of the most suspicious situations for a webmaster is to see a large number of failed requests. Initially, they may not even suspect a bot is the cause. They might think there's something wrong with their site and start investigating these failed requests.

If they detect these errors happen because a bot is trying to scrape their data, they will try to block your web scraper. To minimize this risk, it's best to detect failed attempts, keep a log of them, notify you, and suspend scraping.

These errors usually happen because there have been changes to the website. In this case, you will need to adjust your scraper to accommodate the new website structure before continuing with data scraping. This way, you will avoid triggering alarms that can lead to being blocked.

Conclusion

As you can see, some websites use multiple mechanisms to block you from scraping their content. Using only one technique to avoid being blocked might not be enough for successful scraping. Let's recap the anti-block tips we saw in this post:

Anti-scraper block Workaround Supported by Zenrows
Requests limit by IP Rotating proxies βœ…
Data center IPs blocked Premium proxies βœ…
Cloudflare and other antibot systems Avoid suspicious requests and reverse-engineer JavaScript Challenge βœ…
Browser fingerprinting Rotating headless browsers βœ…
Honeypot traps Skipping invisible links and circular references βœ…
CAPTCHAs on suspicious requests Premium proxies and user-like requests βœ…
Always-on CAPTCHAs CAPTCHA-solving tools and services ❌

Remember that even after applying these tips, you can be blocked. Don't waste time on all of that! At Zenrows, we use all the anti-block techniques discussed here and more. That is the reason our web scraping API can handle thousands of requests per second without being blocked. On top of that, we can even create custom scrapers suited to your needs. You can try it for free today.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Want to keep learning?

We will be sharing all the insights we have learned through the years in the following blog posts. If you don't want to miss a piece and keep learning, we'd be thrilled to have us in our newsletter.

No spam guaranteed. You can unsubscribe at any time.