Bypass Rate Limit While Web Scraping Like a Pro

October 8, 2022 Β· 6 min read

Web scraping is an essential tool for data collection from the internet. But, sometimes, you get blocked while web scraping because of a rate limit. So how does a rate limit work, and how can your scraper avoid it?

In this article, we'll cover what a rate limit is and how you get blocked while web scraping. Then we will see what we can do to avoid rate limiting.

Let's get started!

What is a Rate Limit in Web Scraping?

Rate limit schema
Click to open the image in fullscreen

Well, what does it mean to be rate limited?

The rate limit is the hard limit of requests you can send in a particular time window. In the case of APIs, it is the maximum amount of API calls you can make. So when the resource is being rate limited, you cannot send more requests than the defined limit. If you exceed the limit, you can get error responses such as:
  • Slow down, too Many Requests from this IP Address
  • IP Address Has Reached Rate Limit

WAF service providers such as Cloudflare, Akamai, and Datadome use rate limiting mainly for security. Meanwhile, API providers like Amazon use rate limiting to control the data flow and prevent overuse.

So, how does it work? Let's say you're rate limited on the web server. When your scraper exceeds the rate limit, the web server responds with 429: Too Many Requests.

There are different rate-limiting methods. But we're interested in rate limit applications in practice. So here are the popular rate limiting types.
  • Rate Limiting: The most popular way of rate limiting. Simply associates the number of requests with the user's IP address.
  • API Rate Limits: API providers generally require you to use an API key. The provider can limit the number of API calls you can make in a specific time frame.
  • Geographic Rate Limit: It's also possible to set rate limiting for a specific region and country.
  • Rate Limiting based on User Session: WAF vendors such as Akamai set session cookies and then limit your session's request rate.
  • Rate Limiting based on HTTP requests: Cloudflare supports rate limiting for specific HTTP headers and cookies. It's also possible to implement a rate limit by TLS fingerprints.

Why are APIs Rate Limited?

Many APIs are rate-limited not to overload the web server. API rate limiting also protects the API from malicious bots and DDoS attacks. These attacks block the API's service from legitimate users or completely shut down its service.

Why do Websites Use Rate Limiting?

The primary reason is to prevent overloading on the server and mitigate potential attacks. However, even if your intent is not malicious, you can get stuck on a rate limit while web scraping. The reason is to control the data flow on the server side.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Bypassing the rate limits while web scraping

What can you do to avoid rate limits in web scraping? Here are the popular methods and tricks you can use!
  • Using Proxy Servers
  • Using Specific Request Headers
  • Changing HTTP Request Headers

The most popular rate limiting approach is IP-based rate limiting. Thus, it is preferred to use proxy servers since most platforms implement an IP-based rate limiting.

Using Specific Headers in Request

There are several headers we can use to spoof IP on the backend. You can try to use these headers when a CDN delivers the content:
  • X-Forwarded-Host: This is a header for identifying the original host requested by the client in the Host HTTP request header. It's possible to get around rate limiting using an extensive list of hostnames. You can pass a URL in this header.
  • X-Forwarded-For: Identifies the originating IP address of a client connecting to a web server through a proxy server. It requires specifying the IP addresses of proxy servers used for the connection. Passing a single IP address or brute-force with a list of IP addresses is possible.
The headers below specify the IP address of the client. However, they might not be implemented in every service. You can change the IP address on these headers and try your luck!
  • X-Client-IP
  • X-Remote-IP
  • X-Remote-Addr

Changing HTTP Request Headers

Sending requests with randomized HTTP headers can be used to bypass rate limiting. Many websites and WAF vendors use HTTP headers to block malicious bots. You can randomize headers like User-Agent to avoid rate limiting. That's a best practice for web scraping. You can learn more from our article for the 10 best tips for avoiding getting blocked!

If you are using Python, you can check our guide to avoid blocking in Python like a pro!

Ultimate Solution: Proxy Servers

When you use a proxy server, the request you send is routed to the proxy server. Then the proxy server gets the response and forwards the data to you. You won't need to deal with rate limited proxy since you can always use another one. Therefore, using a proxy server is the ultimate solution to IP rate limiting.

While there are public and free-to-use proxy servers, those are generally blocked by the WAF vendors (ex., Cloudflare) and websites. You can find free and public proxy servers on many websites. But WAF service providers can also use these websites and block them.

There are two types of proxy servers.
  • Residential Proxies: IP addresses are assigned by an ISP. They're much more reliable than data center proxies since they're attached to an actual address. The primary drawback is the price; high-quality proxy servers cost more.
  • Datacenter Proxies: Datacenter proxies are commercially assigned. They usually don't have a unique address and can be flagged by websites and WAF services. So, they're more affordable but less reliable than residential proxy servers.

Residential proxies are industry standard since they're more reliable. But you can also use a smart rotating proxy, which will automatically use a random residential proxy server every time you send a request.

Proxy Rotation in Python

There's a problem, though: What if that IP gets blocked?

You can build a list of proxy servers and rotate those while sending requests! Since this will let you use multiple IP addresses, you won't get blocked because of IP rate limiting algorithms.

However, you don't need to change the proxy server in every request if there's an authentication session. Your session also can be tracked, so rotating proxy servers for each request would not be a proper solution.

Before starting, let's get a proxy list!

Most public proxy servers are short-lived and unreliable. So some of the servers below may not work when you try them.

138.68.60.8:3128 
54.66.104.168:80 
80.48.119.28:8080 
157.100.26.69:80 
198.59.191.234:8080 
198.49.68.80:80 
169.57.1.85:8123 
219.78.228.211:80 
88.215.9.208:80 
130.41.55.190:8080 
88.210.37.28:80 
128.199.202.122:8080 
2.179.154.157:808 
165.154.226.12:80 
200.103.102.18:80

As the first step, save the proxy servers in a text file (proxies.txt). After that, we will read the ip:port pairs from the file and check whether they are working. Then, we will code simple synchronous functions to send requests from random proxy servers.

If your application works in scale, you'll want to use an asynchronous proxy rotator. You can check our guide on building a fully-fledged proxy rotator in Python.

In order to check the proxy servers, we will send requests to httpbin. If the proxy server is not working, we will receive an error in the response.

Now, let's dive in!

First, we need to read the proxies from proxies.txt:

proxies = open("proxies.txt", "r").read().strip().split("\n")

We will send requests to the URL using the requests module. Unfortunately, public proxies generally don't support SSL. So we will use HTTP rather than HTTPS.

The get() function uses the given proxy server to send the GET request to the target URL. When there's an error, the returned status code will exceed 400. The status codes between 400 and 500 specify client-side errors, and anything above 500 means a server-side error. If that's the case, the function returns None. Finally, it returns the actual response if there isn't an error.

def get(url, proxy): 
	""" 
	Sends a GET request to the given url using given proxy server. 
	The proxy server is used without SSL, so the URL should be HTTP. 
 
	Args: 
		url - string: HTTP URL to send the GET request with proxy 
		proxy - string: proxy server in the form of {ip}:{port} to use while sending the request 
	Returns: 
		Response of the server if the request sent successfully. Returns `None` otherwise. 
 
	""" 
	try: 
		r = requests.get(url, proxies={"http": f"http://{proxy}"}) 
		if r.status_code < 400: # client-side and server-side error codes are above 400 
			return r 
		else: 
			print(r.status_code) 
	except Exception as e: 
		print(e) 
 
	return None

Now, it's possible to check our proxies. The check_proxy() function returns True or False by sending a GET request to httpbin. We can filter out the available proxy servers from our proxies list.

def check_proxy(proxy): 
	""" 
	Checks the proxy server by sending a GET request to httpbin. 
	Returns False if there is an error from the `get` function 
	""" 
 
	return get("http://httpbin.org/ip", proxy) is not None 
 
available_proxies = list(filter(check_proxy, proxies))

Afterward, we can use the random module to select a random proxy server from the available ones. Then, we'll use the get function to send a request!

Conclusion

Congratulations, you now have a snippet to send requests from random IP addresses!

As a summary of what we covered in the article:
  1. While there are different approaches, the most popular one is IP rate limiting
  2. Hard limits can be passed by using different HTTP headers
  3. The most convenient solution is usually based on proxy rotation
  4. Public and free proxy servers are not reliable, the better option is to use residential proxy servers.

However, this naive implementation wouldn't be enough for scaled applications. There are lots of room for improvement. So again, check out our Python guide for a scalable proxy rotation!

Implementing a full-fledged proxy rotator for web scraping is difficult. To save you some pain, you can use ZenRows's web scraping API. It includes smart rotating proxies, so you can use rotating proxies automatically by specifying a single URL, like a regular proxy. Just sign up and get your trial API key for free!

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Want to keep learning?

We will be sharing all the insights we have learned through the years in the following blog posts. If you don't want to miss a piece and keep learning, we'd be thrilled to have us in our newsletter.

No spam guaranteed. You can unsubscribe at any time.