Web Scraping in Python: Avoid Detection Like a Ninja

Picture of Ander
By Ander Β· September 14, 2022 Β· 15 min read Β· Twitter
Ander is a web developer who has been working for several startups for more than 10 years, having worked with a wide variety of sectors and technologies. Engineer turned entrepreneur.

Scraping should be about extracting content from HTML. It sounds simple but has many obstacles. The first one is to obtain the said HTML. For that, we'll use Python to avoid detection.

If you've been there, you know it might require bypassing antibot systems. Web scraping without getting blocked using Python - or any other tool - is not a walk in the park.

Websites tend to protect their data and access. There are many possible actions a defensive system could take. Stay with us to learn how to mitigate their impact. Or directly bypass bot detection using Python Requests or Playwright.

Note: when testing at scale, never use your home IP directly. A small mistake or slip and you will get banned.

Prerequisites

For the code to work, you will need python3 installed. Some systems have it pre-installed. After that, install all the necessary libraries by running pip install.

pip install requests playwright 
npx playwright install

IP Rate Limit

The most basic security system is to ban or throttle requests from the same IP. It means that a regular user would not request a hundred pages in a few seconds, so they proceed to tag that connection as dangerous.

import requests 
 
response = requests.get('http://httpbin.org/ip') 
print(response.json()['origin']) 
# xyz.84.7.83

IP rate limits work similarly to API rate limits, but there is usually no public information about them. We cannot know for sure how many requests we can do safely.

Our Internet Service Provider assigns us our IP, which we cannot affect or mask. The solution is to change it. We cannot modify a machine's IP, but we can use different machines. Datacenters might have different IPs, although that is not a real solution.

Proxies are. They take an incoming request and relay it to the final destination. It does no processing there. But that is enough to mask our IP and bypass the block since the target website will see the proxy's IP.

Rotating Proxies

There are Free Proxies even though we do not recommend them. They might work for testing but are not reliable. We can use some of those for testing, as we'll see in some examples.

Now we have a different IP, and our home connection is safe and sound. Good. But what if they block the proxy's IP? We are back to the initial position.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

We won't go into detail about free proxies. Just use the next one on the list. Change them frequently since their lifespan is usually short.

Paid proxy services, on the other hand, offer IP Rotation. Our service would work the same, but the website would see a different IP. In some cases, they rotate for every request or every few minutes. In any case, they are much harder to ban. And when it happens, we'll get a new IP after a short time.

import requests 
 
proxies = {'http': 'http://190.64.18.177:80'} 
response = requests.get('http://httpbin.org/ip', proxies=proxies) 
print(response.json()['origin']) # 190.64.18.162

We know about these; it means bot detection services also know about them. Some big companies will block traffic from known proxy IPs or datacenters. For those cases, there is a higher proxy level: Residential.

More expensive and sometimes bandwidth-limited, residential proxies offer us IPs used by regular people. That implies that our mobile provider could assign us that IP tomorrow. Or a friend had it yesterday. They are indistinguishable from actual final users.

We can scrape whatever we want, right? The cheaper ones by default, the expensive ones when necessary. No, not there yet. We only passed the first hurdle, with some more to come. We must look like legitimate users to avoid being tagged as a bot or scraper.

User-Agent Header

The next step would be to check our request headers. The most known one is User-Agent (UA for short), but there are many more. UA follows a format we'll see later, and many software tools have their own, for example, GoogleBot. Here is what the target website will receive if we directly use Python Requests or cURL.

import requests 
 
response = requests.get('http://httpbin.org/headers') 
print(response.json()['headers']['User-Agent']) 
# python-requests/2.25.1
curl http://httpbin.org/headers # { ... "User-Agent": "curl/7.74.0" ... }

Many sites won't check UA, but this is a huge red flag for the ones that do this. We'll have to fake it. Luckily, most libraries allow custom headers. Following the example using Requests:

import requests 
 
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"} 
response = requests.get('http://httpbin.org/headers', headers=headers) 
print(response.json()['headers']['User-Agent']) # Mozilla/5.0 ...

To get your current user agent, visit httpbin - just as the code snippet is doing - and copy it. Requesting all the URLs with the same UA might also trigger some alerts, making the solution a bit more complicated.

Ideally, we would have all the current possible User-Agents and rotate them as we did with the IPs. Since that is nearly impossible, we can at least have a few. There are lists of User Agents available for us to choose from.

import requests 
import random 
 
user_agents = [ 
	'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 
	'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36', 
	'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36', 
	'Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148', 
	'Mozilla/5.0 (Linux; Android 11; SM-G960U) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.72 Mobile Safari/537.36' 
] 
user_agent = random.choice(user_agents) 
headers = {'User-Agent': user_agent} 
response = requests.get('https://httpbin.org/headers', headers=headers) 
print(response.json()['headers']['User-Agent']) 
# Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) ...

Keep in mind that browsers change versions quite often, and this list can be obsolete in a few months. If we are to use User-Agent rotation, a reliable source is essential. We can do it by hand or use a service provider.

We are a step closer, but there is still one flaw in the headers: antibot systems also know this trick and check other headers along with the User-Agent.

Full Set of Headers

Each browser, or even version, sends different headers. Check Chrome and Firefox in action:

{ 
	"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
	"Accept-Encoding": "gzip, deflate, br", 
	"Accept-Language": "en-US,en;q=0.9", 
	"Host": "httpbin.org", 
	"Sec-Ch-Ua": "\"Chromium\";v=\"92\", \" Not A;Brand\";v=\"99\", \"Google Chrome\";v=\"92\"", 
	"Sec-Ch-Ua-Mobile": "?0", 
	"Sec-Fetch-Dest": "document", 
	"Sec-Fetch-Mode": "navigate", 
	"Sec-Fetch-Site": "none", 
	"Sec-Fetch-User": "?1", 
	"Upgrade-Insecure-Requests": "1", 
	"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36" 
}
{ 
	"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
	"Accept-Encoding": "gzip, deflate, br", 
	"Accept-Language": "en-US,en;q=0.5", 
	"Host": "httpbin.org", 
	"Sec-Fetch-Dest": "document", 
	"Sec-Fetch-Mode": "navigate", 
	"Sec-Fetch-Site": "none", 
	"Sec-Fetch-User": "?1", 
	"Upgrade-Insecure-Requests": "1", 
	"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0" 
}

It means what you think it means. The previous array with 5 User Agents is incomplete. We need an array with a complete set of headers per User-Agent. For brevity, we will show a list with one item. It is already long enough.

In this case, copying the result from httpbin is not enough. The ideal would be to copy it directly from the source. The easiest way to do it is from the Firefox or Chrome DevTools - or equivalent in your browser. Go to the Network tab, visit the target website, right-click on the request and copy as cURL. Then convert curl syntax to Python and paste the headers into the list.

import requests 
import random 
 
headers_list = [{ 
	'authority': 'httpbin.org', 
	'cache-control': 'max-age=0', 
	'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"', 
	'sec-ch-ua-mobile': '?0', 
	'upgrade-insecure-requests': '1', 
	'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36', 
	'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 
	'sec-fetch-site': 'none', 
	'sec-fetch-mode': 'navigate', 
	'sec-fetch-user': '?1', 
	'sec-fetch-dest': 'document', 
	'accept-language': 'en-US,en;q=0.9', 
} # , {...} 
] 
headers = random.choice(headers_list) 
response = requests.get('https://httpbin.org/headers', headers=headers) 
print(response.json()['headers'])

We could add a Referer header for extra security - such as Google or an internal page from the same website. It would mask the fact that we always request URLs directly without interaction. But be careful since adding a referrer would change more headers. You don't want your Python Request script blocked by mistakes like that.

Cookies

We ignored the cookies above since they deserve a separate section. Cookies can help you bypass some antibots or get your requests blocked. They are a powerful tool that we need to understand correctly.

Cookies can track a user session and remember that user after login, for example. Websites assign each new user a cookie session. There are many ways to do it, but we'll try to simplify. Then the user's browser will send that cookie in each request, tracking the user activity.

How is that a problem? We are using rotating proxies, so each request might have a different IP from different regions or countries. Antibots can see that pattern and block it since it's not a natural way for users to browse.

On the other hand, once bypassed the antibot solution, it will send valuable cookies. Defensive systems won't check twice if the session looks legit. Check out how to bypass Cloudflare for more info.

Will cookies help our Python Requests scripts to avoid bot detection? Or will they hurt us and get us blocked? The answer lies in our implementation.

For simple cases, not sending cookies might work best. There is no need to maintain a session.

For more advanced cases and antibot software, session cookies might be the only way to reach and scrape the final content. Always taking into account that the session requests and the IP must match.

The same happens if we want content generated in the browser after XHR calls. We will need to use a headless browser. After the initial load, the Javascript will try to get some content using an XHR call. We cannot do that call without cookies on a protected site.

How will we use headless browsers, specifically Playwright, to avoid detection? Keep on reading!

Headless Browsers

Some antibot systems will only show the content after the browser solves a Javascript challenge. And we can't use Python Requests to simulate browser behavior like that. We need a browser with Javascript execution to run and pass the challenge.

Selenium, Puppeteer, and Playwright are the most used and known libraries. Avoiding them - for performance reasons - would be preferable, and they will make scraping slower. But sometimes, there is no alternative.

We'll see how to run Playwright. The snippet below shows a simple script visiting a page that prints the sent headers. The output only shows the User-Agent, but since it is a real browser, the headers will include the entire set (Accept, Accept-Encoding, etcetera).

import json 
from playwright.sync_api import sync_playwright 
 
with sync_playwright() as p: 
	for browser_type in [p.chromium, p.firefox, p.webkit]: 
		browser = browser_type.launch() 
		page = browser.new_page() 
		page.goto('https://httpbin.org/headers') 
		jsonContent = json.loads(page.inner_text('pre')) 
		print(jsonContent['headers']['User-Agent']) 
		browser.close() 
 
# Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/93.0.4576.0 Safari/537.36 
# Mozilla/5.0 (X11; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0 
# Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15

This approach comes with its own problem: take a look a the User-Agents. The Chromium one includes HeadlessChrome, which will tell the target website, well, that it is a headless browser. They might act upon that.

Back to the headers section: we can add custom headers that will overwrite the default ones. Replace the line in the previous snippet with this one and paste a valid User-Agent:

browser.new_page(extra_http_headers={'User-Agent': '...'})

That is just an entry-level with headless browsers. Headless detection is a field in itself, and many people are working on it. Some to detect it, some to avoid being blocked. As an example, you can visit pixelscan with an actual browser and a headless one. To be deemed "consistent," you'll need to work hard.

Look at the screenshot below, taken when visiting pixelscan with Playwright. See the UA? The one we fake is all right, but they can detect that we are lying by checking the navigator Javascript API. Pixelscan Inconsistent

We can pass user_agent, and playwright will set the user agent in javascript and the header for us. Nice!

page = browser.new_page(user_agent='...')

For more advanced cases, you can easily add Playwright stealth to your scripts and make detection harder. It handles inconsistencies between headers and browser Javascript APIs, among other things.

In summary, having 100% coverage is complex, but you won't need it most of the time. Sites can always do some more complex checks: WebGL, touch events, or battery status.

You won't need those extra features unless trying to scrape a website that requires bypassing an antibot solution, like Akamai. And for those cases, that extra effort will be mandatory. And demanding, to be honest.

Geographic Limits or Geoblocking

Have you ever tried to watch CNN from outside the US? CNN Geoblocked

That's called geoblocking. Only connections from inside the US can watch CNN live. To bypass that, we could use a Virtual Private Network (VPN). We can then browse as usual, but the website will see a local IP thanks to the VPN.

The same can happen when scraping websites with geoblocking. There is an equivalent for proxies: geolocated proxies. Some proxy providers allow us to choose from a list of countries. With that activated, we will only get local IPs from the US, for example.

Behavioral patterns

Blocking IPs and User-Agents is not enough these days. They become unmanageable and stale in hours, if not minutes. As long as we perform requests with clean IPs and real-world User-Agents, we are mainly safe. There are more factors involved, but most requests should be valid.

However, most modern antibot software use machine learning and behavioral patterns, not just static markers (IP, UA, geolocation). That means we would be detected if we always performed the same actions in the same order.
  1. Go to the homepage
  2. Click on the "Shop" button
  3. Scroll down
  4. Go to page 2
  5. ...
After a few days, launching the same script could result in every request being blocked. Many people can perform those same actions, but bots have something that makes them obvious: speed. With software, we would execute every step sequentially, while an actual user would take a second, then click, scroll down slowly using the mouse wheel, move the mouse to the link and click.

Maybe there is no need to fake all that, but be aware of the possible problems and know how to face them.

We have to think what is what we want. Maybe we don't need that first request since we only require the second page. We could use that as an entry point, not the homepage. And save one request. It can scale to hundreds of URLs per domain. No need to visit every page in order, scroll down, click on the next page and start again.

To scrape search results, once we recognize the URL pattern for pagination, we only need two data points: the number of items and items per page. And most of the time, that info is present on the first page or request.

import requests 
from bs4 import BeautifulSoup 
 
response = requests.get('https://scrapeme.live/shop/') 
soup = BeautifulSoup(response.content, 'html.parser') 
pages = soup.select('.woocommerce-pagination a.page-numbers:not(.next)') 
print(pages[0].get('href')) # https://scrapeme.live/shop/page/2/ 
print(pages[-1].get('href')) # https://scrapeme.live/shop/page/48/
One request shows us that there are 48 pages. We can now queue them. Mixing with the other techniques, we would scrape the content from this page and add the remaining 47. To scrape them bypassing antibot systems, we could:
  • Shuffle the page order to avoid pattern detection
  • Use different IPs and User-Agent, so each request looks like a new one
  • Add delays between some of the calls
  • Use Google as a referrer randomly

We could write some snippet mixing all these, but the best option in real life is to use a tool with it all like Scrapy, pyspider, node-crawler (Node.js), or Colly (Go). The idea being the snippets is to understand each problem on its own. But for large-scale, real-life projects, handling everything on our own would be too complicated.

Captcha

Even the best-prepared request can get caught and shown a captcha. Nowadays, solving captchas is achievable - Anti-Captcha and 2Captcha - but a waste of time and money. The best solution is to avoid them. The second best is to forget about that request and retry.

The exception is obvious: sites that always show a Captcha on the first visit. We have to solve it if there is no way to bypass it. And then, use the session cookies to avoid being challenged again.

It might sound counterintuitive, but waiting for a second and retrying the same request with a different IP and set of headers will be faster than solving a captcha. Try it yourself and tell us about the experience πŸ˜‰.

Login Wall or Paywall

Some websites prefer to show or redirect users to a login page instead of a captcha. After a few visits, Instagram will redirect anonymous users, and Medium will show a paywall.

As with the captchas, a solution can be to tag the IP as dirty, forget the request and try again. Libraries usually follow redirects by default but offer an option not to allow them. Ideally, we would only disallow redirects to log in, sign up, or specific pages, not all of them.

In this example, we will follow redirects until its location contains accounts/login. We would queue that page with a delay and try again in an actual project.

import sys 
import requests 
 
session = requests.session() 
response = session.get('http://instagram.com', allow_redirects=False) 
print(response.status_code, response.headers.get('location')) 
for redirect in session.resolve_redirects(response, response.request): 
	location = redirect.headers.get('location') 
	print(redirect.status_code, location) 
	if location and 'accounts/login' in location: 
		sys.exit() # no need to exit, return would be enough 
# 301 https://instagram.com/ 
# 301 https://www.instagram.com/ 
# 302 https://www.instagram.com/accounts/login/

Be a Good Internet Citizen

We can use several websites for testing, but be careful when doing the same at scale. Try to be a good internet citizen and don't cause -small- DDoS. Limit your interactions per domain. Amazon can handle thousands of requests per second. But not all target sites will.

We are always talking about "read-only" browsing mode. Access a page and read its contents. Never submit a form or perform active actions with malicious intent.

If we take a more active approach, several other factors would matter: writing speed, mouse movement, navigation without clicking, browsing many pages simultaneously, etcetera. Bot prevention software is specifically aggressive with active actions. As it should for security reasons.

We won't discuss this part, but these actions will give them new reasons to block requests. Again, good citizens don't try massive logins. We are talking about scraping, not malicious activities.

Sometimes websites make data collection harder, maybe not on purpose. But with modern frontend tools, CSS classes could change daily, ruining thoroughly prepared scripts. For more details, read our previous entry on how to scrape data in Python.

Conclusion

We'd like you to remember the low-hanging fruits:
  1. IP rotating proxies
  2. Residential proxies for challenging targets
  3. Full set headers, including User-Agent
  4. Bypass bot detection with Playwright when Javascript challenge is required - maybe adding the stealth module
  5. Avoid patterns that might tag you as a bot

There are many more, and probably more we didn't cover. But with these techniques, you should be able to crawl and scrape at scale. After all, web scraping without getting blocked with python is possible if you know how.

Contact us if you know more website scraping tricks or have doubts about applying them.

Remember, we covered scraping and avoiding being blocked, but there is much more: crawling, converting and storing the content, scaling the infrastructure, and more. Stay tuned!

Do not forget to take a look at the rest of the posts in this series.
+ From Zero to Hero (1/4)
+ Avoid Detection Like a Ninja (2/4)
+ Crawling from Scratch (3/4)
+ Scaling to Distributed Crawling (4/4)

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Want to keep learning?

We will be sharing all the insights we have learned through the years in the following blog posts. If you don't want to miss a piece and keep learning, we'd be thrilled to have us in our newsletter.

No spam guaranteed. You can unsubscribe at any time.