Web Scraping: Intercepting XHR Requests

Picture of Ander Rodríguez
By Ander Rodríguez · October 27, 2021 · 5 min read · Twitter
Ander is a web developer who has been working for several startups for more than 10 years, having worked with a wide variety of sectors and technologies. Engineer turned entrepreneur.

Have you ever tried scraping AJAX websites? Sites full of Javascript and XHR calls? Decipher tons of nested CSS selectors? Or worse, daily changing selector? Maybe you won't need that ever again. Keep on reading, XHR scraping might prove your ultimate solution!

Prerequisites

For the code to work, you will need python3 installed. Some systems have it pre-installed. After that, install Playwright and the browser binaries for Chromium, Firefox, and WebKit.

pip install playwright 
playwright install

Intercept Responses

As we saw in a previous blog post about blocking resources, headless browsers allow request and response inspection. We will use Playwright in python for the demo, but it can be done in Javascript or using Puppeteer.

We can quickly inspect all the responses on a page. As we can see below, the response parameter contains the status, URL, and content itself. And that's what we'll be using instead of directly scraping content in the HTML using CSS selectors.

page.on("response", lambda response: print( 
	"<<", response.status, response.url))

Use case: auction.com

Our first example will be auction.com. You might need proxies or a VPN since it blocks outside of the countries they operate in. Anyway, it might be a problem trying to scrape from your IP since they will ban it eventually. Check out how to avoid blocking if you find any issues.

Here is a basic example of loading the page using Playwright while logging all the responses.

from playwright.sync_api import sync_playwright 
 
url = "https://www.auction.com/residential/ca/" 
 
with sync_playwright() as p: 
	browser = p.chromium.launch() 
	page = browser.new_page() 
 
	page.on("response", lambda response: print( 
		"<<", response.status, response.url)) 
	page.goto(url, wait_until="networkidle", timeout=90000) 
 
	print(page.content()) 
 
	page.context.close() 
	browser.close()

auction.com will load an HTML skeleton without the content we are after (house prices or auction dates). They will then load several resources such as images, CSS, fonts, and Javascript. If we wanted to save some bandwidth, we could filter out some of those. For now, we're going to focus on the attractive parts.

Auction.com screenshot with DevTools open
Click to open the image in fullscreen

As we can see in the network tab, almost all relevant content comes from an XHR call to an assets endpoint. Ignoring the rest, we can inspect that call by checking that the response URL contains this string: if ("v1/search/assets?" in response.url).

Frustrated your scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

There is a size and time problem: the page will load tracking and map, which will amount to more than a minute in loading (using proxies) and 130 requests 😮. We could do better by blocking certain domains and resources. We were able to do it in under 20 seconds with only 7 loaded resources in our tests. We will leave that as an exercise for you 😉

<< 407 https://www.auction.com/residential/ca/ 
<< 200 https://www.auction.com/residential/ca/ 
<< 200 https://cdn.auction.com/residential/page-assets/styles.d5079a39f6.prod.css 
<< 200 https://cdn.auction.com/residential/page-assets/framework.b3b944740c.prod.js 
<< 200 https://cdn.cookielaw.org/scripttemplates/otSDKStub.js 
<< 200 https://static.hotjar.com/c/hotjar-45084.js?sv=5 
<< 200 https://adc-tenbox-prod.imgix.net/resi/propertyImages/no_image_available.v1.jpg 
<< 200 https://cdn.mlhdocs.com/rcp_files/auctions/E-19200/photos/thumbnails/2985798-1-G_bigThumb.jpg 
# ...

For a more straightforward solution, we decided to change to the wait_for_selector function. It is not the ideal solution, but we noticed that sometimes the script stops altogether before loading the content. To avoid those cases, we change the waiting method.

While inspecting the results, we saw that the wrapper was there from the skeleton. But each houses' content is not. So we will wait for one of those: "h4[data-elm-id]".

with sync_playwright() as p: 
	def handle_response(response): 
		# the endpoint we are insterested in 
		if ("v1/search/assets?" in response.url): 
			print(response.json()["result"]["assets"]["asset"]) 
	# ... 
	page.on("response", handle_response) 
	# really long timeout since it gets stuck sometimes 
	page.goto(url, timeout=120000) 
	page.wait_for_selector("h4[data-elm-id]", timeout=120000)

Here we have the output, with even more info than the interface offers! Everything is clean and nicely formatted 😎

[{ 
	"item_id": "E192003", 
	"global_property_id": 2981226, 
	"property_id": 5444765, 
	"property_address": "13841 COBBLESTONE CT", 
	"property_city": "FONTANA", 
	"property_county": "San Bernardino", 
	"property_state": "CA", 
	"property_zip": "92335", 
	"property_type": "SFR", 
	"seller_code": "FSH", 
	"beds": 4, 
	"baths": 3, 
	"sqft": 1704, 
	"lot_size": 0.2, 
	"latitude": 34.10391, 
	"longitude": -117.50212, 
	...

We could go a step further and use the pagination to get the whole list, but we'll leave that to you.

Use case: twitter.com

Another typical case where there is no initial content is Twitter. To be able to scrape Twitter, you will undoubtedly need Javascript Rendering. As in the previous case, you could use CSS selectors once the entire content is loaded. But beware, since Twitter classes are dynamic and they will change frequently.

What will most probably remain the same is the API endpoint they use internally to get the main content: TweetDetail. In cases like this one, the easiest path is to check the XHR calls in the network tab in devTools and look for some content in each request. It is an excellent example because Twitter can make 20 to 30 JSON or XHR requests per page view.

Twitter.com screenshot with DevTools open
Click to open the image in fullscreen

Once we identify the calls and the responses we are interested in, the process will be similar.

import json 
from playwright.sync_api import sync_playwright 
 
url = "https://twitter.com/playwrightweb/status/1396888644019884033" 
 
with sync_playwright() as p: 
	def handle_response(response): 
		# the endpoint we are insterested in 
		if ("/TweetDetail?" in response.url): 
			print(json.dumps(response.json())) 
 
	browser = p.chromium.launch() 
	page = browser.new_page() 
	page.on("response", handle_response) 
	page.goto(url, wait_until="networkidle") 
	page.context.close() 
	browser.close()

The output will be a considerable JSON (80kb) with more content than we asked for. More than ten nested structures until we arrive at the tweet content. The good news is that we can now access favorite, retweet, or reply counts, images, dates, reply tweets with their content, and many more.

Use case: nseindia.com

Stock markets are an ever-changing source of essential data. Some sites offering this info, such as the National Stock Exchange of India, will start with an empty skeleton. After browsing for a few minutes on the site, we see that the market data loads via XHR.

Another common clue is to view the page source and check for content there. If it's not there, it usually means that it will load later, which probably requires XHR requests. And we can intercept those!

NSE screenshot with DevTools open
Click to open the image in fullscreen

Since we are parsing a list, we will loop over it a print only part of the data in a structured way: symbol and price for each entry.

from playwright.sync_api import sync_playwright 
 
url = "https://www.nseindia.com/market-data/live-equity-market" 
 
with sync_playwright() as p: 
	def handle_response(response): 
		# the endpoint we are insterested in 
		if ("equity-stockIndices?" in response.url): 
			items = response.json()["data"] 
			[print(item["symbol"], item["lastPrice"]) for item in items] 
 
	browser = p.chromium.launch() 
	page = browser.new_page() 
 
	page.on("response", handle_response) 
	page.goto(url, wait_until="networkidle") 
 
	page.context.close() 
	browser.close() 
 
# Output: 
# NIFTY 50 18125.4 
# ICICIBANK 846.75 
# AXISBANK 845 
# ...

As in the previous examples, this is a simplified example. Printing is not the solution to a real-world problem. Instead, each page structure should have a content extractor and a method to store it. And the system should also handle the crawling part independently.

Conclusion

We'd like you to go with three main points:
  1. Inspect the page looking for clean data
  2. API endpoints change less often than CSS selectors, and HTML structure
  3. Playwright offers more than just Javascript rendering

Even if the extracted data is the same, fail-tolerance and effort in writing the scraper are fundamental factors. The less you have to change them manually, the better.

Apart from XHR requests, there are many other ways to scrape data beyond selectors. Not every one of them will work on a given website, but adding them to your toolbelt might help you often.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn or Facebook.

Frustrated your scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Want to keep learning?

We will be sharing all the insights we have learned through the years in the following blog posts. If you don't want to miss a piece and keep learning, we'd be thrilled to have us in our newsletter.

No spam guaranteed. You can unsubscribe at any time.