React Crawling: How to crawl JavaScript-generated web pages

October 11, 2022 Β· 6 min read

React is a popular JavaScript library for building interactive UIs and single-page applications (SPAs). Websites with React UI typically rely on JavaScript for some or all of their content. As you might know, regular libraries are not enough for crawling JavaScript-generated web pages. Let's see why they don't work and the alternatives for React crawling.

Before the modern internet of dynamic websites, web crawling was pretty straightforward. Virtually all websites relied on client-side scripting, and web crawlers could easily extract data from static HTML code. However, today, most websites use React, Vue, or Angular to load content dynamically.

Follow this tutorial with educational purposes to learn how to crawl a React website.

Why are regular libraries not enough for React crawling?

When a bot visits a web page, it views the HTML file and parses its content to extract the necessary data. Since React websites use server-side code to render the final HTML, the fetched HTML source code does not yet contain the entire page data. The browser initially renders static HTML content while downloading and running the JavaScript needed to populate the HTML with the necessary page data.

The background JavaScript execution often adds to or changes the entire page content. However, typical libraries fetch the HTML before JavaScript fires. Hence the incomplete data you receive in your crawler.

What is the alternative?

In React crawling, you need to run all the code on the page and render the content. If the website uses APIs to request data from the server, you can view this data by mimicking the API call in your crawler. But regular libraries can't render JavaScript. So, the alternative here is to build a web crawler with React using a headless browser or a DOM engine.

Headless browsers you could use include:
  • Selenium: an open-source browser automation framework available in Node.js, Python, Java, and other languages
  • Puppeteer: a Chromium-based Node library that uses a high-level API to automate browser actions.

What is a React crawler?

A React web crawler is a tool that can extract the complete HTML data from a React website. A React crawler solution is able to render React components before fetching the HTML data and extracting the needed information.

Typically, a regular crawler takes in a list of URLs, also known as a seed list, from which it discovers other valuable URLs. However, when it encounters a React website, it doesn't return accurate data because it can't access the site's final HTML.

In web data extraction projects, web scrapers employ this crawling structure to find URLs from which they'd extract data. For example, you may have a single React website from which you want to extract the information. But you don't have the URLs of the pages you want to scrape. In this case, a React crawler takes your website's URL and selectively outputs the URLs you need.

Also, some web pages and URLs are more valuable than others. Thus, a React web crawler will generally organize pages and URLs according to their priority. This action is known as indexing. While some may use the terms crawling and indexing interchangeably, in reality, they are two different processes.

What is the difference between crawling and indexing?

Crawling is the discovery of URLs, while indexing is collecting, analyzing, and assigning URLs a priority.

Basic considerations for React crawling

Before we dive into how to crawl React js components, here are a few points to note. Crawling with headless browsers is slow and performance intensive. Thus, we want to avoid it if possible.

Not all React websites are the same. For example, some websites render only static HTML before plugging in React to display some content dynamically. So, the data you're after may be accessible by simply running HTTP requests, and you don't have to use headless browsers.

Thus, going through the following checklist before you begin is essential.

πŸ€” Is the JavaScript-rendered data part of the page source and accessible via a script tag somewhere in the DOM? If yes, we can easily parse the string from the JavaScript object. An example of this case can be seen on Redfin, a real estate website. If you inspect its HTML elements, you'll find the following script tag with JSON content.

React Storefront content
Click to open the image in fullscreen

πŸ€” Is the website rendering data through a JavaScript API call? If true, we can mimic the API call by sending a request directly to the endpoint using an HTTP React crawler. This often returns a JSON, which we can easily parse to retrieve the necessary data.

If both situations return a "No" answer, then the only feasible option is to use a headless browser.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Prerequisites

For this React crawling tutorial, we'll be using Node.js and Puppeteer. Therefore, you'll need Node (or nvm) and npm installed. Some systems have it pre-installed. After that, install Puppeteer.

npm install puppeteer

Let's get started!

How to create a React web crawler with headless Chrome and Puppeteer?

First of all, initialize your project:

mkdir react-crawler 
cd react-crawler 
npm init -y

The code above creates a new file, react-crawler.js, in your project's directory. Open it in your preferred code editor. To run Puppeteer, we must import the Puppeteer library into our script.

const puppeteer = require('puppeteer');

Also, remember that the Puppeteer library is promise-based: it makes asynchronous API calls to chromium behind the scenes. So, we must write our code within the async() function.

For this tutorial, we'll use React storefront as our target website.

React Storefront
Click to open the image in fullscreen

Using Node.js, we can execute a React text crawl, including any other elements, and scrape the rendered content of our target website with the following script.

The page's HTML content is obviously too large to console.log(); hence we stored it in a file as in the code below. But we had to import the node js fs module to use the fs.writeFile() function.

const puppeteer = require('puppeteer'); 
const fs = require('fs').promises; 
 
(async () => { 
	//initiate the browser 
	const browser = await puppeteer.launch(); 
 
	//create a new in headless chrome 
	const page = await browser.newPage(); 
 
	//go to target website 
	await page.goto('https://reactstorefront.vercel.app/default-channel/en-US/', { 
		//wait for content to load 
		waitUntil: 'networkidle0', 
	}); 
 
	//get full page html 
	const html = await page.content(); 
 
	//store html content in the reactstorefront file 
	await fs.writeFile('reactstorefront.html', html); 
 
	//close headless chrome 
	await browser.close(); 
})();

Your result should look like this:

React Storefront HTML result
Click to open the image in fullscreen

Congratulations! You've scraped your first React website.

Selecting nodes with Puppeteer

Most data extraction projects usually don't aim to download the entire site content. In this case, we must select the elements we want to crawl and access their URLs. The Puppeteer Page API explains the different methods for accessing a web page. You can also take a look at our article on web scraping with Puppeteer for details about how to extract data.

You can use the corresponding selectors to get the required links, then add them to a list.

For this tutorial, let's crawl the products' URLs:

React Storefront records
Click to open the image in fullscreen

So, if we inspect this page, we can locate the products' parent div and map the anchor tag to get the "href" attributes we're after:

React Storefront element
Click to open the image in fullscreen

Our complete code to get the hrefs will look like this:

const puppeteer = require('puppeteer'); 
 
(async () => { 
	//initiate the browser 
	const browser = await puppeteer.launch(); 
 
	//create a new in headless chrome 
	const page = await browser.newPage(); 
 
	//go to target website 
	await page.goto('https://reactstorefront.vercel.app/default-channel/en-US/', { 
		//wait for content to load 
		waitUntil: 'networkidle0', 
	}); 
 
	//get product urls 
	const imgUrl = await page.evaluate(() => 
		Array.from(document.querySelectorAll('ul.grid > li a')).map(a => a.href) 
	); 
 
	console.log(imgUrl); 
 
	await browser.close(); 
})();

And it should output the following result:

[ 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/abba-abba-1970-1982/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/abba-abba-abba-again/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/abba-abba-abba-mp3/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/abba-abba-aniversario-los-10-anos-de-abba/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/adele-adele-3-25/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/adele-adele-3-30/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/adele-adele-3-chasing-pavements/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/adele-adele-3-easy-on-me/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/acdc-acdc-a-giant-dose-of-rock-and-roll/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/acdc-acdc-detroit-rock-city/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/acdc-acdc-high-voltage-rock-n-roll/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/acdc-acdc-hot-seams-wet-dreams-rock-in-rio-1985/' 
]

The React crawling process can be endless, depending on your project goals. Follow the same steps above to crawl new links.

Now, you might be wondering: is that it?

Yes, React crawling is not that complicated of a process. With 10 lines of code, you can crawl JavaScript-generated pages. And you may not face any challenges until you encounter a URL that blocks you out.

The use of anti-bot solutions is becoming increasingly rampant these days. Plus, they have become a major challenge for web crawlers. We've already covered how to bypass anti-bot solutions like Akamai, and we won't go into further detail here.

In the meantime, we can optimize Puppeteer for performance.

How to optimize Puppeteer's performance

Have you ever wondered why crawling with headless browsers is slow compared to other approaches? Well, we've all been there. It's not a secret that React crawling significantly impacts your infrastructure. Therefore the aim is to streamline the amount of work done by the headless browser. Some extra work your headless browser might be doing includes
  • Loading images.
  • Firing XHR requests
  • Applying CSS rules.

However, extra work is relative to your use case. For example, if your goal is to screenshot a few web pages, you might not consider loading images as extra work.

That said, there are different approaches to optimizing a Puppeteer script, for example, the cache-first strategy.

Whenever we launch a new headless browser instance, Puppeteer creates a new but temporary directory that stores data like cookies and cache. It disappears once the browser is closed.

The following code forces Puppeteer to store those assets in a custom path. So it can reuse them every time we launch a new instance:

const browser = await puppeteer.launch({ 
	userDataDir: './data', 
});

This can result in a significant performance boost as Chrome won't need to download those assets continuously.

Conclusion

In this React crawling tutorial, you've learned why we must use headless browsers for crawling JavaScript-generated websites.

Let's do a quick recap of the steps:
  1. Install and run Puppeteer.
  2. Scrape your target URL.
  3. Extract links using selectors.
  4. Crawl the new links.
  5. Go back to step #2.

While writing all this code can be fun, doing this for thousands of pages can be overwhelming. With ZenRows, you can forget most of this and get the HTML of any website using a simple API call. Also, it handles the whole anti-bot bypass for you.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Want to keep learning?

We will be sharing all the insights we have learned through the years in the following blog posts. If you don't want to miss a piece and keep learning, we'd be thrilled to have us in our newsletter.

No spam guaranteed. You can unsubscribe at any time.