The Anti-bot Solution to Scrape Everything? Get Your Free API Key! ๐Ÿ˜Ž

React Crawling: How to Crawl JavaScript-Generated Web Pages

October 10, 2022 ยท 6 min read

Before dynamic websites took over the web, crawling was relatively straightforward. Virtually all websites relied on client-side scripting, and spiders could easily extract data from static HTML code. However, today is a different situation. Most pages use React, Vue, or Angular to load content dynamically.

React is a popular JavaScript library for building interactive UIs and Single-page applications (SPAs). Websites using it typically rely on JS for some or all of their content. You already know that regular libraries aren't enough for crawling JavaScript-generated web pages. So, let's see why that is and how to overcome it.

Your education on how to crawl a React website starts now!

Why Are Regular Libraries Not Enough for React Crawling?

When a bot visits a page, it views the HTML file and parses its content to extract the target data. Since React websites use server-side code to render the final HTML, the fetched source code doesn't contain the entire page data.

The browser initially renders static content while downloading and running the JavaScript needed to populate the HTML with the necessary information.

The background JS execution often changes or adds elements to the entire content. However, standard libraries fetch the HTML before JavaScript fires. Hence the incomplete data you receive in your crawler.

What Is the Alternative?

In React crawling, you need to run all the code on the page and render the content. If the website uses APIs to request data from the server, you can view this data by mimicking the API call in your crawler. But regular libraries can't render JavaScript. So, the alternative here is to build a web crawler with React using a headless browser or a DOM engine.

Headless browsers you could use include:

  • Selenium: An open-source browser automation framework available in Node.js, Python, Java, and many other languages.
  • Puppeteer: A Chromium-based Node library that uses a high-level API to automate browser actions.

What Is a React Crawler?

A React crawler is a tool that extracts the complete HTML data from a React website. Such a solution can render React components before fetching the HTML content and retrieving the target info.

Typically, a regular crawler inspects a list of URLs (also known as a seed list) and discovers the ones it needs. However, it doesn't return accurate data upon encountering a React website because it can't access its final HTML.

Web scrapers employ this crawling structure to find URLs to extract data from. Let's clear it up with an example:

Say you have your target React website but no URLs of the pages you want to scrape. In this case, a React crawler would take your website's URL and selectively output the ones you need.

Keep in mind that some web pages and their URLs are more valuable than others. Thus, the crawler will organize them according to their priority. This is known as indexing. While some may use the terms crawling and indexing interchangeably, in reality, they are two different processes.

What Is the Difference Between Crawling and Indexing?

Crawling is the discovery of URLs. On the other hand, indexing refers to collecting, analyzing, and organizing them.

Basic Considerations for React Crawling

Before we dive deeper into the world of React crawling, let's see an overview of the key things to keep in mind:

  • Crawling with headless browsers like Selenium or Puppeteer tends to be slow and performance-intensive. Thus, you want to avoid it if possible..
  • Not all React websites are the same. For example, some render only static HTML before plugging in React to display dynamic content. Meaning that the data you're after may be accessible by simply running HTTP requests, and there'd be no need to go headless.

Thus, going through the following checklist before you make any crawling efforts is essential:

๐Ÿค” Is the JavaScript-rendered data part of the page source and accessible via a script tag somewhere in the DOM? If yes, we can easily parse the string from the JS object. Here's an example: Take the real estate website Redfin. If you inspect its HTML elements, you'll find the following script tag with JSON content.

react-crawling
Click to open the image in full screen

๐Ÿค” Is the website rendering data through a JavaScript API call? If true, we can mimic the call by sending a direct request to the endpoint with the help of an HTTP React crawler. This often returns a JSON, which we can easily parse later.

However, if the answer for both is negative, the only feasible option before you is to use a headless browser.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Prerequisites

Here's what you need to follow this tutorial:

For this React crawling tutorial, we'll use Node.js and Puppeteer. Therefore, you must have Node (or nvm if you prefer), npm, and Puppeteer on your system. Keep in mind that some come with the former pre-installed.

Terminal
npm install puppeteer

It's time to write some code. Let's get started!

How to Create a React Web Crawler With Headless Chrome and Puppeteer?

First, initialize your project:

Terminal
mkdir react-crawler 
cd react-crawler 
npm init -y

That creates a new file, react-crawler.js, in your project's directory. Open it in your favorite code editor. Then, import the Puppeteer library into your script to run it.

react-crawler.js
const puppeteer = require('puppeteer');

Remember that the Puppeteer library is promise-based: it makes asynchronous API calls to Chromium behind the scenes. So, we must write our code within the async() function.

Assume you want to retrieve data from React storefront.

build-react-crawler
Click to open the image in full screen

With Node.js' help, we can execute a React crawl and scrape the rendered data with the following script. We can see that the page's HTML content is obviously too large to use console.log(). Hence, we'll store it in a file, as shown in the code below.

Note that we had to import the Node.js fs module to use the fs.writeFile() function.

react-crawler.js
const puppeteer = require('puppeteer'); 
const fs = require('fs').promises; 
 
(async () => { 
	//initiate the browser 
	const browser = await puppeteer.launch(); 
 
	//create a new in headless chrome 
	const page = await browser.newPage(); 
 
	//go to target website 
	await page.goto('https://reactstorefront.vercel.app/default-channel/en-US/', { 
		//wait for content to load 
		waitUntil: 'networkidle0', 
	}); 
 
	//get full page html 
	const html = await page.content(); 
 
	//store html content in the reactstorefront file 
	await fs.writeFile('reactstorefront.html', html); 
 
	//close headless chrome 
	await browser.close(); 
})();

Here's what the result looks like:

crawling-react
Click to open the image in full screen

Congratulations! You scraped your first React website.

Selecting Nodes With Puppeteer

Most data extraction projects don't aim to download the entire site's content. Thus, we must select only our target elements and access their URLs.

Take a look at the Puppeteer Page API for all the different methods you can use to access a web page. Or explore our in-depth guide on web scraping with Puppeteer for all the information you need on how to extract the desired data.

You can use the corresponding selectors to get the links and add them to a list.

Let's learn how to crawl the products' URLs:

crawling-react-javascript
Click to open the image in full screen

If we inspect the page, we can locate the products' parent div and map the anchor tag to obtain the href attributes we're after:

react-crawl-javascript-page
Click to open the image in full screen

Here's the code you need to do so:

react-crawler.js
const puppeteer = require('puppeteer'); 
 
(async () => { 
	//initiate the browser 
	const browser = await puppeteer.launch(); 
 
	//create a new in headless chrome 
	const page = await browser.newPage(); 
 
	//go to target website 
	await page.goto('https://reactstorefront.vercel.app/default-channel/en-US/', { 
		//wait for content to load 
		waitUntil: 'networkidle0', 
	}); 
 
	//get product urls 
	const imgUrl = await page.evaluate(() => 
		Array.from(document.querySelectorAll('ul.grid > li a')).map(a => a.href) 
	); 
 
	console.log(imgUrl); 
 
	await browser.close(); 
})();

And that's the result:

Output
[ 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/abba-abba-1970-1982/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/abba-abba-abba-again/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/abba-abba-abba-mp3/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/abba-abba-aniversario-los-10-anos-de-abba/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/adele-adele-3-25/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/adele-adele-3-30/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/adele-adele-3-chasing-pavements/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/adele-adele-3-easy-on-me/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/acdc-acdc-a-giant-dose-of-rock-and-roll/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/acdc-acdc-detroit-rock-city/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/acdc-acdc-high-voltage-rock-n-roll/', 
	'https://reactstorefront.vercel.app/default-channel/en-US/products/acdc-acdc-hot-seams-wet-dreams-rock-in-rio-1985/' 
]

Depending on your project goals, React crawling processes can be endless. Just follow the above steps to crawl new links.

Now, you might be wondering: Is that it?

Yes! React crawling isn't that complicated. With about 10 lines of code, you can crawl JS-generated pages. Luckily, you may not face any challenges while doing so. That is, until you encounter a URL that blocks you out.

Unfortunately, anti-bot protection is practically everywhere these days. Another major challenge for web crawlers. We won't go into further detail here as this goes beyond the scope of our article, but as always, we've got you covered.

Explore our step-by-step tutorial on how to bypass anti-bot solutions like Akamai to learn more.

In the meantime, we'll optimize Puppeteer's performance.

How to Optimize Puppeteer's Performance

You've probably wondered why crawling with headless browsers is much slower than other methods. Well, it's not a secret that React crawling significantly impacts your infrastructure. Therefore the goal is to streamline the headless browser's work.

Something extra they might be doing:

  • Loading images
  • Firing XHR requests
  • Applying CSS rules

However, additional work is relative only to your project's goal. For instance, if that is to screenshot a few pages, you might not consider loading images as extra.

That said, there are different approaches to optimizing a Puppeteer script.

Let's dig into this:

Take, for example, the cache-first strategy. When we launch a new headless browser instance, Puppeteer creates a temporary directory that stores data like cookies and cache. It all disappears once the browser is closed.

The following code forces the library to keep those assets in a custom path. So it can reuse them every time we launch a new instance:

react-crawler.js
const browser = await puppeteer.launch({ 
	userDataDir: './data', 
});

This results in a significant performance boost as Chrome won't need to download those assets continuously.

Conclusion

In this React crawling tutorial, you've learned why we must use headless browsers for crawling JavaScript-generated websites.

Here's a quick recap of the steps:

  1. Install and run Puppeteer.
  2. Scrape your target URL.
  3. Extract links using selectors.
  4. Crawl the new links.
  5. Go back to step #2.

While writing all this code can be fun, doing this continuously for thousands of pages can be overwhelming, to say the least.

Check out ZenRows and forget all about such troubles and challenges. It bypasses the entire anti-bot protection and helps you get your target data with just a simple API call.

Did you find the content helpful? Spread the word and share it on Twitter, or LinkedIn.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.