Before dynamic websites took over the web, crawling was relatively straightforward. Virtually all websites relied on client-side scripting, and spiders could easily extract data from static HTML code. However, today is a different situation. Most pages use React, Vue, or Angular to load content dynamically.
React is a popular JavaScript library for building interactive UIs and Single-page applications (SPAs). Websites using it typically rely on JS for some or all of their content. You already know that regular libraries aren't enough for crawling JavaScript-generated web pages. So, let's see why that is and how to overcome it.
Your education on how to crawl a React website starts now!
Why Are Regular Libraries Not Enough for React Crawling?
When a bot visits a page, it views the HTML file and parses its content to extract the target data. Since React websites use server-side code to render the final HTML, the fetched source code doesn't contain the entire page data.
The browser initially renders static content while downloading and running the JavaScript needed to populate the HTML with the necessary information.
The background JS execution often changes or adds elements to the entire content. However, standard libraries fetch the HTML before JavaScript fires. Hence the incomplete data you receive in your crawler.
What Is the Alternative?
In React crawling, you need to run all the code on the page and render the content. If the website uses APIs to request data from the server, you can view this data by mimicking the API call in your crawler. But regular libraries can't render JavaScript. So, the alternative here is to build a web crawler with React using a headless browser or a DOM engine.
Headless browsers you could use include:
- Selenium: An open-source browser automation framework available in Node.js, Python, Java, and many other languages.
- Puppeteer: A Chromium-based Node library that uses a high-level API to automate browser actions.
What Is a React Crawler?
A React crawler is a tool that extracts the complete HTML data from a React website. Such a solution can render React components before fetching the HTML content and retrieving the target info.
Typically, a regular crawler inspects a list of URLs (also known as a seed list) and discovers the ones it needs. However, it doesn't return accurate data upon encountering a React website because it can't access its final HTML.
Web scrapers employ this crawling structure to find URLs to extract data from. Let's clear it up with an example:
Say you have your target React website but no URLs of the pages you want to scrape. In this case, a React crawler would take your website's URL and selectively output the ones you need.
Keep in mind that some web pages and their URLs are more valuable than others. Thus, the crawler will organize them according to their priority. This is known as indexing. While some may use the terms crawling and indexing interchangeably, in reality, they are two different processes.
What Is the Difference Between Crawling and Indexing?
Crawling is the discovery of URLs. On the other hand, indexing refers to collecting, analyzing, and organizing them.
Basic Considerations for React Crawling
Before we dive deeper into the world of React crawling, let's see an overview of the key things to keep in mind:
- Crawling with headless browsers like Selenium or Puppeteer tends to be slow and performance-intensive. Thus, you want to avoid it if possible..
- Not all React websites are the same. For example, some render only static HTML before plugging in React to display dynamic content. Meaning that the data you're after may be accessible by simply running HTTP requests, and there'd be no need to go headless.
Thus, going through the following checklist before you make any crawling efforts is essential:
🤔 Is the JavaScript-rendered data part of the page source and accessible via a script tag somewhere in the DOM? If yes, we can easily parse the string from the JS object. Here's an example: Take the real estate website Redfin. If you inspect its HTML elements, you'll find the following script tag with JSON content.
🤔 Is the website rendering data through a JavaScript API call? If true, we can mimic the call by sending a direct request to the endpoint with the help of an HTTP React crawler. This often returns a JSON, which we can easily parse later.
However, if the answer for both is negative, the only feasible option before you is to use a headless browser.
Prerequisites
Here's what you need to follow this tutorial:
For this React crawling tutorial, we'll use Node.js and Puppeteer. Therefore, you must have Node (or nvm if you prefer), npm, and Puppeteer on your system. Keep in mind that some come with the former pre-installed.
npm install puppeteer
It's time to write some code. Let's get started!
How to Create a React Web Crawler With Headless Chrome and Puppeteer?
First, initialize your project:
mkdir react-crawler
cd react-crawler
npm init -y
That creates a new file, react-crawler.js
, in your project's directory. Open it in your favorite code editor. Then, import the Puppeteer library into your script to run it.
const puppeteer = require('puppeteer');
Remember that the Puppeteer library is promise-based: it makes asynchronous API calls to Chromium behind the scenes. So, we must write our code within the async()
function.
Assume you want to retrieve data from React storefront.
With Node.js' help, we can execute a React crawl and scrape the rendered data with the following script. We can see that the page's HTML content is obviously too large to use console.log()
. Hence, we'll store it in a file, as shown in the code below.
Note that we had to import the Node.js fs module to use the fs.writeFile()
function.
const puppeteer = require('puppeteer');
const fs = require('fs').promises;
(async () => {
//initiate the browser
const browser = await puppeteer.launch();
//create a new in headless chrome
const page = await browser.newPage();
//go to target website
await page.goto('https://reactstorefront.vercel.app/default-channel/en-US/', {
//wait for content to load
waitUntil: 'networkidle0',
});
//get full page html
const html = await page.content();
//store html content in the reactstorefront file
await fs.writeFile('reactstorefront.html', html);
//close headless chrome
await browser.close();
})();
Here's what the result looks like:
Congratulations! You scraped your first React website.
Selecting Nodes With Puppeteer
Most data extraction projects don't aim to download the entire site's content. Thus, we must select only our target elements and access their URLs.
Take a look at the Puppeteer Page API for all the different methods you can use to access a web page. Or explore our in-depth guide on web scraping with Puppeteer for all the information you need on how to extract the desired data.
You can use the corresponding selectors to get the links and add them to a list.
Let's learn how to crawl the products' URLs:
If we inspect the page, we can locate the products' parent div
and map the anchor tag to obtain the href
attributes we're after:
Here's the code you need to do so:
const puppeteer = require('puppeteer');
(async () => {
//initiate the browser
const browser = await puppeteer.launch();
//create a new in headless chrome
const page = await browser.newPage();
//go to target website
await page.goto('https://reactstorefront.vercel.app/default-channel/en-US/', {
//wait for content to load
waitUntil: 'networkidle0',
});
//get product urls
const imgUrl = await page.evaluate(() =>
Array.from(document.querySelectorAll('ul.grid > li a')).map(a => a.href)
);
console.log(imgUrl);
await browser.close();
})();
And that's the result:
[
'https://reactstorefront.vercel.app/default-channel/en-US/products/abba-abba-1970-1982/',
'https://reactstorefront.vercel.app/default-channel/en-US/products/abba-abba-abba-again/',
'https://reactstorefront.vercel.app/default-channel/en-US/products/abba-abba-abba-mp3/',
'https://reactstorefront.vercel.app/default-channel/en-US/products/abba-abba-aniversario-los-10-anos-de-abba/',
'https://reactstorefront.vercel.app/default-channel/en-US/products/adele-adele-3-25/',
'https://reactstorefront.vercel.app/default-channel/en-US/products/adele-adele-3-30/',
'https://reactstorefront.vercel.app/default-channel/en-US/products/adele-adele-3-chasing-pavements/',
'https://reactstorefront.vercel.app/default-channel/en-US/products/adele-adele-3-easy-on-me/',
'https://reactstorefront.vercel.app/default-channel/en-US/products/acdc-acdc-a-giant-dose-of-rock-and-roll/',
'https://reactstorefront.vercel.app/default-channel/en-US/products/acdc-acdc-detroit-rock-city/',
'https://reactstorefront.vercel.app/default-channel/en-US/products/acdc-acdc-high-voltage-rock-n-roll/',
'https://reactstorefront.vercel.app/default-channel/en-US/products/acdc-acdc-hot-seams-wet-dreams-rock-in-rio-1985/'
]
Depending on your project goals, React crawling processes can be endless. Just follow the above steps to crawl new links.
Now, you might be wondering: Is that it?
Yes! React crawling isn't that complicated. With about 10 lines of code, you can crawl JS-generated pages. Luckily, you may not face any challenges while doing so. That is, until you encounter a URL that blocks you out.
Unfortunately, anti-bot protection is practically everywhere these days. Another major challenge for web crawlers. We won't go into further detail here as this goes beyond the scope of our article, but as always, we've got you covered.
Explore our step-by-step tutorial on how to bypass anti-bot solutions like Akamai to learn more.
In the meantime, we'll optimize Puppeteer's performance.
How to Optimize Puppeteer's Performance
You've probably wondered why crawling with headless browsers is much slower than other methods. Well, it's not a secret that React crawling significantly impacts your infrastructure. Therefore the goal is to streamline the headless browser's work.
Something extra they might be doing:
- Loading images
- Firing XHR requests
- Applying CSS rules
However, additional work is relative only to your project's goal. For instance, if that is to screenshot a few pages, you might not consider loading images as extra.
That said, there are different approaches to optimizing a Puppeteer script.
Let's dig into this:
Take, for example, the cache-first strategy. When we launch a new headless browser instance, Puppeteer creates a temporary directory that stores data like cookies and cache. It all disappears once the browser is closed.
The following code forces the library to keep those assets in a custom path. So it can reuse them every time we launch a new instance:
const browser = await puppeteer.launch({
userDataDir: './data',
});
This results in a significant performance boost as Chrome won't need to download those assets continuously.
Conclusion
In this React crawling tutorial, you've learned why we must use headless browsers for crawling JavaScript-generated websites.
Here's a quick recap of the steps:
- Install and run Puppeteer.
- Scrape your target URL.
- Extract links using selectors.
- Crawl the new links.
- Go back to step #2.
While writing all this code can be fun, doing this continuously for thousands of pages can be overwhelming, to say the least.
Check out ZenRows and forget all about such troubles and challenges. It bypasses the entire anti-bot protection and helps you get your target data with just a simple API call.