Web Crawling Webinar for Tech Teams
Register Now

JavaScript Web Crawler with Node.js: A Step-By-Step Tutorial

Yuvraj Chandra
Yuvraj Chandra
Updated: November 15, 2024 · 8 min read

Do you want to take your JavaScript web scraping skills to the next level by building a web crawler? We've got you covered!

In this tutorial, you'll learn how to build a fast and efficient JavaScript web crawler with best practices to optimize your crawler's performance and effectiveness.

What Is a Web Crawler?

A web crawler, also known as a web spider, is a tool that systematically navigates the web to gather information. Like search engines, it follows links on a list of known web pages to discover additional URLs. Many developers also use web crawling tools to simplify and streamline this process.

The primary difference between web crawling and web scraping is that web crawling focuses on traversing and indexing a wide array of web pages, often to build a comprehensive database of links. On the other hand, web scraping targets specific web elements for extraction within those pages.

Developing a web spider involves just a few steps and some good practices. You'll want to design your crawler to add new URLs to a queue continuously. The web crawling process can be endless, so it's essential to set limits based on your project needs. For example, you can control the crawl frequency and crawl depth or prioritize certain pages using a priority queue.

Now, let's build your first Node.js web crawler!

Build Your First JavaScript Web Crawler

You'll now build a JavaScript web crawler by crawling the e-commerce challenge page. Here's what the page looks like:

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

The target site contains many linked pages, including paginated products, carts, checkout pages, etc. You'll crawl some of those links and extract specific data from the paginated product pages.

Let's begin by setting up the prerequisites for building your web crawler.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Prerequisites for JavaScript Crawling

Here are the tools you need to get started with this web crawling tutorial.

  • Node.js and npm: This tutorial requires a Node.js v20+ runtime environment and the node package manager (npm) v10+. Node.js installation also installs npm by default. So, download and install the latest Node.js version if you haven't done so already.
  • Axios and Cheerio: You'll use Axios as your HTTP client and Cheerio as an HTML parser.

Create a new "crawler" project folder and open your command line to this directory. Then, initiate a new Node.js project:

Terminal
npm init -y

Now, install Axios and Cheerio using npm:

Terminal
npm install axios cheerio

Finally, create a crawler.js file in your project folder, and you're ready to build your first Node.js web crawler with Axios and Cheerio. 

Open your crawler.js. Let's go through the crawling steps.

Step 1: Follow all the Links on a Website

The simplest JavaScript web crawler you'll build is a basic request to the target site. Import axios and create a crawler function that opens the target website to get the initial HTML response:

crawler.js
// npm install axios cheerio 
const axios = require('axios');

// specify the URL of the site to crawl
const targetUrl = 'https://www.scrapingcourse.com/ecommerce/';

// define a crawler function
const crawler = async () => {
    try {
        // request the target website
        const response = await axios.get(targetUrl);
    } catch (error) {
        // handle any error that occurs during the HTTP request
        console.error(`Error fetching ${targetUrl}: ${error.message}`);
    }
};

That's a simple Axios request. However, this section aims to crawl the website for links, so let's modify the basic crawler above to achieve that.

Add cheerio to your imports and add the target URL to an array of urlsToVisit:

crawler.js
// npm install axios cheerio
// ...
const cheerio = require('cheerio');

// ...

let urlsToVisit = [targetUrl];

// ...

Modify the crawler function to start a for loop that runs continuously if the urlsToVisit array isn't empty. Shift the target URL to the newly discovered one. Then, expand the try block to find the website's HTML and find all a tags on the website. Reformat every relative path to an absolute one and confirm that the currently crawled URL is a path of the main target and is not already in the queue:

crawler.js
// ...

// define a crawler function
const crawler = async () => {
    for (; urlsToVisit.length > 0;) {
        // get the next URL from the list
        const currentUrl = urlsToVisit.shift();

        try {
            // ...
            // parse the website's HTML
            const $ = cheerio.load(response.data);

            // find all links on the page
            const linkElements = $('a[href]');
            linkElements.each((index, element) => {
                let url = $(element).attr('href');

                // check if the URL is a full link or a relative path
                if (!url.startsWith('http')) {
                    // remove leading slash if present
                    url = targetUrl + url.replace(/^\//, '');
                }

                // follow links within the target website
                if (url.startsWith(targetUrl) && !urlsToVisit.includes(url)) {
                    // update the URLs to visit
                    urlsToVisit.push(url);
                }
            });
        } catch (error) {
           // ... error handling
        }
    }
};

Next, you'll apply a basic depth limit to prevent the crawl from running indefinitely. A depth limit allows the web spider to stop after crawling a specific number of pages.

You can set a depth limit by specifying a maximum crawl length, after which the crawling process stops. We'll set the crawl length to 20 to expand the crawler's reach:

crawler.js
// ...

// define the desired crawl limit
const maxCrawlLength = 20;

// ...

Apply this crawl length as a condition in the for loop to break the process after crawling 20 links. Add a crawlCount to track the crawled URL. Then, log the urlsToVisit array at this point to see the crawled URLs and execute the crawler function:

crawler.js
// ...

// define a crawler function
const crawler = async () => {
    // track the number of crawled URLs
    let crawledCount = 0;
    for (; urlsToVisit.length > 0 && crawledCount <= maxCrawlLength;) {
        // increment the crawl count
        crawledCount++;

        // ... crawling logic
    }
    console.log(urlsToVisit);
};

// execute the crawler function
crawler();

Before moving to the next step, let's combine all the snippets to see what you have:

crawler.js
// npm install axios cheerio
const axios = require('axios');
const cheerio = require('cheerio');

// specify the URL of the site to crawl
const targetUrl = 'https://www.scrapingcourse.com/ecommerce/';

// add the target URL to an array of URLs to visit
let urlsToVisit = [targetUrl];

// define the desired crawl limit
const maxCrawlLength = 20;

// define a crawler function
const crawler = async () => {
    // track the number of crawled URLs
    let crawledCount = 0;

    for (; urlsToVisit.length > 0 && crawledCount <= maxCrawlLength;) {
        // get the next URL from the list
        const currentUrl = urlsToVisit.shift();
        // increment the crawl count
        crawledCount++;

        try {
            // request the target website
            const response = await axios.get(currentUrl);
            // parse the website's HTML
            const $ = cheerio.load(response.data);

            // find all links on the page
            const linkElements = $('a[href]');
            linkElements.each((index, element) => {
                let url = $(element).attr('href');

                // check if the URL is a full link or a relative path
                if (!url.startsWith('http')) {
                    // remove leading slash if present
                    url = targetUrl + url.replace(/^\//, '');
                }

                // follow links within the target website
                if (url.startsWith(targetUrl) && !urlsToVisit.includes(url)) {
                    // update the URLs to visit
                    urlsToVisit.push(url);
                }
            });
        } catch (error) {
            // handle any error that occurs during the HTTP request
            console.error(`Error fetching ${currentUrl}: ${error.message}`);
        }
    }
    console.log(urlsToVisit);
};

// execute the crawler function
crawler();

Execute the code, and you'll see the spider has crawled a handful of URLs:

Output
[
    'https://www.scrapingcourse.com/ecommerce/?add-to-cart=2740',
    'https://www.scrapingcourse.com/ecommerce/product/ajax-full-zip-sweatshirt/',

    // ... omitted for brevity,

    'https://www.scrapingcourse.com/ecommerce/page/1/',
    'https://www.scrapingcourse.com/ecommerce/page/5/',

    // ... omitted for brevity,

    // ... 118 more items,
]

You'll scrape product data from specific URLs in the next step.

Step 2: Extract Data From Your Crawler

In this step, you'll extract product data selectively from paginated pages by filtering the URLs to scrape. You'll extract product names, prices, image URLs, and links.

Start by creating an empty array to collect the scraped data. Observe the website's pagination pattern from the previously crawled links. You'll see that the path is /page/<PAGE_NUMBER>.

Configure your crawler to scrape product data only from URLs containing the paginated URL pattern. To achieve that, define a regular expression (regex) to match the pattern and test it against the current URL:

crawler.js
// ...

// to store scraped product data
const productData = [];

// define a crawler function
const crawler = async () => {
    // define a regex to match the pagination pattern
    const pagePattern = /page\/\d+/i;

    // ...
    for (; urlsToVisit.length > 0 && crawledCount <= maxCrawlLength;) {
        // ...
        try {
            // ...

            // extract product information from paginated product pages only
            if (pagePattern.test(currentUrl)) {
                // retrieve all product containers
                const productContainers = $('.product');

                // iterate through the product containers to extract data
                productContainers.each((index, product) => {
                    const data = {};

                    data.url =
                        $(product)
                            .find('.woocommerce-LoopProduct-link')
                            .attr('href') || 'N/A';
                    data.image =
                        $(product).find('.product-image').attr('src') || 'N/A';
                    data.name =
                        $(product).find('.product-name').text().trim() || 'N/A';
                    data.price =
                        $(product).find('.price').text().trim() || 'N/A';
                    // append the scraped data to the empty array
                    productData.push(data);
                });
            }
        } catch (error) {
            // ... error handling
        }
    }
    // ...
};

The current crawler extracts product data from paginated links only. It's time to store the data.

Step 3: Export the Scraped Data to CSV

Storing your data after crawling is essential for referencing, further analysis, sharing, and more. You can save your data in CSV, XLSX, JSON, or a remote database. Let's choose CSV to keep it simple.

To export the previously scraped data from your crawler, import Node.js' built-in fs package. Then, write each entry to a new row under specified headers:

crawler.js
// ...
const fs = require('fs');

// define a crawler function
const crawler = async () => {
    // ...

    // write productData to a CSV file
    const header = 'Url,Image,Name,Price\n';
    const csvRows = productData
        .map((item) => `${item.url},${item.image},${item.name},${item.price}`)
        .join('\n');
    const csvData = header + csvRows;

    fs.writeFileSync('products.csv', csvData);
    console.log('CSV file has been successfully created!');
};

Merge all the snippets from all steps. You'll get the following complete code:

crawler.js
// npm install axios cheerio
const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs');

// specify the URL of the site to crawl
const targetUrl = 'https://www.scrapingcourse.com/ecommerce/';

// add the target URL to an array of URLs to visit
let urlsToVisit = [targetUrl];

// define the desired crawl limit
const maxCrawlLength = 20;

// to store scraped product data
const productData = [];

// define a crawler function
const crawler = async () => {
    // track the number of crawled URLs
    let crawledCount = 0;
    // define a regex to match the pagination pattern
    const pagePattern = /page\/\d+/i;
    for (; urlsToVisit.length > 0 && crawledCount <= maxCrawlLength;) {
        // get the next URL from the list
        const currentUrl = urlsToVisit.shift();
        // increment the crawl count
        crawledCount++;

        try {
            // request the target website
            const response = await axios.get(currentUrl);
            // parse the website's HTML
            const $ = cheerio.load(response.data);

            // find all links on the page
            const linkElements = $('a[href]');
            linkElements.each((index, element) => {
                let url = $(element).attr('href');

                // check if the URL is a full link or a relative path
                if (!url.startsWith('http')) {
                    // remove leading slash if present
                    url = targetUrl + url.replace(/^\//, '');
                }

                // follow links within the target website
                if (url.startsWith(targetUrl) && !urlsToVisit.includes(url)) {
                    // update the URLs to visit
                    urlsToVisit.push(url);
                }
            });

            // extract product information from paginated product pages only
            if (pagePattern.test(currentUrl)) {
                // retrieve all product containers
                const productContainers = $('.product');

                // iterate through the product containers to extract data
                productContainers.each((index, product) => {
                    const data = {};

                    data.url =
                        $(product)
                            .find('.woocommerce-LoopProduct-link')
                            .attr('href') || 'N/A';
                    data.image =
                        $(product).find('.product-image').attr('src') || 'N/A';
                    data.name =
                        $(product).find('.product-name').text().trim() || 'N/A';
                    data.price =
                        $(product).find('.price').text().trim() || 'N/A';

                    // append the scraped data to the empty array
                    productData.push(data);
                });
            }
        } catch (error) {
            // handle any error that occurs during the HTTP request
            console.error(`Error fetching ${currentUrl}: ${error.message}`);
        }
    }
    // write productData to a CSV file
    const header = 'Url,Image,Name,Price\n';
    const csvRows = productData
        .map((item) => `${item.url},${item.image},${item.name},${item.price}`)
        .join('\n');
    const csvData = header + csvRows;

    fs.writeFileSync('products.csv', csvData);
    console.log('CSV file has been successfully created!');
};

// execute the crawler function
crawler();

The above code extracts product data from crawled paginated links and exports them to a products.csv file, as shown below:

scrapingcourse ecommerce product output csv
Click to open the image in full screen

Congratulations 🎉! You've built a web crawler that extracts specific information using Axios and Cheerio in Node.js. 

Don't stop here! Your crawler still needs some more improvements.

Optimize Your Web Crawler

While your current JavaScript web crawler follows links and applies depth limits to scrape data from selected pages, it still requires a few optimizations to boost efficiency. You'll learn them in this section.

Avoid Duplicate Links

Although your spider checks if the queued URL is listed in the urlsToVisit before crawling, this approach may not catch minor formatting details such as slashes or case sensitivity in similar URLs.

For instance, https://www.scrapingcourse.com and https://www.scrapingcourse.com/ point to the same web page, but the latter has a trailing slash. The current crawling logic doesn't use this detail to filter the queued URLs, resulting in potential duplication.

To handle duplicates effectively, you'll track each crawled link with a Set and normalize the current URLs to remove trailing slashes before they're crawled.

First, create a normalizeUrl function to convert the crawled URLs into standardized lowercase formats, remove any trailing slashes, and handle relative URLs using the base URL (the target URL):

crawler.js
// ...

const normalizeUrl = (url, baseUrl) => {
    try {
        // create a URL object and use baseUrl for relative URLs
        const normalized = new URL(url, baseUrl);

        // remove trailing slash from the pathname, if present
        if (normalized.pathname.endsWith('/')) {
            normalized.pathname = normalized.pathname.slice(0, -1);
        }

        // return the normalized URL as a lowercase string
        return normalized.href.toLowerCase();
    } catch (e) {
        console.error(`Invalid URL: ${url}`);
        return null;
    }
};

Now, create a new visitedUrls Set object to allow the crawler to track unique URLs.

crawler.js
// ...

// track visited URLs with a set
const visitedUrls = new Set();

Replace the previous crawl counter (crawlCount) with the Set object to ensure you only count the crawl length by the number of unique URLs. The new for loop logic looks like this:

crawler.js
// ...

// define a crawler function
const crawler = async () => {
    // ...
    for (; urlsToVisit.length > 0 && visitedUrls.size < maxCrawlLength;) {

        // ... other crawling logic
    }

    // ... csv export logic
};

Improve the crawler function to normalize the current URL using the normalizeUrl function and update the visitedUrls with the normalized URLs:

crawler.js
// ...

// define a crawler function
const crawler = async () => {
    // ...
    for (; urlsToVisit.length > 0 && visitedUrls.size < maxCrawlLength;) {
        // ...
        // normalize the URLs to an absolute path
        const normalizedUrl = normalizeUrl(currentUrl, targetUrl);
        if (!normalizedUrl || visitedUrls.has(normalizedUrl)) continue;

        // update the visited URLs set
        visitedUrls.add(normalizedUrl);

        // ... other crawling logic
    }
    // ... csv export logic
};

Request the normalized URL instead of the current URL. Replace the previous relative path logic with the normalizeUrl function execution (absoluteUrl). Then, update the crawling logic to check for unique URLs and only scrape product data if the page pattern matches the absolute URLs:

crawler.js
// ...
// define a crawler function
const crawler = async () => {
    // ...
    for (; urlsToVisit.length > 0 && visitedUrls.size < maxCrawlLength;) {
        // ...

        try {
            // request the target website
            const response = await axios.get(normalizedUrl);
            // ...

            // find all links on the page
            // ...
            linkElements.each((index, element) => {
                // ...
                // normalize the URLs as they're crawled
                const absoluteUrl = normalizeUrl(url, targetUrl);

                // follow links within the target website
                if (
                    absoluteUrl &&
                    absoluteUrl.startsWith(targetUrl) &&
                    !visitedUrls.has(absoluteUrl) &&
                    !urlsToVisit.includes(absoluteUrl)
                ) {
                    // update the URLs to visit
                    urlsToVisit.push(absoluteUrl);
                }
            });

            // extract product information from product pages only
            if (pagePattern.test(normalizedUrl)) {
                // ... scraping logic
            }
        } catch (error) {
            // ... error handling
        }
    }
    // ... csv export logic
};

Let's also set queue priorities appropriately. 

Prioritize Specific Pages

Prioritizing specific URLs in the crawl queue ensures your crawler visits those URLs before others in the crawl queue. 

Since we're interested in extracting data from paginated product pages, let's prioritize and crawl them before the other pages.

First, replace the previous urlsToVisit with high and low-priority queues.

crawler.js
// ...

// high-priority and low-priority queues
let highPriorityQueue = [targetUrl];
let lowPriorityQueue = [targetUrl];

Adjust the loop condition and shift the crawls by the new priority queues. Then, update the high-priority queue array to prioritize paginated links:

crawler.js
// ...

// define a crawler function
const crawler = async () => {
    // ...

    for (
        ;
        (highPriorityQueue.length > 0 || lowPriorityQueue.length > 0) &&
        visitedUrls.size <= maxCrawlLength;

    ) {
        // Check for URLs in high-priority queue first
        if (highPriorityQueue.length > 0) {
            currentUrl = highPriorityQueue.shift();
        } else {
            // Otherwise, get the next URL from the low-priority queue
            currentUrl = lowPriorityQueue.shift();
        }
        // ...

        try {
            // ...
            linkElements.each((index, element) => {
                // ...

                // follow links within the target website
                if (
                    // ...
                    !highPriorityQueue.includes(absoluteUrl) &&
                    !lowPriorityQueue.includes(absoluteUrl)
                ) {
                    // prioritize paginated pages
                    if (pagePattern.test(absoluteUrl)) {
                        highPriorityQueue.push(absoluteUrl);
                    } else {
                        lowPriorityQueue.push(absoluteUrl);
                    }
                }
            });

            // ...scraping logic
        } catch (error) {
            // ... error handling
        }
    }

    //  ...csv export logic
};

Next, you'll optimize the crawler to use a single session.

Maintain a Single Crawl Session

Maintaining a single session throughout the crawling process helps avoid frequent reconnections, improves performance, and can prevent server overloading. For instance, you can use sessions to persist cookies across multiple crawling requests.

You can create an Axios section using the axios.create() method:

crawler.js
// ...

// create a new Axios instance
const axiosInstance = axios.create();

// define a crawler function
const crawler = async () => {
    // ...
    for (
        ;
        (highPriorityQueue.length > 0 || lowPriorityQueue.length > 0) &&
        visitedUrls.size <= maxCrawlLength;

    ) {
        // ...

        try {
            // ...
            // request the target URL with the Axios instance
            const response = await axiosInstance.get(normalizedUrl);

            // ...scraping logic
        } catch (error) {
            // ... error handling
        }
    }

    //  ...csv export logic
};

Combine the code snippets from each section. Here's the final code:

crawler.js
// npm install axios cheerio
const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs');

const normalizeUrl = (url, baseUrl) => {
    try {
        // create a URL object and use baseUrl for relative URLs

        const normalized = new URL(url, baseUrl);

        // remove trailing slash from the pathname, if present
        if (normalized.pathname.endsWith('/')) {
            normalized.pathname = normalized.pathname.slice(0, -1);
        }

        // return the normalized URL as a lowercase string
        return normalized.href.toLowerCase();
    } catch (e) {
        console.error(`Invalid URL: ${url}`);
        return null;
    }
};

// specify the URL of the site to crawl
const targetUrl = 'https://www.scrapingcourse.com/ecommerce/';

// high-priority and low-priority queues
let highPriorityQueue = [targetUrl];
let lowPriorityQueue = [targetUrl];

// define the desired crawl limit
const maxCrawlLength = 20;

// to store scraped product data
const productData = [];

// track visited URLs with a set
const visitedUrls = new Set();

// create a new Axios instance
const axiosInstance = axios.create();

// define a crawler function
const crawler = async () => {
    // define a regex that matches the pagination pattern
    const pagePattern = /page\/\d+/i;
    for (
        ;
        (highPriorityQueue.length > 0 || lowPriorityQueue.length > 0) &&
        visitedUrls.size <= maxCrawlLength;

    ) {
        // check for URLs in high-priority queue first
        if (highPriorityQueue.length > 0) {
            currentUrl = highPriorityQueue.shift();
        } else {
            // otherwise, get the next URL from the low-priority queue
            currentUrl = lowPriorityQueue.shift();
        }
        // normalize the URLs to an absolute path
        const normalizedUrl = normalizeUrl(currentUrl, targetUrl);
        if (!normalizedUrl || visitedUrls.has(normalizedUrl)) continue;

        // update the visited URLs set
        visitedUrls.add(normalizedUrl);

        try {
            // request the target URL with the Axios instance
            const response = await axiosInstance.get(normalizedUrl);
            // parse the website's HTML
            const $ = cheerio.load(response.data);

            // find all links on the page
            const linkElements = $('a[href]');
            linkElements.each((index, element) => {
                let url = $(element).attr('href');

                // normalize the URLs as they're crawled
                const absoluteUrl = normalizeUrl(url, targetUrl);

                // follow links within the target website
                if (
                    absoluteUrl &&
                    absoluteUrl.startsWith(targetUrl) &&
                    !visitedUrls.has(absoluteUrl) &&
                    !highPriorityQueue.includes(absoluteUrl) &&
                    !lowPriorityQueue.includes(absoluteUrl)
                ) {
                    // prioritize paginated pages                   
                    if (pagePattern.test(absoluteUrl)) {
                        highPriorityQueue.push(absoluteUrl);
                    } else {
                        lowPriorityQueue.push(absoluteUrl);
                    }
                }
            });

            // extract product information from product pages only
            if (pagePattern.test(normalizedUrl)) {
                // retrieve all product containers
                const productContainers = $('.product');

                // iterate through the product containers to extract data
                productContainers.each((index, product) => {
                    const data = {};

                    data.url =
                        $(product)
                            .find('.woocommerce-LoopProduct-link')
                            .attr('href') || 'N/A';
                    data.image =
                        $(product).find('.product-image').attr('src') || 'N/A';
                    data.name =
                        $(product).find('.product-name').text().trim() || 'N/A';
                    data.price =
                        $(product).find('.price').text().trim() || 'N/A';

                    // append the scraped data to the empty array
                    productData.push(data);
                });
            }
        } catch (error) {
            console.error(`Error fetching ${currentUrl}: ${error.message}`);
        }
    }

    // write productData to a CSV file
    const header = 'Url,Image,Name,Price\n';
    const csvRows = productData
        .map((item) => `${item.url},${item.image},${item.name},${item.price}`)
        .join('\n');
    const csvData = header + csvRows;

    fs.writeFileSync('products.csv', csvData);
    console.log('CSV file has been successfully created!');
};

// execute the crawler function
crawler();

You've just built a robust web crawler with JavaScript and Node.js. Great job! You'll still need to deal with edge cases like anti-bots, JavaScript rendering, and more.

Avoid Getting Blocked While Crawling With JavaScript

Web crawling can trigger anti-bot measures since you're visiting multiple pages simultaneously. While most websites allow specific crawlers, such as search engine bots, they're likely to disallow a custom-made crawler.

You can reduce the chances of anti-bot detection by using proxies with Axios, optimizing your request frequency, or setting up a custom Axios User Agent. 

However, while these methods work in some cases, they won't prevent detection at scale, especially if dealing with advanced anti-bot measures. 

The easiest way to bypass any anti-bot measure when crawling at scale is via a web scraping API such as ZenRows. The ZenRows Scraper API provides all the toolsets required for efficient scraping, including premium proxy rotation, request header management, cookie support for session persistence, advanced fingerprint spoofing, JavaScript execution, anti-bot auto-bypass, and more.

Let's see how it works by scraping the full-page HTML of this anti-bot challenge page.

Sign up to open the ZenRows Request Builder. Paste the target URL in the link box, and activate Premium Proxies and JS Rendering.

Choose Node.js as your programming language and select the API connection mode. Copy and paste the generated code into your crawler file:

building a scraper with zenrows
Click to open the image in full screen

Here's the generated code:

Example
// npm install axios
const axios = require('axios');

const url = 'https://www.scrapingcourse.com/antibot-challenge';
const apikey = '<YOUR_ZENROWS_API_KEY>';
axios({
    url: 'https://api.zenrows.com/v1/',
    method: 'GET',
    params: {
        url: url,
        apikey: apikey,
        js_render: 'true',
        premium_proxy: 'true',
    },
})
    .then((response) => console.log(response.data))
    .catch((error) => console.log(error));

The code outputs the protected website's full-page HTML as shown:

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

Way to go! 🎉 You bypassed an anti-bot detection measure with the ZenRows Scraper API.

Web Crawling Tools for JavaScript

Although you've crawled a website with Axios and Cheerio, there are more web crawling tools for Node.js.

1. ZenRows

ZenRows is a top web scraping and crawling solution, especially if you want to bypass anti-bot detection at scale. It's lightweight and has advanced features, including premium proxy rotation, geo-targeting, anti-bot auto-bypass, fingerprinting evasions, request header optimization, and more. ZenRows integrates perfectly with browser automation libraries like Playwright and Puppeteer, allowing you to crawl JavaScript websites.

2. Node Crawler

Node Crawler, also called Crawler, is an open-source library for creating web spiders in Node.js. Its key features include priority queueing, request retry, request frequency optimization, concurrency control, and more. Node Crawler uses Cheerio as its HTML parser under the hood and is easy to use, especially for those familiar with jQuery.

3. Crawlee

Crawlee is another open-source web crawling tool available in Node.js and Python. It supports advanced features such as priority queueing, concurrency control, proxy management, and more. Crawlee runs on Cheerio and JSDOM for HTML parsing under the hood. However, you can integrate it with Playwright and Puppeteer for headless browser support to crawl JavaScript sites.

JavaScript Crawling Best Practices and Considerations

Congratulations on building and optimizing your first JavaScript web crawler! However, websites often use WAF services like Akamai and PerimeterX to protect against automated access. Check out our guides on how to bypass Akamai and how to bypass PerimeterX before diving into the best practices and scenarios that will make your web spider more efficient and sharpen your skills.

Parallel Crawling and Concurrency

The current crawler executes the process sequentially. You can introduce concurrency to run the crawling job in parallel and improve overall performance.

To modify your spider with concurrency, introduce a crawlWithConcurrency function that manages parallel crawling by adding crawl queue promises to a Set (activePromises). This function launches a new crawl only when the active promise terminates. 

Here's the modified crawler with concurrency:

Example
// npm install axios cheerio
const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs');

const normalizeUrl = (url, baseUrl) => {
    try {
        // create a URL object and use baseUrl for relative URLs
        const normalized = new URL(url, baseUrl);

        // remove trailing slash from the pathname, if present
        if (normalized.pathname.endsWith('/')) {
            normalized.pathname = normalized.pathname.slice(0, -1);
        }

        // return the normalized url as a lowercase string
        return normalized.href.toLowerCase();
    } catch (e) {
        console.error(`invalid url: ${url}`);
        return null;
    }
};

// specify the url of the site to crawl
const targetUrl = 'https://www.scrapingcourse.com/ecommerce/';

// high-priority and low-priority queues
let highPriorityQueue = [targetUrl];
let lowPriorityQueue = [targetUrl];

// define the desired crawl limit
const maxCrawlLength = 20;

// to store scraped product data
const productData = [];

// track visited URLs with a set
const visitedUrls = new Set();

// create a new axios instance
const axiosInstance = axios.create();

// set the number of concurrency
const maxConcurrency = 5;

// define a crawler function
const crawler = async () => {
    // define a regex that matches the pagination pattern
    const pagePattern = /page\/\d+/i;

    // helper function to crawl the next url
    const crawlNext = async () => {
        // stop crawling if queues are empty or crawl limit is reached
        if (
            (highPriorityQueue.length === 0 && lowPriorityQueue.length === 0) ||
            visitedUrls.size >= maxCrawlLength
        )
            return;

        // check for URLs in high-priority queue first
        let currentUrl;
        if (highPriorityQueue.length > 0) {
            currentUrl = highPriorityQueue.shift();
        } else {
            // otherwise, get the next url from the low-priority queue
            currentUrl = lowPriorityQueue.shift();
        }

        // normalize the URLs to an absolute path
        const normalizedUrl = normalizeUrl(currentUrl, targetUrl);
        if (!normalizedUrl || visitedUrls.has(normalizedUrl)) return;

        // update the visited URLs set
        visitedUrls.add(normalizedUrl);

        try {
            // request the target URL with the Axios instance
            const response = await axiosInstance.get(normalizedUrl);
            // parse the website's html
            const $ = cheerio.load(response.data);

            // find all links on the page
            const linkElements = $('a[href]');
            linkElements.each((index, element) => {
                let url = $(element).attr('href');

                // normalize the URLs as they're crawled
                const absoluteUrl = normalizeUrl(url, targetUrl);

                // follow links within the target website
                if (
                    absoluteUrl &&
                    absoluteUrl.startsWith(targetUrl) &&
                    !visitedUrls.has(absoluteUrl) &&
                    !highPriorityQueue.includes(absoluteUrl) &&
                    !lowPriorityQueue.includes(absoluteUrl)
                ) {
                    // prioritize paginated pages
                    if (pagePattern.test(absoluteUrl)) {
                        highPriorityQueue.push(absoluteUrl);
                    } else {
                        lowPriorityQueue.push(absoluteUrl);
                    }
                }
            });

            // extract product information from product pages only
            if (pagePattern.test(normalizedUrl)) {
                // retrieve all product containers
                const productContainers = $('.product');

                // iterate through the product containers to extract data
                productContainers.each((index, product) => {
                    const data = {};
                    data.url =
                        $(product)
                            .find('.woocommerce-LoopProduct-link')
                            .attr('href') || 'N/A';
                    data.image =
                        $(product).find('.product-image').attr('src') || 'N/A';
                    data.name =
                        $(product).find('.product-name').text().trim() || 'N/A';
                    data.price =
                        $(product).find('.price').text().trim() || 'N/A';

                    // append the scraped data to the empty array
                    productData.push(data);
                });
            }
        } catch (error) {
            console.error(`error fetching ${currentUrl}: ${error.message}`);
        }
    };

    // manage concurrency by tracking active crawl promises
    const crawlWithConcurrency = async () => {
        const activePromises = new Set();

        // continue crawling as long as there are URLs and crawl limit is not reached
        for (
            ;
            (highPriorityQueue.length > 0 || lowPriorityQueue.length > 0) &&
            visitedUrls.size < maxCrawlLength;

        ) {
            // check if active promises are below max concurrency limit
            if (activePromises.size < maxConcurrency) {
                const crawlPromise = crawlNext().finally(() =>
                    activePromises.delete(crawlPromise)
                );
                activePromises.add(crawlPromise);
            }
            // wait for any of the active promises to resolve
            await Promise.race(activePromises);
        }
        // ensure all ongoing crawls are finished
        await Promise.allSettled(activePromises);
    };

    await crawlWithConcurrency();

    // write productData to a CSV file
    const header = 'Url,Image,Name,Price\n';
    const csvRows = productData
        .map((item) => `${item.url},${item.image},${item.name},${item.price}`)
        .join('\n');
    const csvData = header + csvRows;

    fs.writeFileSync('products.csv', csvData);
    console.log('csv file has been successfully created!');
};

// execute the crawler function
crawler();

Kudos! Your Node.js web spider just got a boost with parallel crawling.

Crawling JavaScript Rendered Pages in Node.js

Your scraper currently uses an HTTP client (Axios) and is only suitable for crawling static websites. It won't work with websites that render content dynamically using JavaScript. 

The best way to crawl dynamic websites is to use headless browser automation libraries, such as Selenium, Playwright, or Puppeteer. These tools support JavaScript execution, allowing you to simulate user interactions, such as clicking, scrolling, hovering, taking screenshots, typing, and more.

Let's quickly see how to set up a basic Puppeteer script to access the JavaScript Rendering challenge page.

First, install Puppeteer:

Terminal
npm install puppeteer

Import the library, navigate to the dynamic website, and use it to grab a screenshot of the loaded page:

Example
// npm install puppeteer
const puppeteer = require('puppeteer');

(async () => {
    // launch a new browser instance
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    // navigate to the target URL
    await page.goto('https://www.scrapingcourse.com/javascript-rendering', {
        waitUntil: 'networkidle2', // wait for network to be idle before taking a screenshot
    });

    // take a screenshot and save it
    await page.screenshot({ path: 'screenshot.png', fullPage: true });

    // close the browser
    await browser.close();
})();

That's a basic Puppeteer scraper. Read our complete guide on web scraping with Puppeteer in Node.js to learn more.

Distributed Web Crawling in JavaScript

Distributed web crawling enhances efficiency significantly by spreading your crawl process over multiple nodes or machines. This technique is handy when crawling continuously on a large scale, as it prevents the web spider from overloading a single machine.

In addition to optimized performance, distributed web crawling makes your crawling process more fault-tolerant, ensuring your web spider keeps running even when one node fails.

Conclusion

In this tutorial, you've learned to build a Node.js web crawler, including optimization strategies and best practices required for efficient crawling. Here's a recap of what you've covered:

  • Follow all the links on a website.
  • Scrape data from selected product URLs.
  • Export the extracted data to a CSV.
  • Avoid crawling duplicate links.
  • Implement queue prioritization to prioritize specific pages.
  • Maintain a single crawl session.

Remember, no matter how sophisticated your JavaScript web crawler is, it's still prone to anti-bot detection since it's visiting multiple pages simultaneously. We recommend fortifying your spider with ZenRows to crawl any website at scale without limitations.

Try ZenRows for free now without a credit card!

Ready to get started?

Up to 1,000 URLs for free are waiting for you