JavaScript Web Crawler with Node.js: A Step-By-Step Tutorial

August 29, 2022 · 9 min read

Web scrapers and search engines rely on web crawling to extract information from the web. As a result, web crawlers have become increasingly popular. 

Building a web spider with the right libraries in Node.js is easy. Here you'll learn how to build a JavaScript web crawler with the most popular web crawling libraries.

In this tutorial, you'll understand the basics of JavaScript crawling. In addition, you'll see why JavaScript is a good language when it comes to building a web spider. You'll also see some of the best practices for web crawling.

Follow this tutorial and become an expert in web crawling with JavaScript! Let's waste no more time and build our first crawler in Node.js.

What's Web Crawling?

A web crawler, also known as a web spider, is a tool that systematically goes through one or more websites to gather information. Specifically, a web crawler starts from a list of known URLs. While crawling these web pages, the web spider tool discovers other URLs.

Then, the web spider analyzes these new URLs, and the URL discovery process continues. So, the web crawling process can be endless. Also, one webpage associated with a URL might be more important than another. Thus, web spiders generally assign each URL a priority.

Simply put, a web crawler's goal is to discover URLs while reviewing and ranking web pages. Generally, search engines use web spiders to crawl the Web. Similarly, web scrapers use web crawling logic to find the web pages to extract data from.

Is JavaScript Good for Web Crawling?

Using JavaScript on the front end, you can only crawl web pages within the same origin. That's because you would download web pages via AJAX. But the Same-Origin Policy applied by browsers narrows the scope of AJAX.

Let's better understand this with an example. Let's assume your JavaScript web spider runs on a web page from example.com. So, that script could crawl only web pages on the example.com domain.

That doesn't mean that JavaScript isn't good for web crawling. Quite the opposite. Thanks to Node.js, you can run JavaScript on servers and avoid all the problems mentioned before.

So, Node.js allows you to build a web spider that takes advantage of all JavaScript benefits. In detail, JavaScript is an easy-to-code and async language supported by thousands of libraries.

Let's now look at some best practices on how to use Node.js when it comes to web crawling.

Best Practices for Web Crawling in Node.js

Let's dig into five best practices for building a JavaScript web crawler in Node.js.

Use Your Web Spider to Retrieve All URLs

You should consider retrieving a web page's entire list of links during crawling After all, URLs are what allow a web spider to continue crawling a site.

Only some may interest the user, but storing them all makes future iterations easier.

Also, consider using the site sitemap to find the list of all indexed URLs.

Perform Crawling in Parallel

Analyzing each webpage one at a time isn't an efficient process. Web crawling a website sequentially takes a lot of time.

Luckily, you can tweak your web spider to run in parallel to speed up the process. JavaScript supports async logic. But keep in mind that implementing it requires extra logic.

Make Your Web Spider Behave Like a Human Being

Your JavaScript web crawler should appear to the website as a human being, not as a bot. You can achieve this by setting the proper HTTP headers and timeouts.

The goal is to have your Node.js web spider visit web pages as a real user would.

Keep Your Web Crawling Logic Simple

The layout of the website that your web spider targets can change a lot over time. For this reason, you shouldn't over-engineer your Node.js web crawler.

Keep it simple so that you can easily adapt it to a new layout or site.

Keep Your Web Crawler Running

Web crawling performed on Node.js is unlikely to consume a lot of system resources. 

Thus, you should consider keeping your web spider running forever, especially when targeting large websites that are constantly adding new pages.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Before Getting Started

Here's the list of what you need for the simple scraper to work:

  • Node.js and npm >= 8.0+ You can download Node.js from the link above if you don't have Node.js installed on your system.

Then, you also need the two following npm libraries:

You can add this to your project's dependencies with the following command:

Terminal
npm install axios cheerio
Terminal
git clone https://github.com/Tonel/web-crawler-nodejs 
cd web-crawler-nodejs 
npm install

Follow this tutorial and learn how to build a Node.js web crawler app in JavaScript!

First, you need to set up a Node.js server. If you haven't cloned the repo above, create a web-crawler-nodejs folder and enter it with the command below.

Terminal
mkdir web-crawler-nodejs 
cd web-crawler-nodejs

Now, initialize an npm application with:

Terminal
npm init

Follow the process. You should now have a package.json file in your web-crawler-nodejs folder.

You're now ready to add axios and cheerio to your project's dependencies.

Terminal
npm install axios cheerio

Create a server.js file and initialize it as follows:

server.js
function main() { 
	console.log("Hello, World!") 
} 
 
main()

Type the following command to launch the server.js Node.js script:

Terminal
node server.js

You should now see in your terminal the following message:

Output
Hello, World!

Et voilà! You've just initialized a Node.js application. It's now time to define the web crawling logic. Place it inside the main() function.

Here, you'll see how to perform web crawling on https://www.scrapingcourse.com/ecommerce/. That's what the shop looks like:

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

The site contains a paginated list of different clothes and accessories

Let's build a simple web crawler in Node.js to retrieve all product URLs.

First, you need to download the HTML content from a web page. You can do it with axios as follows:

server.js
const pageHTML = await axios.get("https://www.scrapingcourse.com/ecommerce/")

Axios allows you to perform an HTTP request to retrieve in just a few lines of code. pageHTML now stores the HTML content of the scrapingcourse.com/ecommerce/ webpage.

It's time to feed it to cheerio with load() as follows:

server.js
const cheerio = require("cheerio"); 
 
// initializing cheerio 
const $ = cheerio.load(pageHTML.data)

Congratulations! You can now start performing web crawling!

Let's now extract the list of all pagination links needed to crawl the entire site. Right-click on the HTML element containing the pagination number and select the "Inspect" option.

scrapingcourse ecommerce homepage inspect
Click to open the image in full screen

Your browser should open the following DevTools window with the DOM element highlighted:

scrapingcourse ecommerce homepage devtools
Click to open the image in full screen

Here, you can see that .page-numbers identifies the pagination HTML elements in the DOM. You can now use the .page-numbers a CSS selector to retrieve all the pagination URLs with cheerio.

server.js
// retrieving the pagination URLs 
$(".page-numbers a").each((index, element) => { 
	const paginationURL = $(element).attr("href") 
})

As you can see, cheerio works just like jQuery. paginationURL would contain a URL as below:

server.js
https://www.scrapingcourse.com/ecommerce/page/2/

Now, let's retrieve the URL associated with a single product. Right-click on the HTML element of a product. Then, open the DevTools window with the "Inspect" option. This is what you should get:

scrapingcourse ecommerce homepage inspect first product li
Click to open the image in full screen

You can retrieve the product URL with the li.product a.woocommerce-LoopProduct-link CSS selector as below:

server.js
// retrieving the product URLs 
$("li.product a.woocommerce-LoopProduct-link").each((index, element) => { 
	const productURL = $(element).attr("href") 
})

Now, you only have to implement the JavaScript crawling logic to iterate over each page:

server.js
const axios = require("axios"); 
const cheerio = require("cheerio"); 
 
async function main(maxPages = 50) { 
	// initialized with the first webpage to visit 
	const paginationURLsToVisit = ["https://www.scrapingcourse.com/ecommerce/"]; 
	const visitedURLs = []; 
 
	const productURLs = new Set(); 
 
	// iterating until the queue is empty 
	// or the iteration limit is hit 
	while ( 
		paginationURLsToVisit.length !== 0 && 
		visitedURLs.length <= maxPages 
	) { 
		// the current webpage to crawl 
		const paginationURL = paginationURLsToVisit.pop(); 
 
		// retrieving the HTML content from paginationURL 
		const pageHTML = await axios.get(paginationURL); 
 
		// adding the current webpage to the 
		// web pages already crawled 
		visitedURLs.push(paginationURL); 
 
		// initializing cheerio on the current webpage 
		const $ = cheerio.load(pageHTML.data); 
 
		// retrieving the pagination URLs 
		$(".page-numbers a").each((index, element) => { 
			const paginationURL = $(element).attr("href"); 
 
			// adding the pagination URL to the queue 
			// of web pages to crawl, if it wasn't yet crawled 
			if ( 
				!visitedURLs.includes(paginationURL) && 
				!paginationURLsToVisit.includes(paginationURL) 
			) { 
				paginationURLsToVisit.push(paginationURL); 
			} 
		}); 
 
		// retrieving the product URLs 
		$("li.product a.woocommerce-LoopProduct-link").each((index, element) => { 
			const productURL = $(element).attr("href"); 
			productURLs.add(productURL); 
		}); 
	} 
 
	// logging the crawling results 
	console.log([...productURLs]); 
 
	// use productURLs for scraping purposes... 
} 
 
// running the main() function 
main() 
	.then(() => { 
		// successful ending 
		process.exit(0); 
	}) 
	.catch((e) => { 
		// logging the error message 
		console.error(e); 
 
		// unsuccessful ending 
		process.exit(1); 
	});

paginationURLsToVisit and visitedURLs ensure that the web spider won't visit the same page many times. Notice that productURLs is a Set, so it can't store the same product URL twice.

Also, keep in mind that the crawling logic may be infinite. So, you should provide a way to limit the number of web pages the web spider can visit. In the script above, this happens with the maxPages variable.

Et voilà! You just learned how to build a simple web crawler in Node.js and JavaScript!

At this point, you should save the crawled URLs to a database. That will allow you to scrape the crawled web pages directly without having to crawl them again. In addition, you should schedule the JavaScript crawling task to run periodically.

That is because they're likely to add, remove, or change products in the future. These are just a few ideas, and this tutorial stops here. What matters is to understand that crawling is typically only the first step in a larger process.

Run your JavaScript web crawler, and you'll obtain the following data:

Output
[ 
	"https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/", 
	"https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/", 
	"https://www.scrapingcourse.com/ecommerce/product/aeon-capri/", 
 
	// ... 
	 
	"https://www.scrapingcourse.com/ecommerce/product/zing-jump-rope/", 
	"https://www.scrapingcourse.com/ecommerce/product/zoe-tank/", 
	"https://www.scrapingcourse.com/ecommerce/product/zoltan-gym-tee/" 
]

Congratulations! You just crawled scrapingcourse.com/ecommerce/ entirely!

However, sequentially crawling a website may not be the best solution for performance. That's why you should make your JavaScript web crawler work in parallel. For that, follow this tutorial on how to crawl in parallel with Node.js.

Also, don't forget that crawling in parallel means performing many HTTP requests in a short time. Your target site may identify your web spider as a threat and block it. To avoid this, you should configure Axios to use web proxies to rotate your IP.

ZenRows offers premium proxies. Find out more about how you can use them to avoid blocks.

Conclusion

In this guide, you'll find everything you should know about building a JavaScript web crawler. Specifically, you saw how to create a web spider in Node.js that crawls all URLs from a website.

All you need are the right libraries, and we discussed some of the most popular ones. Implementing a JavaScript web crawler with them isn't difficult and takes only a few lines of code.

To wrap up what you learned in three main points:

  • What web scraping is.
  • Tips on how to perform web scraping in JavaScript.
  • How to apply the basics of web crawling to build a web spider in Node.js from scratch. If you liked this, take a look at the Python web crawling guide.

Thanks for reading! We hope that you found this guide helpful. You can sign up for free, try ZenRows, and let us know any questions, comments, or suggestions.

Ready to get started?

Up to 1,000 URLs for free are waiting for you