JavaScript Web Crawler with Node.js: A Step-By-Step Tutorial

Web scrapers and search engines rely on web crawling to extract information from the web. As a result, web crawlers have become increasingly popular. Building a web spider with the right libraries in Node.js is easy. Here you'll learn how to build a JavaScript web crawler with the most popular web crawling libraries.

In this tutorial, you'll understand the basics of JavaScript crawling. In addition, you'll see why JavaScript is a good language when it comes to building a web spider. You'll also see some of the best practices for web crawling.

Follow this tutorial and become an expert in web crawling with JavaScript! Let's waste no more time and build our first crawler in Node.js.

What's Web Crawling?

A web crawler, also known as a web spider, is a tool that systematically goes through one or more websites to gather information. Specifically, a web crawler starts from a list of known URLs. While crawling these web pages, the web spider tool discovers other URLs.

Then, the web spider analyzes these new URLs, and the URL discovery process continues. So, the web crawling process can be endless. Also, one webpage associated with a URL might be more important than another. Thus, web spiders generally assign each URL a priority.

Simply put, a web crawler's goal is to discover URLs while reviewing and ranking web pages. Generally, search engines use web spiders to crawl the Web. Similarly, web scrapers use web crawling logic to find the web pages to extract data from.

Is JavaScript Good for Web Crawling?

Using JavaScript on the frontend, you can only crawl web pages within the same origin. That's because you would download web pages via AJAX. But the Same-Origin Policy applied by browsers narrows the scope of AJAX.

Let's better understand this with an example. Let's assume your JavaScript web spider runs on a web page from example.com. So, that script could crawl only web pages on the example.com domain.

This doesn't mean that JavaScript isn't good for web crawling. Quite the opposite. Thanks to Node.js, you can run JavaScript on servers and avoid all the problems mentioned before.

So, Node.js allows you to build a web spider that takes advantage of all JavaScript benefits. In detail, JavaScript is an easy-to-code and async language supported by thousands of libraries.

Let's now look at some best practices on how to use Node.js when it comes to web crawling.

Best Practices for Web Crawling in Node.js

Let's dig into five best practices for building a JavaScript web crawler in Node.js.

Use Your Web Spider to Retrieve All URLs

You should consider retrieving the entire list of links on a web page during crawling. After all, URLs are what allow a web spider to continue crawling a site. Only some of them may interest the user, but storing them all makes future iteration easier.

Also, consider using the site sitemap to find the list of all indexed URLs.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Perform Crawling in Parallel

Analyzing each webpage one at a time isn't an efficient process. Web crawling a website sequentially takes a lot of time. Luckily, you can tweak your web spider to run in parallel to speed up the process. JavaScript supports async logic. But keep in mind that implementing it requires extra logic.

Make Your Web Spider Behave Like a Human Being

Your JavaScript web crawler should appear to the website as a human being, not as a bot. You can achieve this by setting the proper HTTP headers and timeouts. The goal is to have your Node.js web spider visit web pages as a real user would.

Keep Your Web Crawling Logic Simple

The layout of the website that your web spider targets can change a lot over time. For this reason, you shouldn't over-engineer your Node.js web crawler. Keep it simple so that you can easily adapt it to a new layout or site.

Keep Your Web Crawler Running

Web crawling performed on Node.js is unlikely to consume a lot of system resources. Thus, you should consider keeping your web spider running forever. Especially when targeting large websites that are constantly adding new pages.

Before getting started

Here's the list of what you need for the simple scraper to work:

If you don't have Node.js installed on your system, you can download it by following the link above.

Then, you also need the two following npm libraries:

You can add this to your project's dependencies with the following command:

npm install axios cheerio

axios is a promised-based JavaScript HTTP client. Axios allows you to make HTTP requests and retrieve data through them.

cheerio is a JavaScript tool for parsing HTML and XML in Node.js. Cheerio provides APIs for traversing and manipulating the DOM of a webpage.

Let's now see how to use these two libraries to build a web crawler in Node.js.

How To Create a Web Crawler in Node.js

You can find the code of the demo JavaScript web crawler in this GitHub repo. Clone it and install the project's dependencies with the following commands:

git clone https://github.com/Tonel/web-crawler-nodejs 
cd web-crawler-nodejs 
npm install

Follow this tutorial and learn how to build a Node.js web crawler app in JavaScript!

First, you need to set up a Node.js server. If you haven't cloned the repo above, create a web-crawler-nodejs folder and enter it with:

mkdir web-crawler-nodejs 
cd web-crawler-nodejs

Now, initialize an npm application with:

npm init

Follow the process. You should now have a package.json file in your web-crawler-nodejs folder.

You are now ready to add axios and cheerio to your project's dependencies.

npm install axios cheerio

Create a server.js file and initialize it as follows:

function main() { 
	console.log("Hello, World!") 
} 
 
main()

Type the following command to launch the server.js Node.js script:

node server.js

You should now see in your terminal the following message:

Hello, World!

Et voilà! You've just initialized a Node.js application. It is now time to define the web crawling logic. Place it inside the main() function.

Here, you'll see how to perform web crawling on https://scrapeme.live/shop/. That's what the shop looks like:

A general view of scrapeme.live
A general view of scrapeme.live/shop

scrapeme.live/shop's nothing more than a simple paginated list of Pokemon-inspired products.

Let's build a simple web crawler in Node.js to retrieve all product URLs.

First, you need to download the HTML content from a web page. You can do it with axios as follows:

const pageHTML = await axios.get("https://scrapeme.live/shop")

Axios allows you to perform an HTTP request to retrieve in just a few lines of code. pageHTML now stores the HTML content of the scrapeme.live/shop webpage. It's time to feed it to cheerio with load() as follows:

const cheerio = require("cheerio"); 
 
// initializing cheerio 
const $ = cheerio.load(pageHTML.data)

Congratulations! You can now start performing web crawling!

Let's now extract the list of all pagination links needed to crawl the entire site. Right-click on the HTML element containing the pagination number and select the "Inspect" option.

Open Inspect in DevTools
Selecting the "Inspect" option to open the DevTools window

Your browser should open the following DevTools window with the DOM element highlighted:

DevTools window open
The DevTools Window after selecting a pagination number HTML element

Here, you can see that .page-numbers identifies the pagination HTML elements in the DOM. You can now use the .page-numbers a CSS selector to retrieve all the pagination URLs with cheerio.

// retrieving the pagination URLs 
$(".page-numbers a").each((index, element) => { 
	const paginationURL = $(element).attr("href") 
})

As you can see, cheerio works just like jQuery. paginationURL would contain a URL as below:

https://scrapeme.live/shop/page/2/

Now, let's retrieve the URL associated with a single product. Right-click on the HTML element of a product. Then, open the DevTools window with the "Inspect" option. This is what you should get:

Product on DevTools
The DevTools window after selecting a product HTML element

You can retrieve the product URL with the li.product a.woocommerce-LoopProduct-link CSS selector as below:

// retrieving the product URLs 
$("li.product a.woocommerce-LoopProduct-link").each((index, element) => { 
	const productURL = $(element).attr("href") 
})

Now, you only have to implement the JavaScript crawling logic to iterate over each page:

const axios = require("axios"); 
const cheerio = require("cheerio"); 
 
async function main(maxPages = 50) { 
	// initialized with the first webpage to visit 
	const paginationURLsToVisit = ["https://scrapeme.live/shop"]; 
	const visitedURLs = []; 
 
	const productURLs = new Set(); 
 
	// iterating until the queue is empty 
	// or the iteration limit is hit 
	while ( 
		paginationURLsToVisit.length !== 0 && 
		visitedURLs.length <= maxPages 
	) { 
		// the current webpage to crawl 
		const paginationURL = paginationURLsToVisit.pop(); 
 
		// retrieving the HTML content from paginationURL 
		const pageHTML = await axios.get(paginationURL); 
 
		// adding the current webpage to the 
		// web pages already crawled 
		visitedURLs.push(paginationURL); 
 
		// initializing cheerio on the current webpage 
		const $ = cheerio.load(pageHTML.data); 
 
		// retrieving the pagination URLs 
		$(".page-numbers a").each((index, element) => { 
			const paginationURL = $(element).attr("href"); 
 
			// adding the pagination URL to the queue 
			// of web pages to crawl, if it wasn't yet crawled 
			if ( 
				!visitedURLs.includes(paginationURL) && 
				!paginationURLsToVisit.includes(paginationURL) 
			) { 
				paginationURLsToVisit.push(paginationURL); 
			} 
		}); 
 
		// retrieving the product URLs 
		$("li.product a.woocommerce-LoopProduct-link").each((index, element) => { 
			const productURL = $(element).attr("href"); 
			productURLs.add(productURL); 
		}); 
	} 
 
	// logging the crawling results 
	console.log([...productURLs]); 
 
	// use productURLs for scraping purposes... 
} 
 
// running the main() function 
main() 
	.then(() => { 
		// successful ending 
		process.exit(0); 
	}) 
	.catch((e) => { 
		// logging the error message 
		console.error(e); 
 
		// unsuccessful ending 
		process.exit(1); 
	});

paginationURLsToVisit and visitedURLs ensure that the web spider won't visit the same page many times. Notice that productURLs is a Set so that it can't store the same product URL twice.

Also, keep in mind that the crawling logic may be infinite. So, you should provide a way to limit the number of webpages the web spider can visit. In the script above, this happens with the maxPages variable.

Et voilà! You just learned how to build a simple web crawler in Node.js and JavaScript!

At this point, you should save the crawled URLs to a database. This will allow you to scrape the crawled web pages directly without having to crawl them again. In addition, you should schedule the JavaScript crawling task to run periodically.

This is because they're likely to add, remove, or change products in the future. These are just a few ideas, and this tutorial stops here. What matters is to understand that crawling is typically only the first step in a larger process.

Run your JavaScript web crawler, and you'll obtain the following data:

[ 
	"https://scrapeme.live/shop/Bulbasaur/", 
	"https://scrapeme.live/shop/Ivysaur/", 
	"https://scrapeme.live/shop/Venusaur/", 
 
	// ... 
	 
	"https://scrapeme.live/shop/Nidoqueen/", 
	"https://scrapeme.live/shop/Nidorino/", 
	"https://scrapeme.live/shop/Nidoking/" 
]

Congratulations! You just crawled scrapeme.live/shop entirely!

However, sequentially crawling a website may not be the best solution for performance. That's why you should make your JavaScript web crawler work in parallel. For that, follow this tutorial on how to crawl in parallel with Node.js.

Also, don't forget that crawling in parallel means performing many HTTP requests in a short time. Your target site may identify your web spider as a threat and block it. To avoid this, you should use web proxies to rotate your IP.

ZenRows offers premium proxies. Find out more about how you can use them to avoid blocks.

Conclusion

Here, you learned everything you should know about building a JavaScript web crawler. Specifically, you saw how to create a web spider in Node.js that crawls all URLs from a website.

All you need are the right libraries, and here you looked at some of the most popular ones. Implementing a JavaScript web crawler with them isn't that difficult and takes only a few lines of code.

In detail, here is what you learned here in three points:
  • Understood what web scraping is.
  • Learned some tips on how to perform web scraping in JavaScript.
  • Applied the basics of web crawling to build a web spider in Node.js from scratch.

If you liked this, take a look at the Python Web Crawling guide.

Thanks for reading! We hope that you found this guide helpful. You can sign up for free, try ZenRows, and let us know any questions, comments, or suggestions.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Want to keep learning?

We will be sharing all the insights we have learned through the years in the following blog posts. If you don't want to miss a piece and keep learning, we'd be thrilled to have us in our newsletter.

No spam guaranteed. You can unsubscribe at any time.