Web scrapers and search engines rely on web crawling to extract information from the web. As a result, web crawlers have become increasingly popular.
Building a web spider with the right libraries in Node.js is easy. Here you'll learn how to build a JavaScript web crawler with the most popular web crawling libraries.
In this tutorial, you'll understand the basics of JavaScript crawling. In addition, you'll see why JavaScript is a good language when it comes to building a web spider. You'll also see some of the best practices for web crawling.
Follow this tutorial and become an expert in web crawling with JavaScript! Let's waste no more time and build our first crawler in Node.js.
What's Web Crawling?
A web crawler, also known as a web spider, is a tool that systematically goes through one or more websites to gather information. Specifically, a web crawler starts from a list of known URLs. While crawling these web pages, the web spider tool discovers other URLs.
Then, the web spider analyzes these new URLs, and the URL discovery process continues. So, the web crawling process can be endless. Also, one webpage associated with a URL might be more important than another. Thus, web spiders generally assign each URL a priority.
Simply put, a web crawler's goal is to discover URLs while reviewing and ranking web pages. Generally, search engines use web spiders to crawl the Web. Similarly, web scrapers use web crawling logic to find the web pages to extract data from.
Is JavaScript Good for Web Crawling?
Using JavaScript on the front end, you can only crawl web pages within the same origin. That's because you would download web pages via AJAX. But the Same-Origin Policy applied by browsers narrows the scope of AJAX.
Let's better understand this with an example. Let's assume your JavaScript web spider runs on a web page from example.com. So, that script could crawl only web pages on the example.com domain.
That doesn't mean that JavaScript isn't good for web crawling. Quite the opposite. Thanks to Node.js, you can run JavaScript on servers and avoid all the problems mentioned before.
So, Node.js allows you to build a web spider that takes advantage of all JavaScript benefits. In detail, JavaScript is an easy-to-code and async language supported by thousands of libraries.
Let's now look at some best practices on how to use Node.js when it comes to web crawling.
Best Practices for Web Crawling in Node.js
Let's dig into five best practices for building a JavaScript web crawler in Node.js.
Use Your Web Spider to Retrieve All URLs
You should consider retrieving a web page's entire list of links during crawling After all, URLs are what allow a web spider to continue crawling a site.
Only some may interest the user, but storing them all makes future iterations easier.
Also, consider using the site sitemap to find the list of all indexed URLs.
Perform Crawling in Parallel
Analyzing each webpage one at a time isn't an efficient process. Web crawling a website sequentially takes a lot of time.
Luckily, you can tweak your web spider to run in parallel to speed up the process. JavaScript supports async logic. But keep in mind that implementing it requires extra logic.
Make Your Web Spider Behave Like a Human Being
Your JavaScript web crawler should appear to the website as a human being, not as a bot. You can achieve this by setting the proper HTTP headers and timeouts.
The goal is to have your Node.js web spider visit web pages as a real user would.
Keep Your Web Crawling Logic Simple
The layout of the website that your web spider targets can change a lot over time. For this reason, you shouldn't over-engineer your Node.js web crawler.
Keep it simple so that you can easily adapt it to a new layout or site.
Keep Your Web Crawler Running
Web crawling performed on Node.js is unlikely to consume a lot of system resources.
Thus, you should consider keeping your web spider running forever, especially when targeting large websites that are constantly adding new pages.
Before Getting Started
Here's the list of what you need for the simple scraper to work:
- Node.js and npm >= 8.0+ You can download Node.js from the link above if you don't have Node.js installed on your system.
Then, you also need the two following npm libraries:
You can add this to your project's dependencies with the following command:
npm install axios cheerio
git clone https://github.com/Tonel/web-crawler-nodejs
cd web-crawler-nodejs
npm install
Follow this tutorial and learn how to build a Node.js web crawler app in JavaScript!
First, you need to set up a Node.js server. If you haven't cloned the repo above, create a web-crawler-nodejs
folder and enter it with the command below.
mkdir web-crawler-nodejs
cd web-crawler-nodejs
Now, initialize an npm application with:
npm init
Follow the process. You should now have a package.json
file in your web-crawler-nodejs
folder.
You're now ready to add axios
and cheerio
to your project's dependencies.
npm install axios cheerio
Create a server.js file and initialize it as follows:
function main() {
console.log("Hello, World!")
}
main()
Type the following command to launch the server.js
Node.js script:
node server.js
You should now see in your terminal the following message:
Hello, World!
Et voilà! You've just initialized a Node.js application. It's now time to define the web crawling logic. Place it inside the main()
function.
Here, you'll see how to perform web crawling on https://www.scrapingcourse.com/ecommerce/. That's what the shop looks like:
The site contains a paginated list of different clothes and accessories
Let's build a simple web crawler in Node.js to retrieve all product URLs.
First, you need to download the HTML content from a web page. You can do it with axios
as follows:
const pageHTML = await axios.get("https://www.scrapingcourse.com/ecommerce/")
Axios allows you to perform an HTTP request to retrieve in just a few lines of code. pageHTML
now stores the HTML content of the scrapingcourse.com/ecommerce/
webpage.
It's time to feed it to cheerio
with load()
as follows:
const cheerio = require("cheerio");
// initializing cheerio
const $ = cheerio.load(pageHTML.data)
Congratulations! You can now start performing web crawling!
Let's now extract the list of all pagination links needed to crawl the entire site. Right-click on the HTML element containing the pagination number and select the "Inspect" option.
Your browser should open the following DevTools window with the DOM element highlighted:
Here, you can see that .page-numbers
identifies the pagination HTML elements in the DOM. You can now use the .page-numbers a
CSS selector to retrieve all the pagination URLs with cheerio
.
// retrieving the pagination URLs
$(".page-numbers a").each((index, element) => {
const paginationURL = $(element).attr("href")
})
As you can see, cheerio
works just like jQuery. paginationURL
would contain a URL as below:
https://www.scrapingcourse.com/ecommerce/page/2/
Now, let's retrieve the URL associated with a single product. Right-click on the HTML element of a product. Then, open the DevTools window with the "Inspect" option. This is what you should get:
You can retrieve the product URL with the li.product a.woocommerce-LoopProduct-link
CSS selector as below:
// retrieving the product URLs
$("li.product a.woocommerce-LoopProduct-link").each((index, element) => {
const productURL = $(element).attr("href")
})
Now, you only have to implement the JavaScript crawling logic to iterate over each page:
const axios = require("axios");
const cheerio = require("cheerio");
async function main(maxPages = 50) {
// initialized with the first webpage to visit
const paginationURLsToVisit = ["https://www.scrapingcourse.com/ecommerce/"];
const visitedURLs = [];
const productURLs = new Set();
// iterating until the queue is empty
// or the iteration limit is hit
while (
paginationURLsToVisit.length !== 0 &&
visitedURLs.length <= maxPages
) {
// the current webpage to crawl
const paginationURL = paginationURLsToVisit.pop();
// retrieving the HTML content from paginationURL
const pageHTML = await axios.get(paginationURL);
// adding the current webpage to the
// web pages already crawled
visitedURLs.push(paginationURL);
// initializing cheerio on the current webpage
const $ = cheerio.load(pageHTML.data);
// retrieving the pagination URLs
$(".page-numbers a").each((index, element) => {
const paginationURL = $(element).attr("href");
// adding the pagination URL to the queue
// of web pages to crawl, if it wasn't yet crawled
if (
!visitedURLs.includes(paginationURL) &&
!paginationURLsToVisit.includes(paginationURL)
) {
paginationURLsToVisit.push(paginationURL);
}
});
// retrieving the product URLs
$("li.product a.woocommerce-LoopProduct-link").each((index, element) => {
const productURL = $(element).attr("href");
productURLs.add(productURL);
});
}
// logging the crawling results
console.log([...productURLs]);
// use productURLs for scraping purposes...
}
// running the main() function
main()
.then(() => {
// successful ending
process.exit(0);
})
.catch((e) => {
// logging the error message
console.error(e);
// unsuccessful ending
process.exit(1);
});
paginationURLsToVisit
and visitedURLs
ensure that the web spider won't visit the same page many times. Notice that productURLs
is a Set, so it can't store the same product URL twice.
Also, keep in mind that the crawling logic may be infinite. So, you should provide a way to limit the number of web pages the web spider can visit. In the script above, this happens with the maxPages
variable.
Et voilà! You just learned how to build a simple web crawler in Node.js and JavaScript!
At this point, you should save the crawled URLs to a database. That will allow you to scrape the crawled web pages directly without having to crawl them again. In addition, you should schedule the JavaScript crawling task to run periodically.
That is because they're likely to add, remove, or change products in the future. These are just a few ideas, and this tutorial stops here. What matters is to understand that crawling is typically only the first step in a larger process.
Run your JavaScript web crawler, and you'll obtain the following data:
[
"https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/",
"https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/",
"https://www.scrapingcourse.com/ecommerce/product/aeon-capri/",
// ...
"https://www.scrapingcourse.com/ecommerce/product/zing-jump-rope/",
"https://www.scrapingcourse.com/ecommerce/product/zoe-tank/",
"https://www.scrapingcourse.com/ecommerce/product/zoltan-gym-tee/"
]
Congratulations! You just crawled scrapingcourse.com/ecommerce/
entirely!
However, sequentially crawling a website may not be the best solution for performance. That's why you should make your JavaScript web crawler work in parallel. For that, follow this tutorial on how to crawl in parallel with Node.js.
Also, don't forget that crawling in parallel means performing many HTTP requests in a short time. Your target site may identify your web spider as a threat and block it. To avoid this, you should configure Axios to use web proxies to rotate your IP.
ZenRows offers premium proxies. Find out more about how you can use them to avoid blocks.
Conclusion
In this guide, you'll find everything you should know about building a JavaScript web crawler. Specifically, you saw how to create a web spider in Node.js that crawls all URLs from a website.
All you need are the right libraries, and we discussed some of the most popular ones. Implementing a JavaScript web crawler with them isn't difficult and takes only a few lines of code.
To wrap up what you learned in three main points:
- What web scraping is.
- Tips on how to perform web scraping in JavaScript.
- How to apply the basics of web crawling to build a web spider in Node.js from scratch. If you liked this, take a look at the Python web crawling guide.
Thanks for reading! We hope that you found this guide helpful. You can sign up for free, try ZenRows, and let us know any questions, comments, or suggestions.