Web Scraping in Javascript and NodeJS

Picture of Ander
By Ander · September 1, 2021 · 12 min read · Twitter
Ander is a web developer who has been working for several startups for more than 10 years, having worked with a wide variety of sectors and technologies. Engineer turned entrepreneur.

Javascript and web scraping are both on the rise. We will combine them to build a scraper and crawler from scratch using Javascript in NodeJS.

Avoiding blocks is an essential part of website scraping. So we will also add some features to help in that regard. And finally, parallelize the tasks to go faster thanks to Node's event loop.

Follow this tutorial to learn how to web scrape with Node and Javascript!

Prerequisites

For the code to work, you will need Node (or nvm) and npm installed. Some systems have it pre-installed. After that, install all the necessary libraries by running npm install. It will create a package.json file with all the dependencies.

npm init -y 
npm install axios cheerio playwright

Introduction to JS scraping tools

We are using Node v12, but you can always check the compatibility of each feature.

Axios is a "promise based HTTP client" that we will use to get the HTML from a URL. It allows several options, such as headers and proxies, which we will cover later. If you use TypeScript, they include "definitions and a type guard for Axios errors."

Cheerio is a "fast, flexible & lean implementation of core jQuery" Javascript library. It lets us find DOM nodes using selectors, get text or attributes, and many other things. We will pass the HTML to cheerio and then query it to extract data. Just as we would in a browser environment.

Playwright "is a Node.js library to automate Chromium, Firefox and WebKit with a single API." When Axios is not enough, we will get the HTML using a headless browser. It will then parse the content, execute Javascript and wait for the async content to load.

Is Node JS good for web scraping?

As you've seen above, tools are available, and the technology is consolidated. All of them are widely used and properly maintained.

Apart from these, there are several alternatives to each of them. And many more focused on one task, such as table scraper. The Javascript web scraping ecosystem is huge!

How to web scrape with Javascript?

The first thing we need is the HTML. We installed Axios for that, and its usage is straightforward. We'll use scrapeme.live as an example, a fake web page prepared for scraping.

const axios = require('axios'); 
axios.get('https://scrapeme.live/shop/') 
	.then(({ data }) => console.log(data));

Nice! Then, we can query the two things we want right now using cheerio: pagination links and products. We will look at the page with Chrome DevTools open to know how to do that. All modern web browsers offer developer tools such as these. Pick your favorite.

Pagination in DevTools
Click to open the image in fullscreen

We marked the interesting parts in red, but you can go on your own and try it yourselves. In this case, all the CSS selectors are straightforward and do not need nesting. Check the guide if you are looking for a different outcome or cannot select it. You can also use DevTools to get the selector.

Copy Selector from DevTools
Click to open the image in fullscreen

On the Elements tab, right-click on the node ➡ Copy ➡ Copy selector. But the outcome is usually very coupled to the HTML, as in this case: #main > div:nth-child(2) > nav > ul > li:nth-child(2) > a.

This approach might be a problem in the future because it will stop working after any minimal change. Besides, it will only capture one of the pagination links, not all of them.

You can execute Javascript on the Console tab and check if the selectors are working correctly. Pass the selector to the document.querySelector function and check the output. Remember this trick while web scraping. 😉

We could capture all the links on the page and then filter them by content. If we were to write a full-site crawler, that would be the right approach.

In our case, we only want the pagination links. Using the provided class, .page-numbers a will capture them all. Then extract the URLs (hrefs) from those. The CSS selector will match all the link nodes with an ancestor containing the class page-numbers.

const axios = require('axios'); 
const cheerio = require('cheerio'); 
 
const extractLinks = $ => [ 
	...new Set( 
		$('.page-numbers a') // Select pagination links 
			.map((_, a) => $(a).attr('href')) // Extract the href (url) from each link 
			.toArray() // Convert cheerio object to array 
	), 
]; 
 
axios.get('https://scrapeme.live/shop/').then(({ data }) => { 
	const $ = cheerio.load(data); // Initialize cheerio 
	const links = extractLinks($); 
 
	console.log(links); 
	// ['https://scrapeme.live/shop/page/2/', 'https://scrapeme.live/shop/page/3/', ... ] 
});

Store the content above in a file and execute it in NodeJS to see the results.

As for the products (Pokémon in this case), we will get id, name, and price. Check the image below for details on selectors, or try again on your own. We will only log the scraped data for now. Check the final code for adding them to an array.

Product (Charmander) in DevTools
Click to open the image in fullscreen

As you can see above, all the products contain the class product, which makes our job easier. And for each of them, the h2 tag and price node hold the content we want.

As for the product ID, we need to match an attribute instead of a class or DOM node type. That can be done using the syntax node[attribute="value"]. We are looking only for the DOM node with the attribute, so there is no need to match it to any particular value.

const extractContent = $ => 
	$('.product') 
		.map((_, product) => { 
			const $product = $(product); 
			return { 
				id: $product.find('a[data-product_id]').attr('data-product_id'), 
				title: $product.find('h2').text(), 
				price: $product.find('.price').text(), 
			}; 
		}) 
		.toArray(); 
// ... 
 
const content = extractContent($); 
console.log(content); 
// [{ id: '759', title: 'Bulbasaur', price: '£63.00' }, ...]

There is no error handling, as you can see above. We will omit it for brevity in the snippets but take it into account in real life. Most of the time, returning the default value (i.e., empty array) should do the trick.

Now that we have some pagination links, we should also visit them. If you run the whole code, you'll see that they appear twice - there are two pagination bars.

We will add two sets to keep track of what we already visited and the newly discovered links. Sets have existed in Javascript since ES2015, and all modern NodeJS versions support them.

We are using them instead of arrays to avoid dealing with duplicates, but either one would work. To avoid crawling too much, we'll also include a maximum.

const maxVisits = 5; 
const visited = new Set(); 
const toVisit = new Set(); 
toVisit.add('https://scrapeme.live/shop/page/1/'); // Add initial URL

We will use async/await for the next part to avoid callbacks and nesting. An async function is an alternative to writing promise-based functions as chains. Again, supported in all modern versions of Node.js.

In this case, the Axios call will remain asynchronous. It might take around 1 second per page, but we write the code sequentially without needing callbacks.

There is a small gotcha with this: await is only valid in async function. That will force us to wrap the initial code inside an IIFE (Immediately Invoked Function Expression). The syntax is a bit weird. It creates a function and then calls it immediately.

const crawl = async url => { 
	visited.add(url); 
	const { data } = await axios.get(url); 
	const $ = cheerio.load(data); 
	const content = extractContent($); 
	const links = extractLinks($); 
	links 
		.filter(link => !visited.has(link)) // Filter out already visited links 
		.forEach(link => toVisit.add(link)); 
}; 
 
(async () => { // IIFE 
	// Loop over a set's values 
	for (const next of toVisit.values()) { 
		if (visited.size >= maxVisits) { 
			break; 
		} 
 
		toVisit.delete(next); 
		await crawl(next); 
	} 
 
	console.log(visited); 
	// Set { 'https://scrapeme.live/shop/page/1/', '.../2/', ... } 
	console.log(toVisit); 
	// Set { 'https://scrapeme.live/shop/page/47/', '.../48/', ... } 
})(); // The final set of parenthesis will call the function

Avoid blocks when web scraping

As said before, we need mechanisms to avoid blocks, captchas, login walls, and other defensive techniques. It is complicated to prevent them 100% of the time. But we can achieve a high success rate with simple efforts. We will apply two tactics: adding proxies and full-set headers.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Proxies

There are Free Proxies even though we do not recommend them. They might work for testing but are not reliable. We can use some of those for testing, as we'll see in some examples.
Note that these free proxies might not work for you. They are short-time lived.

Paid proxy services, on the other hand, offer IP Rotation. Our web scraper will work the same, but the target website will see a different IP. In some cases, they rotate for every request or every few minutes. In any case, they are much harder to ban. And when it happens, we'll get a new IP after a short time.

We will use httpbin for testing. It offers an API with several endpoints that will respond with headers, IP addresses, and more.

const axios = require('axios'); 
 
const proxy = { 
	protocol: 'http', 
	host: '202.212.123.44', // Free proxy from the list 
	port: 80, 
}; 
 
(async () => { 
	const { data } = await axios.get('https://httpbin.org/ip', { proxy }); 
 
	console.log(data); 
	// { origin: '202.212.123.44' } 
})(); 
	

HTTP request headers

The next step would be to check our request's HTTP headers. The most known one is User-Agent (UA for short), but there are many more. Many software tools have their own, for example, Axios (axios/0.21.1).

In general, it is a good practice to send actual headers along with the UA. That means we need a real-world set of headers because not all browsers and versions use the same ones. We include two in the snippet: Chrome 92 and Firefox 90 in a Linux machine.

const axios = require('axios'); 
 
// Helper function to get a random item from an array 
const sample = array => array[Math.floor(Math.random() * array.length)]; 
 
const headers = [ 
	{ 
		Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 
		'Accept-Encoding': 'gzip, deflate, br', 
		'Accept-Language': 'en-US,en;q=0.9', 
		'Sec-Ch-Ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"', 
		'Sec-Ch-Ua-Mobile': '?0', 
		'Sec-Fetch-Dest': 'document', 
		'Sec-Fetch-Mode': 'navigate', 
		'Sec-Fetch-Site': 'none', 
		'Sec-Fetch-User': '?1', 
		'Upgrade-Insecure-Requests': '1', 
		'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36', 
	}, 
	{ 
		Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 
		'Accept-Encoding': 'gzip, deflate, br', 
		'Accept-Language': 'en-US,en;q=0.5', 
		'Sec-Fetch-Dest': 'document', 
		'Sec-Fetch-Mode': 'navigate', 
		'Sec-Fetch-Site': 'none', 
		'Sec-Fetch-User': '?1', 
		'Upgrade-Insecure-Requests': '1', 
		'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0', 
	}, 
]; 
 
(async () => { 
	const { data } = await axios.get('https://httpbin.org/anything', { headers: sample(headers) }); 
 
	console.log(data); 
	// { 'User-Agent': '...Chrome/92...', ... } 
})();

Headless browsers for dynamic HTML

Until now, every page visited was done using axios.get, which can be inadequate in some cases. Say we need JS to load and execute or interact with the browser (via mouse or keyboard).

While avoiding headless browsers would be preferable - for performance reasons - sometimes there is no other choice. Selenium, Puppeteer, and Playwright are the most used and known librariesin the Javascript and NodeJS world.

The snippet below shows only the User-Agent. But since it is a real browser, the headers will include the entire set (Accept, Accept-Encoding, etcetera).

const playwright = require('playwright'); 
 
(async () => { 
	// 'webkit' is also supported, but there is a problem on Linux 
	for (const browserType of ['chromium', 'firefox']) { 
		const browser = await playwright[browserType].launch(); 
		const context = await browser.newContext(); 
		const page = await context.newPage(); 
		await page.goto('https://httpbin.org/headers'); 
		console.log(await page.locator('pre').textContent()); 
		await browser.close(); 
	} 
})(); 
 
// "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/94.0.4595.0 Safari/537.36", 
// "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0",

This approach comes with its own problem: look at the User-Agents. The Chromium one includes "HeadlessChrome". It will tell the target web page, well, that it's a headless browser. They might act upon that.

As with Axios, we can set headers, proxies, and other options to customize requests. An excellent choice to hide our "HeadlessChrome" User-Agent. And since this is a real web browser, we can intercept requests, block others (like CSS files or images), take screenshots or videos, and more. Really handy for web scraping!

const playwright = require('playwright'); 
 
(async () => { 
	const browser = await playwright.chromium.launch({ 
		proxy: { server: 'http://91.216.164.251:80' }, // Another free proxy from the list 
	}); 
	const context = await browser.newContext(); 
	const page = await context.newPage(); 
	page.setExtraHTTPHeaders({ referrer: 'https://news.ycombinator.com/' }); 
	await page.goto('http://httpbin.org/anything'); 
	console.log(await page.locator('pre').textContent()); // Print the complete response 
	await browser.close(); 
})(); 
 
// "Referrer": "https://news.ycombinator.com/" 
// "origin": "91.216.164.251"

Now we can separate getting the HTML in a couple of functions, one using Playwright and the other Axios. We would then need a way to select which one is appropriate for the case at hand. For now, it is hardcoded.

This output, by the way, is the same but quite faster when using Axios.

const playwright = require('playwright'); 
const axios = require('axios'); 
const cheerio = require('cheerio'); 
 
const getHtmlPlaywright = async url => { 
	const browser = await playwright.chromium.launch(); 
	const context = await browser.newContext(); 
	const page = await context.newPage(); 
	await page.goto(url); 
	const html = await page.content(); 
	await browser.close(); 
 
	return html; 
}; 
 
const getHtmlAxios = async url => { 
	const { data } = await axios.get(url); 
 
	return data; 
}; 
 
(async () => { 
	const html = await getHtmlPlaywright('https://scrapeme.live/shop/page/1/'); 
	const $ = cheerio.load(html); 
	const content = extractContent($); 
	console.log('getHtmlPlaywright', content); 
})(); 
 
(async () => { 
	const html = await getHtmlAxios('https://scrapeme.live/shop/page/1/'); 
	const $ = cheerio.load(html); 
	const content = extractContent($); 
	console.log('getHtmlAxios', content); 
})();

Using Javascript's async for parallel crawling

We already introduced async/await when crawling several links sequentially. If we were to crawl them in parallel, removing the await would be enough, right? Well... not so fast.

The function would call the first crawl and take the following item from the toVisit set. The problem is that the set is empty since the crawling of the first page didn't occur yet. So we added no new links to the list. The function keeps running in the background, but we have already exited from the main one.

To do this properly, we need to create a queue that will execute tasks when available. To avoid many requests simultaneously, we will limit its concurrency.

Neither Javascript nor NodeJS offers a built-in queue. For web scraping at scale, you can search for libraries that do it better.

const queue = (concurrency = 4) => { 
	let running = 0; 
	const tasks = []; 
 
	return { 
		enqueue: async (task, ...params) => { 
			tasks.push({ task, params }); // Add task to the list 
			if (running >= concurrency) { 
				return; // Do not run if we are above the concurrency limit 
			} 
 
			running += 1; // "Block" one concurrent task 
			while (tasks.length > 0) { 
				const { task, params } = tasks.shift(); // Take task from the list 
				await task(...params); // Execute task with the provided params 
			} 
			running -= 1; // Release a spot 
		}, 
	}; 
}; 
 
// Just a helper function, JS has no sleep function 
const sleep = ms => new Promise(resolve => setTimeout(resolve, ms)); 
 
const printer = async num => { 
	await sleep(2000); 
	console.log(num, Date.now()); 
}; 
 
const q = queue(); 
// Add 8 tasks that will sleep and print a number 
for (let num = 0; num < 8; num++) { 
	q.enqueue(printer, num); 
}

Running the code above will print numbers from 0 to 3 almost immediately (with a timestamp). Then from 4 to 7 after 2 seconds. It might be the hardest snippet to understand - review it without hurry.

We define queue in lines 1-20. It will return an object with the function enqueue to add a task to the list. Then it checks if we are above the concurrency limit. If we are not, it will sum one to running and enter a loop that gets a task and runs it with the provided params. Until the task list is empty, then subtract one from running. This variable is the one that marks when we can or cannot execute any more tasks, only allowing it below the concurrency limit. In lines 23-28, there are helper functions sleep and printer. Instantiate the queue in line 30 and enqueue items in 32-34 (which will start running 4).

You just created a queue using JS in a few lines of code!

We have to use the queue now instead of a for loop to run several pages concurrently. The code below is partial with the parts that change.

const crawl = async url => { 
	// ... 
	links 
		.filter(link => !visited.has(link)) 
		.forEach(link => { 
			q.enqueue(crawlTask, link); // Add to queue instead of to the list 
		}); 
}; 
 
// Helper function that will call crawl after some checks 
const crawlTask = async url => { 
	if (visited.size >= maxVisits) { 
		console.log('Over Max Visits, exiting'); 
		return; 
	} 
 
	if (visited.has(url)) { 
		return; 
	} 
 
	await crawl(url); 
}; 
 
const q = queue(); 
// Add the first link to the process 
q.enqueue(crawlTask, url);

Remember that Node.js runs in a single thread. We can take advantage of its event loop but cannot use multiple CPUs/threads. What we've seen works fine because the thread is idle most of the time - network requests do not consume CPU time.

To build this further, we need to use some storage (database, CSV, or JSON file) or distributed queue system. Right now, we rely on variables that are not shared between threads in Node. For the moment, showing the scraped data is enough for a demo.

It is not overly complicated, but we covered enough ground in this blog post. Well done, good job!

Final code

All the code is on the same js file for the demo. Consider splitting it for a real-world use case. You can also see it on Github.

const axios = require('axios'); 
const playwright = require('playwright'); 
const cheerio = require('cheerio'); 
 
const url = 'https://scrapeme.live/shop/page/1/'; 
const useHeadless = false; // "true" to use playwright 
const maxVisits = 30; // Arbitrary number for the maximum of links visited 
const visited = new Set(); 
const allProducts = []; 
 
const sleep = ms => new Promise(resolve => setTimeout(resolve, ms)); 
 
const getHtmlPlaywright = async url => { 
	const browser = await playwright.chromium.launch(); 
	const context = await browser.newContext(); 
	const page = await context.newPage(); 
	await page.goto(url); 
	const html = await page.content(); 
	await browser.close(); 
 
	return html; 
}; 
 
const getHtmlAxios = async url => { 
	const { data } = await axios.get(url); 
 
	return data; 
}; 
 
const getHtml = async url => { 
	return useHeadless ? await getHtmlPlaywright(url) : await getHtmlAxios(url); 
}; 
 
const extractContent = $ => 
	$('.product') 
		.map((_, product) => { 
			const $product = $(product); 
 
			return { 
				id: $product.find('a[data-product_id]').attr('data-product_id'), 
				title: $product.find('h2').text(), 
				price: $product.find('.price').text(), 
			}; 
		}) 
		.toArray(); 
 
const extractLinks = $ => [ 
	...new Set( 
		$('.page-numbers a') 
			.map((_, a) => $(a).attr('href')) 
			.toArray() 
	), 
]; 
 
const crawl = async url => { 
	visited.add(url); 
	console.log('Crawl: ', url); 
	const html = await getHtml(url); 
	const $ = cheerio.load(html); 
	const content = extractContent($); 
	const links = extractLinks($); 
	links 
		.filter(link => !visited.has(link)) 
		.forEach(link => { 
			q.enqueue(crawlTask, link); 
		}); 
	allProducts.push(...content); 
 
	// We can see how the list grows. Gotta catch 'em all! 
	console.log(allProducts.length); 
}; 
 
// Change the default concurrency or pass it as a param 
const queue = (concurrency = 4) => { 
	let running = 0; 
	const tasks = []; 
 
	return { 
		enqueue: async (task, ...params) => { 
			tasks.push({ task, params }); 
			if (running >= concurrency) { 
				return; 
			} 
 
			++running; 
			while (tasks.length) { 
				const { task, params } = tasks.shift(); 
				await task(...params); 
			} 
			--running; 
		}, 
	}; 
}; 
 
const crawlTask = async url => { 
	if (visited.size >= maxVisits) { 
		console.log('Over Max Visits, exiting'); 
		return; 
	} 
 
	if (visited.has(url)) { 
		return; 
	} 
 
	await crawl(url); 
}; 
 
const q = queue(); 
q.enqueue(crawlTask, url);

Conclusion

We'd like you to part with four main points:
  1. Understand the basics of website parsing, crawling, and how to extract data.
  2. Separate responsibilities and use abstractions when necessary.
  3. Apply the required techniques to avoid blocks.
  4. Be able to figure out the following steps to scale up.

We can build a custom web scraper using Javascript and NodeJS using the pieces we've seen. It might not scale to thousands of websites, but it will be enough for a few. And moving to distributed crawling is not that far from here. And then to automation.

If you liked it, you might be interested in the ultimate guide to web scraping in Python.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Want to keep learning?

We will be sharing all the insights we have learned through the years in the following blog posts. If you don't want to miss a piece and keep learning, we'd be thrilled to have us in our newsletter.

No spam guaranteed. You can unsubscribe at any time.