The Anti-bot Solution to Scrape Everything? Get Your Free API Key! 😎

Web Scraping in JavaScript and Node.js

August 22, 2022 · 12 min read

JavaScript and web scraping are both on the rise. We'll combine them to build a scraper and crawler from scratch using JavaScript in Node.js.

Avoiding blocks is an essential part of website scraping. So we'll also add some features to help in that regard. And finally, we'll parallelize the tasks to go faster with Node.js's event loop.

Follow this tutorial to learn how to web scrape with Node.js and JavaScript!

Prerequisites

You'll need Node.js (or nvm) and npm installed for the code to work. 

After that, install all the necessary libraries by running npm install. It'll create a package.json file with all the dependencies.

npm init -y 
npm install axios cheerio playwright

Introduction to JS Scraping Tools

We're using Node.js v12, but you can always check the compatibility of each feature.

Axios is a "promise-based HTTP client" that we'll use to get the HTML from a URL. It allows several options, such as headers and proxies, which we'll cover later. If you use TypeScript, they include "definitions and a type guard for Axios errors."

Cheerio is a "fast, flexible & lean implementation of core jQuery" JavaScript library. It lets us find DOM nodes using selectors, get text or attributes, and many other things. We'll pass the HTML to Cheerio and then query it to extract data. Just as we would in a browser environment.

Playwright "is a Node.js library to automate Chromium, Firefox, and WebKit with a single API." When Axios isn't enough, we'll get the HTML using a headless browser. It'll then parse the content, execute JavaScript and wait for the async content to load.

Is Node.js Good for Web Scraping?

As you've seen above, tools are available, and the technology is consolidated. All of them are widely used and properly maintained.

Furthermore, there are several alternatives to each of them. And many more focused on one task, such as table-scraper. The JavaScript web scraping ecosystem is huge!

How to Web Scrape with JavaScript?

The first thing we need is the HTML. We installed Axios for that, and its usage is straightforward. 

We'll use scrapeme.live as an example, a fake web page prepared for scraping.

const axios = require('axios'); 
axios.get('https://scrapeme.live/shop/') 
	.then(({ data }) => console.log(data));

Nice! Then, we can query the two things we want using Cheerio: pagination links and products. 

We'll look at the page with Chrome DevTools open to learn how to do that. All modern web browsers offer developer tools such as these. Pick your favorite.

ScrapeMe Paginator
Click to open the image in full screen

We marked the interesting parts in red, but you can try it on your own. In this case, all the CSS selectors are straightforward and don't need nesting.

Check the guide if you want a different outcome or can't select it. You can also use DevTools to get the selector.

ScrapeMe Copy Selector
Click to open the image in full screen

On the Elements tab, right-click on the node ➡ Copy ➡ Copy selector. But the outcome is usually very coupled to the HTML, as in this case: #main > div:nth-child(2) > nav > ul > li:nth-child(2) > a.

This approach might be a problem in the future because it'll stop working after any minimal change. Besides, it'll only capture one of the pagination links, not all.

You can execute JavaScript on the Console tab and check if the selectors work correctly. Pass the selector to the document.querySelector function and check the output. Remember this trick while web scraping. 😉

We could capture all the links on the page and then filter them by content. If we were to write a full-site crawler, that would be the right approach.

In our case, we only want the pagination links. Using the provided class, .page-numbers a will capture them all. Then extract the URLs (hrefs) from those. The CSS selector will match all the link nodes with an ancestor containing the class page-numbers.

const axios = require('axios'); 
const cheerio = require('cheerio'); 
 
const extractLinks = $ => [ 
	...new Set( 
		$('.page-numbers a') // Select pagination links 
			.map((_, a) => $(a).attr('href')) // Extract the href (url) from each link 
			.toArray() // Convert cheerio object to array 
	), 
]; 
 
axios.get('https://scrapeme.live/shop/').then(({ data }) => { 
	const $ = cheerio.load(data); // Initialize cheerio 
	const links = extractLinks($); 
 
	console.log(links); 
	// ['https://scrapeme.live/shop/page/2/', 'https://scrapeme.live/shop/page/3/', ... ] 
});

Store the content above in a file and execute it in Node.js to see the results.

As for the products, Pokémon, in this case, we'll get their ID, name, and price. Check the image below for details on selectors, or try again on your own. We'll only log the scraped data for now. Check the final code for adding them to an array.

Scraped Data About Charmander
Click to open the image in full screen

As you can see above, all the products contain the class product, which makes our job easier. And for each of them, the h2 tag and price node hold the content we want.

As for the product ID, we need to match an attribute instead of a class or DOM node type. That can be done using the syntax node[attribute="value"]. We're looking only for the DOM node with the attribute, so there is no need to match it to any particular value.

const extractContent = $ => 
	$('.product') 
		.map((_, product) => { 
			const $product = $(product); 
			return { 
				id: $product.find('a[data-product_id]').attr('data-product_id'), 
				title: $product.find('h2').text(), 
				price: $product.find('.price').text(), 
			}; 
		}) 
		.toArray(); 
// ... 
 
const content = extractContent($); 
console.log(content); 
// [{ id: '759', title: 'Bulbasaur', price: '£63.00' }, ...]

There is no error handling, as you can see above. We'll omit it for brevity in the snippets but take it into account in real life. Most of the time, returning the default value (i.e., empty array) should do the trick.

Now that we have some pagination links, we should also visit them. If you run the whole code, you'll see that they appear twice, as there are two pagination bars.

We'll add two sets to track what we have already visited and the newly discovered links. Sets have existed in JavaScript since ES2015, and all modern Node.js versions support them.

We use them instead of arrays to avoid duplicates, but either would work. To avoid crawling too much, we'll also include a maximum.

const maxVisits = 5; 
const visited = new Set(); 
const toVisit = new Set(); 
toVisit.add('https://scrapeme.live/shop/page/1/'); // Add initial URL

We'll use async/await for the next part to avoid callbacks and nesting. The async function is an alternative to writing promise-based functions as chains. Again, supported in all modern versions of Node.js.

In this case, the Axios call will remain asynchronous. It might take around one second per page, but we write the code sequentially without needing callbacks.

There is a small gotcha with this: await is only valid in an async function. That will force us to wrap the initial code inside an IIFE (Immediately Invoked Function Expression). The syntax is a bit weird. It creates a function and then calls it immediately.

const crawl = async url => { 
	visited.add(url); 
	const { data } = await axios.get(url); 
	const $ = cheerio.load(data); 
	const content = extractContent($); 
	const links = extractLinks($); 
	links 
		.filter(link => !visited.has(link)) // Filter out already visited links 
		.forEach(link => toVisit.add(link)); 
}; 
 
(async () => { // IIFE 
	// Loop over a set's values 
	for (const next of toVisit.values()) { 
		if (visited.size >= maxVisits) { 
			break; 
		} 
 
		toVisit.delete(next); 
		await crawl(next); 
	} 
 
	console.log(visited); 
	// Set { 'https://scrapeme.live/shop/page/1/', '.../2/', ... } 
	console.log(toVisit); 
	// Set { 'https://scrapeme.live/shop/page/47/', '.../48/', ... } 
})(); // The final set of parenthesis will call the function

Avoid Blocks When Web Scraping

As mentioned, we need mechanisms to avoid blocks, captchas, login walls, and other defensive techniques. It's complicated to prevent them 100% of the time. 

But we can achieve a high success rate with simple efforts. We'll apply two tactics: adding proxies and full-set headers.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Proxies

There are free proxies, even though we don't recommend them. They might work for testing but aren't reliable. We can use some of those for testing, as we'll see in some examples.

Note that these free proxies might not work for you. They're short-lived.

Paid proxy services, on the other hand, offer IP rotation. Our web scraper will work the same, but the target website will see a different IP. In some cases, they rotate for every request or every few minutes. 

In any case, they're much harder to ban. But if it happens, we'll get a new IP after a short time.

We'll use HTTPBin for testing. It offers an API with several endpoints that will respond with headers, IP addresses, and more.

const axios = require('axios'); 
 
const proxy = { 
	protocol: 'http', 
	host: '202.212.123.44', // Free proxy from the list 
	port: 80, 
}; 
 
(async () => { 
	const { data } = await axios.get('https://httpbin.org/ip', { proxy }); 
 
	console.log(data); 
	// { origin: '202.212.123.44' } 
})(); 
	

HTTP Request Headers

The next step would be to check our request's HTTP headers. The most known one is User-Agent (UA for short), but there are many more. Many software tools, for example, Axios (axios/0.21.1), have their own.

In general, it's a good practice to send actual headers along with the UA. That means we need a real-world set of headers because not all browsers and versions use the same ones. We include two in the snippet: Chrome 92 and Firefox 90 on a Linux machine.

const axios = require('axios'); 
 
// Helper function to get a random item from an array 
const sample = array => array[Math.floor(Math.random() * array.length)]; 
 
const headers = [ 
	{ 
		Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 
		'Accept-Encoding': 'gzip, deflate, br', 
		'Accept-Language': 'en-US,en;q=0.9', 
		'Sec-Ch-Ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"', 
		'Sec-Ch-Ua-Mobile': '?0', 
		'Sec-Fetch-Dest': 'document', 
		'Sec-Fetch-Mode': 'navigate', 
		'Sec-Fetch-Site': 'none', 
		'Sec-Fetch-User': '?1', 
		'Upgrade-Insecure-Requests': '1', 
		'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36', 
	}, 
	{ 
		Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 
		'Accept-Encoding': 'gzip, deflate, br', 
		'Accept-Language': 'en-US,en;q=0.5', 
		'Sec-Fetch-Dest': 'document', 
		'Sec-Fetch-Mode': 'navigate', 
		'Sec-Fetch-Site': 'none', 
		'Sec-Fetch-User': '?1', 
		'Upgrade-Insecure-Requests': '1', 
		'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0', 
	}, 
]; 
 
(async () => { 
	const { data } = await axios.get('https://httpbin.org/anything', { headers: sample(headers) }); 
 
	console.log(data); 
	// { 'User-Agent': '...Chrome/92...', ... } 
})();

Headless Browsers for Dynamic HTML

Until now, every page visited was done using axios.get, which can sometimes be inadequate. Say we need JS to load, execute, or interact with the browser via mouse or keyboard.

While avoiding headless browsers would be preferable for performance reasons, sometimes there is no other choice. Selenium, Puppeteer, and Playwright are the most used and known libraries in the JavaScript and Node.js world.

The snippet below shows only the User-Agent. But since it's a real browser, the headers will include the entire set (Accept, Accept-Encoding, etc.).

const playwright = require('playwright'); 
 
(async () => { 
	// 'webkit' is also supported, but there is a problem on Linux 
	for (const browserType of ['chromium', 'firefox']) { 
		const browser = await playwright[browserType].launch(); 
		const context = await browser.newContext(); 
		const page = await context.newPage(); 
		await page.goto('https://httpbin.org/headers'); 
		console.log(await page.locator('pre').textContent()); 
		await browser.close(); 
	} 
})(); 
 
// "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/94.0.4595.0 Safari/537.36", 
// "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0",

This approach comes with its own problem: look at the User-Agents. The Chromium one includes HeadlessChrome. It'll tell the target web page that it's a headless browser. They might act upon that.

As with Axios, we can set headers, proxies, and other options to customize requests. An excellent choice to hide our HeadlessChrome User-Agent. 

And since this is a real web browser, we can intercept requests, block others (like CSS files or images), take screenshots or videos, and more. Really handy for web scraping!

const playwright = require('playwright'); 
 
(async () => { 
	const browser = await playwright.chromium.launch({ 
		proxy: { server: 'http://91.216.164.251:80' }, // Another free proxy from the list 
	}); 
	const context = await browser.newContext(); 
	const page = await context.newPage(); 
	page.setExtraHTTPHeaders({ referrer: 'https://news.ycombinator.com/' }); 
	await page.goto('http://httpbin.org/anything'); 
	console.log(await page.locator('pre').textContent()); // Print the complete response 
	await browser.close(); 
})(); 
 
// "Referrer": "https://news.ycombinator.com/" 
// "origin": "91.216.164.251"

Now we can separate getting the HTML in a couple of functions, one using Playwright and the other Axios. We would then need a way to select which is appropriate for the case. For now, it's hardcoded.

This output, by the way, is the same but quite faster when using Axios.

const playwright = require('playwright'); 
const axios = require('axios'); 
const cheerio = require('cheerio'); 
 
const getHtmlPlaywright = async url => { 
	const browser = await playwright.chromium.launch(); 
	const context = await browser.newContext(); 
	const page = await context.newPage(); 
	await page.goto(url); 
	const html = await page.content(); 
	await browser.close(); 
 
	return html; 
}; 
 
const getHtmlAxios = async url => { 
	const { data } = await axios.get(url); 
 
	return data; 
}; 
 
(async () => { 
	const html = await getHtmlPlaywright('https://scrapeme.live/shop/page/1/'); 
	const $ = cheerio.load(html); 
	const content = extractContent($); 
	console.log('getHtmlPlaywright', content); 
})(); 
 
(async () => { 
	const html = await getHtmlAxios('https://scrapeme.live/shop/page/1/'); 
	const $ = cheerio.load(html); 
	const content = extractContent($); 
	console.log('getHtmlAxios', content); 
})();

Using JavaScript's Async for Parallel Crawling

We already introduced async/await when crawling several links sequentially. If we were to crawl them in parallel, removing the await would be enough, right? Well... not so fast.

The function would call the first crawl and take the following item from the toVisit set. The problem is that the set is empty since the crawling of the first page hasn't occurred yet. 

So we added no new links to the list. The function keeps running in the background, but we have already exited from the main one.

To do this properly, we must create a queue to execute tasks when available. To avoid many requests simultaneously, we'll limit its concurrency.

Neither JavaScript nor Node.js offers a built-in queue. For web scraping at scale, you can search for libraries that do it better.

const queue = (concurrency = 4) => { 
	let running = 0; 
	const tasks = []; 
 
	return { 
		enqueue: async (task, ...params) => { 
			tasks.push({ task, params }); // Add task to the list 
			if (running >= concurrency) { 
				return; // Do not run if we are above the concurrency limit 
			} 
 
			running += 1; // "Block" one concurrent task 
			while (tasks.length > 0) { 
				const { task, params } = tasks.shift(); // Take task from the list 
				await task(...params); // Execute task with the provided params 
			} 
			running -= 1; // Release a spot 
		}, 
	}; 
}; 
 
// Just a helper function, JS has no sleep function 
const sleep = ms => new Promise(resolve => setTimeout(resolve, ms)); 
 
const printer = async num => { 
	await sleep(2000); 
	console.log(num, Date.now()); 
}; 
 
const q = queue(); 
// Add 8 tasks that will sleep and print a number 
for (let num = 0; num < 8; num++) { 
	q.enqueue(printer, num); 
}

Running the code above will print numbers from zero to three almost immediately with a timestamp. Then from four to seven after two seconds. It might be the hardest snippet to understand, so review it without a hurry.

We define queue in lines 1-20. It'll return an object with the function enqueue to add a task to the list. Then it checks if we're above the concurrency limit. 

If we're not, it'll sum one to running and enter a loop that gets a task and runs it with the provided parameters. Until the task list is empty, it'll subtract one from running

This variable is the one that marks when we can or can't execute any more tasks, only allowing it below the concurrency limit. In lines 23-28 are helper functions sleep and printer. Instantiate the queue in line 30 and enqueue items in 32-34 (which will start running four).

You just created a queue using JS in a few lines of code!

We have to use the queue now instead of a for loop to run several pages concurrently. The code below is partial with the parts that change.

const crawl = async url => { 
	// ... 
	links 
		.filter(link => !visited.has(link)) 
		.forEach(link => { 
			q.enqueue(crawlTask, link); // Add to queue instead of to the list 
		}); 
}; 
 
// Helper function that will call crawl after some checks 
const crawlTask = async url => { 
	if (visited.size >= maxVisits) { 
		console.log('Over Max Visits, exiting'); 
		return; 
	} 
 
	if (visited.has(url)) { 
		return; 
	} 
 
	await crawl(url); 
}; 
 
const q = queue(); 
// Add the first link to the process 
q.enqueue(crawlTask, url);

Remember that Node.js runs in a single thread. We can take advantage of its event loop but can't use multiple CPUs/threads. What we've seen works fine because the thread is idle most of the time, so network requests don't consume CPU time.

To build this further, we need to use some storage (database, CSV, or JSON file) or distributed queue system. Right now, we rely on variables that aren't shared between threads in Node.js. For the moment, showing the scraped data is enough for a demo.

It's not overly complicated, but we covered enough ground in this blog post. Well done, good job!

Final Code

All the code is on the same .js file for the demo. Consider splitting it for a real-world use case. You can also see it on GitHub.

const axios = require('axios'); 
const playwright = require('playwright'); 
const cheerio = require('cheerio'); 
 
const url = 'https://scrapeme.live/shop/page/1/'; 
const useHeadless = false; // "true" to use playwright 
const maxVisits = 30; // Arbitrary number for the maximum of links visited 
const visited = new Set(); 
const allProducts = []; 
 
const sleep = ms => new Promise(resolve => setTimeout(resolve, ms)); 
 
const getHtmlPlaywright = async url => { 
	const browser = await playwright.chromium.launch(); 
	const context = await browser.newContext(); 
	const page = await context.newPage(); 
	await page.goto(url); 
	const html = await page.content(); 
	await browser.close(); 
 
	return html; 
}; 
 
const getHtmlAxios = async url => { 
	const { data } = await axios.get(url); 
 
	return data; 
}; 
 
const getHtml = async url => { 
	return useHeadless ? await getHtmlPlaywright(url) : await getHtmlAxios(url); 
}; 
 
const extractContent = $ => 
	$('.product') 
		.map((_, product) => { 
			const $product = $(product); 
 
			return { 
				id: $product.find('a[data-product_id]').attr('data-product_id'), 
				title: $product.find('h2').text(), 
				price: $product.find('.price').text(), 
			}; 
		}) 
		.toArray(); 
 
const extractLinks = $ => [ 
	...new Set( 
		$('.page-numbers a') 
			.map((_, a) => $(a).attr('href')) 
			.toArray() 
	), 
]; 
 
const crawl = async url => { 
	visited.add(url); 
	console.log('Crawl: ', url); 
	const html = await getHtml(url); 
	const $ = cheerio.load(html); 
	const content = extractContent($); 
	const links = extractLinks($); 
	links 
		.filter(link => !visited.has(link)) 
		.forEach(link => { 
			q.enqueue(crawlTask, link); 
		}); 
	allProducts.push(...content); 
 
	// We can see how the list grows. Gotta catch 'em all! 
	console.log(allProducts.length); 
}; 
 
// Change the default concurrency or pass it as a param 
const queue = (concurrency = 4) => { 
	let running = 0; 
	const tasks = []; 
 
	return { 
		enqueue: async (task, ...params) => { 
			tasks.push({ task, params }); 
			if (running >= concurrency) { 
				return; 
			} 
 
			++running; 
			while (tasks.length) { 
				const { task, params } = tasks.shift(); 
				await task(...params); 
			} 
			--running; 
		}, 
	}; 
}; 
 
const crawlTask = async url => { 
	if (visited.size >= maxVisits) { 
		console.log('Over Max Visits, exiting'); 
		return; 
	} 
 
	if (visited.has(url)) { 
		return; 
	} 
 
	await crawl(url); 
}; 
 
const q = queue(); 
q.enqueue(crawlTask, url);

Conclusion

We'd like you to part with four main points:

  1. Understand the basics of website parsing, crawling, and how to extract data.
  2. Separate responsibilities and use abstractions when necessary.
  3. Apply the required techniques to avoid blocks.
  4. Be able to figure out the following steps to scale up.

We can build a custom web scraper with JavaScript and Node.js using the pieces we've seen. It might not scale to thousands of websites, but it'll be enough for a few. And moving to distributed crawling and automation isn't that far from here.

If you liked it, you might be interested in the ultimate Python web scraping guide.

Did you find the content helpful? Spread the word and share it on Twitter, or LinkedIn.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.