Javascript, NodeJS and Cheerio Integration

Learn how to integrate ZenRows API with Axios and Cheerio to scrape any website. From the most basic calls to advanced features such as concurrency and auto-retry. From installation to the final code, we will go step-by-step, explaining everything we code.

To just grab the code, go to the final snippet and copy it. It is commented with the parts that must the filled and helpful remarks for the complicated details.

For the code to work, you will need Node (or nvm) and npm installed. Some systems have it pre-installed. After that, install all the necessary libraries by running npm install.

npm install axios axios-retry cheerio

You will also need to register to get your API Key.

Using Axios to Get a Page

The first library we will see is Axios, a "promise based HTTP client for the browser and node.js". It exposes a get method that will call a URL and return a response with the HTML. For the moment, we won't be using any parameters, just as a demo to see how it works.

Careful! This script will run without any proxy, and the server will see your real IP. You don't need to execute this snippet.

const axios = require("axios"); 
 
const url = ""; // ... your URL here 
axios.get(url).then((response) => { 
	console.log(response.data); // pages's HTML 
});

Calling ZenRows API with Axios

Connecting Axios to ZenRows API is straightforward. axios.get's target will be the API base, and a second parameter is an object with params: apikey for authentication and url. URLs must be encoded, but Axios will handle that when using params.

With this simple change, we will handle all the hassles of scraping, such as proxies rotation, bypassing CAPTCHAs and anti-bot solutions, setting correct headers, and many more. However, there are still some challenges that we will address now. Continue reading.

const axios = require("axios"); 
 
const url = ""; // ... your URL here 
const apikey = "YOUR_KEY"; // paste your API Key here 
const zenrowsApiBase = "https://api.zenrows.com/v1/"; 
 
axios 
	.get(zenrowsApiBase, { 
		params: { apikey, url }, 
	}) 
	.then((response) => { 
		console.log(response.data); // pages's HTML 
	});

Extracting Basic Data with Cheerio

We will now parse the page's HTML with Cheerio and extract some data. We'll create a simple function extractContent to return URL, title, and h1 content. Your custom extracting logic goes there.

Cheerio offers a "jQuery-like" syntax, and it is designed to work on the server. Its load method receives a plain HTML and creates a querying function that will allow us to find elements. Then you can query with CSS Selectors and navigate, manipulate, or extract content as a browser would. The resulting selector exposes text, which will give us the content in plain text, without tags. Check the docs for more advanced features.

const axios = require("axios"); 
const cheerio = require("cheerio"); 
 
const url = ""; // ... your URL here 
const apikey = "YOUR_KEY"; // paste your API Key here 
const zenrowsApiBase = "https://api.zenrows.com/v1/"; 
 
const extractContent = (url, $) => ({ 
	// extracting logic goes here 
	url, 
	title: $("title").text(), 
	h1: $("h1").text(), 
}); 
 
axios 
	.get(zenrowsApiBase, { 
		params: { apikey, url }, 
	}) 
	.then((response) => { 
		const $ = cheerio.load(response.data); 
		const content = extractContent(url, $); 
		console.log(content); // custom scraped content 
	});

List of URLs with Concurrency

We've seen how to scrape a single URL. Instead, we will now introduce a list of URLs closer to an actual use case. We'll also set up concurrency so we don't have to wait for the sequential process to finish. It will allow the script to process several URLs simultaneously, always with a maximum. That number depends on the plan you are in.

ZenRows Javascript SDK provides full concurrency support, as Javascript's support is limited.

npm i zenrows

It will enqueue and execute all our requests. And it will do so by handling the parallelism for us and the maximum number of requests going on simultaneously, but never over the limit (10 in the example). Once all the requests finish, we will print the results. In a real case, for example, store them in a database.

const { ZenRows } = require("zenrows"); 
const cheerio = require("cheerio"); 
 
const apikey = "YOUR_KEY"; // paste your API Key here 
const urls = [ 
	// ... your URLs here 
]; 
(async () => { 
	const client = new ZenRows(apiKey, { concurrency: 10 }); 
 
	const extractContent = (url, $) => ({ 
		// extracting logic goes here 
		url, 
		title: $("title").text(), 
		h1: $("h1").text(), 
	}); 
 
	const scrapeUrl = async (url) => { 
		try { 
			const response = await client.get(url); 
			const $ = cheerio.load(response.data); 
 
			return extractContent(url, $); 
		} catch (error) { 
			return { url, error: error.message }; 
		} 
	}; 
 
	const promises = urls.map((url) => scrapeUrl(url)); 
 
	const results = await Promise.allSettled(promises); 
	console.log(results); 
	/* 
		[ 
			{ 
				status: "fulfilled", 
				value: { 
					url: "YOUR_FIRST_URL", 
					title: "First Title", 
					h1: "Some Important H1" 
				} 
			}, 
			... 
		] 
	*/ 
})();

Auto-Retry Failed Requests

The last step to having a robust scraper is to retry failed requests. We could use axios-retry, but the SDK already does that.

The basic idea goes like this:
  1. Identify the failed requests based on the return status code.
  2. Wait an arbitrary amount of time. Using the library's exponentialDelay will increment exponentially plus a random margin between attempts.
  3. Retry the request until it succeeds or reaches a maximum amount of retries.

Keep in mind that all the retries will take place on the same concurrency thread, effectively blocking it. Some errors are temporary, so retrying might not solve the issue. For those cases, a better strategy would be to store the URL as failed and enqueue it again after some minutes.

Passing a integer value on the SDK constructor is enough to set the number of retries you want. Visit the article on Retry Failed Requests for more info.

const { ZenRows } = require("zenrows"); 
 
// same snippet as above 
 
const apikey = "YOUR_KEY"; // paste your API Key here 
const client = new ZenRows(apiKey, { concurrency: 10, retries: 3 });

If the implementation does not work for your use case or you have any problem, contact us and we'll gladly help you.