How to Scrape Web Pages with Cheerio in Node.js

October 6, 2022 · 6 min read

In this article, you'll learn how to do web scraping with Cheerio and NodeJS. Web scraping is the extraction of data from websites for specific use cases and analysis.

This tutorial will take you through how to use Cheerio to scrape Pokémon data from an example website, and then create a JSON file from the data. Here we aim to teach you how you can use Cheerio in NodeJS to scrape any kind of web page, using Axios as an HTTP request client.

Let's get right into it!

What is Cheerio?

Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for the server. Cheerio is a JavaScript library built over htmlparser2, with a similar implementation to jQuery specifically designed for server-side DOM manipulation. It also provides robust APIs for parsing and traversing markup data.

If you're new to web scraping in JavaScript and Node JS, the article introduction to web scraping with JavaScript and Node.js is a good starting point. And if you're unfamiliar with the jQuery syntax, a good Cheerio alternative for web scraping is Puppeteer.

Prerequisites

Before you begin this article, you'll need to have a good knowledge of writing code in JavaScript and Node.js. Also, make sure you have the following installed on your computer before continuing with the article:
  1. node.js
  2. npm
  3. code editor - such as VS Code or Atom

If you're unsure whether Node.js or npm are installed on your computer, you can confirm by running node -v and npm -v in your terminal, respectively.

Cheerio example usage

Here we'll demonstrate a basic usage of Cheerio in web scraping to help you understand how the tool works. To get started with scraping web data, you need to pass markup data for Cheerio to load to build a DOM. This is usually done using the load function, which accepts a string of markup data.

This is the recommended way to load HTML data for scraping. After loading in the markup and initializing Cheerio, you can begin manipulating and traversing the resulting data structure with cheerio's API.

An example is shown below:

const cheerio = require('cheerio'); 
 
const $ = cheerio.load('<h2 class="title">Hello world</h2>') // Load markup 
 
// Use a selector to grab the title class from the markup and change its text 
$('h2.title').text('Hello there!'); 
$('h2').addClass('welcome'); // Add a 'welcome' class to the markup 
 
console.log($.html()); // The html() method renders the document

Which logs the following to the console:

<html><head></head><body><h2 class="title welcome">Hello there!</h2></body></html>

It's also worth noting that Cheerio will automatically include <html>, <head>, and <body> elements in the rendered markup, just like we have in browser contexts (only if they are not already present). However, you can disable this behavior by adding false as a third argument to the load function like below:

// ... 
const $ = cheerio.load('<h2> class="title">Hello world</h2>', null, false) 
// ...
Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Cheerio's selector API

In this section, we'll look at some of Cheerio's selectors that can be used to traverse and manipulate markup data. The selector API implementation is very similar to jQuery if you have used it before.

The function has the following structure: $(selector, [context], [root])
  1. selector - This is used for targeting specific elements in markup data. It's the starting point for traversing and manipulating the markup data. It can be a string, DOM element, array of elements, or cheerio objects.
  2. context - Optional. Defines the scope or where to begin looking for the target elements. It can be a string, DOM element, array of elements, or cheerio objects.
  3. root - This is usually the markup string you want to traverse or manipulate.

Optionally, you can load the markup data directly with the selector API like below:

const cheerio = require('cheerio'); 
 
const $ = cheerio.load('') 
 
// This loads the HTML data, selects the last list item and returns its text content. 
console.log($('li:last', 'ul', '<ul id="fruits"><li class="apple">Apple</li><li class="orange">Orange</li><li class="pear">Pear</li></ul>').text())

Note that the above method is not recommended for loading data, so you should only use it in rare cases.

For example, given the following HTML markup:

<ul id="fruits"> 
	<li class="apple">Apple</li> 
	<li class="orange">Orange</li> 
	<li class="pear">Pear</li> 
</ul>

We can target the list item with a class name of apple like so:

const cheerio = require('cheerio'); 
 
const $ = cheerio.load('<ul id="fruits"><li class="apple">Apple</li><li class="orange">Orange</li><li class="pear">Pear</li></ul>', null, false) // Load markup 
 
// Target the list item with 'apple' class name, then return its text content 
console.log($('li.apple').text())
Below are additional examples of selectors we can use:
  • $('li:last') - returns the last list item.
  • $('li:even') - returns all even <li> elements.

Almost all jQuery selectors are compatible, so you are not limited to what you can 'find' in a given markup data. You can refer to a comprehensive list of jQuery selectors here.

Extracting data with regex

It's also possible to search for data in Cheerio with regex patterns using JavaScript's string match() method.

For example, given the following HTML markup, which contains a list of usernames:

<ul class='usernames'> 
	<li>janedoe</li> 
	<li>maxweber</li> 
	<li>greengoblin</li> 
	<li>maxweber34</li> 
	<li>alpha123</li> 
	<li>chrisjones</li> 
	<li>amelia</li> 
	<li>mrjohn34</li> 
	<li>matjoe212</li> 
	<li>eliza</li> 
	<li>commando007</li> 
</ul>

We can search and return usernames that include digits using a regular expression like so:

const cheerio = require('cheerio'); 
 
const markup = '<ul class="usernames"><li>janedoe</li><li>maxweber</li><li>greengoblin</li<li>maxweber34</li><li>alpha123</li><li>chrisjones</li><li>amelia</li><li>mrjohn34</li><li>matjoe212</li><li>eliza</li><li>commando007</li></ul>' 
 
const $ = cheerio.load(markup); // Load markup and initialize Cheerio 
 
const usernames = $('.usernames li'); // Get all list items 
 
const usernamesWithDigits = []; 
 
usernames.each((index, el) => { 
	const regex = /\d/; // Search for usernames that contain digits 
	const hasNumber = $(el).text().match(regex); 
	if (hasNumber !== null) { 
		usernamesWithDigits.push(hasNumber.input) 
	} 
}) 
 
console.log(usernamesWithDigits) // Log usernames that contain digits to the console.

The above code logs the following to the console:

Regex result
Click to open the image in fullscreen

How to use Cheerio to scrape a web page

In this section, you'll learn how to scrape Pokémon data from ScrapeMe and then turn the resulting data into a JSON file. The web page looks like the image below:

ScrapeMe homepage
Click to open the image in fullscreen

Let's get started.

Step 1 - Create a working directory

To get started, you'll need to create a project repository. Run the command below in your to create a project directory and move into it:

mkdir cheerio-web-scraping && cd cheerio-web-scraping

Obviously, you don't have to name your project cheerio-web-scraping. You can always choose a name you're comfortable with.

Step 2 - Initialize a node project

Firstly, make sure you are in the project directory, and then run the following command to initialize a node project:

npm init -y

The above command will create a new project with a package.json file for configuration.

Note that using the -y flag will automatically skip the interactive prompts and configure the project.

Step 3 - Install project dependencies

In this part, we'll install the necessary dependencies for our project. Run the command below to install the packages:

npm install cheerio axios
The above command will install the following packages:
  1. Cheerio - The main package we'll be using for scraping.
  2. Axios - A promise-based HTTP client for browsers and node.js.

Step 4 - Inspect the target website

Before writing any code for web scraping, it's crucial that you understand the structure and content of the website you're scraping. It will be your guide when writing your code, so you don't waste time scraping useless data. Every modern browser comes with Developer Tools for inspecting websites and applications.

In this tutorial, we will use Chrome's DevTools to inspect our target website. We want to scrape three things from every Pokémon on our target website:
  • Picture URL
  • Name
  • Price

The image below shows how the website is structured using Chrome's DevTools:

Pokemon product on DevTools
Click to open the image in fullscreen

As you can see from the image above, a <ul> element with a class name of products contains a list of all Pokémons. And in the image below, we can also see that each list element has a class name of product and contains all the data we want to scrape.

Pokemon list items
Click to open the image in fullscreen

Now that we understand our target website structure, we can dive into writing some code. Let's get started!

Step 5 - Write the code

The next step is to create an index.js file in your project root directory. Open the index.js file and add the following code to it:

const axios = require('axios'); 
const cheerio = require('cheerio'); 
const fs = require('fs'); 
 
const targetURL = 'https://scrapeme.live/shop/'; 
 
const getPokemons = ($) => { 
	// Get all list items from the unodered list with a class name of 'products' 
	const pokemons = $('.products li'); 
	const pokemonData = []; 
	// The 'each()' method loops over all pokemon list items 
	pokemons.each((index, el) => { 
		// Get the image, name, and price of each pokemon and create an object 
		const pokemon = {} 
 
		// Selector to get the image 'src' value of a pokemon 
		pokemon.img = $(el).find('a > img').attr('src'); 
		pokemon.name = $(el).find('h2').text(); // Selector to get the name of a pokemon 
		pokemon.price = $(el).find('.amount').text(); // Selector to get the price of a pokemon 
		pokemonData.push(pokemon) 
	}) 
 
	// Create a 'pokemon.json' file in the root directory with the scraped pokemonData 
	fs.writeFile("pokemon.json", JSON.stringify(pokemonData, null, 2), (err) => { 
		if (err) { 
			console.error(err); 
			return; 
		} 
		console.log("Data written to file successfully!"); 
	}); 
} 
 
// axios function to fetch HTML Markup from target URL 
axios.get(targetURL).then((response) => { 
	const body = response.data; 
	const $ = cheerio.load(body); // Load HTML data and initialize cheerio 
	getPokemons($) 
});

Let's understand what's going on in the code above.

Our scraper code contains a major function scraper(). In the scraper function, we have another function named getPokemons. And in it, we are writing the actual scraping code that uses Cheerio's selectors to search for the Pokemon data we are looking for.

Firstly, we use the selector $('.products li') to get all <li> items from the element with a class name of .products (in this case, a <ul> element). This returns a Cheerio object containing all list items. We then use Cheerio's each() method to iterate over the Cheerio object and run a callback function that executes for each list item in the Cheerio object.

In the callback, we have some selectors that are searching through the markup for the pokemon data we want. Let's go through them:
  • $(el).find('a > img').atrr('src') - to get the img URL of a pokémon, this selector goes through a list item, searches for an <img> element with the find() traversing method and returns the value of the src attribute using Cheerio's attr() method.
  • $(el).find('h2').text() - to get the name of a pokémon, this selector searches for an <h2> element in a list item using Cheerio's find() traversing method and returns the text content of the element.
  • $(el).find('.amount').text() - to get the price of a pokémon, this selector finds an element with a class name of .amount (in this case, a <span> element) and returns its text content.

The pokemon object holds the data for each Pokémon, which we then push into a pokemonData array. This array will contain all the scraped Pokémons.

The next step is creating a JSON file from the pokemonData array. For this, we use fs.writeFile which writes data to a file asynchronously. It takes a callback function, and in it, we use the JSON.stringify() method to convert the pokemonData array into a JSON string.

Now a final step. Open the package.json file and add the following code to it:

// ... 
"scripts": { 
	"dev": "node index.js" 
}, 
// ...

Now, if you run npm run dev in your terminal, a pokemon.json file should be created automatically, and the message "Data written to file successfully!" should also be logged to the console.

This is what the created pokemon.json file looks like:

Pokemon JSON result
Click to open the image in fullscreen

Congratulations, you just built a web scraper with Cheerio in NodeJS! You're now ready to take on every piece of data on the web!

Conclusion

In summary, you've learned how to scrape web data using Cheerio and Axios in NodeJS. We walked you through the process of web scraping, what you need to know before scraping websites, and the tools for getting it done.

We hope this article helps you get started with building your own web scrapers, and we are excited to see what you build next!

A frequent challenge you'll encounter is being blocked by websites, so we recommend you take a look at our Web Scraping Without Getting Blocked guide.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Want to keep learning?

We will be sharing all the insights we have learned through the years in the following blog posts. If you don't want to miss a piece and keep learning, we'd be thrilled to have us in our newsletter.

No spam guaranteed. You can unsubscribe at any time.