Headless Browser in NodeJS with Puppeteer [2024]

May 20, 2024 · 9 min read

Using a headless browser in NodeJS allows developers to control e.g. Chrome with code, providing extra functionality in order to interact with web pages and simulate human behavior.

Today, we'll look into how to use Puppeteer, the most popular in this language, for web scraping.

What Is a Headless Browser in NodeJS?

A headless browser in NodeJS is an automated browser that runs without a Graphical User Interface (GUI), spending fewer resources and being faster. It allows JavaScript rendering and performing actions (submitting forms, scrolling, etc.) like a human would.

How to Run a Headless Browser in NodeJS with Puppeteer

Now that you know what a headless browser is, let's dig into running one with Puppeteer to interact with elements on the page and scrape data.

As a target site, we'll use ScrapingCourse.com, a demo website with e-commerce features.

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

Prerequisites

Ensure you have NodeJS installed (npm ships with it) before moving forward.

Create a new directory and initialize a NodeJS project using npm init -y. Then, install Puppeteer with the command below:

Terminal
npm i [email protected]

Note: Puppeteer will download the most recent version of Chromium after running the installation command. If you'd opt for a manual setup, which is useful if you want to connect to a remote browser or manage browsers yourself, the puppeteer-core package won't download Chromium by default.

Then, create a new scraper.js file inside the headless browser JavaScript project you initialized above.

scraper.js
touch scraper.js

We're ready to get started now!

Step 1: Open the Page

Let's begin by opening the site we want to scrape. For that, launch a browser instance, create a new page and navigate to our target site.

scraper.js
const puppeteer = require('puppeteer');

(async () => {
  // launches a browser instance
  const browser = await puppeteer.launch();
  // creates a new page in the default browser context
  const page = await browser.newPage();
  // navigates to the page to be scraped 
  const response = await page.goto('https://www.scrapingcourse.com/ecommerce/');

  // logs the status of the request to the page
  console.log('Request status: ', response?.status(), '\n\n\n\n');

  // closes the browser instance
  await browser.close();
})();

Note: The close() method is called at the end to close Chromium and all its pages.

Run the code using node scraper on the terminal. It'll log the status code of the request to ScrapingCourse.com, as seen in the image below:

Output
Request status:  200 

Congratulations! 200 shows your request was successful. Now, you're ready to do some scraping.

Step 2: Scrape the Data

Our goal is to scrape all product names on the homepage and display them in a list. Here's what you need to do:

Use your regular browser to go to ScrapingCourse.com and locate any product card, then right-click on the name of the creature and select "Inspect" to open your Chrome DevTools. The browser will highlight the selected element, as shown below.

scrapingcourse ecommerce homepage selected product h2 class
Click to open the image in full screen

The selected element holding the product name is an h2 with the woocommerce-loop-product__title class. If you inspect other ones on that page, you'll see they all have the same class. We can use that to target all the name elements and, in turn, scrape them.

The Puppeteer Page API provides several methods to select elements on a page. One example is Page.$$eval(selector, pageFunction, args), where $$eval() runs document.querySelectorAll against its first argument, the selector. It then returns the result to its second argument, the callback page function, for further operations.

Let's leverage this. Update your scraper.js file with the below code:

scraper.js
const puppeteer = require('puppeteer');

(async () => {
  // launches a browser instance
  const browser = await puppeteer.launch();
  // creates a new page in the default browser context
  const page = await browser.newPage();

  // remove timeout limit
  page.setDefaultNavigationTimeout(0); 

  // navigates to the page to be scraped 
  await page.goto('https://www.scrapingcourse.com/ecommerce/');

  // gets an array of all product names
  const names = await page.$$eval('.woocommerce-loop-product__title', (nodes) => nodes.map((n) => n.textContent));
  
  console.log('Number of products: ', names.length);
  console.log('List of products: ', names.join(', '), '\n\n\n');

  // closes the browser instance
  await browser.close();
})();

Like in the last example, we see similar operations of creating a browser instance and page. However, to disable the timeout and its errors, page.setDefaultNavigationTimeout(0); sets the navigation timeout to zero ms instead of the default 3000 ms.

Furthermore, n.textContent gets the text of all the nodes or elements with a woocommerce-loop-product__title class. Meanwhile, the $$eval() function returns an array of the product names. 

Finally, the code logs the amount of product scraped and creates a comma-separated list with the names.

Run the script again, and you'll see an output like this:

Output
Number of products:  16
List of products:  Abominable Hoodie, Adrienne Trek Jacket, Aeon Capri, Aero Daily Fitness Tee, Aether Gym Pant, Affirm Water Bottle, Aim Analog Watch, Ajax Full-Zip Sweatshirt, Ana Running Short, Angel Light Running Short, Antonia Racer Tank, Apollo Running Short, Arcadio Gym Short, Argus All-Weather Tank, Ariel Roll Sleeve Sweatshirt, Artemis Running Short

Cool!

Let's see next how to interact with the webpage with Puppeteer, an extra functionality the headless browser provides us.

Interact with Elements on the Page

There are some Page APIs for interacting with elements on a page. For example, the Page.type(selector, text) method can send keydown, keyup and input events.

Take a look at the search field on the top right of the target site, which we can use. Inspect the element, and you'll see this:

ScrapingCourse search field inspection
Click to open the image in full screen

The search field has the woocommerce-product-search-field-0 ID. We can select the element with this and trigger input events on it. To do so, add the below code between the page.goto() and browser.close() methods in your scraper.js file.

scraper.js
  const searchFieldSelector = '#woocommerce-product-search-field-0';

  const getSearchFieldValue = async () => await page.$eval(searchFieldSelector, el => el.value);
  
  console.log('Search field value before: ', await getSearchFieldValue());
  // type instantly into the search field
  await page.type(searchFieldSelector, 'Atlas Fitness Tank');
  console.log('Search field value after: ', await getSearchFieldValue());

We used the page.type() method to type in the word "Atlas Fitness Tank" in the field.

Rerun the scraper file, and you should get this output:

Output
Search field value before:  
Search field value after:  Atlas Fitness Tank

The value of the search field changed, indicating the input events were successfully triggered.

Great! Let's explore other useful capabilities now.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Advanced Headless Browsing with Puppeteer in NodeJS

In this section, you'll learn how to up your Puppeteer headless browser game.

Take a Screenshot

Imagine you'd want to get screengrabs, for instance to check visually that your scraper is working properly. The good news is taking screenshots with Puppeteer is doable by calling the screenshot() method.

scraper.js
  // takes a screenshot of the search term in the search box
  await page.screenshot({ path: 'scrapingcourse-search-result.png' })
  console.log('Screenshot taken');

Note: The path option specifies the screenshot's location and filename.

Run the scraper file again, and you'll see a "scrapingcourse-search-result.png" image file created in the root directory of your project upon execution:

ScrapingCourse search term
Click to open the image in full screen

Wait for the Content to Load

It's a best practice to wait for the whole page or part of it to load when web scraping to make sure everything has been displayed. Let's see an example of why.

Assume you want to get the description of the first product on the target homepage. For that, we can simulate a click event on its image, which will trigger another page load that will contain its description.

Inspecting the first product's image on the homepage reveals a link with the woocommerce-LoopProduct-link and woocommerce-loop-product__link classes.

scrapingcourse ecommerce homepage selected product ahref
Click to open the image in full screen

And, on the page that loads after clicking the first product's image, the description reveals a div element with a woocommerce-product-details__short-description class.

ScrapingCourse first product page inspection
Click to open the image in full screen

We'll use these classes as selectors for the elements. So you need to update the code between the page.goto() and browser.close() methods with the one below:

scraper.js
  // selectors
  const productDetailsSelector = '.woocommerce-product-details__short-description',
    productLinkSelector = '.woocommerce-LoopProduct-link.woocommerce-loop-product__link';
  // clicks on the first product image link (triggers a new page load)
  await page.$$eval(productLinkSelector, (links) => links[0]?.click());
  // gets the content of the description from the element
  const description = await page.$eval(productDetailsSelector, (node) => node.textContent);
  // logs the description of the product
  console.log('Description: ', description);

There, the $$eval() method selects all available product links and clicks, and the $eval() method targets the description element and gets its content.

Now it's time to run the scraper. Unfortunately, we get an error:

Output
#error = new Errors_js_1.ProtocolError();
             ^

ProtocolError: Protocol error (DOM.describeNode): Cannot find context with specified id

It occurred because Puppeteer was trying to get the description element before it was loaded.

To fix this, add the waitForSelector(selector) method to wait for the selector of the description element. This method will resolve only when the description is available. We could also wait for the page to load with ​​waitForNavigation. Either will do the job, but we recommend waiting for a selector if possible.

scraper.js
  // selectors
  const productDetailsSelector = '.woocommerce-product-details__short-description',
    productLinkSelector = '.woocommerce-LoopProduct-link.woocommerce-loop-product__link';
  // clicks on the first product image link (triggers a new page load)
  await page.$$eval(productLinkSelector, (links) => links[0]?.click());
  // waits for the element with the description of the product
  await page.waitForSelector(productDetailsSelector);
  // gets the content of the description from the element
  const description = await page.$eval(productDetailsSelector, (node) => node.textContent);
  // logs the description of the product
  console.log('Description: ', description);

Run the scraper again. This time, no error appears, and the description of the product is logged.

Output
Description: This is a variable product called a Abominable Hoodie

Scrape Multiple Pages

Do you remember we scraped a list of products earlier on? We can also scrape the descriptions of each one from their respective page.

For that, use the array of product names and links to loop through, updating the code between the page.goto() and browser.close() methods:

scraper.js
  // selectors
  const productDetailsSelector = '.woocommerce-product-details__short-description',
    productLinkSelector = '.woocommerce-LoopProduct-link.woocommerce-loop-product__link';
  // get a  list of product names and links
  const list = await page.$$eval(productLinkSelector,
    ((links) => links.map(link => {
        return {
            name: link.querySelector('h2').textContent,
            link: link.href
        };
    }))
  );
  for (const { name, link } of list) {
    await Promise.all([
      page.waitForNavigation(),
      page.goto(link),
      page.waitForSelector(productDetailsSelector),
    ]);
    const description = await page.$eval(productDetailsSelector, (node) => node.textContent);
    console.log(name + ': ' + description);
  }

This is the complete code:

File
const puppeteer = require('puppeteer');

(async () => {
  // launches a browser instance
  const browser = await puppeteer.launch({headless:'new'});
  // creates a new page in the default browser context
  const page = await browser.newPage();

  // remove timeout limit
  page.setDefaultNavigationTimeout(0); 

  // navigates to the page to be scraped 
  await page.goto('https://www.scrapingcourse.com/ecommerce/');

  const productDetailsSelector = '.woocommerce-product-details__short-description',
    productLinkSelector = '.woocommerce-LoopProduct-link.woocommerce-loop-product__link';
  // get a  list of product names and links
  const list = await page.$$eval(productLinkSelector,
    ((links) => links.map(link => {
        return {
            name: link.querySelector('h2').textContent,
            link: link.href
        };
    }))
  );
  for (const { name, link } of list) {
    await Promise.all([
      page.waitForNavigation(),
      page.goto(link),
      page.waitForSelector(productDetailsSelector),
    ]);
    const description = await page.$eval(productDetailsSelector, (node) => node.textContent);
    console.log(name + ': ' + description);
  }
  await browser.close();
})();

When you run the scraper file, you should start seeing the products and their description logged on the terminal.

Output
Abominable Hoodie: This is a variable product called a Abominable Hoodie

Adrienne Trek Jacket: This is a variable product called a Adrienne Trek Jacket

//... other products omitted for brevity

Ariel Roll Sleeve Sweatshirt: This is a variable product called a Ariel Roll Sleeve Sweatshirt

Artemis Running Short: This is a variable product called a Artemis Running Short

Optimize Puppeteer Scripts

Like most tools, Puppeteer can be optimized to improve its general speed and performance. Here are some of the ways to do so:

Block Unnecessary Requests

Blocking requests you don't need will reduce the number of requests made. In Puppeteer, you can create an interceptor for the types of files you don't need.

Since we've been using only HTML documents when targeting ScrapingCourse.com, blocking other types of documents, like images or stylesheets, makes sense.

scraper.js
  // allows interception of requests
  await page.setRequestInterception(true);
  // listens for requests being triggered
  page.on('request', (request) => {
    if (request.resourceType() === 'document') {
      // allow request to be maded
      request.continue();
    } else {
      // cancel request
      request.abort();
    }
  });

Cache Resources

Caching a resource will prevent further requests by the Puppeteer headless browser. Every new browser instance will create a temporary directory for its user data directory, which houses the user cache directory.

We can specify a permanent directory for all browser instances by specifying the userDataDir option in the Puppeteer.launch() method.

scraper.js
  // launches a browser instance
  const browser = await puppeteer.launch({
    userDataDir: './user_data',
  });

Set the Headless Mode

The headless option is true by default. Changing the value to false will stop Puppeteer from running in a headless mode; instead, it'll run with a GUI.

Puppeteer allows you to set the browser mode using the headless option of the Puppeteer.launch() method.

scraper.js
  // launches a browser instance
  const browser = await puppeteer.launch({
    headless: false,
  });

Note: You should perform scraping in headless mode when in production since a graphical interface is just for testing.

Avoid Being Blocked with Puppeteer

A common issue that web scrapers face is getting blocked because many websites have measures in place to block visitors that behave like bots. But here are some of the ways you can prevent that:

  1. Use proxies.
  2. Limit requests.
  3. Use a valid User-Agent.
  4. Mimic user behavior.
  5. Implement Puppeteer's Stealth plugin.
  6. Use a web scraping API like ZenRows.

For more in-depth information, check out our guide on how to avoid detection with Puppeteer.

Conclusion

In this tutorial, we looked at what a headless browser in NodeJS is. More specifically, you now know how to use Puppeteer for headless browser web scraping and can benefit from its advanced features.

However, running Puppeteer at scale or avoiding getting blocked will prove to be challenging, so you should consider a tool like ZenRows to ease your web scraping operations. It has a built-in anti-bot feature, and you can try it for free now.

Frequent Questions

What Are Some Examples of Headless Browsers in NodeJS?

Some examples of headless browsers in NodeJS include:

  • Puppeteer is a library that allows you to control and automate a headless Chrome or any Chromium browser and is most popular in NodeJS
  • Selenium is a suite of tools used for automating web browsers, and its WebDriver enables users to interact with web pages. It's more popular when using other languages.
  • Playwright is a library similar to Selenium. However, it has some unique features, such as a wide range of browser automation support.
  • NightmareJS is a high-level library built on top of Electron. It uses the Chrome DevTools protocol to control a headless version of the Chrome browser.
  • PhantomJS is a headless, scriptable web browser with JavaScript. It allows developers to perform web interactions and automation via a CLI.
  • CasperJS is a scripting utility based on PhantomJS that simplifies the process of automating interactions with web pages.

What Is the Best Headless Browser for NodeJS?

Puppeteer is the best headless browser for NodeJS. Its APIs are easy to use and provide full control over the headless browser. It also has a large and active community backing it.

Is Puppeteer a Headless Browser?

Yes, Puppeteer is a headless browser. It provides a high-level API to control Chrome or any Chromium browser.

How Do I Run Puppeteer in Headless Mode?

Puppeteer runs in headless mode by default but can be configured to run in full using the headless option when launching a new browser instance.

Ready to get started?

Up to 1,000 URLs for free are waiting for you