Using a headless browser in NodeJS allows developers to control e.g. Chrome with code, providing extra functionality in order to interact with web pages and simulate human behavior.
Today, we'll look into how to use Puppeteer, the most popular in this language, for web scraping.
What Is a Headless Browser in NodeJS?
A headless browser in NodeJS is an automated browser that runs without a Graphical User Interface (GUI), spending fewer resources and being faster. It allows JavaScript rendering and performing actions (submitting forms, scrolling, etc.) like a human would.
How to Run a Headless Browser in NodeJS with Puppeteer
Now that you know what a headless browser is, let's dig into running one with Puppeteer to interact with elements on the page and scrape data.
As a target site, we'll use ScrapingCourse.com, a demo website with e-commerce features.
Prerequisites
Ensure you have NodeJS installed (npm ships with it) before moving forward.
Create a new directory and initialize a NodeJS project using npm init -y
. Then, install Puppeteer with the command below:
npm i puppeteer@19.7.2
Note: Puppeteer will download the most recent version of Chromium after running the installation command. If you'd opt for a manual setup, which is useful if you want to connect to a remote browser or manage browsers yourself, the puppeteer-core
package won't download Chromium by default.
Then, create a new scraper.js
file inside the headless browser JavaScript project you initialized above.
touch scraper.js
We're ready to get started now!
Step 1: Open the Page
Let's begin by opening the site we want to scrape. For that, launch a browser instance, create a new page and navigate to our target site.
const puppeteer = require('puppeteer');
(async () => {
// launches a browser instance
const browser = await puppeteer.launch();
// creates a new page in the default browser context
const page = await browser.newPage();
// navigates to the page to be scraped
const response = await page.goto('https://www.scrapingcourse.com/ecommerce/');
// logs the status of the request to the page
console.log('Request status: ', response?.status(), '\n\n\n\n');
// closes the browser instance
await browser.close();
})();
Note: The close()
method is called at the end to close Chromium and all its pages.
Run the code using node scraper
on the terminal. It'll log the status code of the request to ScrapingCourse.com, as seen in the image below:
Request status: 200
Congratulations! 200
shows your request was successful. Now, you're ready to do some scraping.
Step 2: Scrape the Data
Our goal is to scrape all product names on the homepage and display them in a list. Here's what you need to do:
Use your regular browser to go to ScrapingCourse.com and locate any product card, then right-click on the name of the creature and select "Inspect" to open your Chrome DevTools. The browser will highlight the selected element, as shown below.
The selected element holding the product name is an h2
with the woocommerce-loop-product__title
class. If you inspect other ones on that page, you'll see they all have the same class. We can use that to target all the name elements and, in turn, scrape them.
The Puppeteer Page API provides several methods to select elements on a page. One example is Page.$$eval(selector, pageFunction, args)
, where $$eval()
runs document.querySelectorAll
against its first argument, the selector. It then returns the result to its second argument, the callback page function, for further operations.
Let's leverage this. Update your scraper.js
file with the below code:
const puppeteer = require('puppeteer');
(async () => {
// launches a browser instance
const browser = await puppeteer.launch();
// creates a new page in the default browser context
const page = await browser.newPage();
// remove timeout limit
page.setDefaultNavigationTimeout(0);
// navigates to the page to be scraped
await page.goto('https://www.scrapingcourse.com/ecommerce/');
// gets an array of all product names
const names = await page.$$eval('.woocommerce-loop-product__title', (nodes) => nodes.map((n) => n.textContent));
console.log('Number of products: ', names.length);
console.log('List of products: ', names.join(', '), '\n\n\n');
// closes the browser instance
await browser.close();
})();
Like in the last example, we see similar operations of creating a browser instance and page. However, to disable the timeout and its errors, page.setDefaultNavigationTimeout(0);
sets the navigation timeout to zero ms instead of the default 3000 ms.
Furthermore, n.textContent
gets the text of all the nodes or elements with a woocommerce-loop-product__title
class. Meanwhile, the $$eval()
function returns an array of the product names.
Finally, the code logs the amount of product scraped and creates a comma-separated list with the names.
Run the script again, and you'll see an output like this:
Number of products: 16
List of products: Abominable Hoodie, Adrienne Trek Jacket, Aeon Capri, Aero Daily Fitness Tee, Aether Gym Pant, Affirm Water Bottle, Aim Analog Watch, Ajax Full-Zip Sweatshirt, Ana Running Short, Angel Light Running Short, Antonia Racer Tank, Apollo Running Short, Arcadio Gym Short, Argus All-Weather Tank, Ariel Roll Sleeve Sweatshirt, Artemis Running Short
Cool!
Let's see next how to interact with the webpage with Puppeteer, an extra functionality the headless browser provides us.
Interact with Elements on the Page
There are some Page APIs for interacting with elements on a page. For example, the Page.type(selector, text)
method can send keydown
, keyup
and input events.
Take a look at the search field on the top right of the target site, which we can use. Inspect the element, and you'll see this:
The search field has the woocommerce-product-search-field-0
ID. We can select the element with this and trigger input events on it. To do so, add the below code between the page.goto()
and browser.close()
methods in your scraper.js
file.
const searchFieldSelector = '#woocommerce-product-search-field-0';
const getSearchFieldValue = async () => await page.$eval(searchFieldSelector, el => el.value);
console.log('Search field value before: ', await getSearchFieldValue());
// type instantly into the search field
await page.type(searchFieldSelector, 'Atlas Fitness Tank');
console.log('Search field value after: ', await getSearchFieldValue());
We used the page.type()
method to type in the word "Atlas Fitness Tank" in the field.
Rerun the scraper file, and you should get this output:
Search field value before:
Search field value after: Atlas Fitness Tank
The value of the search field changed, indicating the input events were successfully triggered.
Great! Let's explore other useful capabilities now.
Advanced Headless Browsing with Puppeteer in NodeJS
In this section, you'll learn how to up your Puppeteer headless browser game.
Take a Screenshot
Imagine you'd want to get screengrabs, for instance to check visually that your scraper is working properly. The good news is taking screenshots with Puppeteer is doable by calling the screenshot()
method.
// takes a screenshot of the search term in the search box
await page.screenshot({ path: 'scrapingcourse-search-result.png' })
console.log('Screenshot taken');
Note: The path
option specifies the screenshot's location and filename.
Run the scraper file again, and you'll see a "scrapingcourse-search-result.png" image file created in the root directory of your project upon execution:
Wait for the Content to Load
It's a best practice to wait for the whole page or part of it to load when web scraping to make sure everything has been displayed. Let's see an example of why.
Assume you want to get the description of the first product on the target homepage. For that, we can simulate a click event on its image, which will trigger another page load that will contain its description.
Inspecting the first product's image on the homepage reveals a link with the woocommerce-LoopProduct-link
and woocommerce-loop-product__link
classes.
And, on the page that loads after clicking the first product's image, the description reveals a div
element with a woocommerce-product-details__short-description
class.
We'll use these classes as selectors for the elements. So you need to update the code between the page.goto()
and browser.close()
methods with the one below:
// selectors
const productDetailsSelector = '.woocommerce-product-details__short-description',
productLinkSelector = '.woocommerce-LoopProduct-link.woocommerce-loop-product__link';
// clicks on the first product image link (triggers a new page load)
await page.$$eval(productLinkSelector, (links) => links[0]?.click());
// gets the content of the description from the element
const description = await page.$eval(productDetailsSelector, (node) => node.textContent);
// logs the description of the product
console.log('Description: ', description);
There, the $$eval()
method selects all available product links and clicks, and the $eval()
method targets the description element and gets its content.
Now it's time to run the scraper. Unfortunately, we get an error:
#error = new Errors_js_1.ProtocolError();
^
ProtocolError: Protocol error (DOM.describeNode): Cannot find context with specified id
It occurred because Puppeteer was trying to get the description element before it was loaded.
To fix this, add the waitForSelector(selector)
method to wait for the selector of the description element. This method will resolve only when the description is available. We could also wait for the page to load with waitForNavigation
. Either will do the job, but we recommend waiting for a selector if possible.
// selectors
const productDetailsSelector = '.woocommerce-product-details__short-description',
productLinkSelector = '.woocommerce-LoopProduct-link.woocommerce-loop-product__link';
// clicks on the first product image link (triggers a new page load)
await page.$$eval(productLinkSelector, (links) => links[0]?.click());
// waits for the element with the description of the product
await page.waitForSelector(productDetailsSelector);
// gets the content of the description from the element
const description = await page.$eval(productDetailsSelector, (node) => node.textContent);
// logs the description of the product
console.log('Description: ', description);
Run the scraper again. This time, no error appears, and the description of the product is logged.
Description: This is a variable product called a Abominable Hoodie
Scrape Multiple Pages
Do you remember we scraped a list of products earlier on? We can also scrape the descriptions of each one from their respective page.
For that, use the array of product names and links to loop through, updating the code between the page.goto()
and browser.close()
methods:
// selectors
const productDetailsSelector = '.woocommerce-product-details__short-description',
productLinkSelector = '.woocommerce-LoopProduct-link.woocommerce-loop-product__link';
// get a list of product names and links
const list = await page.$$eval(productLinkSelector,
((links) => links.map(link => {
return {
name: link.querySelector('h2').textContent,
link: link.href
};
}))
);
for (const { name, link } of list) {
await Promise.all([
page.waitForNavigation(),
page.goto(link),
page.waitForSelector(productDetailsSelector),
]);
const description = await page.$eval(productDetailsSelector, (node) => node.textContent);
console.log(name + ': ' + description);
}
This is the complete code:
const puppeteer = require('puppeteer');
(async () => {
// launches a browser instance
const browser = await puppeteer.launch({headless:'new'});
// creates a new page in the default browser context
const page = await browser.newPage();
// remove timeout limit
page.setDefaultNavigationTimeout(0);
// navigates to the page to be scraped
await page.goto('https://www.scrapingcourse.com/ecommerce/');
const productDetailsSelector = '.woocommerce-product-details__short-description',
productLinkSelector = '.woocommerce-LoopProduct-link.woocommerce-loop-product__link';
// get a list of product names and links
const list = await page.$$eval(productLinkSelector,
((links) => links.map(link => {
return {
name: link.querySelector('h2').textContent,
link: link.href
};
}))
);
for (const { name, link } of list) {
await Promise.all([
page.waitForNavigation(),
page.goto(link),
page.waitForSelector(productDetailsSelector),
]);
const description = await page.$eval(productDetailsSelector, (node) => node.textContent);
console.log(name + ': ' + description);
}
await browser.close();
})();
When you run the scraper file, you should start seeing the products and their description logged on the terminal.
Abominable Hoodie: This is a variable product called a Abominable Hoodie
Adrienne Trek Jacket: This is a variable product called a Adrienne Trek Jacket
//... other products omitted for brevity
Ariel Roll Sleeve Sweatshirt: This is a variable product called a Ariel Roll Sleeve Sweatshirt
Artemis Running Short: This is a variable product called a Artemis Running Short
Optimize Puppeteer Scripts
Like most tools, Puppeteer can be optimized to improve its general speed and performance. Here are some of the ways to do so:
Block Unnecessary Requests
Blocking requests you don't need will reduce the number of requests made. In Puppeteer, you can create an interceptor for the types of files you don't need.
Since we've been using only HTML documents when targeting ScrapingCourse.com, blocking other types of documents, like images or stylesheets, makes sense.
// allows interception of requests
await page.setRequestInterception(true);
// listens for requests being triggered
page.on('request', (request) => {
if (request.resourceType() === 'document') {
// allow request to be maded
request.continue();
} else {
// cancel request
request.abort();
}
});
Cache Resources
Caching a resource will prevent further requests by the Puppeteer headless browser. Every new browser instance will create a temporary directory for its user data directory, which houses the user cache directory.
We can specify a permanent directory for all browser instances by specifying the userDataDir
option in the Puppeteer.launch()
method.
// launches a browser instance
const browser = await puppeteer.launch({
userDataDir: './user_data',
});
Set the Headless Mode
The headless option is true
by default. Changing the value to false
will stop Puppeteer from running in a headless mode; instead, it'll run with a GUI.
Puppeteer allows you to set the browser mode using the headless
option of the Puppeteer.launch()
method.
// launches a browser instance
const browser = await puppeteer.launch({
headless: false,
});
Note: You should perform scraping in headless mode when in production since a graphical interface is just for testing.
Avoid Being Blocked with Puppeteer
A common issue that web scrapers face is getting blocked because many websites have measures in place to block visitors that behave like bots. But here are some of the ways you can prevent that:
- Use proxies.
- Limit requests.
- Use a valid User-Agent.
- Mimic user behavior.
- Implement Puppeteer's Stealth plugin.
- Use a web scraping API like ZenRows.
For more in-depth information, check out our guide on how to avoid detection with Puppeteer.
Conclusion
In this tutorial, we looked at what a headless browser in NodeJS is. More specifically, you now know how to use Puppeteer for headless browser web scraping and can benefit from its advanced features.
However, running Puppeteer at scale or avoiding getting blocked will prove to be challenging, so you should consider a tool like ZenRows to ease your web scraping operations. It has a built-in anti-bot feature, and you can try it for free now.
Frequent Questions
What Are Some Examples of Headless Browsers in NodeJS?
Some examples of headless browsers in NodeJS include:
- Puppeteer is a library that allows you to control and automate a headless Chrome or any Chromium browser and is most popular in NodeJS
- Selenium is a suite of tools used for automating web browsers, and its WebDriver enables users to interact with web pages. It's more popular when using other languages.
- Playwright is a library similar to Selenium. However, it has some unique features, such as a wide range of browser automation support.
- NightmareJS is a high-level library built on top of Electron. It uses the Chrome DevTools protocol to control a headless version of the Chrome browser.
- PhantomJS is a headless, scriptable web browser with JavaScript. It allows developers to perform web interactions and automation via a CLI.
- CasperJS is a scripting utility based on PhantomJS that simplifies the process of automating interactions with web pages.
What Is the Best Headless Browser for NodeJS?
Puppeteer is the best headless browser for NodeJS. Its APIs are easy to use and provide full control over the headless browser. It also has a large and active community backing it.
Is Puppeteer a Headless Browser?
Yes, Puppeteer is a headless browser. It provides a high-level API to control Chrome or any Chromium browser.
How Do I Run Puppeteer in Headless Mode?
Puppeteer runs in headless mode by default but can be configured to run in full using the headless
option when launching a new browser instance.