The Anti-bot Solution to Scrape Everything? Get Your Free API Key! 😎

Puppeteer Extra: Comprehensive Tutorial 2024

August 18, 2023 · 4 min read

Puppeteer Extra is an open-source framework that extends the functionality of Puppeteer, a popular browser automation and web scraping tool. We'll explore its diverse plugins:

What Is Puppeteer Extra?

Puppeteer Extra is a Node JS library that augments Puppeteer with plugin functionality. Those add-ons aim to fix individual shortcomings and increase Puppeteer's viability, especially for web scraping. For example, puppeteer-extra-plugin-stealth applies various evasion techniques to make it harder for websites to detect the requests as coming from a bot.

Getting Started with any puppeteer-extra Plugin

Using any puppeteer-extra plugin requires a base Puppeteer and the Extra integration. Follow the steps below to achieve that:

Start NodeJS Project

First, ensure Node JS is installed on your machine by typing node -v in the terminal.

Navigate to your desired project location, and create a new NodeJS project using the following command:

Terminal
$ npm init -y
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Install Dependencies

Install Puppeteer and Puppeteer Extra.

Terminal
$ npm install puppeteer puppeteer-extra 

Import Puppeteer Extra and open an async function where you'll write your code. 

program.js
(async () => {
 
 
//..
 
})();

Create Initial Script

Launch a browser and set headless to true because the browser runs in headful mode by default. Then, open a new page. 

program.js
(async () => {
  const browser = await puppeteer.launch({
    headless: true, // Set this option to true to run in headless mode
  });
 
  const page = await browser.newPage();
 
 //..
 
})();

Add your web scraping logic and close the browser. For example, you can navigate to a URL and take a screenshot.

program.js
(async () => {
    //..
    
    // Navigate to a sample website
    await page.goto('https://www.example.com');
 
    // Take a screenshot of the page
    await page.screenshot({ path: 'screenshot.png' });
    
    console.log('Screenshot saved as screenshot.png');
 
    await browser.close();
})();

Putting everything together, you'll get the following complete code.

program.js
const puppeteer = require('puppeteer-extra');
 
(async () => {
  // Launch a headless browser
  const browser = await puppeteer.launch({
    headless: true,
  });
 
  // Create a new page
  const page = await browser.newPage();
 
  // Navigate to a sample website
  await page.goto('https://www.example.com');
 
  // Take a screenshot of the page
  await page.screenshot({ path: 'screenshot.png' });
 
  console.log(`Screenshot saved as screenshot.png`);
 
  // Close the browser
  await browser.close();
})();

You can run the script on NodeJS using node projectFileName.js. It will take a screenshot of the web page and save it in your project file. 

Once you've set up this base script, you can use any plugin by importing it and activating it using the puppeteer.use() method.

In case you'd need to refresh your fundamentals, check out our Puppeteer web scraping tutorial.

Let's explore the plugins next.

Puppeteer Extra Plugins for Web Scraping

Here are the Puppeteer Extra plugins for web scraping ordered by popularity to help you in your data extraction operations.

1. puppeteer-extra-plugin-stealth

Due to default automation properties that flag base Puppeteer as a bot, it can easily be detected by anti-scraping measures. Fortunately, puppeteer-extra-plugin-stealth masks them using various evasion modules to reduce the risk of getting blocked.

Check out our guide on how to use Puppeteer Stealth to learn more. 

2. puppeteer-extra-plugin-proxy

puppeteer-extra-plugin-proxy enables you to route your requests through proxies to avoid rate limiting in web scraping. The advantage is it provides better code readability and therefore helps with maintenance.

Check out our tutorial on how to use proxies in Puppeteer for more details.

3. puppeteer-extra-plugin-recaptcha

Websites use CAPTCHAs to prevent automated access, posing a major challenge for web scrapers. However, puppeteer-extra-plugin-recaptcha enables Puppeteer to solve reCAPTCHAs and hCAPTCHAs, some of the most common types.

It achieves that by automating the use of third-party CAPTCHA-solving services, like 2Captcha. Once a challenge is detected, the plugin submits the CAPTCHA to the chosen service's API, which automatically returns the solution. 

Here's an example of how to use puppeteer-extra-plugin-recaptcha:

program.js
const puppeteer = require('puppeteer-extra');
const RecaptchaPlugin = require('puppeteer-extra-plugin-recaptcha');
 
// Configure the plugin with a CAPTCHA-solving service provider (e.g., 2Captcha)
puppeteer.use(RecaptchaPlugin({
  provider: { id: '2captcha', token: 'YOUR_2CAPTCHA_API_KEY' },
}));
 
(async () => {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
 
  // Navigate to a web page with a reCAPTCHA challenge
  await page.goto('https://www.example.com');
 
  // Solve reCAPTCHA challenges on the page
  const recaptchaSolutions = await page.solveRecaptchas();
 
  // Navigate to a sample website
  await page.goto('https://www.example.com');
 
  // Take a screenshot of the page
  await page.screenshot({ path: 'screenshot.png' });
 
  console.log(`Screenshot saved as screenshot.png`);
 
  // Close the browser
  await browser.close();
})();

You can check out the documentation to learn more

4. puppeteer-extra-plugin-anonymize-ua

This plugin anonymizes the User Agent (UA) string header sent by Puppeteer when you make requests. The UA typically contains information about the web client, which websites can use to identify and block requests.

However, using puppeteer-extra-plugin-anonymize-ua, you can randomize and set the User Agent to those of actual browsers, making your requests appear human-made. Also, this plugin supports dynamic replacing, which allows you to maintain a Chrome version for the different UAs.

Find more information in its GitHub repository.

Useful puppeteer-extra Tools for Other Purposes

Puppeteer Extra also offers plugins for other essential purposes. Here's a list of the most important ones and what they're used for:

Plugin Use
puppeteer-extra-plugin-block-resources Blocks specific resources (like images and script) to make Puppeteer faster and reduce unnecessary data usage.
puppeteer-extra-plugin-adblocker Like conventional adblockers, this plugin blocks ads and trackers, leading to reduced data consumption and faster load times.
puppeteer-extra-plugin-devtools Enables browser debugging. It grants access to the Chrome DevTools protocol, allowing you to interact with it programmatically.
puppeteer-extra-plugin-repl Adds the REPL (Read Eval Print Loop) feature to Puppeteer, allowing you to execute Puppeteer scripts directly from the command line.
puppeteer-extra-plugin-user-preferences Lets you control the browser environment by setting custom preferences, such as enabling geolocation.

Common Questions

TypeScript with puppeteer-extra-plugin

Using TypeScript with Puppeteer Extra enhances code readability and improves productivity. Enable it following the steps below: 

  1. Install TypeScript to add it to your project.
Terminal
npm install typescript
  1. Initialize TypeScript. This creates a tsconfig.json file containing TypeScript configurations in your project root.
Terminal
npx tsc --init
  1. Rename your script from .js to .ts and update the imports accordingly. For example, replace const puppeteer = require ('puppeteer-extra') with import puppeteer from 'puppeteer-extra'.

Scaling up: Multiple Puppeteers with Different Plugins

Scraping can quickly become complex and time-consuming, particularly in large-scale projects. Fortunately, you can use multiple Puppeteers with different plugins to enhance efficiency and scale up operations.

You can achieve that by using the `addExtra` function from Puppeteer Extra to create different Puppeteer instances, each representing a distinct browser environment.  Then, add the required plugins for each instance using the `puppeteer.use()` method.

Concurrency: Puppeteer Extra with puppeteer-cluster

Puppeteer-cluster alongside Puppeteer Extra enables concurrency support, the ability to perform multiple tasks simultaneously. Puppeteer-cluster allows you to create a cluster of Puppeteer workers, and it integrates well with puppeteer-extra. 

To achieve concurrency, use addExtra to create a custom Puppeteer instance that incorporates the necessary plugins. Then, initialize the cluster with the custom Puppeteer instance, define the task handler using the cluster.task function, and queue the tasks using cluster.queue.

Conclusion

Puppeteer Extra allows you to enhance the capabilities of Puppeteer by integrating different plugins. Those handle specific use cases, the most common one being to avoid bot detection.

Learn more about Puppeteer Stealth, the most popular plugin to avoid getting blocked.

Did you find the content helpful? Spread the word and share it on Twitter, or LinkedIn.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.