Puppeteer Extra is an open-source framework that extends the functionality of Puppeteer, a popular browser automation and web scraping tool. We'll explore its diverse plugins:
What Is Puppeteer Extra?
Puppeteer Extra is a Node JS library that augments Puppeteer with plugin functionality. Those add-ons aim to fix individual shortcomings and increase Puppeteer's viability, especially for web scraping. For example, puppeteer-extra-plugin-stealth applies various evasion techniques to make it harder for websites to detect the requests as coming from a bot.
Getting Started with any puppeteer-extra Plugin
Using any puppeteer-extra plugin requires a base Puppeteer and the Extra integration. Follow the steps below to achieve that:
Start NodeJS Project
First, ensure Node JS is installed on your machine by typing node -v
in the terminal.
Navigate to your desired project location, and create a new NodeJS project using the following command:
$ npm init -y
Install Dependencies
Install Puppeteer and Puppeteer Extra.
$ npm install puppeteer puppeteer-extra
Import Puppeteer Extra and open an async
function where you'll write your code.
(async () => {
//..
})();
Create Initial Script
Launch a browser and set headless to true
because the browser runs in headful mode by default. Then, open a new page.
(async () => {
const browser = await puppeteer.launch({
headless: true, // Set this option to true to run in headless mode
});
const page = await browser.newPage();
//..
})();
Add your web scraping logic and close the browser. For example, you can navigate to a URL and take a screenshot.
(async () => {
//..
// Navigate to a sample website
await page.goto('https://www.example.com');
// Take a screenshot of the page
await page.screenshot({ path: 'screenshot.png' });
console.log('Screenshot saved as screenshot.png');
await browser.close();
})();
Putting everything together, you'll get the following complete code.
const puppeteer = require('puppeteer-extra');
(async () => {
// Launch a headless browser
const browser = await puppeteer.launch({
headless: true,
});
// Create a new page
const page = await browser.newPage();
// Navigate to a sample website
await page.goto('https://www.example.com');
// Take a screenshot of the page
await page.screenshot({ path: 'screenshot.png' });
console.log(`Screenshot saved as screenshot.png`);
// Close the browser
await browser.close();
})();
You can run the script on NodeJS using node projectFileName.js
. It will take a screenshot of the web page and save it in your project file.
Once you've set up this base script, you can use any plugin by importing it and activating it using the puppeteer.use()
method.
In case you'd need to refresh your fundamentals, check out our Puppeteer web scraping tutorial.
Let's explore the plugins next.
Puppeteer Extra Plugins for Web Scraping
Here are the Puppeteer Extra plugins for web scraping ordered by popularity to help you in your data extraction operations.
1. puppeteer-extra-plugin-stealth
Due to default automation properties that flag base Puppeteer as a bot, it can easily be detected by anti-scraping measures. Fortunately, puppeteer-extra-plugin-stealth
masks them using various evasion modules to reduce the risk of getting blocked.
Check out our guide on how to use Puppeteer Stealth to learn more.
2. puppeteer-extra-plugin-proxy
puppeteer-extra-plugin-proxy
enables you to route your requests through proxies to avoid rate limiting in web scraping. The advantage is it provides better code readability and therefore helps with maintenance.
Check out our tutorial on how to use proxies in Puppeteer for more details.
3. puppeteer-extra-plugin-recaptcha
Websites use CAPTCHAs to prevent automated access, posing a major challenge for web scrapers. However, puppeteer-extra-plugin-recaptcha
enables Puppeteer to solve reCAPTCHAs and hCAPTCHAs, some of the most common types.
It achieves that by automating the use of third-party CAPTCHA-solving services, like 2Captcha. Once a challenge is detected, the plugin submits the CAPTCHA to the chosen service's API, which automatically returns the solution.
Here's an example of how to use puppeteer-extra-plugin-recaptcha:
const puppeteer = require('puppeteer-extra');
const RecaptchaPlugin = require('puppeteer-extra-plugin-recaptcha');
// Configure the plugin with a CAPTCHA-solving service provider (e.g., 2Captcha)
puppeteer.use(RecaptchaPlugin({
provider: { id: '2captcha', token: 'YOUR_2CAPTCHA_API_KEY' },
}));
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// Navigate to a web page with a reCAPTCHA challenge
await page.goto('https://www.example.com');
// Solve reCAPTCHA challenges on the page
const recaptchaSolutions = await page.solveRecaptchas();
// Navigate to a sample website
await page.goto('https://www.example.com');
// Take a screenshot of the page
await page.screenshot({ path: 'screenshot.png' });
console.log(`Screenshot saved as screenshot.png`);
// Close the browser
await browser.close();
})();
You can check out the documentation to learn more
4. puppeteer-extra-plugin-anonymize-ua
This plugin anonymizes the User Agent (UA) string header sent by Puppeteer when you make requests. The UA typically contains information about the web client, which websites can use to identify and block requests.
However, using puppeteer-extra-plugin-anonymize-ua
, you can randomize and set the User Agent to those of actual browsers, making your requests appear human-made. Also, this plugin supports dynamic replacing, which allows you to maintain a Chrome version for the different UAs.
Find more information in its GitHub repository.
Useful puppeteer-extra Tools for Other Purposes
Puppeteer Extra also offers plugins for other essential purposes. Here's a list of the most important ones and what they're used for:
Plugin | Use |
---|---|
puppeteer-extra-plugin-block-resources |
Blocks specific resources (like images and script) to make Puppeteer faster and reduce unnecessary data usage. |
puppeteer-extra-plugin-adblocker |
Like conventional adblockers, this plugin blocks ads and trackers, leading to reduced data consumption and faster load times. |
puppeteer-extra-plugin-devtools |
Enables browser debugging. It grants access to the Chrome DevTools protocol, allowing you to interact with it programmatically. |
puppeteer-extra-plugin-repl |
Adds the REPL (Read Eval Print Loop) feature to Puppeteer, allowing you to execute Puppeteer scripts directly from the command line. |
puppeteer-extra-plugin-user-preferences |
Lets you control the browser environment by setting custom preferences, such as enabling geolocation. |
Common Questions
TypeScript with puppeteer-extra-plugin
Using TypeScript with Puppeteer Extra enhances code readability and improves productivity. Enable it following the steps below:
- Install TypeScript to add it to your project.
npm install typescript
- Initialize TypeScript. This creates a
tsconfig.json
file containing TypeScript configurations in your project root.
npx tsc --init
- Rename your script from
.js
to.ts
and update the imports accordingly. For example, replaceconst puppeteer = require ('puppeteer-extra')
withimport puppeteer from 'puppeteer-extra'
.
Scaling up: Multiple Puppeteers with Different Plugins
Scraping can quickly become complex and time-consuming, particularly in large-scale projects. Fortunately, you can use multiple Puppeteers with different plugins to enhance efficiency and scale up operations.
You can achieve that by using the `addExtra` function from Puppeteer Extra to create different Puppeteer instances, each representing a distinct browser environment. Then, add the required plugins for each instance using the `puppeteer.use()` method.
Concurrency: Puppeteer Extra with puppeteer-cluster
Puppeteer-cluster alongside Puppeteer Extra enables concurrency support, the ability to perform multiple tasks simultaneously. Puppeteer-cluster allows you to create a cluster of Puppeteer workers, and it integrates well with puppeteer-extra.
To achieve concurrency, use addExtra
to create a custom Puppeteer instance that incorporates the necessary plugins. Then, initialize the cluster with the custom Puppeteer instance, define the task handler using the cluster.task
function, and queue the tasks using cluster.queue
.
Conclusion
Puppeteer Extra allows you to enhance the capabilities of Puppeteer by integrating different plugins. Those handle specific use cases, the most common one being to avoid bot detection.
Learn more about Puppeteer Stealth, the most popular plugin to avoid getting blocked.