Web scraping dynamic websites requires more than sending HTTP requests. You need a tool that can handle JavaScript, interact with elements, and behave like a real browser. Playwright is that tool. It offers a powerful way to automate browser interactions and extract data from modern web applications.
In this tutorial, you'll learn how to scrape data from websites using Playwright in Node.js through a practical, step-by-step approach. Here's what you'll learn:
- Getting started with Playwright and setting up your environment.
- Advanced Playwright features.
- Playwright vs. Puppeteer vs. Selenium.
- Handling anti-bot measures and avoiding blocks.
Let's get started!
What Is Playwright?
Playwright is an open-source framework built on Node.js that helps you automate web browsing tasks. You can use it with most popular programming languages like Python, Node.js, Java, and .NET. It's also compatible with major browsers, including Google Chrome, Microsoft Edge, Firefox, and Safari.
You can use Playwright for a wide range of scraping and automation tasks. Playwright scripts can help you navigate pages, click buttons, fill out forms, extract text, capture screenshots, and more. You can even configure it to bypass CAPTCHA challenges and automate complex workflows.
In short, Playwright offers a user-friendly syntax that makes it accessible even if you're new to programming. You'll find its headless browser mode particularly useful. It runs without a graphical interface, which significantly reduces page loading times and memory usage.
Ready to start using Playwright?
How to Use Playwright for Web Scraping
In this section, you'll explore the essential steps of web scraping with Playwright, starting with environment setup, moving through basic scraping and data parsing, and concluding with exporting the scraped data. Let's jump right in!
Step 1: Prerequisites
Before you start scraping with Playwright, let's get your development environment ready.
Here's what you'll need:
- Ensure you have Node.js installed. Run the command
node -v
in your terminal to confirm the installation. - Create a new project folder in your desired location and install Playwright using the following
npm
command.
npm install playwright
- Next, install the browsers using the following command.
npx playwright install
- Your favorite code editor. We'll use VS Code, but feel free to use any IDE you're comfortable with.
If you're new to web scraping with JavaScript, don't forget to check out our in-depth guide on web scraping in JavaScript and Node.js.
You're all set up!
Step 2: Build a Basic Playwright Web Scraper
Let's build your first Playwright scraper by targeting the ScrapingCourse's JS Rendering demo page.
You'll launch a browser in headless mode (no visible UI), navigate to the target website, and extract its HTML content. While this is a simple example, it forms the foundation for more complex scraping tasks.
Create a new file called scraper.js
and add the following code:
const { chromium } = require("playwright");
(async () => {
try {
// launch the browser
const browser = await chromium.launch();
// playwright runs in headless mode by default for better performance
// create a new browser context and page
const page = await browser.newPage();
// navigate to the target website
await page.goto("https://www.scrapingcourse.com/javascript-rendering");
// extract the full HTML content
const html = await page.content();
console.log(html);
// close the browser
await browser.close();
} catch (error) {
console.error("An error occurred:", error);
}
})();
To run your scraper, use this command in your terminal:
node scraper.js
You'll see the complete HTML of the target page:
<html lang="en">
<head>
<!-- ... -->
<title>JS Rendering Challenge to Learn Web Scraping - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<div class="product-info" ...>
<span class="product-name" ...>
Chaz Kangeroo Hoodie
</span>
</div>
<!-- ... -->
</body>
</html>
While headless mode makes scraping faster and more resource-efficient, it can trigger website anti-bot systems. This happens because headless browsers have certain properties that make them distinguishable from regular browsers.
Don't worry if you encounter blocks! Later in this guide, we'll cover advanced techniques for avoiding detection in our section on handling anti-bot measures. For now, let's focus on extracting specific data from the page.
Step 3: Parse Data from the Page With Playwright
Now that you've accessed the page, let's extract specific data elements from our target page. The first step is identifying the HTML elements you want to scrape. Open Chrome DevTools (F12 or right-click and select Inspect) to examine the page structure.
Let's start by extracting a single product name. Inspect on the first product name. You'll notice each product is contained within a div
with the class product-item
. Looking at the HTML structure, product names are within span
elements with the class product-name
.
Playwright offers multiple ways to select elements, including XPath and CSS selectors. In this case, we'll use CSS selectors, as they're simple and straightforward.
Here's how to extract the first product name from the page using CSS selectors and Playwright's locator
method:
const { chromium } = require("playwright");
(async () => {
try {
// launch the browser
const browser = await chromium.launch();
// playwright runs in headless mode by default for better performance
// create a new browser context and page
const page = await browser.newPage();
// navigate to the target website
await page.goto("https://www.scrapingcourse.com/javascript-rendering");
// extract the first product name using CSS selector
const firstProduct = await page.locator(".product-name").first();
const productName = await firstProduct.innerText();
console.log("First product:", productName);
// close the browser
await browser.close();
} catch (error) {
console.error("An error occurred:", error);
}
})();
You'll get the following output on running this code:
First product: Chaz Kangeroo Hoodie
After successfully extracting a single element, you can expand the script to handle multiple products.
The page structure reveals several key CSS selectors you'll need: .product-name
for the product name, .product-price
for pricing information, and .product-image
for the product image. Use Playwright's evaluateAll
method to process all products simultaneously.
const { chromium } = require("playwright");
(async () => {
try {
// launch the browser
const browser = await chromium.launch();
// playwright runs in headless mode by default for better performance
// create a new browser context and page
const page = await browser.newPage();
// navigate to the target website
await page.goto("https://www.scrapingcourse.com/javascript-rendering");
// extract all products
const products = await page.locator(".product-item");
const productData = await products.evaluateAll((items) => {
return items.map((item) => ({
name: item.querySelector(".product-name").innerText,
price: item.querySelector(".product-price").innerText,
image: item.querySelector(".product-image").getAttribute("src"),
}));
});
console.log(productData);
// close the browser
await browser.close();
} catch (error) {
console.error("An error occurred:", error);
}
})();
Here's what the output looks like:
[
{
name: 'Chaz Kangeroo Hoodie',
price: '$52',
image: 'https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh01-gray_main.jpg'
},
// other items omitted for brevity
]
Awesome! Now, you have a robust scraper that extracts product information efficiently!
Step 4: Export Scraped Data to a CSV File
Exporting scraped data to CSV format offers a versatile way to store and analyze your collected information. CSV files are lightweight, widely supported, and can be easily imported into spreadsheet software, databases, or data analysis tools.
Node.js provides a built-in fs
module that makes file operations straightforward. You need to convert your structured JavaScript objects into comma-separated rows and add headers.
Here's how to transform your scraping output into a clean, structured CSV file:
const { chromium } = require("playwright");
const fs = require("fs");
(async () => {
try {
// launch the browser
const browser = await chromium.launch();
// playwright runs in headless mode by default for better performance
// create a new browser context and page
const page = await browser.newPage();
// navigate to the target website
await page.goto("https://www.scrapingcourse.com/javascript-rendering");
// extract all products
const products = await page.locator(".product-item");
const productData = await products.evaluateAll((items) => {
return items.map((item) => ({
name: item.querySelector(".product-name").innerText,
price: item.querySelector(".product-price").innerText,
image: item.querySelector(".product-image").getAttribute("src"),
}));
});
// format data as CSV string
const headers = ["name", "price", "imageURL"].join(",");
const rows = productData.map((product) => {
return [product.name, product.price, product.image].join(",");
});
// combine headers and data rows
const csvContent = [headers, ...rows].join("\n");
// write to file synchronously
fs.writeFileSync("products.csv", csvContent, "utf-8");
console.log("Data successfully exported to products.csv!");
// close the browser
await browser.close();
} catch (error) {
console.error("An error occurred:", error);
}
})();
When you run this script, it will create a CSV file with headers and your scraped data:
Congratulations! 🎉 You've built a complete web scraper using Playwright in Node.js that extracts product data and saves it in a structured format.Â
Playwright Features
Playwright isn't just a data extraction tool. It's a complete browser automation powerhouse. Beyond basic data extraction, it provides a rich set of features for interacting with web pages programmatically, which makes it ideal for handling dynamic content, form submissions, complex user interactions, and more.
The upcoming sections will explore key Playwright capabilities such as page navigation, screenshot capture, request interception, and resource blocking. These features make Playwright particularly effective for scraping modern web applications where simple HTTP requests aren't enough.
Page Navigation and Waiting
When scraping modern web applications, handling page navigation and content loading becomes crucial for reliable data extraction. Unlike static websites, modern web apps load content dynamically, implement pagination mechanisms and require specific user interactions.
Navigation in Playwright goes beyond simple URL loading. The page.goto()
method supports options like waitUntil
to ensure proper page readiness, while page.click()
handles complex scenarios, such as clicking elements that trigger route changes. These navigation features automatically manage page transitions, handle redirects, and maintain browser history.
Playwright's waiting mechanisms are intelligent and context-aware. Rather than using fixed delays, you can wait for specific events, such as network requests completing (waitForLoadState
), elements becoming visible (waitForSelector
), or DOM changes occurring (waitForFunction
). This approach makes your scraper faster and more reliable.
Here's a simple example of page navigation and waiting:
async function navigateAndWait(page) {
// navigate with custom timeout and wait until network is idle
await page.goto('https://www.scrapingcourse.com/infinite-scrolling', {
timeout: 30000,
waitUntil: 'networkidle'
});
// wait for product grid to be visible and clickable
await page.waitForSelector('.product-item');
// click first product and wait for details page
const productLink = await page.locator('.product-item a').first();
await productLink.click();
// wait for product details to load
await page.waitForSelector('.product-info');
}
You can learn more in our in-depth guide on Playwright pagination.
Taking Screenshots with Playwright
Playwright offers screenshot capabilities that work seamlessly in both headless and headed browser modes. It allows you to capture full pages, viewports, or individual elements with precision.
Playwright's screenshot API offers three distinct approaches to capture screenshots:Â
- Viewport screenshot using the
screenshot()
method that captures only the visible area (like what you see in your browser window). - Full-page screenshot using the
fullPage: true
parameter along with thescreenshot()
method, which includes all scrollable content. - Specific element screenshot using the
locator().screenshot()
method, which lets you target particular page components.
Let's use a demo product page as our target to demonstrate the various screenshot capabilities. The script below captures the viewport, the entire page, and the product summary section.
const { chromium } = require("playwright");
(async () => {
try {
// launch the browser
const browser = await chromium.launch();
// playwright runs in headless mode by default for better performance
// create a new browser context and page
const page = await browser.newPage();
// set consistent viewport size
await page.setViewportSize({ width: 1280, height: 720 });
// navigate to the target website
await page.goto(
"https://www.scrapingcourse.com/ecommerce/product/chaz-kangeroo-hoodie"
);
await page.waitForLoadState("networkidle");
// capture full-page screenshot
await page.screenshot({
path: "./full-page.png",
fullPage: true,
});
console.log("Full page screenshot saved as full-page.png");
// capture viewport screenshot
await page.screenshot({
path: "./viewport.png",
});
console.log("Viewport screenshot saved as viewport.png");
// capture specific element screenshot
const productInfo = await page.locator(".entry-summary");
await productInfo.screenshot({
path: "./specific-element.png",
});
console.log("Specific element screenshot saved as product-summary.png");
// close the browser
await browser.close();
} catch (error) {
console.error("An error occurred:", error);
}
})();
You will get all three screenshots in PNG format when you run this code.
To learn more, read our detailed tutorial on how to take screenshots in Playwright.
Request and Response Intercepting
Request and response interception in Playwright provides powerful control over network traffic during web scraping. By intercepting network requests, you can modify headers in Playwright to mimic real browsers, bypass certain security checks, transform responses before they reach the page, and more.
The code below demonstrates a practical implementation of request interception to customize HTTP headers. It intercepts all outgoing requests and modifies them to include custom User Agent, language preferences, and referer information.
This technique is particularly useful when websites implement browser fingerprinting or require specific header configurations to allow access. By modifying these headers, your scraper can better emulate legitimate browser behavior. Let's look at the implementation:
// import playwright's chromium browser
const { chromium } = require("playwright");
// main function to demonstrate request interception
async function scrapeWithInterception() {
// initialize browser instance
const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
// set up request interception
await page.route("**/*", async (route) => {
// get the request details
const request = route.request();
// modify headers for all requests
const headers = {
...request.headers(),
// simulate chrome browser
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
// set preferred language
"Accept-Language": "en-US,en;q=0.9",
// simulate coming from google search
Referer: "https://www.google.com/",
};
// continue with modified headers
await route.continue({ headers });
});
try {
// navigate to test endpoint
await page.goto("https://httpbin.io/headers");
// extract the full HTML content
const html = await page.content();
console.log("Page loaded with custom headers!");
console.log(html);
} catch (error) {
// handle navigation failures
console.error("Navigation failed:", error);
} finally {
// ensure browser cleanup
await browser.close();
}
}
// execute the scraping function
scrapeWithInterception();
You'll get the following output on running this code (notice the modified headers):
{
"headers": {
"Accept": [
"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"
],
"Accept-Encoding": [
"gzip, deflate, br, zstd"
],
"Accept-Language": [
"en-US,en;q=0.9"
],
"Cache-Control": [
"no-cache"
],
"Connection": [
"keep-alive"
],
"Host": [
"httpbin.io"
],
"Pragma": [
"no-cache"
],
"Referer": [
"https://www.google.com/"
],
"Sec-Ch-Ua": [
"\"HeadlessChrome\";v=\"131\", \"Chromium\";v=\"131\", \"Not_A Brand\";v=\"24\""
],
"Sec-Ch-Ua-Mobile": [
"?0"
],
"Sec-Ch-Ua-Platform": [
"\"Windows\""
],
"Sec-Fetch-Dest": [
"document"
],
"Sec-Fetch-Mode": [
"navigate"
],
"Sec-Fetch-Site": [
"none"
],
"Sec-Fetch-User": [
"?1"
],
"Upgrade-Insecure-Requests": [
"1"
],
"User-Agent": [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
]
}
}
Playwright vs. Puppeteer vs. Selenium
How does Playwright compare with Selenium and Puppeteer, the other two most popular headless browsers for web scraping?
Playwright can run seamlessly across multiple browsers using a single API and has extensive documentation to help you get going. It allows the use of different programming languages like Python, Node.js, Java, and .NET, but not Ruby.
Meanwhile, Selenium has a slightly wider range of language compatibility as it works with Ruby, but it needs third-party add-ons for parallel execution and video recording.
On the other hand, Puppeteer is a more limited tool but about 60% faster than Selenium, and slightly faster than Playwright.
Let's take a look at this comparison table:
As you can see, Playwright certainly wins that competition for most use cases. But if you're still not convinced, here's a summary of Playwright features to consider:
- It has cross-browser, cross-platform and cross-language support.
- Playwright can isolate browser contexts for each test or scraping loop you run. You can customize settings like cookies, proxies, and JavaScript on a per-context basis to tailor the browser experience.
- Its auto-waiting feature determines when the context is ready for interaction. By complementing
await page.click()
with Playwright APIs (such asawait page.waitForSelector()
or awaitpage.waitForFunction()
methods), your scraper will extract all data. - Playwright uses proxy servers to help developers disguise their IP addresses.
- It's also possible to lower your bandwidth by blocking resources in Playwright.
If you want to dig deeper, we wrote some direct comparisons:Â
For more details on working with Playwright and other web scraping libraries in Node.js, check out this guide on top Node.js web scraping libraries.
Avoid Getting Blocked While Scraping with Playwright
Web scraping with Playwright often faces a significant challenge: anti-bot systems. Modern anti-bot systems employ sophisticated detection methods like browser fingerprinting, behavioral analysis, request patterns, IP reputation, machine learning models, and more to distinguish between real users and automated scripts.
Let's test the anti-bot bypass capability of Playwright by scraping the full-page HTML of this Anti-bot Challenge page:
const { chromium } = require("playwright");
(async () => {
try {
// launch the browser
const browser = await chromium.launch();
// playwright runs in headless mode by default for better performance
// create a new browser context and page
const page = await browser.newPage();
// navigate to the target website
await page.goto("https://www.scrapingcourse.com/antibot-challenge");
// extract the full HTML content
const html = await page.content();
console.log(html);
// close the browser
await browser.close();
} catch (error) {
console.error("An error occurred:", error);
}
})();
Playwright got blocked by the anti-bot. Here's the output:
<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Just a moment...</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<!-- ... -->
</head>
<body>
<!-- ... -->
<h2 class="h2" id="challenge-running">
Checking if the site connection is secure
</h2>
<!-- ... -->
</body>
</html>
This was expected because Playwright and the other headless browsers present bot-like attributes that make them easily detectable. While you can implement basic anti-blocking techniques like custom request headers or proxy rotation, it's inefficient against modern anti-bot systems.
This is where ZenRows Scraping Browser comes in. It fortifies your Playwright browser instance with continually updated advanced evasions to mimic an actual user and bypass anti-bot checks.
It provides a cloud-based managed infrastructure, which removes local memory overhead and makes your web scraping highly scalable. It also handles other tasks under the hood, such as residential proxy auto-rotation to distribute your requests efficiently and evade IP bans or geo-restrictions.
Integrating the Scraping Browser into your existing Playwright scraper requires only a single line of code.
Let's see how it works by requesting the Anti-bot Challenge page that previously blocked our Playwright scraper.
Sign up to open the ZenRows Request Builder. Then, go to the Scraping Browser Builder dashboard and copy your Browser URL:
Update the previous code by replacing the launch()
method with the connection URL for the ZenRows Scraping Browser. Here's the updated code:
const { chromium } = require("playwright");
// define your connection URL
const connectionURL = "wss://browser.zenrows.com?apikey=<YOUR_ZENROWS_API_KEY>";
(async () => {
try {
// launch the browser
const browser = await chromium.connectOverCDP(connectionURL);
// playwright runs in headless mode by default for better performance
// create a new browser context and page
const page = await browser.newPage();
// navigate to the target website
await page.goto("https://www.scrapingcourse.com/antibot-challenge");
// extract the full HTML content
const html = await page.content();
console.log(html);
// close the browser
await browser.close();
} catch (error) {
console.error("An error occurred:", error);
}
})();
<html lang="en">
<head>
<!-- ... -->
<title>Antibot Challenge - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<h2>
You bypassed the Antibot challenge! :D
</h2>
<!-- other content omitted for brevity -->
</body>
</html>
Congratulations! 🎉 You have successfully bypassed the anti-bot measures using a one-liner integration of Playwright and ZenRows.
Conclusion
Throughout this guide, you've learned essential Playwright scraping techniques. From basic setup and data extraction to advanced features like request interception and robust waiting mechanisms.
Although headless browsers like Playwright offer several benefits, they can't bypass anti-bot mechanisms independently. We recommend using ZenRows for reliable web scraping at any scale and to fix the limitations while retaining all the benefits of Playwright. Try ZenRows for free!