Playwright Web Scraping: A Complete Guide (2025)

Yuvraj Chandra
Yuvraj Chandra
Updated: December 31, 2024 · 7 min read

Web scraping dynamic websites requires more than sending HTTP requests. You need a tool that can handle JavaScript, interact with elements, and behave like a real browser. Playwright is that tool. It offers a powerful way to automate browser interactions and extract data from modern web applications.

In this tutorial, you'll learn how to scrape data from websites using Playwright in Node.js through a practical, step-by-step approach. Here's what you'll learn:

Let's get started!

What Is Playwright?

Playwright is an open-source framework built on Node.js that helps you automate web browsing tasks. You can use it with most popular programming languages like Python, Node.js, Java, and .NET. It's also compatible with major browsers, including Google Chrome, Microsoft Edge, Firefox, and Safari.

You can use Playwright for a wide range of scraping and automation tasks. Playwright scripts can help you navigate pages, click buttons, fill out forms, extract text, capture screenshots, and more. You can even configure it to bypass CAPTCHA challenges and automate complex workflows.

In short, Playwright offers a user-friendly syntax that makes it accessible even if you're new to programming. You'll find its headless browser mode particularly useful. It runs without a graphical interface, which significantly reduces page loading times and memory usage.

Ready to start using Playwright?

How to Use Playwright for Web Scraping

In this section, you'll explore the essential steps of web scraping with Playwright, starting with environment setup, moving through basic scraping and data parsing, and concluding with exporting the scraped data. Let's jump right in!

Step 1: Prerequisites

Before you start scraping with Playwright, let's get your development environment ready.

Here's what you'll need:

  • Ensure you have Node.js installed. Run the command node -v in your terminal to confirm the installation.
  • Create a new project folder in your desired location and install Playwright using the following npm command.
Terminal
npm install playwright
  • Next, install the browsers using the following command.
Terminal
npx playwright install
  • Your favorite code editor. We'll use VS Code, but feel free to use any IDE you're comfortable with.

If you're new to web scraping with JavaScript, don't forget to check out our in-depth guide on web scraping in JavaScript and Node.js.

You're all set up!

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step 2: Build a Basic Playwright Web Scraper

Let's build your first Playwright scraper by targeting the ScrapingCourse's JS Rendering demo page.

You'll launch a browser in headless mode (no visible UI), navigate to the target website, and extract its HTML content. While this is a simple example, it forms the foundation for more complex scraping tasks.

Create a new file called scraper.js and add the following code:

scraper.js
const { chromium } = require("playwright");

(async () => {
  try {
    // launch the browser
    const browser = await chromium.launch();
    // playwright runs in headless mode by default for better performance

    // create a new browser context and page
    const page = await browser.newPage();

    // navigate to the target website
    await page.goto("https://www.scrapingcourse.com/javascript-rendering");

    // extract the full HTML content
    const html = await page.content();
    console.log(html);

    // close the browser
    await browser.close();
  } catch (error) {
    console.error("An error occurred:", error);
  }
})();

To run your scraper, use this command in your terminal:

Terminal
node scraper.js

You'll see the complete HTML of the target page:

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>JS Rendering Challenge to Learn Web Scraping - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
            <div class="product-info" ...>
                <span class="product-name" ...>
                    Chaz Kangeroo Hoodie
                </span>
            </div>
    <!-- ... -->
</body>
</html>

While headless mode makes scraping faster and more resource-efficient, it can trigger website anti-bot systems. This happens because headless browsers have certain properties that make them distinguishable from regular browsers.

Don't worry if you encounter blocks! Later in this guide, we'll cover advanced techniques for avoiding detection in our section on handling anti-bot measures. For now, let's focus on extracting specific data from the page.

Step 3: Parse Data from the Page With Playwright

Now that you've accessed the page, let's extract specific data elements from our target page. The first step is identifying the HTML elements you want to scrape. Open Chrome DevTools (F12 or right-click and select Inspect) to examine the page structure.

Let's start by extracting a single product name. Inspect on the first product name. You'll notice each product is contained within a div with the class product-item. Looking at the HTML structure, product names are within span elements with the class product-name.

js rendering first product name devtools
Click to open the image in full screen

Playwright offers multiple ways to select elements, including XPath and CSS selectors. In this case, we'll use CSS selectors, as they're simple and straightforward.

Here's how to extract the first product name from the page using CSS selectors and Playwright's locator method:

scraper.js
const { chromium } = require("playwright");

(async () => {
  try {
    // launch the browser
    const browser = await chromium.launch();
    // playwright runs in headless mode by default for better performance

    // create a new browser context and page
    const page = await browser.newPage();

    // navigate to the target website
    await page.goto("https://www.scrapingcourse.com/javascript-rendering");

    // extract the first product name using CSS selector
    const firstProduct = await page.locator(".product-name").first();
    const productName = await firstProduct.innerText();
    console.log("First product:", productName);

    // close the browser
    await browser.close();
  } catch (error) {
    console.error("An error occurred:", error);
  }
})();

You'll get the following output on running this code:

Output
First product: Chaz Kangeroo Hoodie

After successfully extracting a single element, you can expand the script to handle multiple products.

The page structure reveals several key CSS selectors you'll need: .product-name for the product name, .product-price for pricing information, and .product-image for the product image. Use Playwright's evaluateAll method to process all products simultaneously.

scraper.js
const { chromium } = require("playwright");

(async () => {
  try {
    // launch the browser
    const browser = await chromium.launch();
    // playwright runs in headless mode by default for better performance

    // create a new browser context and page
    const page = await browser.newPage();

    // navigate to the target website
    await page.goto("https://www.scrapingcourse.com/javascript-rendering");

    // extract all products
    const products = await page.locator(".product-item");
    const productData = await products.evaluateAll((items) => {
      return items.map((item) => ({
        name: item.querySelector(".product-name").innerText,
        price: item.querySelector(".product-price").innerText,
        image: item.querySelector(".product-image").getAttribute("src"),
      }));
    });

    console.log(productData);

    // close the browser
    await browser.close();
  } catch (error) {
    console.error("An error occurred:", error);
  }
})();

Here's what the output looks like:

Output
[
  {
    name: 'Chaz Kangeroo Hoodie',
    price: '$52',
    image: 'https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh01-gray_main.jpg'    
  },
  // other items omitted for brevity
]

Awesome! Now, you have a robust scraper that extracts product information efficiently!

Step 4: Export Scraped Data to a CSV File

Exporting scraped data to CSV format offers a versatile way to store and analyze your collected information. CSV files are lightweight, widely supported, and can be easily imported into spreadsheet software, databases, or data analysis tools.

Node.js provides a built-in fs module that makes file operations straightforward. You need to convert your structured JavaScript objects into comma-separated rows and add headers.

Here's how to transform your scraping output into a clean, structured CSV file:

scraper.js
const { chromium } = require("playwright");
const fs = require("fs");

(async () => {
  try {
    // launch the browser
    const browser = await chromium.launch();
    // playwright runs in headless mode by default for better performance

    // create a new browser context and page
    const page = await browser.newPage();

    // navigate to the target website
    await page.goto("https://www.scrapingcourse.com/javascript-rendering");

    // extract all products
    const products = await page.locator(".product-item");
    const productData = await products.evaluateAll((items) => {
      return items.map((item) => ({
        name: item.querySelector(".product-name").innerText,
        price: item.querySelector(".product-price").innerText,
        image: item.querySelector(".product-image").getAttribute("src"),
      }));
    });

    // format data as CSV string
    const headers = ["name", "price", "imageURL"].join(",");
    const rows = productData.map((product) => {
      return [product.name, product.price, product.image].join(",");
    });

    // combine headers and data rows
    const csvContent = [headers, ...rows].join("\n");

    // write to file synchronously
    fs.writeFileSync("products.csv", csvContent, "utf-8");
    console.log("Data successfully exported to products.csv!");

    // close the browser
    await browser.close();
  } catch (error) {
    console.error("An error occurred:", error);
  }
})();

When you run this script, it will create a CSV file with headers and your scraped data:

js rendering page products data exported csv
Click to open the image in full screen

Congratulations! 🎉 You've built a complete web scraper using Playwright in Node.js that extracts product data and saves it in a structured format. 

Playwright Features

Playwright isn't just a data extraction tool. It's a complete browser automation powerhouse. Beyond basic data extraction, it provides a rich set of features for interacting with web pages programmatically, which makes it ideal for handling dynamic content, form submissions, complex user interactions, and more.

The upcoming sections will explore key Playwright capabilities such as page navigation, screenshot capture, request interception, and resource blocking. These features make Playwright particularly effective for scraping modern web applications where simple HTTP requests aren't enough.

When scraping modern web applications, handling page navigation and content loading becomes crucial for reliable data extraction. Unlike static websites, modern web apps load content dynamically, implement pagination mechanisms and require specific user interactions.

Navigation in Playwright goes beyond simple URL loading. The page.goto() method supports options like waitUntil to ensure proper page readiness, while page.click() handles complex scenarios, such as clicking elements that trigger route changes. These navigation features automatically manage page transitions, handle redirects, and maintain browser history.

Playwright's waiting mechanisms are intelligent and context-aware. Rather than using fixed delays, you can wait for specific events, such as network requests completing (waitForLoadState), elements becoming visible (waitForSelector), or DOM changes occurring (waitForFunction). This approach makes your scraper faster and more reliable.

Here's a simple example of page navigation and waiting:

Example
async function navigateAndWait(page) {
    // navigate with custom timeout and wait until network is idle
    await page.goto('https://www.scrapingcourse.com/infinite-scrolling', {
        timeout: 30000,
        waitUntil: 'networkidle'
    });
    
    // wait for product grid to be visible and clickable
    await page.waitForSelector('.product-item');
    
    // click first product and wait for details page
    const productLink = await page.locator('.product-item a').first();
    await productLink.click();
    
    // wait for product details to load
    await page.waitForSelector('.product-info');
}

You can learn more in our in-depth guide on Playwright pagination.

Taking Screenshots with Playwright

Playwright offers screenshot capabilities that work seamlessly in both headless and headed browser modes. It allows you to capture full pages, viewports, or individual elements with precision.

Playwright's screenshot API offers three distinct approaches to capture screenshots: 

  • Viewport screenshot using the screenshot() method that captures only the visible area (like what you see in your browser window). 
  • Full-page screenshot using the fullPage: true parameter along with the screenshot() method, which includes all scrollable content.
  • Specific element screenshot using the locator().screenshot() method, which lets you target particular page components.

Let's use a demo product page as our target to demonstrate the various screenshot capabilities. The script below captures the viewport, the entire page, and the product summary section.

scraper.js
const { chromium } = require("playwright");

(async () => {
  try {
    // launch the browser
    const browser = await chromium.launch();
    // playwright runs in headless mode by default for better performance

    // create a new browser context and page
    const page = await browser.newPage();

    // set consistent viewport size
    await page.setViewportSize({ width: 1280, height: 720 });

    // navigate to the target website
    await page.goto(
      "https://www.scrapingcourse.com/ecommerce/product/chaz-kangeroo-hoodie"
    );
    await page.waitForLoadState("networkidle");

    // capture full-page screenshot
    await page.screenshot({
      path: "./full-page.png",
      fullPage: true,
    });
    console.log("Full page screenshot saved as full-page.png");

    // capture viewport screenshot
    await page.screenshot({
      path: "./viewport.png",
    });
    console.log("Viewport screenshot saved as viewport.png");

    // capture specific element screenshot
    const productInfo = await page.locator(".entry-summary");
    await productInfo.screenshot({
      path: "./specific-element.png",
    });
    console.log("Specific element screenshot saved as product-summary.png");

    // close the browser
    await browser.close();
  } catch (error) {
    console.error("An error occurred:", error);
  }
})();

You will get all three screenshots in PNG format when you run this code.

To learn more, read our detailed tutorial on how to take screenshots in Playwright.

Request and Response Intercepting

Request and response interception in Playwright provides powerful control over network traffic during web scraping. By intercepting network requests, you can modify headers in Playwright to mimic real browsers, bypass certain security checks, transform responses before they reach the page, and more.

The code below demonstrates a practical implementation of request interception to customize HTTP headers. It intercepts all outgoing requests and modifies them to include custom User Agent, language preferences, and referer information.

This technique is particularly useful when websites implement browser fingerprinting or require specific header configurations to allow access. By modifying these headers, your scraper can better emulate legitimate browser behavior. Let's look at the implementation:

scraper.js
// import playwright's chromium browser
const { chromium } = require("playwright");

// main function to demonstrate request interception
async function scrapeWithInterception() {
  // initialize browser instance
  const browser = await chromium.launch();
  const context = await browser.newContext();
  const page = await context.newPage();

  // set up request interception
  await page.route("**/*", async (route) => {
    // get the request details
    const request = route.request();

    // modify headers for all requests
    const headers = {
      ...request.headers(),
      // simulate chrome browser
      "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
      // set preferred language
      "Accept-Language": "en-US,en;q=0.9",
      // simulate coming from google search
      Referer: "https://www.google.com/",
    };

    // continue with modified headers
    await route.continue({ headers });
  });

  try {
    // navigate to test endpoint
    await page.goto("https://httpbin.io/headers");

    // extract the full HTML content
    const html = await page.content();
    console.log("Page loaded with custom headers!");
    console.log(html);
  } catch (error) {
    // handle navigation failures
    console.error("Navigation failed:", error);
  } finally {
    // ensure browser cleanup
    await browser.close();
  }
}

// execute the scraping function
scrapeWithInterception();

You'll get the following output on running this code (notice the modified headers):

Output
{
  "headers": {
    "Accept": [
      "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"
    ],
    "Accept-Encoding": [
      "gzip, deflate, br, zstd"
    ],
    "Accept-Language": [
      "en-US,en;q=0.9"
    ],
    "Cache-Control": [
      "no-cache"
    ],
    "Connection": [
      "keep-alive"
    ],
    "Host": [
      "httpbin.io"
    ],
    "Pragma": [
      "no-cache"
    ],
    "Referer": [
      "https://www.google.com/"
    ],
    "Sec-Ch-Ua": [
      "\"HeadlessChrome\";v=\"131\", \"Chromium\";v=\"131\", \"Not_A Brand\";v=\"24\""
    ],
    "Sec-Ch-Ua-Mobile": [
      "?0"
    ],
    "Sec-Ch-Ua-Platform": [
      "\"Windows\""
    ],
    "Sec-Fetch-Dest": [
      "document"
    ],
    "Sec-Fetch-Mode": [
      "navigate"
    ],
    "Sec-Fetch-Site": [
      "none"
    ],
    "Sec-Fetch-User": [
      "?1"
    ],
    "Upgrade-Insecure-Requests": [
      "1"
    ],
    "User-Agent": [
      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
    ]
  }
}

Playwright vs. Puppeteer vs. Selenium

How does Playwright compare with Selenium and Puppeteer, the other two most popular headless browsers for web scraping?

selenium vs playwright vs puppeteer
Click to open the image in full screen

Playwright can run seamlessly across multiple browsers using a single API and has extensive documentation to help you get going. It allows the use of different programming languages like Python, Node.js, Java, and .NET, but not Ruby.

Meanwhile, Selenium has a slightly wider range of language compatibility as it works with Ruby, but it needs third-party add-ons for parallel execution and video recording.

On the other hand, Puppeteer is a more limited tool but about 60% faster than Selenium, and slightly faster than Playwright.

Let's take a look at this comparison table:

playwright vs selenium vs puppeteer
Click to open the image in full screen

As you can see, Playwright certainly wins that competition for most use cases. But if you're still not convinced, here's a summary of Playwright features to consider:

  • It has cross-browser, cross-platform and cross-language support.
  • Playwright can isolate browser contexts for each test or scraping loop you run. You can customize settings like cookies, proxies, and JavaScript on a per-context basis to tailor the browser experience.
  • Its auto-waiting feature determines when the context is ready for interaction. By complementing await page.click() with Playwright APIs (such as await page.waitForSelector() or await page.waitForFunction() methods), your scraper will extract all data.
  • Playwright uses proxy servers to help developers disguise their IP addresses.
  • It's also possible to lower your bandwidth by blocking resources in Playwright.

If you want to dig deeper, we wrote some direct comparisons: 

Avoid Getting Blocked While Scraping with Playwright

Web scraping with Playwright often faces a significant challenge: anti-bot systems. Modern anti-bot systems employ sophisticated detection methods like browser fingerprinting, behavioral analysis, request patterns, IP reputation, machine learning models, and more to distinguish between real users and automated scripts.

Let's test the anti-bot bypass capability of Playwright by scraping the full-page HTML of this Anti-bot Challenge page:

scraper.js
const { chromium } = require("playwright");

(async () => {
  try {
    // launch the browser
    const browser = await chromium.launch();
    // playwright runs in headless mode by default for better performance

    // create a new browser context and page
    const page = await browser.newPage();

    // navigate to the target website
    await page.goto("https://www.scrapingcourse.com/antibot-challenge");

    // extract the full HTML content
    const html = await page.content();
    console.log(html);

    // close the browser
    await browser.close();
  } catch (error) {
    console.error("An error occurred:", error);
  }
})();

Playwright got blocked by the anti-bot. Here's the output:

Output
<!DOCTYPE html>
<html lang="en-US">
<head>
    <title>Just a moment...</title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2 class="h2" id="challenge-running">
        Checking if the site connection is secure
    </h2>
    <!-- ... -->
</body>
</html>

This was expected because Playwright and the other headless browsers present bot-like attributes that make them easily detectable. While you can implement basic anti-blocking techniques like custom request headers or proxy rotation, it's inefficient against modern anti-bot systems.

This is where ZenRows Scraping Browser comes in. It fortifies your Playwright browser instance with continually updated advanced evasions to mimic an actual user and bypass anti-bot checks.

It provides a cloud-based managed infrastructure, which removes local memory overhead and makes your web scraping highly scalable. It also handles other tasks under the hood, such as residential proxy auto-rotation to distribute your requests efficiently and evade IP bans or geo-restrictions.

Integrating the Scraping Browser into your existing Playwright scraper requires only a single line of code.

Let's see how it works by requesting the Anti-bot Challenge page that previously blocked our Playwright scraper.

Sign up to open the ZenRows Request Builder. Then, go to the Scraping Browser Builder dashboard and copy your Browser URL:

ZenRows scraping browser
Click to open the image in full screen

Update the previous code by replacing the launch() method with the connection URL for the ZenRows Scraping Browser. Here's the updated code:

scraper.js
const { chromium } = require("playwright");

// define your connection URL
const connectionURL = "wss://browser.zenrows.com?apikey=<YOUR_ZENROWS_API_KEY>";

(async () => {
  try {
    // launch the browser
    const browser = await chromium.connectOverCDP(connectionURL);
    // playwright runs in headless mode by default for better performance

    // create a new browser context and page
    const page = await browser.newPage();

    // navigate to the target website
    await page.goto("https://www.scrapingcourse.com/antibot-challenge");

    // extract the full HTML content
    const html = await page.content();
    console.log(html);

    // close the browser
    await browser.close();
  } catch (error) {
    console.error("An error occurred:", error);
  }
})();
Output
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

Congratulations! 🎉 You have successfully bypassed the anti-bot measures using a one-liner integration of Playwright and ZenRows.

Conclusion

Throughout this guide, you've learned essential Playwright scraping techniques. From basic setup and data extraction to advanced features like request interception and robust waiting mechanisms.

Although headless browsers like Playwright offer several benefits, they can't bypass anti-bot mechanisms independently. We recommend using ZenRows for reliable web scraping at any scale and to fix the limitations while retaining all the benefits of Playwright. Try ZenRows for free!

Ready to get started?

Up to 1,000 URLs for free are waiting for you