How to Parse Tables Using Puppeteer (+2 Simpler Ways)

September 19, 2024 · 4 min read

Table of contents

How to parse tables with Puppeteer
- Use Puppeteer's page.$$eval method
- Parse tables with Puppeteer's page.evaluate
Pro-Alternative #1
Pro-Alternative #2
Conclusion

Table parsing from HTML tables using Puppeteer is hard (especially when dealing with tables with complex layouts or those relying on JavaScript rendering.)

So, in this guide, I'll show you how to parse and scrape an HTML table with Puppeteer in Node.js and also introduce you to a simpler, more efficient alternative to streamline your workflow.

How to Parse Tables With Puppeteer

You can parse and scrape static HTML tables with tools like Axios and Cheerio. However, when dealing with dynamically rendered tables, you need a browser-enabled tool. That's where a headless browser like Puppeteer comes in.

Puppeteer lets you interact with web elements via a browser, making it ideal for parsing dynamically rendered HTML tables that rely on JavaScript to populate their content.

In this section, you'll use Puppeteer to scrape the HTML Table Challenge, a demo page to learn HTML table parsing. We'll show you the two methods of parsing HTML tables in Puppeteer, including the page.$$eval and page.evaluate functions.

First, you'll visit the target page. Then, you'll use Puppeteer's locator to scrape table elements based on their selectors. Here's what the target page looks like:

ScrapingCourse HTML Table — Click to open the image in full screen

First, start a Node.js project by running the following command:

                    Terminal
                
npm init -y

Copied!

Install Puppeteer using npm:

                    Terminal
                
npm install puppeteer

Copied!

Let's inspect the table elements before going on. Open the target page via a browser like Chrome. Right-click the table and select Inspect to open the developer tools.

The table has an ID of product-catalog with rows (tr) and data (td) bearing unique class names.

ScrapingCourse Table Inspection — Click to open the image in full screen

Let's start with the page.$$eval method.

Use Puppeteer's page.$$eval Method

Puppeteer's page.$$eval method maps through the table rows and extracts their data. It selects all elements matching the table's CSS or tag selector and retrieves the innerText of each row. The major disadvantage is that it provides less flexibility for customizing the output.

Import Puppeteer into your JavaScript file, launch a browser instance, create a new page, and open the target website:

                    Example
                
// npm install puppeteer
const puppeteer = require('puppeteer');

(async () => {
    // launch a new browser instance
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    // open the target table URL
    await page.goto('https://www.scrapingcourse.com/table-parsing');

    // ... scraping logic

    await browser.close();
})();

Copied!

To scrape the HTML table data with the page.$eval method, map through it like so:

                    Example
                
// ...

(async () => {
    //    ...

    // scrape the table data
    const tableData = await page.$$eval('table tr td', (tds) =>
        tds.map((td) => {
            return td.innerText;
        })
    );

    //... log the extracted data

    // close the browser
    await browser.close();
})();

  
  

  
Copied!

Merge the snippets. The complete code looks like this:

                    Example
                
// npm install puppeteer
const puppeteer = require('puppeteer');

(async () => {
    // launch a new browser instance
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    // open the target table URL
    await page.goto('https://www.scrapingcourse.com/table-parsing');

    // scrape the table data
    const tableData = await page.$$eval('table tr td', (tds) =>
        tds.map((td) => {
            return td.innerText;
        })
    );

    // log the extracted data
    console.log(tableData);

    // close the browser
    await browser.close();
})();

  
  

  
Copied!

You'll get the following output:

                    Output
                
[
    '001',              'Laptop',       'Electronics',
    '$999.99',          'Yes',          '002',
    'Smartphone',       'Electronics',  '$599.99',

    // ... other products omitted for brevity,
   
    'No',               '015',          'Gaming Console',
    'Electronics',      '$399.99',      'Yes'
  ]

Copied!

Keep going!

However, the page.evaluate method gives you more control over the output. You'll see how to use it below.

Frustrated that your web scrapers are blocked once and again?

ZenRows API handles rotating proxies and headless browsers for you.

Try for FREE

Parse HTML Tables With Puppeteer's page.evaluate

The page.evaluate method leverages the Chrome DevTools Protocol (CDP) to execute JavaScript directly within the browser's context. Its upside is that it gives you fine-grained control over the output.

Using the mapping principle, you can also implement the page.evaluate method to iterate through the table and extra data from each row. However, like the previous page.$$eval method, this approach doesn't return the table headers:

                    Example
                
// ...

(async () => {
    //    ...

    // scrape the table data
    const tableData = await page.evaluate(() => {
        const tds = Array.from(document.querySelectorAll('table tr td'));
        return tds.map((td) => td.innerText);
    });

    //... log the extracted data

    // close the browser
    await browser.close();
})();

Copied!

The above gives the same output as the previous page.$$eval method:

                    Output
                
[
    '001',              'Laptop',       'Electronics',
    '$999.99',          'Yes',          '002',
    'Smartphone',       'Electronics',  '$599.99',

    // ... other products omitted for brevity,
   
    'No',               '015',          'Gaming Console',
    'Electronics',      '$399.99',      'Yes'
  ]

Copied!

To gain more control over the table and define the output headers, we'll scrape each column's data by locating its header selector.

Select the table element using querySelectorAll and iterate through its headers to extract row content from each using the querySelector. Insert the extracted data into the empty data list, return its content, and log it.

                    Example
                
// ...

(async () => {
    // ...
    // scrape the table data
    const tableData = await page.evaluate(() => {
        const rows = document.querySelectorAll('#product-catalog tbody tr');

        // empty array to collect scraped data
        const data = [];

        // iterate through the rows to collect their data
        rows.forEach((row) => {
            const product = {
                id: row.querySelector('.product-id').textContent.trim(),
                name: row.querySelector('.product-name').textContent.trim(),
                category: row
                    .querySelector('.product-category')
                    .textContent.trim(),
                price: row.querySelector('.product-price').textContent.trim(),
                inStock: row.querySelector('.product-stock').textContent.trim(),
            };
            data.push(product);
        });

        // return the extracted data
        return data;
    });
    // log the extracted data
    console.log(tableData);

    // close the browser
    await browser.close();
})();

  
  

  
Copied!

Merge the snippets, and you'll get the following final code:

                    Example
                
// npm install puppeteer
const puppeteer = require('puppeteer');

(async () => {
    // launch a new browser instance
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    // open the target table URL
    await page.goto('https://www.scrapingcourse.com/table-parsing');

    // scrape the table data
    const tableData = await page.evaluate(() => {
        const rows = document.querySelectorAll('#product-catalog tbody tr');

        // empty array to collect scraped data
        const data = [];

        // iterate through the rows to collect their data
        rows.forEach((row) => {
            const product = {
                id: row.querySelector('.product-id').textContent.trim(),
                name: row.querySelector('.product-name').textContent.trim(),
                category: row
                    .querySelector('.product-category')
                    .textContent.trim(),
                price: row.querySelector('.product-price').textContent.trim(),
                inStock: row.querySelector('.product-stock').textContent.trim(),
            };
            data.push(product);
        });

        // return the extracted data
        return data;
    });

    // log the extracted data
    console.log(tableData);

    // close the browser
    await browser.close();
})();

  
  

  
Copied!

The above code extracts the desired data from the table with the custom table headers, as shown:

                    Output
                
[
    {
        id: '001',
        name: 'Laptop',
        category: 'Electronics',
        price: '$999.99',
        inStock: 'Yes',
    },
    {
        id: '002',
        name: 'Smartphone',
        category: 'Electronics',
        price: '$599.99',
        inStock: 'Yes',
    },

    // ... other products omitted for brevity,

    {
        id: '014',
        name: 'Air Purifier',
        category: 'Home',
        price: '$129.99',
        inStock: 'No',
    },
    {
        id: '015',
        name: 'Gaming Console',
        category: 'Electronics',
        price: '$399.99',
        inStock: 'Yes',
    },
]

  
  

  
Copied!

Another Puppeteer table-parsing technique in the bag!

However, if you want to avoid locating and maintaining CSS selectors, Puppeteer offers a more straightforward approach: the puppeteer-table-parser library. Let's see how it works in the next section.

Pro-Alternative #1: puppeteer-table-parser

The puppeteer-table-parser is a dedicated JavaScript library to overcome the complexities of parsing tables with Puppeteer. The library gives you a built-in table parsing feature while allowing you to use Puppeteer's JavaScript rendering capability.

You only need to supply the target table's selector and column names in an array format to use the puppeteer-table-parser library. The library then returns the table data in a semi-colon delimited format.

To start, install the puppeteer-table-parser library with npm:

                    Terminal
                
npm install puppeteer-table-parser

Copied!

Now, import the library into your scraper:

                    Example
                
// npm install puppeteer puppeteer-table-parser
// ...
const { tableParser } = require('puppeteer-table-parser');

Copied!

Define the table column names in an array using Puppeteer's page context. The array also accepts a selector to specify the table's CSS selector. For ease, we've used the tag name in this case:

                    Example
                
(async () => {
    // ...

    // scrape the table data
    const tableData = await tableParser(page, {
        selector: 'table',
        allowedColNames: {
            'Product ID': 'Product ID',
            Name: 'Name',
            Category: 'Category',
            Price: 'Price',
            'In Stock': 'In Stock',
        },
    });

    // log the extracted data
    console.log(tableData);

    // close the browser
    await browser.close();
})();

  
  

  
Copied!

Merge the snippets, and you'll get the following final code:

                    Example
                
// npm install puppeteer puppeteer-table-parser
const puppeteer = require('puppeteer');
const { tableParser } = require('puppeteer-table-parser');

(async () => {
    // launch a new browser instance
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    // open the target table URL
    await page.goto('https://www.scrapingcourse.com/table-parsing');

    // scrape the table data
    const tableData = await tableParser(page, {
        selector: 'table',
        allowedColNames: {
            'Product ID': 'Product ID',
            Name: 'Name',
            Category: 'Category',
            Price: 'Price',
            'In Stock': 'In Stock',
        },
    });

    // log the extracted data
    console.log(tableData);

    // close the browser
    await browser.close();
})();

  
  

  
Copied!

The above code returns the following output:

                    Output
                
Product ID;Name;Category;Price;In Stock
001;Laptop;Electronics;$999.99;Yes
002;Smartphone;Electronics;$599.99;Yes

// ... other products omitted for brevity

014;Air Purifier;Home;$129.99;No
015;Gaming Console;Electronics;$399.99;Yes

Copied!

Bravo! You just simplified your Puppeteer table parser further.

While Puppeteer is a powerful web scraping tool with browser automation features, it has shortcomings that may warrant considering alternatives.

One of Puppeteer's setbacks is that the browser instance is memory-demanding, making scalability computationally expensive. Plus, it leaks bot-like signals, such as the HeadlessChrome User Agent flag and the WebDriver property, making it prone to anti-bot detection, especially against anti-bots like Cloudflare. Puppeteer also has a steeper learning curve than simple HTTP clients and parsers like Cheerio.

We'll now explore a better alternative to overcome these limitations.

Pro-Alternative #2: ZenRows

ZenRows is the best alternative to Puppeteer if you want an easy way to scrape table data with only a few lines of code. It's a web scraping solution with a complete toolkit to scrape any website at scale without limitations and is compatible with any programming language.

ZenRows offers a table auto-parsing feature, removing the complexity of updating table selectors due to HTML layout changes or obfuscation. It also acts as a headless browser, allowing you to scrape dynamically rendered tables. And if your target website uses some form of anti-bot to prevent table parsing, rest assured that ZenRows will bypass it for you.

While parsing tables with ZenRows, you get your result in JSON format, which is code-friendly and easy to manipulate. You only need to make a single API call and watch it complete the table parsing tasks under the hood.

Let's see how ZenRows works by scraping the HTML table on the previous target page (HTML Table Challenge).

Sign up to open the ZenRows Request Builder dashboard. Paste the target URL in the link box, and activate Premium Proxies and JS Rendering. Select Node.js as your programming language and choose the API connection mode. Copy and paste the generated code into your Python script and add the outputs:tables option to the existing request parameters.

building a scraper with zenrows — Click to open the image in full screen

Here's what your code should look like. Pay attention to the index in the response output:

                    Example
                
// npm install axios
const axios = require('axios');

const url = 'https://www.scrapingcourse.com/table-parsing';
const apikey = '<YOUR_ZENROWS_API_KEY>';
axios({
    url: 'https://api.zenrows.com/v1/',
    method: 'GET',
    params: {
        url: url,
        apikey: apikey,
        js_render: 'true',
        premium_proxy: 'true',
        outputs: 'tables',
    },
})
    .then((response) => console.log(response.data.tables[0]))
    .catch((error) => console.log(error));

  
  

  
Copied!

The above code parses the table data and scrapes its content, including metadata, as shown:

                    Output
                
{
    Dimensions: { Rows: 15, Cols: 5, Headings: true },
    Headings: ['Product ID', 'Name', 'Category', 'Price', 'In Stock'],
    Content: [
        {
            Category: 'Electronics',
            'In Stock': 'Yes',
            Name: 'Laptop',
            Price: '$999.99',
            'Product ID': '001',
        },
        {
            Category: 'Electronics',
            'In Stock': 'Yes',
            Name: 'Smartphone',
            Price: '$599.99',
            'Product ID': '002',
        },

        //... other products omitted for brevity,

        {
            Category: 'Home',
            'In Stock': 'No',
            Name: 'Air Purifier',
            Price: '$129.99',
            'Product ID': '014',
        },
        {
            Category: 'Electronics',
            'In Stock': 'Yes',
            Name: 'Gaming Console',
            Price: '$399.99',
            'Product ID': '015',
        },
    ],
};

  
  

  
Copied!

That's it 🎉! You just scraped an HTML table with ZenRows in JavaScript without inspecting web elements or writing any scraping logic.

Conclusion

In this article, you've learned how to parse an HTML table with Puppeteer in Node.js. Despite Puppeteer's ability to interact with web pages, its dependence on browser instances, steep learning curve, and inability to bypass anti-bots make it unreliable for parsing tables at scale.

You've seen how ZenRows simplifies the table parsing process, leaving you hands-free with the scraping logic while extracting metadata and table content in a clean JSON format. ZenRows is the go-to solution for scraping the internet without getting blocked.

Try ZenRows for free now without a credit card!