Are you trying to scrape data from HTML tables using Puppeteer? You're in the right place! Table parsing can be challenging, especially when dealing with tables with complex layouts or those relying on JavaScript rendering.
This article teaches you how to parse and scrape an HTML table with Puppeteer in Node.js. We'll also introduce you to a simpler, more efficient alternative to streamline your workflow.
How to Parse Tables With Puppeteer
You can parse and scrape static HTML tables with tools like Axios and Cheerio. However, when dealing with dynamically rendered tables, you need a browser-enabled tool. That's where a headless browser like Puppeteer comes in.
Puppeteer lets you interact with web elements via a browser, making it ideal for parsing dynamically rendered HTML tables that rely on JavaScript to populate their content.
In this section, you'll use Puppeteer to scrape the HTML Table Challenge, a demo page to learn HTML table parsing. We'll show you the two methods of parsing HTML tables in Puppeteer, including the page.$$eval
and page.evaluate
functions.
First, you'll visit the target page. Then, you'll use Puppeteer's locator to scrape table elements based on their selectors. Here's what the target page looks like:
First, start a Node.js project by running the following command:
npm init -y
Install Puppeteer using npm
:
npm install puppeteer
Let's inspect the table elements before going on. Open the target page via a browser like Chrome. Right-click the table and select Inspect to open the developer tools.
The table has an ID of product-catalog
with rows (tr
) and data (td
) bearing unique class names.
Let's start with the page.$$eval
method.
Use Puppeteer's page.$$eval Method
Puppeteer's page.$$eval
method maps through the table rows and extracts their data. It selects all elements matching the table's CSS or tag selector and retrieves the innerText
of each row. The major disadvantage is that it provides less flexibility for customizing the output.
Import Puppeteer into your JavaScript file, launch a browser instance, create a new page, and open the target website:
// npm install puppeteer
const puppeteer = require('puppeteer');
(async () => {
// launch a new browser instance
const browser = await puppeteer.launch();
const page = await browser.newPage();
// open the target table URL
await page.goto('https://www.scrapingcourse.com/table-parsing');
// ... scraping logic
await browser.close();
})();
To scrape the HTML table data with the page.$eval
method, map through it like so:
// ...
(async () => {
// ...
// scrape the table data
const tableData = await page.$$eval('table tr td', (tds) =>
tds.map((td) => {
return td.innerText;
})
);
//... log the extracted data
// close the browser
await browser.close();
})();
Merge the snippets. The complete code looks like this:
// npm install puppeteer
const puppeteer = require('puppeteer');
(async () => {
// launch a new browser instance
const browser = await puppeteer.launch();
const page = await browser.newPage();
// open the target table URL
await page.goto('https://www.scrapingcourse.com/table-parsing');
// scrape the table data
const tableData = await page.$$eval('table tr td', (tds) =>
tds.map((td) => {
return td.innerText;
})
);
// log the extracted data
console.log(tableData);
// close the browser
await browser.close();
})();
You'll get the following output:
[
'001', 'Laptop', 'Electronics',
'$999.99', 'Yes', '002',
'Smartphone', 'Electronics', '$599.99',
// ... other products omitted for brevity,
'No', '015', 'Gaming Console',
'Electronics', '$399.99', 'Yes'
]
Keep going!
However, the page.evaluate
method gives you more control over the output. You'll see how to use it below.
Parse HTML Tables With Puppeteer's page.evaluate
The page.evaluate
method leverages the Chrome DevTools Protocol (CDP) to execute JavaScript directly within the browser's context. Its upside is that it gives you fine-grained control over the output.
Using the mapping principle, you can also implement the page.evaluate
method to iterate through the table and extra data from each row. However, like the previous page.$$eval
method, this approach doesn't return the table headers:
// ...
(async () => {
// ...
// scrape the table data
const tableData = await page.evaluate(() => {
const tds = Array.from(document.querySelectorAll('table tr td'));
return tds.map((td) => td.innerText);
});
//... log the extracted data
// close the browser
await browser.close();
})();
The above gives the same output as the previous page.$$eval
method:
[
'001', 'Laptop', 'Electronics',
'$999.99', 'Yes', '002',
'Smartphone', 'Electronics', '$599.99',
// ... other products omitted for brevity,
'No', '015', 'Gaming Console',
'Electronics', '$399.99', 'Yes'
]
To gain more control over the table and define the output headers, we'll scrape each column's data by locating its header selector.
Select the table element using querySelectorAll
and iterate through its headers to extract row content from each using the querySelector
. Insert the extracted data into the empty data list, return its content, and log it.
// ...
(async () => {
// ...
// scrape the table data
const tableData = await page.evaluate(() => {
const rows = document.querySelectorAll('#product-catalog tbody tr');
// empty array to collect scraped data
const data = [];
// iterate through the rows to collect their data
rows.forEach((row) => {
const product = {
id: row.querySelector('.product-id').textContent.trim(),
name: row.querySelector('.product-name').textContent.trim(),
category: row
.querySelector('.product-category')
.textContent.trim(),
price: row.querySelector('.product-price').textContent.trim(),
inStock: row.querySelector('.product-stock').textContent.trim(),
};
data.push(product);
});
// return the extracted data
return data;
});
// log the extracted data
console.log(tableData);
// close the browser
await browser.close();
})();
Merge the snippets, and you'll get the following final code:
// npm install puppeteer
const puppeteer = require('puppeteer');
(async () => {
// launch a new browser instance
const browser = await puppeteer.launch();
const page = await browser.newPage();
// open the target table URL
await page.goto('https://www.scrapingcourse.com/table-parsing');
// scrape the table data
const tableData = await page.evaluate(() => {
const rows = document.querySelectorAll('#product-catalog tbody tr');
// empty array to collect scraped data
const data = [];
// iterate through the rows to collect their data
rows.forEach((row) => {
const product = {
id: row.querySelector('.product-id').textContent.trim(),
name: row.querySelector('.product-name').textContent.trim(),
category: row
.querySelector('.product-category')
.textContent.trim(),
price: row.querySelector('.product-price').textContent.trim(),
inStock: row.querySelector('.product-stock').textContent.trim(),
};
data.push(product);
});
// return the extracted data
return data;
});
// log the extracted data
console.log(tableData);
// close the browser
await browser.close();
})();
The above code extracts the desired data from the table with the custom table headers, as shown:
[
{
id: '001',
name: 'Laptop',
category: 'Electronics',
price: '$999.99',
inStock: 'Yes',
},
{
id: '002',
name: 'Smartphone',
category: 'Electronics',
price: '$599.99',
inStock: 'Yes',
},
// ... other products omitted for brevity,
{
id: '014',
name: 'Air Purifier',
category: 'Home',
price: '$129.99',
inStock: 'No',
},
{
id: '015',
name: 'Gaming Console',
category: 'Electronics',
price: '$399.99',
inStock: 'Yes',
},
]
Another Puppeteer table-parsing technique in the bag!
However, if you want to avoid locating and maintaining CSS selectors, Puppeteer offers a more straightforward approach: the puppeteer-table-parser library. Let's see how it works in the next section.
Pro-Alternative #1: puppeteer-table-parser
The puppeteer-table-parser is a dedicated JavaScript library to overcome the complexities of parsing tables with Puppeteer. The library gives you a built-in table parsing feature while allowing you to use Puppeteer's JavaScript rendering capability.
You only need to supply the target table's selector and column names in an array format to use the puppeteer-table-parser library. The library then returns the table data in a semi-colon delimited format.
To start, install the puppeteer-table-parser library with npm
:
npm install puppeteer-table-parser
Now, import the library into your scraper:
// npm install puppeteer puppeteer-table-parser
// ...
const { tableParser } = require('puppeteer-table-parser');
Define the table column names in an array using Puppeteer's page context. The array also accepts a selector
to specify the table's CSS selector. For ease, we've used the tag name in this case:
(async () => {
// ...
// scrape the table data
const tableData = await tableParser(page, {
selector: 'table',
allowedColNames: {
'Product ID': 'Product ID',
Name: 'Name',
Category: 'Category',
Price: 'Price',
'In Stock': 'In Stock',
},
});
// log the extracted data
console.log(tableData);
// close the browser
await browser.close();
})();
Merge the snippets, and you'll get the following final code:
// npm install puppeteer puppeteer-table-parser
const puppeteer = require('puppeteer');
const { tableParser } = require('puppeteer-table-parser');
(async () => {
// launch a new browser instance
const browser = await puppeteer.launch();
const page = await browser.newPage();
// open the target table URL
await page.goto('https://www.scrapingcourse.com/table-parsing');
// scrape the table data
const tableData = await tableParser(page, {
selector: 'table',
allowedColNames: {
'Product ID': 'Product ID',
Name: 'Name',
Category: 'Category',
Price: 'Price',
'In Stock': 'In Stock',
},
});
// log the extracted data
console.log(tableData);
// close the browser
await browser.close();
})();
The above code returns the following output:
Product ID;Name;Category;Price;In Stock
001;Laptop;Electronics;$999.99;Yes
002;Smartphone;Electronics;$599.99;Yes
// ... other products omitted for brevity
014;Air Purifier;Home;$129.99;No
015;Gaming Console;Electronics;$399.99;Yes
Bravo! You just simplified your Puppeteer table parser further.
While Puppeteer is a powerful web scraping tool with browser automation features, it has shortcomings that may warrant considering alternatives.
One of Puppeteer's setbacks is that the browser instance is memory-demanding, making scalability computationally expensive. Plus, it leaks bot-like signals, such as the HeadlessChrome
User Agent flag and the WebDriver property, making it prone to anti-bot detection. Puppeteer also has a steeper learning curve than simple HTTP clients and parsers like Cheerio.
We'll now explore a better alternative to overcome these limitations.
Pro-Alternative #2: ZenRows
ZenRows is the best alternative to Puppeteer if you want an easy way to scrape table data with only a few lines of code. It's a web scraping solution with a complete toolkit to scrape any website at scale without limitations and is compatible with any programming language.
ZenRows offers a table auto-parsing feature, removing the complexity of updating table selectors due to HTML layout changes or obfuscation. It also acts as a headless browser, allowing you to scrape dynamically rendered tables. And if your target website uses some form of anti-bot to prevent table parsing, rest assured that ZenRows will bypass it for you.
While parsing tables with ZenRows, you get your result in JSON format, which is code-friendly and easy to manipulate. You only need to make a single API call and watch it complete the table parsing tasks under the hood.
Let's see how ZenRows works by scraping the HTML table on the previous target page (HTML Table Challenge).
Sign up to open the ZenRows Request Builder dashboard. Paste the target URL in the link box, and activate Premium Proxies and JS Rendering. Select Node.js as your programming language and choose the API connection mode. Copy and paste the generated code into your Python script and add the outputs:tables
option to the existing request parameters.
Here's what your code should look like. Pay attention to the index in the response output:
// npm install axios
const axios = require('axios');
const url = 'https://www.scrapingcourse.com/table-parsing';
const apikey = '<YOUR_ZENROWS_API_KEY>';
axios({
url: 'https://api.zenrows.com/v1/',
method: 'GET',
params: {
url: url,
apikey: apikey,
js_render: 'true',
premium_proxy: 'true',
outputs: 'tables',
},
})
.then((response) => console.log(response.data.tables[0]))
.catch((error) => console.log(error));
The above code parses the table data and scrapes its content, including metadata, as shown:
{
Dimensions: { Rows: 15, Cols: 5, Headings: true },
Headings: ['Product ID', 'Name', 'Category', 'Price', 'In Stock'],
Content: [
{
Category: 'Electronics',
'In Stock': 'Yes',
Name: 'Laptop',
Price: '$999.99',
'Product ID': '001',
},
{
Category: 'Electronics',
'In Stock': 'Yes',
Name: 'Smartphone',
Price: '$599.99',
'Product ID': '002',
},
//... other products omitted for brevity,
{
Category: 'Home',
'In Stock': 'No',
Name: 'Air Purifier',
Price: '$129.99',
'Product ID': '014',
},
{
Category: 'Electronics',
'In Stock': 'Yes',
Name: 'Gaming Console',
Price: '$399.99',
'Product ID': '015',
},
],
};
That's it 🎉! You just scraped an HTML table with ZenRows in JavaScript without inspecting web elements or writing any scraping logic.
Conclusion
In this article, you've learned how to parse an HTML table with Puppeteer in Node.js. Despite Puppeteer's ability to interact with web pages, its dependence on browser instances, steep learning curve, and inability to bypass anti-bots make it unreliable for parsing tables at scale.
You've seen how ZenRows simplifies the table parsing process, leaving you hands-free with the scraping logic while extracting metadata and table content in a clean JSON format. ZenRows is the go-to solution for scraping the internet without getting blocked.
Try ZenRows for free now without a credit card!