How to Parse HTML Tables Using NodeJS + Top 4 Libraries

October 10, 2024 · 6 min read

Table of contents

Why is table parsing difficult with NodeJS?
- Complex table structure
- Dynamic tables
- Anti-bot systems and obfuscated HTML
Best HTML table parsers for NodeJS
- ZenRows
- Cheerio
- Node-HTML-Parser
- Puppeteer
How to parse HTML tables in NodeJS
- Using Cheerio
- Using ZenRows
Conclusion

HTML tables are a common format for displaying structured data on websites, but they can pose significant challenges for web scraping—especially with complex or irregular layouts.

In this guide, we'll walk you through how to parse HTML tables using Node.js, and we'll suggest more efficient solutions for handling complex cases.

Why Is Table Parsing Difficult With NodeJS?

Parsing HTML tables can get complicated due to several factors. Here are the three most common scenarios you're likely to encounter.

Complex Table Structure

HTML tables can have complex layouts that make it difficult to extract valuable data. For instance, some website designs include nested tables (tables within tables). Retrieving information from such structures can be complicated as you have to account for multiple layers of table elements (<table>, <tr>, etc.).

Similarly, HTML tables can include merged cells or irregular formatting, such as nested tables, inconsistent number of columns, and so on. These attributes and styles can alter the table structure, making parsing challenging, especially for beginners in web scraping.

Dynamic Tables

Almost all modern websites display content dynamically using JavaScript or AJAX, meaning the table data may not be present in the initial HTML. Instead, you must implement some additional steps to execute JavaScript before you can access your desired data.

This can complicate the scraping process as you need to integrate with headless browsers like Puppeteer and Selenium to render dynamic content.

Anti-bot Systems and Obfuscated HTML

In today's internet, most websites route requests through anti-bot solutions like Cloudflare and other WAF systems or obfuscate their data to prevent malicious attacks. These techniques create significant challenges for web scrapers, as they are specifically designed to block automated access.

To parse tables from such sites, you must first bypass the anti-bot protection or deobfuscate the HTML—an understandably daunting task.

Best HTML Table Parsers for NodeJS

Let's explore 4 of the best NodeJS HTML parsers.

ZenRows

Unlike most tools on this list, ZenRows is more than just an HTML parser. It's a web scraping API for extracting data at scale without getting blocked. It offers numerous features, including premium proxies, anti-CAPTCHA, auto-user agent rotation, advanced anti-bot bypass, and more, depending on your use case.

For parsing tables, ZenRows offers a more straightforward approach. It requires less code compared to other options on this list. You only need a single API call to extract all the table data on a web page.

This tool returns data in JSON format and automatically sections the output into dimensions, headings, and content, allowing for easy data processing and transmitting of table values via APIs.

That's not all.

ZenRows' advanced techniques automatically identify table elements even when page layouts change or in the event of dynamic class names. Also, its bypass features allow you to extract data from protected websites.

All this and more makes ZenRows the best NodeJS table parser.

Pros

Returns table data in JSON format.
It can bypass any anti-bot system.
Resilient to layout changes and dynamic class names.
Headless browser functionality automatically executes JavaScript.
Automatically handles obfuscation and WAF bypass.
Auto-rotating proxies.
Support for XPath and CSS selectors.
Easy to use and requires less code compared to open-source alternatives.
Compatible with any programming language.

Cons

While ZenRows offers a free trial, it's primarily a paid service.

Cheerio

Cheerio is arguably the most popular HTML parsing framework in NodeJS, and here's why.

First, its lightweight design makes it fast and efficient for everyday use. Secondly, it creates a Cheerio object or DOM tree that lets you navigate and manipulate HTML elements using jQuery syntax. This makes parsing HTML pretty straightforward, as jQuery syntax is easy to understand and use.

However, parsing complex HTML using Cheerio can be challenging. This is because it operates with a read-only immutable view of the HTML, which can result in a slower, less efficient DOM manipulation when dealing with complex websites.

Pros

jQuery syntax.
Lightweight and fast.
Flexible parsing with parse5 and htmlparser2.
Supports CSS and XPath selectors.

Cons

It can be slow when dealing with complex HTML.
Requires integration with headless browsers to render dynamic content.

Node-HTML-Parser

NodeHtmlParser, or Fast HTML Parser, is a NodeJS library that generates a simplified DOM tree with support for element queries. It's optimized for parsing large HTML files with the lowest possible overhead.

While its performance focus enables you to parse data quickly, there might be some limitations in handling specific malformed HTML. However, it accounts for common HTML errors, such as HTML4-style issues like missing closing tags, for example, <li>, <td>, etc.

Pros

Element query support.
Simplified DOM tree.
Handles basic malformed HTML.
High performance.
Optimized for handling large HTML files.

Cons

Struggles to handle certain malformed HTML.
Limited community support, compared to other open-source alternatives.

Puppeteer

Puppeteer is a NodeJS library that can be used to parse HTML tables. However, it's much more than that.

It offers a high-level API for controlling Chromium-based browsers via the DevTools protocol. Although initially developed for testing, Puppeteer's ability to render JavaScript and interact with web page elements extends its application to web scraping and other browser-related tasks.

With Puppeteer, you can simulate user actions such as clicking buttons, scrolling, identifying inputs, etc. This can be useful when dealing with dynamic or complex websites. Lastly, Puppeteer offers built-in methods that can simplify your parsing process.

However, anti-bot systems can detect these automation properties and block your requests.

Pros

Can render JavaScript.
Web page interaction.
Provides built-in methods for parsing HTML tables
Maintained by Google and over 400 contributors.
Can run multiple instances in parallel.
Relatively easy to use.

Cons

It can get slow when running multiple browser instances.
Resource intensive.
Anti-bot systems can detect Puppeteer's automation properties.

How to Parse HTML Tables in NodeJS

Now that you know some of the best HTML table parsers in NodeJS, let's put these tools to work.

Using the table parsing challenge page, you'll learn how to parse tables with Cheerio, the most popular NodeJS HTML parser. Then, we'll move to ZenRows, a more efficient solution for every use case.

Here's what the target page for this tutorial looks like:

Frustrated that your web scrapers are blocked once and again?

ZenRows API handles rotating proxies and headless browsers for you.

Try for FREE

ScrapingCourse HTML Table — Click to open the image in full screen

Using Cheerio

You need a NodeJS HTTP client to fetch the web page containing the table we'll be parsing using Cheerio. For this tutorial, we'll use Axios.

You can install both libraries using the following command:

                    Terminal
                
npm install axios cheerio

Copied!

Once everything's set up, import both libraries. Then, make a GET request to the target server and retrieve the response (HTML file) using Axios.

                    Example
                
const axios = require('axios');
const cheerio = require('cheerio');

axios.get('https://www.scrapingcourse.com/table-parsing')
  .then(response => {
    const html = response.data;
  })
  .catch(error => {
    console.error(error);
  });

  
  

  
Copied!

This code snippet retrieves the HTML file and stores it in the html variable.

Now that you have the HTML file, let's parse the table data using Cheerio.

Start by loading the HTML file into Cheerio.

                    Example
                
// ...
const $ = cheerio.load(html);

Copied!

This allows you to navigate and manipulate the HTML document using jQuery syntax.

After that, locate the table element containing your desired data and extract its text content.

Once you've loaded the HTML, Cheerio allows you to select HTML elements using jQuery syntax.

In this case, a test web page with only one table, we can easily select the table using basic HTML elements ( <table>, <th>, and <tr>).

However, most real-world cases require you to inspect the page via the DevTools to identify the right selectors.

Note

While using the selectors provided by the DevTools may work, they're sometimes less meaningful and can easily break. Thus, it's best to identify your selectors manually.

For simplicity, let's separate the headers from the table body.

The headers are the <th> elements nested in the first <tr>. Select all <th> elements and loop through each, extracting their text content.

                    Example
                
// ...

// extract the headers
const headers = [];
// select all th and loop through each
$('th').each((i, header) => {
    headers.push($(header).text())
})

Copied!

To get the table body, select all <tr> elements in <tbody>, and loop through each. Within this loop, select each table cell (<td>) and loop through the table cells to extract their text content.

                    Example
                
//...

// extract table body
const rows = [];
// select all tr in tbody and loop through each
$('tbody tr').each((i, row) => {
    const data = []
    // select all td in each tr and loop through each
    $(row).find('td').each((j, cell) =>{
        data.push($(cell).text());
     });
    rows.push(data);
})

  
  

  
Copied!

That's it.

Now, combine all the steps to get the complete code and log the results to verify everything works.

                    Example
                
const axios = require('axios');
const cheerio = require('cheerio');

axios.get('https://www.scrapingcourse.com/table-parsing')
    .then(response => {
        const html = response.data;

        const $ = cheerio.load(html);

        // extract the headers
        const headers = [];
        // select all th and loop through each
        $('th').each((i, header) => {
            headers.push($(header).text())
        })

        // extract table body
        const rows = [];
        // select all tr in tbody and loop through each
        $('tbody tr').each((i, row) => {
            const data = []
            // select all td in each tr and loop through each
            $(row).find('td').each((j, cell) => {
                data.push($(cell).text());
            });
            rows.push(data);
        })
       
    console.log('Headers', headers)
    console.log('Rows', rows)
    })
    .catch(error => {
        console.error(error);
    });

  
  

  
Copied!

Here's the result:

                    Output
                
Headers [ 'Product ID', 'Name', 'Category', 'Price', 'In Stock' ]
Rows [
  [ '001', 'Laptop', 'Electronics', '$999.99', 'Yes' ],
  [ '002', 'Smartphone', 'Electronics', '$599.99', 'Yes' ],
  [ '003', 'Headphones', 'Audio', '$149.99', 'No' ],
  [ '004', 'Coffee Maker', 'Appliances', '$79.99', 'Yes' ],
  // truncated for brevity

  
  

  
Copied!

Congratulations! You've parsed the table using NodeJS.

However, you should know that real-world cases consist of complex scenarios (dynamic layouts, WAF (Web Application Firewall), obfuscated HTML, etc.) that can be challenging to maneuver using Cheerio or any open-source tool.

In such cases, and for efficient results, you need robust parsing features like those ZenRows offers.

Below is a step-by-step guide on how to parse HTML tables using ZenRows.

Using ZenRows

To use ZenRows, you need an API key. Sign up for free to get yours. You'll be redirected to the Request Builder page, where you'll find your ZenRows API key at the top right.

building a scraper with zenrows — Click to open the image in full screen

Input the target URL (https://www.scrapingcourse.com/table-parsing) and activate Premium Proxies and JS Rendering to handle anti-bot systems and dynamic tables.

After that, select the NodeJS language option and choose the API connection mode.

That'll generate your request code on the right. Copy it to your code editor. Lastly, set the outputs parameter for the tables.

Your code should look like this:

                    Example
                
// npm install axios
const axios = require('axios');

const url = 'https://www.scrapingcourse.com/table-parsing';
const apikey = '<YOUR_ZENROWS_API_KEY>';
axios({
	url: 'https://api.zenrows.com/v1/',
	method: 'GET',
	params: {
		'url': url,
		'apikey': apikey,
		'js_render': 'true',
		'premium_proxy': 'true',
            'outputs': 'tables'
	},
})
    .then(response => console.log(response.data))
    .catch(error => console.log(error));

  
  

  
Copied!

ZenRows will automatically parse the tables on the page, returning dimensions, headings, and content.

Here's the result:

                    Output
                
{
  "Dimensions": {
    "rows": 15,
    "columns": 5,
    "heading": true
  },
  "Headings": :["Product ID","Name","Category","Price","In Stock"],
  "content": [
    {"Category":"Electronics","In Stock":"Yes","Name":"Laptop","Price":"$999.99","Product ID":"001"},
    {"Category":"Electronics","In Stock":"Yes","Name":"Smartphone","Price":"$599.99","Product ID":"002"},
    {"Category":"Audio","In Stock":"No","Name":"Headphones","Price":"$149.99","Product ID":"003"},
  //... truncated for brevity
  ]
}

  
  

  
Copied!

Awesome, right? That's how easy it is to parse tables using ZenRows.

Conclusion

Now that you've learned to parse HTML tables in NodeJS, you're fully equipped to scrape virtually any website.

Just remember that while all the tools discussed in this article have their use cases, only ZenRows guarantees success in every scenario, regardless of layout complexity or anti-bot protection.

Try ZenRows for free now!

Why Is Table Parsing Difficult With NodeJS?

Complex Table Structure

Dynamic Tables

Anti-bot Systems and Obfuscated HTML

Best HTML Table Parsers for NodeJS

ZenRows

Pros

Cons

Cheerio

Pros

Cons

Node-HTML-Parser

Pros

Cons

Puppeteer

Pros

Cons

How to Parse HTML Tables in NodeJS

Using Cheerio

Using ZenRows

Conclusion

Ready to get started?