5 Best Rust HTML Parsers for Web Scraping

June 13, 2024 Β· 10 min read

A good web scraper must efficiently navigate and manipulate the HTML structure of a page to extract relevant data. To achieve that with Rust, you need a dedicated Rust HTML parser.

There are many popular Rust HTML parsers out there, each with its unique set of features and capabilities. The variety can be overwhelming, but it also gives you the privilege to choose the right fit for your project. This article will review some of the best HTML parsers and highlight their main use cases.

Let's go!

What Is the Best HTML Parser for Rust?

As you probably expected, there's no definitive answer to this question. The best HTML parser is the one that meets your Rust web scraping requirements.

Below, you can find a comparison table outlining the characteristics of the most popular options. It will help you compare the parsers based on your priorities.

Library Ease of Use Speed Popularity
html5ever Requires more code to process tokens. Fast High
pulldown-cmark Requires additional configuration and external libraries to convert and parse HTML. Slow High
Select.rs Easy to use, with a jQuery-like interface for selecting elements and extracting data. Fast Low
Scraper Relatively easy to use as it provides a high-level interface to Servo's html5ever and Selectors crates. Fast Medium
Kuchiki Easy to use with CSS selector syntax. However, it isn't actively maintained. Fast Medium

Now, let's examine each HTML parser in detail. You’ll also explore how they perform when tasked with parsing real-world HTML. Below, you can find a sample Rust script that retrieves the web content, which will be used to test each parser.

scraper.rs
// define async main function using Tokio
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // make GET request to target URL and retrieve response
    let resp = reqwest::get("https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/")
        .await?
        .text()
        .await?;
    println!("{resp:#?}");
    Ok(())
}

This code snippet uses Tokio to define the main asynchronous function. It then sends a GET request to the target URL and retrieves its raw HTML file.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

1. Html5ever: High-Performing HTML5 Parsing

HTML5Ever
Click to open the image in full screen

Html5ever is a widely used Rust HTML parser developed as a part of the Servo project. Its ability to parse and serialize HTML according to the WHATWG specifications makes it a universally reliable option.

This tool is essentially a C HTML parser but with Rust's built-in memory safety features. This unique combination grants html5ever a high-grade performance expected from a C library while mitigating the security issues often associated with the language.

Unlike most parsers, which build a DOM tree representation of the HTML document, html5ever uses callbacks to manipulate the DOM. This allows for event-driven parsing, where a callback function is triggered by a specific event, like the closing of an HTML tag. This parsing type is memory-efficient and ultimately drives better performance.

Html5ever is also the most popular Rust HTML parser in this list, with over 12 million crate downloads.

πŸ‘ Pros:

  • Adheres to WHATWG specifications.
  • Uses callbacks to manipulate the DOM.
  • Designed to be compatible with Rust's official stable releases.
  • Passes all HTML5 tokenizer tests.
  • Provides all hooks needed by a production web browser, e.g., document.write
  • Enjoys a large user base, active community, and comprehensive documentation.

πŸ‘Ž Cons:

  • Does not provide a DOM tree representation of the HTML document.
  • Uses a tokenizer, which can result in verbose code, especially during large-scale parsing.
  • Parsing and querying can get complex as HTML elements are divided into tokens.
  • Some html5ever optimizations are only supported on nightly releases.
  • Acknowledges some differences from the WHATWG specs in its current actual behavior.

βš™οΈ Features:

  • UTF-8 string representation
  • Callback-based DOM manipulation
  • WHATWG specification compliant
  • HTML parsing and serialization

πŸ‘¨β€πŸ’» Example:

The code below shows how to parse HTML using html5ever.

scraper.rs
// import necessary crates
extern crate html5ever;
extern crate reqwest;
 
use std::default::Default;
 
// import necessary modules from html5ever
use html5ever::tendril::*;
use html5ever::tokenizer::BufferQueue;
use html5ever::tokenizer::{TagToken, StartTag, EndTag};
use html5ever::tokenizer::{Token, TokenSink, TokenSinkResult, Tokenizer, TokenizerOpts,};
use html5ever::tokenizer::CharacterTokens;
 
// define a struct to hold the state of the parser
struct TokenPrinter {
    // define flags to track token location. 
    in_price_tag: bool,  
    in_span_tag: bool,   
    in_bdi_tag: bool,    
    price: String,       // string to hold the price
}
 
// implement the TokenSink trait for TokenPrinter
impl TokenSink for TokenPrinter {
    type Handle = ();
 
    // define function to process each token in the HTML document
    fn process_token(&mut self, token: Token, _line_number: u64) -> TokenSinkResult<()> {
        match token {
            TagToken(tag) => {
                // if the token is a start tag...
                if tag.kind == StartTag {
                    // ...and the tag is a <p> tag with class "price"...
                    if tag.name.to_string() == "p" {
                        for attr in tag.attrs {
                            if attr.name.local.to_string() == "class" && attr.value.to_string() == "price" {
                                // ...set the in_price_tag flag to true
                                self.in_price_tag = true;
                            }
                        }
                    // if we're inside a <p class="price"> tag and the tag is a <span> tag...
                    } else if self.in_price_tag && tag.name.to_string() == "span" {
                        // ...set the in_span_tag flag to true
                        self.in_span_tag = true;
                    // if we're inside a <p class="price"> tag and the tag is a <bdi> tag...
                    } else if self.in_price_tag && tag.name.to_string() == "bdi" {
                        // ...set the in_bdi_tag flag to true
                        self.in_bdi_tag = true;
                    }
                // if the token is an end tag...
                } else if tag.kind == EndTag {
                    // ...and the tag is a <p>, <span>, or <bdi> tag...
                    if tag.name.to_string() == "p" {
                        // ...set the corresponding flag to false
                        self.in_price_tag = false;
                    } else if tag.name.to_string() == "span" {
                        self.in_span_tag = false;
                    } else if tag.name.to_string() == "bdi" {
                        self.in_bdi_tag = false;
                    }
                }
            },
            // if the token is a character token (i.e., text)...
            CharacterTokens(s) => {
                // ...and we're inside a <p class="price"> tag...
                if self.in_price_tag {
                    // ...and we're inside a <span> tag...
                    if self.in_span_tag {
                        // ...add the text to the price string
                        self.price = format!("price: {}", s);
                    // ...and we're inside a <bdi> tag...
                    } else if self.in_bdi_tag {
                        // ...append the text to the price string and print it
                        self.price = format!("{}{}", self.price, s);
                        println!("{}", self.price);
                        // clear the price string for the next price
                        self.price.clear();
                    }
                }
            },         
            // ignore all other tokens
            _ => {},
        }
        // continue processing tokens
        TokenSinkResult::Continue
    }    
}
 
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // initialize the TokenPrinter
    let sink = TokenPrinter { in_price_tag: false, in_span_tag: false, in_bdi_tag: false, price: String::new() };
 
    // retrieve HTML content from target website
    //... let resp = reqwest::get("https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/").await?.text().await?;
 
    // convert the HTML content to a ByteTendril
    let chunk = ByteTendril::from(resp.as_bytes());
    let mut input = BufferQueue::new();
    input.push_back(chunk.try_reinterpret::<fmt::UTF8>().unwrap());
 
    // initialize the Tokenizer with the TokenPrinter
    let mut tok = Tokenizer::new(
        sink,
        TokenizerOpts::default(),
    );
    // feed the HTML content to the Tokenizer
    let _ = tok.feed(&mut input);
    assert!(input.is_empty());
    // end tokenization
    tok.end();
 
    Ok(())
}

The code creates a struct, which is implemented as a TokenSink. It also creates a new tokenizer where the struct is the sink and then feeds the fetched HTML into the tokenizer to break down the HTML document into tokens representing different elements.

The struct is used to process these tokens. When it encounters a <p> start tag, it locates the child node containing the desired price value and extracts it.

2. Scraper: Fast Web Scraping

Scraper
Click to open the image in full screen

Scraper is a popular Rust library for parsing HTML and extracting relevant data from a target webpage. It's built on top of two other Rust crates, html5ever and selectors, which are part of the Servo project.

These two libraries enable Scraper to achieve browser-grade parsing and querying. In other words, it's designed to handle real-world HTML, which isn't always standard-compliant.

This library uses html5ever under the hood, and it provides a high-level API that creates a DOM tree representation of the HTML document. It also allows you to use CSS selectors to find and manipulate elements.

πŸ‘ Pros:

  • Offers browser-grade parsing and querying.
  • Can handle malformed HTML.
  • Uses html5ever and selectors under the hood.
  • Creates a DOM tree representation of the HTML document.
  • Allows you to traverse and manipulate the DOM using CSS selectors.
  • Has an active community and extensive documentation.

πŸ‘Ž Cons:

  • Uses up a lot of memory when creating a DOM tree representation of large HTML documents.
  • Depends on external crates.

βš™οΈ Features:

  • DOM tree representation
  • CSS selectors
  • HTML parsing and serialization
  • High-level API
  • External integrations

πŸ‘¨β€πŸ’» Example:

The following code parses the fetched HTML response into fragments, creates a Selector that matches the element with the class price, finds the target element using the Selector and extracts the product price.

scraper.rs
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // make GET request to target URL and retrieve response
    //...
 
    // create an HTML parser
    let fragment = scraper::Html::parse_fragment(&resp);
 
    // define CSS selector for the price element
    let price_selector = scraper::Selector::parse(".price").unwrap();
 
    // extract the price using the CSS selector
    let price_element = fragment.select(&price_selector).next().unwrap();
    let price = price_element.text().collect::<String>();
 
    println!("Price: {}", price);
 
    Ok(())
}

3. Pulldown-cmark: Markdown Parsing With HTML Support

Pulldown C-Mark
Click to open the image in full screen

Unlike most libraries on this list, pulldown-cmark isn't a traditional HTML parser but a pull parser for CommonMark, a standard markdown version. It takes Markdown as input and renders HTML, but it doesn't extract data from HTML.

So, why is pulldown-cmark here? While it's designed to parse Markdown, it can also be configured for HTML. Most importantly, its pull parser architecture makes it a valuable tool, especially when memory is critical to your project. The tool uses significantly less memory than push parsers or tree-based parsers. Plus, it allows you to only parse what's needed, when it's needed, which ultimately leads to better performance, particularly for large documents.

πŸ‘ Pros:

  • Fast.
  • Memory-efficient.
  • Fully compliant with CommonMark specifications.
  • Optionally supports parsing footnotes.
  • Easy to use.
  • Written in pure Rust with no unsafe blocks.

πŸ‘Ž Cons:

  • Requires additional configurations and other crates to parse HTML.
  • Does not support all HTML tags, attributes, and features.
  • Parsing complex HTML can be challenging.
  • Potential loss of data due to some unsupported HTML features.

βš™οΈ Features:

  • Pull parser architecture
  • CommonMark spec compliance
  • Supports external integrations
  • Rust Safety

πŸ‘¨β€πŸ’» Example:

The following code converts the fetched HTML to markdown using an external crate (html2d). It then creates a Markdown parser and iterates over each event to locate and extract the price.

scraper.rs
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // make GET request to target URL and retrieve response
    //...
 
    // convert HTML to Markdown
    let md = html2md::parse_html(&resp);
 
    // create a Markdown parser
    let parser = pulldown_cmark::Parser::new(&md);
 
    // iterate over the events in the parser
    for event in parser {
        match event {
            pulldown_cmark::Event::Text(text) => {
                // check if the text contains the price
                if text.contains("$") {
                    println!("Price: {}", text);
                }
            },
            _ => {},
        }
    }
 
    Ok(())
}

4. Select: Comprehensive HTML Parsing

Select rs
Click to open the image in full screen

Select.rs is a robust rust library for extracting data from HTML documents. Like Scraper, this library uses html5ever under the hood but provides a jQuery-like interface. The high-level API allows you to select specific elements using different methods, including XPath and CSS selectors.

Additionally, Select.rs offers easy-to-use methods for traversing nodes that let you navigate through HTML structures quickly. You can also modify nodes by setting HTML attributes, tags, and text. The library supports output in multiple formats, including HTML string, plain text, YAML data, and JSON data.

πŸ‘ Pros:

  • Easy to use.
  • Feature-rich library.
  • Supports XPath and CSS selectors.
  • Multiple output formats, including YAML and JSON.
  • jQuey-like interface.
  • Compliant with HTML5 specifications.
  • Supports in-memory cache.
  • Has extensive documentation.

πŸ‘Ž Cons:

  • Inefficient memory usage, especially when dealing with large HTML documents.
  • It uses other libraries under the hood, which can increase the overall application size.

βš™οΈ Features:

  • HTML parsing and serialization
  • Node traversal and modification
  • XPath and CSS selectors
  • Supports multiple output formats
  • Supports external integrations

πŸ‘¨β€πŸ’» Example:

The following code parses the HTML response into a document using select.rs. Then, it finds and extracts the price using the HTML tag and class.

scraper.rs
use select::document::Document;
use select::predicate::*;
 
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // make GET request to target URL and retrieve response
    //...
 
    // parse the response text into a Document
    let document = Document::from(resp.as_str());
 
    // find the price element using its HTML tag and class
    if let Some(price_node) = document.find(Name("p").and(Class("price"))).next() {
        println!("Price: {}", price_node.text());
    }
 
    Ok(())
}

5. Kuchiki: Efficient XML and HTML Parsing

Stats
Click to open the image in full screen

The last solution is Kuchiki, a powerful Rust library for HTML/XML tree manipulation. Like most tools on this list, it uses html5ever under the hood. However, it adds additional features that make manipulating the DOM easier.

With structs like Node, NodeRef, ElementData, DocumentData, etc., for representing and working with nodes in a DOM-like tree, Kuchiki allows you to easily traverse the DOM and modify elements. Plus, it provides easy-to-use functions to parse HTML using html5ever. Additionally, its Selectors struct for working with familiar CSS selector syntax makes it easy to find and extract data.

However, Kuchiki isn't actively maintained, as the owner has archived it in January 2023. It means that while the tool remains usable, you shouldn't expect any updates or bug fixes.

πŸ‘ Pros:

  • HTML/XML tree manipulation.
  • Easy to use.
  • Supports CSS selectors.
  • DOM-like tree structure.
  • It provides easy-to-use functions for parsing HTML using html5ever.
  • Can integrate with external crates.
  • Extensive documentation.

πŸ‘Ž Cons:

  • Depends on external crates.
  • Hasn't been actively maintained since 2023.

βš™οΈ Features:

  • HTML parsing and serialization
  • DOM manipulation
  • Node traversal and modification
  • CSS selectors
  • Traits

πŸ‘¨β€πŸ’» Example:

The example below shows how to parse HTML using Kuchiki.

It parses the HTML response using parse_html(), selects the <p> tag with class price using document.select_first(), and extracts the text content.

scraper.rs
use kuchiki::traits::*;
use kuchiki::parse_html;
 
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // make GET request to target URL and retrieve response
    //...
 
    let document = parse_html().one(resp);
 
    // select the "p" element with class "price"
    if let Some(price_node) = document.select_first("p.price").ok() {
        let price = price_node.text_contents();
        println!("Price: {}", price);
    } else {
        println!("Price not found");
    }
 
    Ok(())
}

Benchmark: Which Rust HTML Parser Is the Fastest?

Now that you've learned about the pros and cons of each tool, let's compare their overall performance.

The test scenario for this benchmark remains the same as in the code examples above. Each library will parse the fetched HTML response and extract the product price.

The table below shows the result.

Library Time (combined mean, ms)
html5ever 1.66
scraper 1.58
pulldown-cmark 3.62
select.rs 1.62
Kuchiki 1.70

As expected, html5ever, scraper, select.rs, and Kuchiki record similar performances, as they all parse HTML using html5ever under the hood. However, pulldown-cmark fell far behind, with the other tools being 21% faster.

Here’s a quick visualization of the overall result, from best to worst performer.

Benchmark
Click to open the image in full screen

To benchmark these libraries, we used Criterion.rs, a Rust project for benchmarking applications.

Conclusion

All the Rust HTML parsers presented above enable you to access and manipulate HTML documents. Each parser has its own set of features that you should carefully review before picking the best tool for your project.

While html5ever, scraper, select.rs, and Kuchiki have similar performance efficiency, they each have unique strengths. For example, if you require a library specifically designed for web scraping, Scraper and Select.rs are the best choices. If you're working with Markdown and memory is a critical factor, go with pulldown-cmark. If manipulating the HTML tree is your main requirement, choose Kuchiki. Lastly, if you need a high-performance Rust HTML parser and verbose code isn't an issue, html5ever is the right fit.

That said, gaining access to the parseable HTML file can be challenging, even with the best parser out there. Most websites implement anti-bot measures that may prevent your web scraper from accessing them. If you'd like to learn how to successfully bypass blocks and bans, check out this guide to web scraping without getting blocked.

Ready to get started?

Up to 1,000 URLs for free are waiting for you