The Anti-bot Solution to Scrape Everything? Get Your Free API Key! 😎

C# HTML Parser: Best Options to Parse Content

November 15, 2023 Β· 15 min read

When you develop a web scraper, you aim to extract specific data from a web page. For that, you need an HTML parser in C#. We'll explore and compare the most popular ones to help you make an informed decision.

What Is the Best HTML Parser for C#?

Here's a quick comparison table of the most popular C# HTML parsers:

Library Popularity Ease of Use Speed
HTML Agility Pack High User-friendly Moderate
AngleSharp High User-friendly Fast
Fizzler Moderate Moderate Moderate
CSQuery Moderate Moderate Moderate
Selenium WebDriver High Requires additional set-up Slow
MSHTML Low Steep learning curve Slow with large HTML content
Majestic-12 Low Steep learning curve Fast

Let's see each one in detail next, for which we'll use a common scraper. The code below sends a GET request to the target URL and retrieves its raw HTML file, which will be parsed to extract specific data from the page.

scraper.cs
using System;
using System.Net.Http;
 
class Scraper
{
    static async Task Main()
    {
        string url = "https://scrapeme.live/shop/Pikachu/"; 
 
        using (HttpClient client = new HttpClient())
        {
            HttpResponseMessage response = await client.GetAsync(url);
 
            string htmlContent = await response.Content.ReadAsStringAsync();
            Console.WriteLine(htmlContent);
        }
    }
}
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

1. HTML Agility Pack: The Swiss Army Knife for C# HTML Parsing

HTML Agility Pack (HAP) Homepage
Click to open the image in full screen

HTML Agility Pack (HAP) is a popular open-source C# parser that provides a flexible and easy-to-use API for navigating, manipulating, and extracting data from the DOM (Document Object Model). One of its major attributes is its ability to handle malformed HTML. That's particularly advantageous for web pages that may not adhere to strict HTML standards, ensuring you can extract data in any case.

Unlike most libraries on this list, HMTL Agility Pack also roles as an HTTP client, which allows you to retrieve the HTML source file using the same library. All this and more makes it one of the best C# HTML parsers.Β 

πŸ‘ Pros:

  • Supports HTTP requests.
  • Tolerates malformed HTML.
  • It is actively maintained, with over 32 GitHub contributors.
  • Well-structured documentation with online examples.
  • Integrates with Fizzler to add CSS selector functionality.

πŸ‘Ž Cons:

  • Natively supports only XPath or XSLT.
  • Can get slow when parsing large and complex HTML.
  • May not have the same level of support as other libraries.
  • Can consume significant memory because it creates an in-memory documentation of the entire document it's parsing.

βš™οΈ Features:

  • XPath support.
  • HTML manipulation.
  • HTML cleaning.
  • Encoding and decoding HTML entities.
  • HTML traversing.

πŸ‘¨β€πŸ’» Example: The example below shows how to parse HTML content using HTML Agility Pack.

It starts by creating an HtmlDocument instance and loading it with the HTML content retrieved by the scraper. Then, it uses the SelectSingleNode method to extract the stock amount with XPath.

scraper.cs
using HtmlAgilityPack;
 
class Scraper
{
    static async Task Main()
    {
        //..
        {
            //...
 
            HtmlDocument htmlDocument = new HtmlDocument();
            htmlDocument.LoadHtml(htmlContent);
 
            HtmlNode stockElement = htmlDocument.DocumentNode.SelectSingleNode("//*[@id='product-752']/div[2]/p[2]");
            string stockAmount = stockElement.InnerText;
 
            Console.WriteLine("Stock Amount: " + stockAmount);
        }
 
    }
}

2. AngleSharp: The Ultimate Parsing Toolkit

anglesharp homepage
Click to open the image in full screen

AngleSharp is a .NET library for parsing HTML and any hypertext using angle brackets (<>) to define and structure content, including SVG and MathML. It uses a fully implemented parser, which goes through the given HTML source file and builds a DOM that can be manipulated using HTML tag selectors.

Additionally, this DOM uses the official W3C-specified API so that even advanced features like querySelectorAll are readily accessible within AngleSharp. Moreover, its parsing engine aligns with the HTML 5.1 specification, which governs how modern HTML documents are processed, including error handling and element correction.

πŸ‘ Pros:

  • Extensive documentation.
  • Compliance and standard-driven documentation.
  • Performs better with large or complex HTML than other C# libraries.Β 
  • Allows you to handle DOM events in your code.
  • Can be used on many platforms, including .NET (Core / FX), Unity, and Xamarin.

πŸ‘Ž Cons:

  • Requires extension libraries for XPath and CSS selectors support.Β 
  • Its advanced features can imply a steep learning curve.Β 

βš™οΈ Features:

  • Allows you to query HTML using LINQΒ 
  • Exception handling.
  • Supports external integrations.
  • Fully functional DOM.Β 

πŸ‘¨β€πŸ’» Example: The following code loads the HTML content into AngleSharp, creates a BrowsingContext, and extracts the stock amount using the QuerySelector method. An intuitive C# HTML parser.

scraper.cs
using AngleSharp;
using AngleSharp.Dom;
 
class Scraper
{
    static async Task Main()
    {
        //..
        {
            //...
 
            var config = Configuration.Default.WithDefaultLoader();
            var context = BrowsingContext.New(config);
            var document = await context.OpenAsync(req => req.Content(htmlContent));
 
            var stockElement = document.QuerySelector(".stock"); // Replace with the actual CSS selector for the stock element
            string stockAmount = stockElement.TextContent;
 
            Console.WriteLine("Stock Amount: " + stockAmount);
        }
 
    }
}

3. Fizzler: Streamlining HTML Parsing with CSS Selectors

fizzler homepage
Click to open the image in full screen

Fizzler is a .NET CSS selector engine based on HTML Agility Pack. It's mainly used to extend the capabilities of parsing libraries like CSQuery, AngleSharp, and HAP.

Although this C# HTML parser is still actively maintained, its last significant update was a while ago. That's a sign that it's stable but might have little room for improvement.

πŸ‘ Pros:

  • Easy-to-understand syntax for querying HTML documents using CSS selectors.
  • Integrates with other C# HTML parsers.

πŸ‘Ž Cons:

  • Limited documentation.
  • Poor overall support.
  • Using Fizzler with other HTML parsers can introduce some overhead.Β 
  • Possible compatibility issues with other libraries as they're updated independently.Β 

βš™οΈ Features:

  • CSS selector support.
  • HTML manipulation.
  • HTML traversing.
  • Extensibility.

πŸ‘¨β€πŸ’» Example: The code above starts by loading the HTML document into HTML Agility Pack. Then, it uses Fizzler to select the stock amount using the QuerySelector method.

scraper.cs
using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;
 
class Scraper
{
    static async Task Main()
    {
        //..
        {
            //...
 
            HtmlDocument htmlDocument = new HtmlDocument();
            htmlDocument.LoadHtml(htmlContent);
 
            // Load the HTML content intoHtmlAgilityPack
            HtmlDocument htmlDocument = new HtmlDocument();
            htmlDocument.LoadHtml(htmlContent);
 
            // Use Fizzler to select the stock element using a CSS selector
            var stockElement = htmlDocument.DocumentNode.QuerySelector(".stock"); 
                
            string stockAmount = stockElement.InnerText;
            Console.WriteLine("Stock Amount: " + stockAmount);
        }
 
    }
}

4. CSQuery: Parsing with jQuery-like Syntax

csquery homepage
Click to open the image in full screen

CSQuery is a C# HTML parser designed to provide jQuery-like features to .NET for HTML document manipulation, allowing you to navigate and select elements like you'd be using jQuery on the client side.Β 

It's relatively fast, easy to use, and ensures standard-compliant HTML parsing by employing a C# port of the validator.nu HTML parser. However, while the current release on Nuget is stable, it's not being actively maintained. That said, there are active forks and community contributions that keep the project relevant.

πŸ‘ Pros:

  • Easy to use.
  • jQuery-like syntax for querying HTML documents using CSS selectors.
  • Standards compliant.
  • Relatively faster performance of selectors.
  • Contains the entire suite from Sizzle (the jQuery CSS selector engine) and jQuery (1.6.2).
  • Implements all CSS2/CSS3 selectors and filters.

πŸ‘Ž Cons:

  • Limited documentation.
  • Not actively maintained.
  • Not as widely used and popular as other C# libraries.
  • Poor overall support.
  • Using Fizzler with other HTML parsers can introduce some overhead.Β 
  • Possible compatibility issues with other libraries as they're updated independently

βš™οΈ Features:

  • Standard-compliant HTML parsing.
  • CSS2 and CSS3 selectors.
  • All jQuery DOM manipulation methods.

πŸ‘¨β€πŸ’» Example: The following code loads the HTML content into CsQuery using CQ.Create(htmlContent). Then, it uses the .Find method to locate the element with the stock amount and extract it using a CSS selector.

scraper.cs
using CSQuery
 
class Scraper
{
    static async Task Main()
    {
        //..
        {
            //...
 
            // Load the HTML content into CsQuery
            CQ dom = CQ.Create(htmlContent);
 
            // Find the stock amount element using a CSS selector
            CQ stockElement = dom.Find(".stock");
 
            // Extract the stock amount text
            string stockAmount = stockElement.Text();
            Console.WriteLine("Stock Amount: " + stockAmount);   
        }
 
    }
}

5. Selenium Webdriver: HTML Parsing Maestro.

selenium webdriver homepage
Click to open the image in full screen

Unlike most C# HTML parsers on this list, Selenium WebDriver is primarily a browser automation tool. However, it also brings powerful HTML parsing capabilities to C# developers. That, combined with its ability to simulate natural user behaviour, makes it a popular web scraping tool, especially for JavaScript-driven websites.Β 

Also, it supports both CSS selectors and XPath for locating and interacting with HTML elements. This flexibility allows you to choose the best method that meets your project needs. You can check out our comparison of XPath vs CSS selectors.

However, Selenium WebDriver may be too resource-intensive as an HTML parser. Thus, if its headless browser functionality isn't a requirement for your project, it's more efficient to use parsers like AngleSharp and HTML Agility Pack.

πŸ‘ Pros:

  • JavaScript rendering.
  • Supports CSS selectors and XPath.
  • Large and active developer community.
  • Extensive documentation.
  • Supports HTTP requests.

πŸ‘Ž Cons:

  • Resource-intensive.
  • Slow script execution.
  • Difficult to scale.
  • Requires additional setup.

βš™οΈ Features:

  • CSS selectors.
  • XPath.
  • Headless browser functionality.
  • Element validation.
  • Waits and timeouts

πŸ‘¨β€πŸ’» Example: Here, Selenium WebDriver (in this case, the ChromeDriver) navigates to the URL and locates the stock amount element using XPath.

scraper.cs
using System;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
 
class Scraper
{
    static async Task Main()
    {
        string url = "https://scrapeme.live/shop/Pikachu/";
 
        // Set up Chrome WebDriver
        IWebDriver driver = new ChromeDriver();
 
        // Navigate to the URL
        driver.Navigate().GoToUrl(url);
 
        // Find the stock amount element using XPath
        IWebElement stockElement = driver.FindElement(By.XPath("//*[@id="product-752"]/div[2]/p[2]"));
 
        // Extract the stock amount text
        string stockAmount = stockElement.Text;
        Console.WriteLine("Stock Amount: " + stockAmount);
 
        // Close the WebDriver
        driver.Quit();
    }
}

6. MSHTML: The Built-in Engine.

MSHTML is a built-in Microsoft engine for HTML parsing. This component is a part of Windows and is used by various Microsoft applications to render HTML.Β 

It's lightweight and easy to implement. Also, it allows you to use JavaScript functions/selectors, like `getElementById()`. However, MSHTML has some limitations when handling large or complex HTML content.

πŸ‘ Pros:

  • Supports JavaScript functions for selecting HTML elements.
  • Event handling.
  • Legacy compatibility.

πŸ‘Ž Cons:

  • Limited documentation.
  • Complex API.
  • Dependency on legacy code.
  • Can't efficiently handle large or complex HTML content.

βš™οΈ Features:

  • CSS selectors.
  • JavaScript rendering.
  • Rich-text editing.
  • Asynchronous loading.

πŸ‘¨β€πŸ’» Example: Below is an example showing how to parse and extract HTML content using MSHTML.

It creates an HTMLDocument instance and casts it to IHTMLDocument2, then writes the retrieved HTML content into the newly created document. Next, it locates and extracts the stock amount using getElementByClassName.

scraper.cs
using mshtml;
 
class Scraper
{
    static async Task Main()
    {
        //..
        {
            //...
 
            // Create an HTMLDocument and cast it to IHTMLDocument2
            var htmlDoc = new HTMLDocument();
            var ihtmlDoc = (IHTMLDocument2)htmlDoc;
 
            // Open the document and write the HTML content
            ihtmlDoc.open();
            ihtmlDoc.write(htmlContent);
            ihtmlDoc.close();
 
            // Access the document's DOM
             var allElements = ihtmlDoc.all;
 
            // Find the element with class "stock"
            foreach (IHTMLElement element in allElements)
            {
                if (element.className == "stock")
                    {
                        // Extract the stock amount text
                        string stockAmount = element.innerText;
                        Console.WriteLine("Stock Amount: " + stockAmount);
                    }
            {
            
        }
 
    }
}

7. Majestic-12: Parsing Super Hero

majestic-12 homepage
Click to open the image in full screen

While Majestic-12 is primarily known for its web crawling and SEO-related tools and services, it also offers a C# HTML parser as one of its projects. This tool is an open-source .NET module designed for parsing HTML for links, indexing, and other purposes.

According to its documentation, it processes over 3TB of HTML daily. Therefore, it's proven against large HTML content. Also, it is known for delivering a fast parsing experience. However, its handling can get a bit clunky.

πŸ‘ Pros:

  • Handles large or complex HTML.
  • High-performance parsing.

πŸ‘Ž Cons:

  • Limited documentation.
  • Complex API.
  • Dependency on legacy code.Β 

βš™οΈ Features:

  • HTML chunks.
  • Encoding.
  • Thread safety.

πŸ‘¨β€πŸ’» Example: The example below initializes the Majestic-12 parser, loads the HTML content, and then iterates through the HTML chunks to find and extract the stock amount element by its class.Β 

scraper.cs
using Majestic12;
 
class Scraper
{
    static async Task Main()
    {
        //..
        {
            //...
 
            // Create an HTML parser instance
            HTMLparser oP = new HTMLparser();
 
            // Load HTML content
            oP.Init(htmlContent);
 
            // Find the stock amount element by its class
            HTMLchunk oChunk = null;
            while ((oChunk = oP.ParseNext()) != null)
            {
                if (oChunk.oType == HTMLchunkType.Text)
                {
                    if (oChunk.oHTML.Contains("in stock"))
                    {
                        string stockAmount = oChunk.oHTML;
                        Console.WriteLine("Stock Amount: " + stockAmount);
                        break;
                    }
                }
 
            }
 
        }
 
    }
}

Benchmark: Which Is Faster?

We've run a benchmark of the performance of the above C# HTML parsers to provide insights into their speed and efficiency.Β 

Below are the results for two HTML parsing scenarios: text extraction first and table data next.

Library Text Extraction (us) Table Data Extraction (us) Combined Mean (us)
Majestic-12 211.6 176.80 194.2
HtmlAgilityPack 817.7 11.16 414.43
Fizzler 819.0 14.81 416.91
AngleSharp 932.0 31.45 481.73
CSQuery 1,000.4 41.48 520.94
Selenium Webdriver 2304.6 62.46 1183.53
MSHTML 1,923,274.8 98,706.92 961106.86

Let's see the overall results in a graph, from best to worst performer:

Click to open the image in full screen

As expected, Majestic-12 was the fastest but the least comfortable to code. HTML Agility Pack (HAP) and Fizzler produced the fastest results next, slightly ahead of AngleSharp and CSQuery. Meanwhile, Selenium WebDriver's speed was poor, and MSHTML fell so far behind that it couldn't be included in the graph.

To benchmark these libraries, we used BenchMarkDotNet, a project that transforms methods into benchmarks.Β 

The measurements were made on an AMD Ryzen 9 6900HX with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores, and .NET SDK 8.0.100-rc.2.23502.2. However, we expect performance to be relatively similar on any machine configuration.

Conclusion

We've explored powerful C# HTML parsers that enable you to access and manipulate HTML content efficiently. While AngleSharp is the most complete one when it comes to features, HTML Agility Pack won in relation to the highest speed and feature availability.

To bear in mind, retrieving the parseable HTML requires web scraping, which can be challenging as getting blocked by websites is a common issue. To mitigate that, you can check out our tips on web scraping without getting blocked.Β 

Frequent Questions

How to Parse HTML Code in C#?

Parsing HTML code in C# typically involves loading raw HTML content and building a DOM (Document Object Model) that you can access, navigate, and manipulate to retrieve your desired data. Various C# libraries, including HTML Agility Pack and AngleSharp, exist to handle this.

Can C# Be Used with HTML?

Yes, C# can be used with HTML. It's a versatile programming language used for various web-related tasks, including web development, web scraping, web automation, and more. Regarding manipulating or interacting with HTML content, libraries like AngleSharp, HtmlAgilityPack, and Fizzler provide powerful tools to work with HTML documents.

Did you find the content helpful? Spread the word and share it on Twitter, or LinkedIn.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.