When you develop a web scraper, you aim to extract specific data from a web page. For that, you need an HTML parser in C#. We'll explore and compare the most popular ones to help you make an informed decision.
What Is the Best HTML Parser for C#?
Here's a quick comparison table of the most popular C# HTML parsers:
Library | Popularity | Ease of Use | Speed |
---|---|---|---|
HTML Agility Pack | High | User-friendly | Moderate |
AngleSharp | High | User-friendly | Fast |
Fizzler | Moderate | Moderate | Moderate |
CSQuery | Moderate | Moderate | Moderate |
Selenium WebDriver | High | Requires additional set-up | Slow |
MSHTML | Low | Steep learning curve | Slow with large HTML content |
Majestic-12 | Low | Steep learning curve | Fast |
Let's see each one in detail next, for which we'll use a common scraper. The code below sends a GET
request to the target URL and retrieves its raw HTML file, which will be parsed to extract specific data from the page.
using System;
using System.Net.Http;
class Scraper
{
static async Task Main()
{
string url = "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/";
using (HttpClient client = new HttpClient())
{
HttpResponseMessage response = await client.GetAsync(url);
string htmlContent = await response.Content.ReadAsStringAsync();
Console.WriteLine(htmlContent);
}
}
}
1. HTML Agility Pack: The Swiss Army Knife for C# HTML Parsing
HTML Agility Pack (HAP) is a popular open-source C# parser that provides a flexible and easy-to-use API for navigating, manipulating, and extracting data from the DOM (Document Object Model). One of its major attributes is its ability to handle malformed HTML. That's particularly advantageous for web pages that may not adhere to strict HTML standards, ensuring you can extract data in any case.
Unlike most libraries on this list, HMTL Agility Pack also roles as an HTTP client, which allows you to retrieve the HTML source file using the same library. All this and more makes it one of the best C# HTML parsers.ย
๐ Pros:
- Supports HTTP requests.
- Tolerates malformed HTML.
- It is actively maintained, with over 32 GitHub contributors.
- Well-structured documentation with online examples.
- Integrates with Fizzler to add CSS selector functionality.
๐ Cons:
- Natively supports only XPath or XSLT.
- Can get slow when parsing large and complex HTML.
- May not have the same level of support as other libraries.
- Can consume significant memory because it creates an in-memory documentation of the entire document it's parsing.
โ๏ธ Features:
- XPath support.
- HTML manipulation.
- HTML cleaning.
- Encoding and decoding HTML entities.
- HTML traversing.
๐จโ๐ป Example: The example below shows how to parse HTML content using HTML Agility Pack.
It starts by creating an HtmlDocument
instance and loading it with the HTML content retrieved by the scraper. Then, it uses the SelectSingleNode
method to extract the stock amount with XPath.
using HtmlAgilityPack;
class Scraper
{
static async Task Main()
{
//..
{
//...
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlContent);
HtmlNode stockElement = htmlDocument.DocumentNode.SelectSingleNode("//*[@id='product-752']/div[2]/p[2]");
string stockAmount = stockElement.InnerText;
Console.WriteLine("Stock Amount: " + stockAmount);
}
}
}
2. AngleSharp: The Ultimate Parsing Toolkit
AngleSharp is a .NET library for parsing HTML and any hypertext using angle brackets (<>
) to define and structure content, including SVG and MathML. It uses a fully implemented parser, which goes through the given HTML source file and builds a DOM that can be manipulated using HTML tag selectors.
Additionally, this DOM uses the official W3C-specified API so that even advanced features like querySelectorAll are readily accessible within AngleSharp. Moreover, its parsing engine aligns with the HTML 5.1 specification, which governs how modern HTML documents are processed, including error handling and element correction.
๐ Pros:
- Extensive documentation.
- Compliance and standard-driven documentation.
- Performs better with large or complex HTML than other C# libraries.ย
- Allows you to handle DOM events in your code.
- Can be used on many platforms, including .NET (Core / FX), Unity, and Xamarin.
๐ Cons:
- Requires extension libraries for XPath and CSS selectors support.ย
- Its advanced features can imply a steep learning curve.ย
โ๏ธ Features:
- Allows you to query HTML using LINQย
- Exception handling.
- Supports external integrations.
- Fully functional DOM.ย
๐จโ๐ป Example:
The following code loads the HTML content into AngleSharp, creates a BrowsingContext, and extracts the stock amount using the QuerySelector
method. An intuitive C# HTML parser.
using AngleSharp;
using AngleSharp.Dom;
class Scraper
{
static async Task Main()
{
//..
{
//...
var config = Configuration.Default.WithDefaultLoader();
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(req => req.Content(htmlContent));
var stockElement = document.QuerySelector(".stock"); // Replace with the actual CSS selector for the stock element
string stockAmount = stockElement.TextContent;
Console.WriteLine("Stock Amount: " + stockAmount);
}
}
}
3. Fizzler: Streamlining HTML Parsing with CSS Selectors
Fizzler is a .NET CSS selector engine based on HTML Agility Pack. It's mainly used to extend the capabilities of parsing libraries like CSQuery, AngleSharp, and HAP.
Although this C# HTML parser is still actively maintained, its last significant update was a while ago. That's a sign that it's stable but might have little room for improvement.
๐ Pros:
- Easy-to-understand syntax for querying HTML documents using CSS selectors.
- Integrates with other C# HTML parsers.
๐ Cons:
- Limited documentation.
- Poor overall support.
- Using Fizzler with other HTML parsers can introduce some overhead.ย
- Possible compatibility issues with other libraries as they're updated independently.ย
โ๏ธ Features:
- CSS selector support.
- HTML manipulation.
- HTML traversing.
- Extensibility.
๐จโ๐ป Example:
The code above starts by loading the HTML document into HTML Agility Pack. Then, it uses Fizzler to select the stock amount using the QuerySelector
method.
using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;
class Scraper
{
static async Task Main()
{
//..
{
//...
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlContent);
// Load the HTML content intoHtmlAgilityPack
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlContent);
// Use Fizzler to select the stock element using a CSS selector
var stockElement = htmlDocument.DocumentNode.QuerySelector(".stock");
string stockAmount = stockElement.InnerText;
Console.WriteLine("Stock Amount: " + stockAmount);
}
}
}
4. CSQuery: Parsing with jQuery-like Syntax
CSQuery is a C# HTML parser designed to provide jQuery-like features to .NET for HTML document manipulation, allowing you to navigate and select elements like you'd be using jQuery on the client side.ย
It's relatively fast, easy to use, and ensures standard-compliant HTML parsing by employing a C# port of the validator.nu HTML parser. However, while the current release on Nuget is stable, it's not being actively maintained. That said, there are active forks and community contributions that keep the project relevant.
๐ Pros:
- Easy to use.
- jQuery-like syntax for querying HTML documents using CSS selectors.
- Standards compliant.
- Relatively faster performance of selectors.
- Contains the entire suite from Sizzle (the jQuery CSS selector engine) and jQuery (1.6.2).
- Implements all CSS2/CSS3 selectors and filters.
๐ Cons:
- Limited documentation.
- Not actively maintained.
- Not as widely used and popular as other C# libraries.
- Poor overall support.
- Using Fizzler with other HTML parsers can introduce some overhead.ย
- Possible compatibility issues with other libraries as they're updated independently
โ๏ธ Features:
- Standard-compliant HTML parsing.
- CSS2 and CSS3 selectors.
- All jQuery DOM manipulation methods.
๐จโ๐ป Example:
The following code loads the HTML content into CsQuery using CQ.Create(htmlContent)
. Then, it uses the .Find
method to locate the element with the stock amount and extract it using a CSS selector.
using CSQuery
class Scraper
{
static async Task Main()
{
//..
{
//...
// Load the HTML content into CsQuery
CQ dom = CQ.Create(htmlContent);
// Find the stock amount element using a CSS selector
CQ stockElement = dom.Find(".stock");
// Extract the stock amount text
string stockAmount = stockElement.Text();
Console.WriteLine("Stock Amount: " + stockAmount);
}
}
}
5. Selenium Webdriver: HTML Parsing Maestro.
Unlike most C# HTML parsers on this list, Selenium WebDriver is primarily a browser automation tool. However, it also brings powerful HTML parsing capabilities to C# developers. That, combined with its ability to simulate natural user behaviour, makes it a popular web scraping tool, especially for JavaScript-driven websites.ย
Also, it supports both CSS selectors and XPath for locating and interacting with HTML elements. This flexibility allows you to choose the best method that meets your project needs. You can check out our comparison of XPath vs CSS selectors.
However, Selenium WebDriver may be too resource-intensive as an HTML parser. Thus, if its headless browser functionality isn't a requirement for your project, it's more efficient to use parsers like AngleSharp and HTML Agility Pack.
๐ Pros:
- JavaScript rendering.
- Supports CSS selectors and XPath.
- Large and active developer community.
- Extensive documentation.
- Supports HTTP requests.
๐ Cons:
- Resource-intensive.
- Slow script execution.
- Difficult to scale.
- Requires additional setup.
โ๏ธ Features:
- CSS selectors.
- XPath.
- Headless browser functionality.
- Element validation.
- Waits and timeouts
๐จโ๐ป Example: Here, Selenium WebDriver (in this case, the ChromeDriver) navigates to the URL and locates the stock amount element using XPath.
Ensure you install the Selenium.WebDriver
and Selenium.WebDriver.ChromeDriver NuGet
packages.
using System;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
class Scraper
{
static async Task Main()
{
string url = "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/";
// Set up Chrome WebDriver
IWebDriver driver = new ChromeDriver();
// Navigate to the URL
driver.Navigate().GoToUrl(url);
// Find the stock amount element using XPath
IWebElement stockElement = driver.FindElement(By.XPath("//*[@id="product-752"]/div[2]/p[2]"));
// Extract the stock amount text
string stockAmount = stockElement.Text;
Console.WriteLine("Stock Amount: " + stockAmount);
// Close the WebDriver
driver.Quit();
}
}
6. MSHTML: The Built-in Engine.
MSHTML is a built-in Microsoft engine for HTML parsing. This component is a part of Windows and is used by various Microsoft applications to render HTML.ย
It's lightweight and easy to implement. Also, it allows you to use JavaScript functions/selectors, like `getElementById()`. However, MSHTML has some limitations when handling large or complex HTML content.
๐ Pros:
- Supports JavaScript functions for selecting HTML elements.
- Event handling.
- Legacy compatibility.
๐ Cons:
- Limited documentation.
- Complex API.
- Dependency on legacy code.
- Can't efficiently handle large or complex HTML content.
โ๏ธ Features:
- CSS selectors.
- JavaScript rendering.
- Rich-text editing.
- Asynchronous loading.
๐จโ๐ป Example: Below is an example showing how to parse and extract HTML content using MSHTML.
It creates an HTMLDocument
instance and casts it to IHTMLDocument2
, then writes the retrieved HTML content into the newly created document. Next, it locates and extracts the stock amount using getElementByClassName.
using mshtml;
class Scraper
{
static async Task Main()
{
//..
{
//...
// Create an HTMLDocument and cast it to IHTMLDocument2
var htmlDoc = new HTMLDocument();
var ihtmlDoc = (IHTMLDocument2)htmlDoc;
// Open the document and write the HTML content
ihtmlDoc.open();
ihtmlDoc.write(htmlContent);
ihtmlDoc.close();
// Access the document's DOM
var allElements = ihtmlDoc.all;
// Find the element with class "stock"
foreach (IHTMLElement element in allElements)
{
if (element.className == "stock")
{
// Extract the stock amount text
string stockAmount = element.innerText;
Console.WriteLine("Stock Amount: " + stockAmount);
}
{
}
}
}
Ensure you have the mshtml.dll
library or reference added to your project.ย ย
7. Majestic-12: Parsing Super Hero
While Majestic-12 is primarily known for its web crawling and SEO-related tools and services, it also offers a C# HTML parser as one of its projects. This tool is an open-source .NET module designed for parsing HTML for links, indexing, and other purposes.
According to its documentation, it processes over 3TB of HTML daily. Therefore, it's proven against large HTML content. Also, it is known for delivering a fast parsing experience. However, its handling can get a bit clunky.
๐ Pros:
- Handles large or complex HTML.
- High-performance parsing.
๐ Cons:
- Limited documentation.
- Complex API.
- Dependency on legacy code.ย
โ๏ธ Features:
- HTML chunks.
- Encoding.
- Thread safety.
๐จโ๐ป Example: The example below initializes the Majestic-12 parser, loads the HTML content, and then iterates through the HTML chunks to find and extract the stock amount element by its class.ย
using Majestic12;
class Scraper
{
static async Task Main()
{
//..
{
//...
// Create an HTML parser instance
HTMLparser oP = new HTMLparser();
// Load HTML content
oP.Init(htmlContent);
// Find the stock amount element by its class
HTMLchunk oChunk = null;
while ((oChunk = oP.ParseNext()) != null)
{
if (oChunk.oType == HTMLchunkType.Text)
{
if (oChunk.oHTML.Contains("in stock"))
{
string stockAmount = oChunk.oHTML;
Console.WriteLine("Stock Amount: " + stockAmount);
break;
}
}
}
}
}
}
Benchmark: Which Is Faster?
We've run a benchmark of the performance of the above C# HTML parsers to provide insights into their speed and efficiency.ย
Below are the results for two HTML parsing scenarios: text extraction first and table data next.
Library | Text Extraction (us) | Table Data Extraction (us) | Combined Mean (us) |
---|---|---|---|
Majestic-12 | 211.6 | 176.80 | 194.2 |
HtmlAgilityPack | 817.7 | 11.16 | 414.43 |
Fizzler | 819.0 | 14.81 | 416.91 |
AngleSharp | 932.0 | 31.45 | 481.73 |
CSQuery | 1,000.4 | 41.48 | 520.94 |
Selenium Webdriver | 2304.6 | 62.46 | 1183.53 |
MSHTML | 1,923,274.8 | 98,706.92 | 961106.86 |
1 us equals 1 microsecond (0.000001 seconds)
Let's see the overall results in a graph, from best to worst performer:
As expected, Majestic-12 was the fastest but the least comfortable to code. HTML Agility Pack (HAP) and Fizzler produced the fastest results next, slightly ahead of AngleSharp and CSQuery. Meanwhile, Selenium WebDriver's speed was poor, and MSHTML fell so far behind that it couldn't be included in the graph.
To benchmark these libraries, we used BenchMarkDotNet, a project that transforms methods into benchmarks.ย
The measurements were made on an AMD Ryzen 9 6900HX with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores, and .NET SDK 8.0.100-rc.2.23502.2. However, we expect performance to be relatively similar on any machine configuration.
Conclusion
We've explored powerful C# HTML parsers that enable you to access and manipulate HTML content efficiently. While AngleSharp is the most complete one when it comes to features, HTML Agility Pack won in relation to the highest speed and feature availability.
To bear in mind, retrieving the parseable HTML requires web scraping, which can be challenging as getting blocked by websites is a common issue. To mitigate that, you can check out our tips on web scraping without getting blocked.ย
Frequent Questions
How to Parse HTML Code in C#?
Parsing HTML code in C# typically involves loading raw HTML content and building a DOM (Document Object Model) that you can access, navigate, and manipulate to retrieve your desired data. Various C# libraries, including HTML Agility Pack and AngleSharp, exist to handle this.
Can C# Be Used with HTML?
Yes, C# can be used with HTML. It's a versatile programming language used for various web-related tasks, including web development, web scraping, web automation, and more. Regarding manipulating or interacting with HTML content, libraries like AngleSharp, HtmlAgilityPack, and Fizzler provide powerful tools to work with HTML documents.