Web Scraping in C#: Complete Guide 2024

May 15, 2024 ยท 14 min read

More and more companies take advantage of data extracted from the web nowadays, and one of the most suitable programming languages for this purpose is C#. In this step-by-step tutorial, you'll see how to do web scraping in C# using libraries like Selenium and Html Agility Pack.

Let's get started!

Prerequisites

Set Up the Environment

Here are the prerequisites you need to meet to follow this C# scraping guide:

To save time, you can directly install theย .NET Coding Pack. It includes Visual Studio Code with the essential .NET extensions and the .NET SDK. Otherwise, follow the links above to download the required tools.

You should now be all set to follow our web scraping C# tutorial now.

However, let's first verify that you installed .NET correctly.ย Launch a PowerShell window, and run the command below.

Terminal
dotnet --list-sdks

This should print the version of the.NET SDK installed on your machine.

Output
8.0.205 [C:\Program Files\dotnet\sdk]

If you receive aย 'dotnet' is not recognized as an internal or external command error, then something went wrong. Restart your machine and try again. If the command above returns the same error, you'll need to reinstall .NET.

Initialize a C# Project

Let's create a .NET console application in Visual Studio Code. In case of problems, consult theย official guide.

First, create an empty folder calledย SimpleWebScraperย for your C# project.

Terminal
mkdir SimpleWebScraper

Now, launch Visual Studio Code and select "File > Open Folder..." from the top menu.

VS Code open folder
Click to open the image in full screen

Selectย SimpleWebScraperย and wait for Visual Studio Code to open the folder. Then, reach the Terminal window by selecting "View > Terminal" from the main menu.

VS Code terminal launcher
Click to open the image in full screen

In the Visual Studio Code terminal, launch the following command:

Terminal
dotnet new console --framework net8.0

This will initialize a .NET 7.0 console project. Specifically, it will create aย .csprojย project fileย and aย Program.csย C# file.

Now, replace the content ofย Program.csย with the code below.

program.cs
namespace SimpleWebScraper 
{ 
	class Program 
	{ 
		static void Main(string[] args) 
		{ 
			Console.WriteLine("Hello, World!"); 
 
			// scraping logic... 
		} 
	} 
}

This is what a simple console script looks like in C#. Note that theย Main()ย function will contain the C# data scraping logic.

Run the script by launching the command you see next:

Terminal
dotnet run

Which should print:

Output
"Hello, World!"

Great, your initial C# script works as expected!

You're about to learn the basics of web scraping in C#.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

How to Scrape a Website in C#

We'll learn how to build a data scraper with C# by extracting data fromย ScrapingCourse.com, a demo site dedicated to web scrapers with real e-commerce features. The C# spider will automatically visit and extract the product data from every one of them.

This is what the target website looks like:

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

Let's install some dependencies and start scraping data from the web.

Step 1: Install Html Agility Pack and Its CSS Selector Extension

Html Agility Packย (HAP) is a powerful open-source .NET library for parsing HTML documents. It offers a flexible API for web scraping, allowing you to download an HTML page and parse it. You can also select HTML elements and extract data from them.

Install Html Agility Pack through theย NuGetย HtmlAgilityPackย package.

Terminal
dotnet add package HtmlAgilityPack

BoldAlthough Html Agility Packย natively supportsย XPathย andย XSLT, these aren't the most popular approaches when it comes to selecting HTML elements from the DOM. Fortunately, there's theย HtmlAgilityPack CSS Selectorย extension.

Install it via the NuGetย HtmlAgilityPack.CssSelectorsย library.

Terminal
dotnet add package HtmlAgilityPack.CssSelectors

HAP will now be able to understand CSS Selector via extended methods.

Now, import Html Agility Pack in your C# web spider by adding the following line on top of yourย Program.csย file.

program.cs
using HtmlAgilityPack;

If Visual Studio Code doesn't report errors, then you're good to go.

Time to see how to use HAP for web scraping in C#!

Step 2: Load the Target Web Page

Start by initializing an Html Agility Pack object.

program.cs
var web = new HtmlWeb();

HtmlWebย gives you access to the web scraping capabilities offered by HAP.

Then, useย HtmlWeb'sย Load()ย method to get the HTML from a URL.

program.cs
// loading the target web page 
var document = web.Load("https://www.scrapingcourse.com/ecommerce/");

Behind the scene, HAP performs an HTTPย GETย request to download the web page and parse its HTML content. It raises anย HtmlAgilityPack.HtmlWebExceptionย in case of error, and provides an HAPย HtmlDocumentย object if everything works as expected.

You're now ready to useย HtmlDocumentย to extract data from HTML elements. But first, let's study the code of the target page to define an effective strategy for selecting HTML elements.

Step 3: Inspecting the Target Page

Explore the target web page to see how it's structured. We'll start with the target HTML nodes, which are the product elements. Right-click on one and access the browser DevTools by selecting the "Inspect" option:

scrapingcourse ecommerce homepage inspect first product li
Click to open the image in full screen

Here, you can clearly see a singleย li.productย HTML consists of the following four elements:

  • The product URL in anย a.
  • The product image in anย img.
  • The product name in anย h2.
  • The product price in aย .priceย spanย HTML element.

Inspect other HTML products, and you'll seeย they all share the same structure. What changes are the values stored in the underlying HTML elements.ย This means that you can scrape them all programmatically.

Next, we'll learn how to scrape data from these product HTML elements with HAP in C#.

Step 4: Extract Data From HTML Elements

You need to define a custom C# class to help you store the scraped data. For this purpose, initialize a nestedย Productย class inside theย Programย as follows:

program.cs
public class Product 
{ 
	public string? Url { get; set; } 
	public string? Image { get; set; } 
	public string? Name { get; set; } 
	public string? Price { get; set; } 
}

This custom class contains theย Url,ย Image,ย Name, andย Priceย fields. These match what you're interested in scraping from every product.

Now, initialize a list ofย Productย in yourย Main()ย function with the line below:

program.cs
var products = new List<Product>();

This will contain the scraped data stored inย Productย instances.

It's time to use HAP to extract the list of allย li.productย HTML elements from the DOM, like this:

program.cs
// selecting all HTML product elements from the current page 
var productHTMLElements = document.DocumentNode.QuerySelectorAll("li.product");

QuerySelectorAll()ย allows you to retrieve HTML nodes from the DOM with a CSS selector. Here, the method applies theย li.productย CSS selector strategy to get all product elements. Specifically,ย QuerySelectorAll()ย returns a list of HAPย HtmlNodeย objects.

Note thatย QuerySelectorAll()ย comes from the HAP CSS selector extension, so you won't find it in Html Agility Pack's original interface.

Use aย foreachย loop to iterate over the list of HTML and scrape data from each product.

program.cs
// iterating over the list of product elements 
foreach (var productHTMLElement in productHTMLElements) 
{ 
	// scraping the interesting data from the current HTML element 
	var url = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("a").Attributes["href"].Value); 
	var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("img").Attributes["src"].Value); 
	var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("h2").InnerText); 
	var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".price").InnerText); 
	// instancing a new Product object 
	var product = new Product() { Url = url, Image = image, Name = name, Price = price }; 
	// adding the object containing the scraped data to the list 
	products.Add(product); 
}

Incredible! You just implemented C# web scraping logic!

Theย QuerySelector()ย method applies a CSS selector in theย HtmlNodeย child nodes to get just one.

Then, we select an HTML attribute fromย Attributesย and extract its date withย Value. Wrap each value withย HtmlEntity.DeEntitize()ย to replace knownย HTML entities.

Again, note thatย QuerySelector()ย comes from the Html Agility Pack CSS Selector extension. You won't find that method in vanilla HAP.

Awesome! Time to learn how to export the scraped data in an easy-to-read format, such as CSV.

Step 5: Export the Scraped Data to CSV

You can convert scraped data to CSV with native C# functions, but a library will make it easier.

CsvHelperย is a fast, flexible, and reliable .NET library for reading and writing CSV files.

Install it by adding the NuGetย CsvHelperย package to your project's dependencies with:

program.cs
dotnet add package CsvHelper

Import it into your project by adding this line to the top of yourย Program.csย file:

program.cs
using CsvHelper;

Convert the scraped data to a CSV output file with CsvHelper as below:

program.cs
// initializing the CSV output file 
using (var writer = new StreamWriter("products.csv")) 
// initializing the CSV writer 
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture)) 
{ 
	// populating the CSV file 
	csv.WriteRecords(products); 
}

The snippet here initializes aย products.csvย file. Then, CsvHelper'sย WriteRecords()ย writes all the product records to that CSV file. Thanks to the C#ย usingย statement, the script will automatically free the resources associated with the writing objects.

Note that the constructor requires aย CultureInfoย parameter. This defines the formatting specs and the delimiter and line-ending character to use.ย InvariantCultureย ensures that any software can parse the produced CSV regardless of the user's local settings.

To useย CultureInfoย values, you need the following extra import:

program.cs
using System.Globalization;

Fantastic! All that remains is to launch the C# web scraper!

Step 6: Launch the Scraper

This is what theย Program.csย C# data scraper implemented so far looks like this:

program.cs
using HtmlAgilityPack; 
using CsvHelper; 
using System.Globalization; 
 
namespace SimpleWebScraper 
{ 
	public class Program 
	{ 
		// defining a custom class to store the scraped data 
		public class Product 
		{ 
			public string? Url { get; set; } 
			public string? Image { get; set; } 
			public string? Name { get; set; } 
			public string? Price { get; set; } 
		} 
 
		public static void Main() 
		{ 
			// creating the list that will keep the scraped data 
			var products = new List<Product>(); 
 
			// creating the HAP object 
			var web = new HtmlWeb(); 
 
			// visiting the target web page 
			var document = web.Load("https://www.scrapingcourse.com/ecommerce/"); 
 
			// getting the list of HTML product nodes 
			var productHTMLElements = document.DocumentNode.QuerySelectorAll("li.product"); 
			// iterating over the list of product HTML elements 
			foreach (var productHTMLElement in productHTMLElements) 
			{ 
				// scraping logic 
				var url = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("a").Attributes["href"].Value); 
				var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("img").Attributes["src"].Value); 
				var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("h2").InnerText); 
				var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".price").InnerText); 
 
				var product = new Product() { Url = url, Image = image, Name = name, Price = price }; 
				products.Add(product); 
			} 
 
			// crating the CSV output file 
			using (var writer = new StreamWriter("products.csv")) 
			using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture)) 
			{ 
				// populating the CSV file 
				csv.WriteRecords(products); 
			} 
		} 
	} 
}

Run the script with the command below:

Terminal
dotnet run

It might take a while to complete depending on the response time of the target page's server. When it's done, you'll find aย products.csvย file in the root folder of your C# project. Open it to explore the data below:

scrapingcourse ecommerce product output csv
Click to open the image in full screen

Wow! In 50 lines of code, you built a fully functional C# data scraper!

Advanced Web Scraping in C#

Web scraping in C# is much more than the fundamentals you just saw. Now, you'll learn about more advanced techniques to help you become a C# scraping expert!

Web Crawling in .NET

Don't forget that SrapingCourse.com shows a paginated list of products. To scrape all products, you need to visit the whole website, which is what web crawling is about.

To do web crawling in C#, you must follow all pagination links. Let's retrieve them all!

Inspect the pagination HTML element to understand how to extract the pages' URLs. Right-click on the number and select "Inspect":

scrapingcourse ecommerce homepage inspect
Click to open the image in full screen

You should be able to see something like this in the browser DevTools:

scrapingcourse ecommerce homepage devtools
Click to open the image in full screen

Here, note that all pagination HTML elements share theย page-numbersย CSS class. In detail, only HTML nodes involve a URL, while theย spanย elements are placeholders. So, you can select all pagination elements with theย a.page-numbersย CSS selector.

To avoid scraping a page twice, you'll need a couple of extra data structures:

  • pagesDiscovered: Aย Listย to keep track of the URLs discovered by the crawler.
  • pagesToScrape: Aย Queueย containing the list of pages the spider will scrape soon.

Also, aย limitย variable will prevent the C# spider from crawling pages forever.

program.cs
// the URL of the first pagination web page 
var firstPageToScrape = "https://www.scrapingcourse.com/ecommerce/page/1/"; 
 
// the list of pages discovered during the crawling task 
var pagesDiscovered = new List<string> { firstPageToScrape }; 
 
// the list of pages that remains to be scraped 
var pagesToScrape = new Queue<string>(); 
 
// initializing the list with firstPageToScrape 
pagesToScrape.Enqueue(firstPageToScrape); 
 
// current crawling iteration 
int i = 1; 
 
// the maximum number of pages to scrape before stopping 
int limit = 12; 
 
// until there are no pages to scrape or limit is hit 
while (pagesToScrape.Count != 0 && i < limit) 
{ 
	// extracting the current page to scrape from the queue 
	var currentPage = pagesToScrape.Dequeue(); 
 
	// loading the page 
	var currentDocument = web.Load(currentPage); 
 
	// selecting the list of pagination HTML elements 
	var paginationHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("a.page-numbers"); 
 
	// to avoid visiting a page twice 
	foreach (var paginationHTMLElement in paginationHTMLElements) 
	{ 
		// extracting the current pagination URL 
		var newPaginationLink = paginationHTMLElement.Attributes["href"].Value; 
 
		// if the page discovered is new 
		if (!pagesDiscovered.Contains(newPaginationLink)) 
		{ 
			// if the page discovered needs to be scraped 
			if (!pagesToScrape.Contains(newPaginationLink)) 
			{ 
				pagesToScrape.Enqueue(newPaginationLink); 
			} 
			pagesDiscovered.Add(newPaginationLink); 
		} 
	} 
 
	// scraping logic... 
	 
	// incrementing the crawling counter 
	i++; 
}

The data crawler above does the following:

  1. Starts from the first page of the pagination list.
  2. Looks for new pagination URLs on the current page.
  3. Adds them to the scraping queue.
  4. Scrapes data from the current page.
  5. Repeats the previous four steps for each page in the queue until there are none there or it visited a numberย limitย of pages.

Since ScrapingCourse.com consists of 12 pages, setย limitย to 12 to scrape data from all products. In this case,ย product.csvย will have a record for each of the 188 products.

Here's the complete code:

scraper.cs
using HtmlAgilityPack; 
using System.Globalization; 
using CsvHelper;  
namespace SimpleWebScraper 
{ 
	public class Program 
	{ 
		// defining a custom class to store 
		// the scraped data 
		public class Product 
		{ 
			public string? Url { get; set; } 
			public string? Image { get; set; } 
			public string? Name { get; set; } 
			public string? Price { get; set; } 
		} 
		public static void Main() 
		{ 
			// initializing HAP 
			var web = new HtmlWeb(); 
			 
			// creating the list that will keep the scraped data 
			var products = new List<Product>(); 
			// the URL of the first pagination web page 
			var firstPageToScrape = "https://www.scrapingcourse.com/ecommerce/page/1/"; 
			// the list of pages discovered during the crawling task 
			var pagesDiscovered = new List<string> { firstPageToScrape }; 
			// the list of pages that remains to be scraped 
			var pagesToScrape = new Queue<string>(); 
			// initializing the list with firstPageToScrape 
			pagesToScrape.Enqueue(firstPageToScrape); 
			// current crawling iteration 
			int i = 1; 
			// the maximum number of pages to scrape before stopping 
			int limit = 12; 
			// until there is a page to scrape or limit is hit 
			while (pagesToScrape.Count != 0 && i < limit) 
			{ 
				// getting the current page to scrape from the queue 
				var currentPage = pagesToScrape.Dequeue(); 
				// loading the page 
				var currentDocument = web.Load(currentPage); 
				// selecting the list of pagination HTML elements 
				var paginationHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("a.page-numbers"); 
				// to avoid visiting a page twice 
				foreach (var paginationHTMLElement in paginationHTMLElements) 
				{ 
					// extracting the current pagination URL 
					var newPaginationLink = paginationHTMLElement.Attributes["href"].Value; 
					// if the page discovered is new 
					if (!pagesDiscovered.Contains(newPaginationLink)) 
					{ 
						// if the page discovered needs to be scraped 
						if (!pagesToScrape.Contains(newPaginationLink)) 
						{ 
							pagesToScrape.Enqueue(newPaginationLink); 
						} 
						pagesDiscovered.Add(newPaginationLink); 
					} 
				} 
				// getting the list of HTML product nodes 
				var productHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("li.product"); 
				// iterating over the list of product HTML elements 
				foreach (var productHTMLElement in productHTMLElements) 
				{ 
					// scraping logic 
					var url = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("a").Attributes["href"].Value); 
					var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("img").Attributes["src"].Value); 
					var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("h2").InnerText); 
					var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".price").InnerText); 
					var product = new Product() { Url = url, Image = image, Name = name, Price = price }; 
					products.Add(product); 
				} 
				// incrementing the crawling counter 
				i++; 
			} 
			// opening the CSV stream reader 
			using (var writer = new StreamWriter("products.csv")) 
			using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture)) 
			{ 
				// populating the CSV file 
				csv.WriteRecords(products); 
			} 
		} 
	} 
}

Way to go! You're now able to build a web scraping C# app that can scrape a complete website!

Avoid Being Blocked

Your data scraper in C# may fail. This is due to the severalย anti-scraping mechanisms websites might adopt.ย There are manyย anti-scraping techniquesย your script should be ready for. Useย ZenRowsย to easily get around them!

The most basic technique is toย block HTTP requests based on the value of their headers. This generally happens when the requests use an invalidย User-Agentย value.

Theย User-Agentย header contains info to qualify where the request comes from. Typically, the accepted ones refer to popular browsers and OS.ย Scraping libraries tend to use placeholderย User-Agents that can easily expose your spider.

You can globally set a validย User-Agentย in Html Agility Pack with the line below:

program.cs
// setting a global User-Agent header in HAP 
web.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36";

The final code looks like this after adding the User Agent:

scraper.cs
using HtmlAgilityPack; 
using System.Globalization; 
using CsvHelper;  
namespace SimpleWebScraper 
{ 
	public class Program 
	{ 
		// defining a custom class to store 
		// the scraped data 
		public class Product 
		{ 
			public string? Url { get; set; } 
			public string? Image { get; set; } 
			public string? Name { get; set; } 
			public string? Price { get; set; } 
		} 
		public static void Main() 
		{ 
			// initializing HAP 
			var web = new HtmlWeb(); 
			// setting a global User-Agent header 
			web.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"; 
			// creating the list that will keep the scraped data 
 
			var products = new List<Product>(); 
			// the URL of the first pagination web page 
			var firstPageToScrape = "https://www.scrapingcourse.com/ecommerce/page/1/"; 
			// the list of pages discovered during the crawling task 
			var pagesDiscovered = new List<string> { firstPageToScrape }; 
			// the list of pages that remains to be scraped 
			var pagesToScrape = new Queue<string>(); 
			// initializing the list with firstPageToScrape 
			pagesToScrape.Enqueue(firstPageToScrape); 
			// current crawling iteration 
			int i = 1; 
			// the maximum number of pages to scrape before stopping 
			int limit = 12; 
			// until there is a page to scrape or limit is hit 
			while (pagesToScrape.Count != 0 && i < limit) 
			{ 
				// getting the current page to scrape from the queue 
				var currentPage = pagesToScrape.Dequeue(); 
				// loading the page 
				var currentDocument = web.Load(currentPage); 
				// selecting the list of pagination HTML elements 
				var paginationHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("a.page-numbers"); 
				// to avoid visiting a page twice 
				foreach (var paginationHTMLElement in paginationHTMLElements) 
				{ 
					// extracting the current pagination URL 
					var newPaginationLink = paginationHTMLElement.Attributes["href"].Value; 
					// if the page discovered is new 
					if (!pagesDiscovered.Contains(newPaginationLink)) 
					{ 
						// if the page discovered needs to be scraped 
						if (!pagesToScrape.Contains(newPaginationLink)) 
						{ 
							pagesToScrape.Enqueue(newPaginationLink); 
						} 
						pagesDiscovered.Add(newPaginationLink); 
					} 
				} 
				// getting the list of HTML product nodes 
				var productHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("li.product"); 
				// iterating over the list of product HTML elements 
				foreach (var productHTMLElement in productHTMLElements) 
				{ 
					// scraping logic 
					var url = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("a").Attributes["href"].Value); 
					var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("img").Attributes["src"].Value); 
					var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("h2").InnerText); 
					var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".price").InnerText); 
					var product = new Product() { Url = url, Image = image, Name = name, Price = price }; 
					products.Add(product); 
				} 
				// incrementing the crawling counter 
				i++; 
			} 
			// opening the CSV stream reader 
			using (var writer = new StreamWriter("products.csv")) 
			using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture)) 
			{ 
				// populating the CSV file 
				csv.WriteRecords(products); 
			} 
		} 
	} 
}

Wonderful! Less than 100 lines of code are enough to build a web scraper in C#! Now, all HTTP requests performed by HAP will seem to come from Chrome 124.

Parallel Web Scraping in C#

The performance of web scraping with C# depends on the target web server's speed. Tackle this byย making parallel requests and scraping pages simultaneously. Avoid dead time and take the speed of your scraper to the next level... this is what parallel web scraping in C# is about!

Store the list of all pages your C# data crawler should visit in aย ConcurrentBag:

program.cs
var pagesToScrape = new ConcurrentBag<string> { 
	"https://www.scrapingcourse.com/ecommerce/page/1/", 
	"https://www.scrapingcourse.com/ecommerce/page/2/", 
	"https://www.scrapingcourse.com/ecommerce/page/3/", 
	// ... 
	"https://www.scrapingcourse.com/ecommerce/page/11/", 
	"https://www.scrapingcourse.com/ecommerce/page/12/" 
};

In C#,ย Listย isn't thread-safe, and you shouldn't use it when it comes to parallel tasks. Replace it with its non-order thread-safe alternativeย ConcurrentBag.

For the same reason, makeย productsย aย ConcurrentBag:

program.cs
var products = new ConcurrentBag<Product>();

Let's perform parallel web scraping with C#! Useย Parallel.forEach()ย to perform aย foreachย loop in parallel in C# and scrape several pages at the same time:

program.cs
// the import statement required to use Parallel.forEach() 
using System.Collections.Concurrent; 
 
// ... 
 
Parallel.ForEach( 
	pagesToScrape, 
	// limiting the parallelization level to 4 pages at a time 
	new ParallelOptions { MaxDegreeOfParallelism = 4 }, 
	currentPage => { 
		// visiting the current page of the loop 
		var currentDocument = web.Load(currentPage); 
	 
		// complete scrapping logic 
		var productHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("li.product"); 
		foreach (var productHTMLElement in productHTMLElements) 
		{ 
			var url = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("a").Attributes["href"].Value); 
			var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("img").Attributes["src"].Value); 
			var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("h2").InnerText); 
			var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".price").InnerText); 
	 
			var product = new Product() { Url = url, Image = image, Name = name, Price = price }; 
	 
			// storing the scraped product data in parallel 
			products.Add(product); 
		} 
	} 
);

Great! Your web scraper in C# is now lightning-fast! But don't forget toย limit the level of parallelization to avoid stressing the server. Your goal is to extract data from a website, not to perform a DoS attack.

The snippet above served as an example to understand how to achieve parallel crawling in C#. Take a look at the entire parallel C# data spider here:

program.cs
using HtmlAgilityPack; 
using CsvHelper; 
using System.Globalization; 
using System.Collections.Concurrent; 
 
namespace SimpleWebScraper 
{ 
	public class Program 
	{ 
		public class Product 
		{ 
			public string? Url { get; set; } 
			public string? Image { get; set; } 
			public string? Name { get; set; } 
			public string? Price { get; set; } 
		} 
 
		public static void Main() 
		{ 
			// initializing HAP 
			var web = new HtmlWeb(); 
		 
			// this can't be a List because it's not thread-safe 
			var products = new ConcurrentBag<Product>(); 
		 
			// the complete list of pages to scrape 
			var pagesToScrape = new ConcurrentBag<string> { 
				"https://www.scrapingcourse.com/ecommerce/page/1/", 
				"https://www.scrapingcourse.com/ecommerce/page/2/", 
				// ... 
				"https://www.scrapingcourse.com/ecommerce/page/12/" 
			}; 
 
			// performing parallel web scraping 
			Parallel.ForEach( 
				pagesToScrape, 
				new ParallelOptions { MaxDegreeOfParallelism = 4 }, 
				currentPage => 
				{ 
					var currentDocument = web.Load(currentPage); 
 
					var productHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("li.product"); 
					foreach (var productHTMLElement in productHTMLElements) 
					{ 
						var url = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("a").Attributes["href"].Value); 
						var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("img").Attributes["src"].Value); 
						var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("h2").InnerText); 
						var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".price").InnerText); 
 
						var product = new Product() { Url = url, Image = image, Name = name, Price = price }; 
 
						products.Add(product); 
					} 
				} 
			); 
 
			// exporting to CSV 
			using (var writer = new StreamWriter("products.csv")) 
			using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture)) 
			{ 
				csv.WriteRecords(products); 
			} 
		} 
	} 
}

Scraping a Dynamic-Content Website with a Headless Browser in C#

Static-content sites have all their content embedded in the HTML pages returned by the server. This makes them an easy scraping target for any HTML parsing library.

Dynamic content websites use JavaScript for rendering or retrieving data. That's because they rely on JavaScript to dynamically retrieve all or part of the content. Scraping such websites requires a tool that can run JavaScript, like aย headless browser. If you're not familiar with this term,ย a headless browser is a programmable browser with no GUI.

With more than 65 million downloads, the most used headless browser library for C# isย Selenium. Installย Selenium.WebDriver's NuGet package.

program.cs
dotnet add package Selenium.WebDriver

Use Selenium in headless mode to scrape data from ScrapingCourse.com with the following logic:

program.cs
using CsvHelper; 
using System.Globalization; 
using OpenQA.Selenium; 
using OpenQA.Selenium.Chrome; 
 
namespace SimpleWebScraper 
{ 
	public class Program 
	{ 
		public class Product 
		{ 
			public string? Url { get; set; } 
			public string? Image { get; set; } 
			public string? Name { get; set; } 
			public string? Price { get; set; } 
		} 
 
		public static void Main() 
		{ 
			var products = new List<Product>(); 
 
			// to open Chrome in headless mode 
			var chromeOptions = new ChromeOptions(); 
			chromeOptions.AddArguments("headless"); 
 
			// starting a Selenium instance 
			using (var driver = new ChromeDriver(chromeOptions)) 
			{ 
				// navigating to the target page in the browser 
				driver.Navigate().GoToUrl("https://www.scrapingcourse.com/ecommerce/"); 
 
				// getting the HTML product elements 
				var productHTMLElements = driver.FindElements(By.CssSelector("li.product")); 
				// iterating over them to scrape the data of interest 
				foreach (var productHTMLElement in productHTMLElements) 
				{ 
					// scraping logic 
					var url = productHTMLElement.FindElement(By.CssSelector("a")).GetAttribute("href"); 
					var image = productHTMLElement.FindElement(By.CssSelector("img")).GetAttribute("src"); 
					var name = productHTMLElement.FindElement(By.CssSelector("h2")).Text; 
					var price = productHTMLElement.FindElement(By.CssSelector(".price")).Text; 
 
					var product = new Product() { Url = url, Image = image, Name = name, Price = price }; 
 
					products.Add(product); 
				} 
			} 
 
			// export logic 
			using (var writer = new StreamWriter("products.csv")) 
			using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture)) 
			{ 
				csv.WriteRecords(products); 
			} 
		} 
	} 
}

The Seleniumย FindElements()ย function allows instructing the browser to look for HTML nodes. Thanks to it, you can select the product HTML elements via a CSS selector query. Then, iterate over them in aย foreachย loop. Applyย GetAttribute()ย and useย Textย to extract the data of interest.

Scraping a website in C# with HAP or Selenium is about the same, code-wise. The difference is in the way they run the scraping logic. HAP parses HTML pages to extract data from them and Selenium runs the scraping statements in a headless browser.

Thanks to Selenium, you can crawl dynamic-content websites and interact with web pages in a browser as a real user would. This also means that your script is less likely to be detected as a bot since Selenium makes it easier toย scrape a web page without getting blocked.

Html Agility Pack doesn't come with complete browser functionality, so you can only use HAP to scrape static-content websites. and itย doesn't involve the resource overhead to run a browser typical of Selenium.

Other Web Scraping Libraries in C#

Other tools to consider when it comes to web scraping with C# are:

  • ZenRows: A fully-featured easy-to-use API to make extracting data from web pages easy. ZenRows offers an automatic bypass for any anti-bot or anti-scraping system. Plus, it comes with rotating proxies, headless browser functionality, and a 99% uptime guarantee.
  • Puppeteer Sharp:ย The .NET port of the popularย Puppeteerย Node.js library. With it, you can instruct a headless Chromium browser to perform testing and scraping.
  • AngleSharp: An open-source .NET library for parsing and manipulating XML and HTML. It allows you to extract data from a website and select HTML elements via CSS selectors.

This was a short reminder that there are other useful tools for data scraping with C#. Read our guide on theย best C# web scraping libraries.

Conclusion

Our step-by-step tutorial covered everything you need to know about web scraping in C#. First, we learned the basics and then tackled the most advanced C# web scraping concepts.

As a recap, you now know:

  • How to do basic web scraping in C# with Html Agility Pack.
  • How to scrape an entire website through web crawling.
  • When you need to use a C# headless browser solution.
  • How to extract data from dynamic-content websites with Selenium.

Web data scraping using C# is a challenge. That's due to the many anti-scraping technologies websites now use. Bypassing them all isn't easy, and you always need to find a workaround. Avoid all this with a completeย C# web scraping API, likeย ZenRows. Thanks to it, you perform data scraping via API calls and forget about anti-bot protections.

Frequent Questions

How Do You Scrape Data From a Website in C#?

Scraping data from the web in C# happens as in the other programming languages. With a C# web scraping library, you can connect to the desired website, select HTML elements from its DOM, and retrieve data.

Is C# Good for Web Scraping?

Yes, it is! C# is a general-purpose programming language that enables you to do web scraping. C# has a large and active community that developed many libraries to help you achieve your scraping goals.

What Is the Best Way to Scrape With C#?

Using one of the many NuGet libraries for scraping in C# makes everything easier. Some of the most popular C# libraries to support your data crawling project are Selenium, ScrapySharp, and Html Agility Pack.

Ready to get started?

Up to 1,000 URLs for free are waiting for you