The Anti-bot Solution to Scrape Everything? Get Your Free API Key! ūüėé

Web Scraping in C#: Complete Guide 2023

February 7, 2023 · 14 min read

More and more companies take advantage of data extracted from the web nowadays, and one of the most suitable programming languages for this purpose is C#. In this step-by-step tutorial, you'll see how to do web scraping in C# using libraries like Selenium and Html Agility Pack.

Let's get started!

Prerequisites

Set Up the Environment

Here are the prerequisites you need to meet to follow this C# scraping guide:

To save time, you can directly install the .NET Coding Pack. It includes Visual Studio Code with the essential .NET extensions and the .NET SDK. Otherwise, follow the links above to download the required tools.

You should now be all set to follow our web scraping C# tutorial now.

However, let's first verify that you installed .NET correctly. Launch a PowerShell window, and run the command below.

dotnet --list-sdks

This should print the version of the.NET SDK installed on your machine.

7.0.101 [C:\Program Files\dotnet\sdk]

If you receive a 'dotnet' is not recognized as an internal or external command error, then something went wrong. Restart your machine and try again. If the command above returns the same error, you'll need to reinstall .NET.

Initialize a C# Project

Let's create a .NET console application in Visual Studio Code. In case of problems, consult the official guide.

First, create an empty folder called SimpleWebScraper for your C# project.

mkdir SimpleWebScraper

Now, launch Visual Studio Code and select "File > Open Folder..." from the top menu.

VS Code open folder
Click to open the image in full screen

Select SimpleWebScraper and wait for Visual Studio Code to open the folder. Then, reach the Terminal window by selecting "View > Terminal" from the main menu.

VS Code launch terminal
Click to open the image in full screen

In the Visual Studio Code terminal, launch the following command:

dotnet new console --framework net7.0

This will initialize a .NET 7.0 console project. Specifically, it will create a .csproj project file and a Program.cs C# file.

Now, replace the content of Program.cs with the code below.

namespace SimpleWebScraper 
{ 
	class Program 
	{ 
		static void Main(string[] args) 
		{ 
			Console.WriteLine("Hello, World!"); 
 
			// scraping logic... 
		} 
	} 
}

This is what a simple console script looks like in C#. Note that the Main() function will contain the C# data scraping logic.

Run the script by launching the command you see next:

dotnet run

Which should print:

"Hello, World!"

Great, your initial C# script works as expected!

You're about to learn the basics of web scraping in C#.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

How to Scrape a Website in C#

We'll learn how to build a data scraper with C# by extracting data from ScrapeMe, a website that showcases a list of Pokemon-inspired elements spread over several pages. The C# spider will automatically visit and extract the product data from every one of them.

This is what ScrapeMe looks like:

ScrapeMe Homepage
Click to open the image in full screen

Let's install some dependencies and start scraping data from the web.

Step 1: Install Html Agility Pack and Its CSS Selector Extension

Html Agility Pack (HAP) is a powerful open-source .NET library for parsing HTML documents. It offers a flexible API for web scraping, allowing you to download an HTML page and parse it. You can also select HTML elements and extract data from them.

Install Html Agility Pack through the NuGet HtmlAgilityPack package.

dotnet add package HtmlAgilityPack

Although Html Agility Pack natively supports XPath and XSLT, these aren't the most popular approaches when it comes to selecting HTML elements from the DOM. Fortunately, there's the HtmlAgilityPack CSS Selector extension.

Install it via the NuGet HtmlAgilityPack.CssSelectors library.

dotnet add package HtmlAgilityPack.CssSelectors

HAP will now be able to understand CSS Selector via extended methods.

Now, import Html Agility Pack in your C# web spider by adding the following line on top of your Program.cs file.

using HtmlAgilityPack;

If Visual Studio Code doesn't report errors, then you're good to go.

Time to see how to use HAP for web scraping in C#!

Step 2: Load the Target Web Page

Start by initializing an Html Agility Pack object.

var web = new HtmlWeb();

HtmlWeb gives you access to the web scraping capabilities offered by HAP.

Then, use HtmlWeb's Load() method to get the HTML from a URL.

// loading the target web page 
var document = web.Load("https://scrapeme.live/shop/");

Behind the scene, HAP performs an HTTP GET request to download the web page and parse its HTML content. It raises an HtmlAgilityPack.HtmlWebException in case of error, and provides an HAP HtmlDocument object if everything works as expected.

You're now ready to use HtmlDocument to extract data from HTML elements. But first, let's study the code of the target page to define an effective strategy for selecting HTML elements.

Step 3: Inspecting the Target Page

Explore the target web page to see how it's structured. We'll start with the target HTML nodes, which are the product elements. Right-click on one and access the browser DevTools by selecting the "Inspect" option:

ScrapeMe product on DevTools
Click to open the image in full screen

Here, you can clearly see a single li.product HTML consists of the following four elements:

  • The product URL in an¬†a.
  • The product image in an¬†img.
  • The product name in an¬†h2.
  • The product price in a¬†.price¬†span¬†HTML element.

Inspect other HTML products, and you'll see they all share the same structure. What changes are the values stored in the underlying HTML elements. This means that you can scrape them all programmatically.

Next, we'll learn how to scrape data from these product HTML elements with HAP in C#.

Step 4: Extract Data From HTML Elements

You need to define a custom C# class to help you store the scraped data. For this purpose, initialize a nested PokemonProduct class inside the Program as follows:

public class PokemonProduct 
{ 
	public string? Url { get; set; } 
	public string? Image { get; set; } 
	public string? Name { get; set; } 
	public string? Price { get; set; } 
}

This custom class contains the Url, Image, Name, and Price fields. These match what you're interested in scraping from every product.

Now, initialize a list of PokemonProduct in your Main() function with the line below:

var pokemonProducts = new List<PokemonProduct>();

This will contain the scraped data stored in PokemonProduct instances.

It's time to use HAP to extract the list of all li.product HTML elements from the DOM, like this:

// selecting all HTML product elements from the current page 
var productHTMLElements = document.DocumentNode.QuerySelectorAll("li.product");

QuerySelectorAll() allows you to retrieve HTML nodes from the DOM with a CSS selector. Here, the method applies the li.product CSS selector strategy to get all product elements. Specifically, QuerySelectorAll() returns a list of HAP HtmlNode objects.

Note that QuerySelectorAll() comes from the HAP CSS selector extension, so you won't find it in Html Agility Pack's original interface.

Use a foreach loop to iterate over the list of HTML and scrape data from each product.

// iterating over the list of product elements 
foreach (var productHTMLElement in productHTMLElements) 
{ 
	// scraping the interesting data from the current HTML element 
	var url = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("a").Attributes["href"].Value); 
	var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("img").Attributes["src"].Value); 
	var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("h2").InnerText); 
	var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".price").InnerText); 
	// instancing a new PokemonProduct object 
	var pokemonProduct = new PokemonProduct() { Url = url, Image = image, Name = name, Price = price }; 
	// adding the object containing the scraped data to the list 
	pokemonProducts.Add(pokemonProduct); 
}

Incredible! You just implemented C# web scraping logic!

The QuerySelector() method applies a CSS selector in the HtmlNode child nodes to get just one.

Then, we select an HTML attribute from Attributes and extract its date with Value. Wrap each value with HtmlEntity.DeEntitize() to replace known HTML entities.

Again, note that QuerySelector() comes from the Html Agility Pack CSS Selector extension. You won't find that method in vanilla HAP.

Awesome! Time to learn how to export the scraped data in an easy-to-read format, such as CSV.

Step 5: Export the Scraped Data to CSV

You can convert scraped data to CSV with native C# functions, but a library will make it easier.

CsvHelper is a fast, flexible, and reliable .NET library for reading and writing CSV files.

Install it by adding the NuGet CsvHelper package to your project's dependencies with:

dotnet add package CsvHelper

Import it into your project by adding this line to the top of your Program.cs file:

using CsvHelper;

Convert the scraped data to a CSV output file with CsvHelper as below:

// initializing the CSV output file 
using (var writer = new StreamWriter("pokemon-products.csv")) 
// initializing the CSV writer 
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture)) 
{ 
	// populating the CSV file 
	csv.WriteRecords(pokemonProducts); 
}

The snippet here initializes a pokemon-products.csv file. Then, CsvHelper's WriteRecords() writes all the product records to that CSV file. Thanks to the C# using statement, the script will automatically free the resources associated with the writing objects.

Note that the constructor requires a CultureInfo parameter. This defines the formatting specs and the delimiter and line-ending character to use. InvariantCulture ensures that any software can parse the produced CSV regardless of the user's local settings.

To use CultureInfo values, you need the following extra import:

using System.Globalization;

Fantastic! All that remains is to launch the C# web scraper!

Step 6: Launch the Scraper

This is what the Program.cs C# data scraper implemented so far looks like this:

using HtmlAgilityPack; 
using CsvHelper; 
using System.Globalization; 
 
namespace SimpleWebScraper 
{ 
	public class Program 
	{ 
		// defining a custom class to store the scraped data 
		public class PokemonProduct 
		{ 
			public string? Url { get; set; } 
			public string? Image { get; set; } 
			public string? Name { get; set; } 
			public string? Price { get; set; } 
		} 
 
		public static void Main() 
		{ 
			// creating the list that will keep the scraped data 
			var pokemonProducts = new List<PokemonProduct>(); 
 
			// creating the HAP object 
			var web = new HtmlWeb(); 
 
			// visiting the target web page 
			var document = web.Load("https://scrapeme.live/shop/"); 
 
			// getting the list of HTML product nodes 
			var productHTMLElements = document.DocumentNode.QuerySelectorAll("li.product"); 
			// iterating over the list of product HTML elements 
			foreach (var productHTMLElement in productHTMLElements) 
			{ 
				// scraping logic 
				var url = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("a").Attributes["href"].Value); 
				var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("img").Attributes["src"].Value); 
				var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("h2").InnerText); 
				var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".price").InnerText); 
 
				var pokemonProduct = new PokemonProduct() { Url = url, Image = image, Name = name, Price = price }; 
				pokemonProducts.Add(pokemonProduct); 
			} 
 
			// crating the CSV output file 
			using (var writer = new StreamWriter("pokemon-products.csv")) 
			using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture)) 
			{ 
				// populating the CSV file 
				csv.WriteRecords(pokemonProducts); 
			} 
		} 
	} 
}

Run the script with the command below:

dotnet run

It might take a while to complete depending on the response time of the target page's server. When it's done, you'll find a pokemon-products.csv file in the root folder of your C# project. Open it to explore the data below:

ScrapeMe Output
Click to open the image in full screen

Wow! In 50 lines of code, you built a fully functional C# data scraper!

Advanced Web Scraping in C#

Web scraping in C# is much more than the fundamentals you just saw. Now, you'll learn about more advanced techniques to help you become a C# scraping expert!

Web Crawling in .NET

Don't forget that ScrapeMe shows a paginated list of products, it is, the target website consists of several web pages. To scrape all products, you need to visit the whole website, which is what web crawling is about.

To do web crawling in C#, you must follow all pagination links. Let's retrieve them all!

Inspect the pagination HTML element to understand how to extract the pages' URLs. Right-click on the number and select "Inspect":

Open Inspect in DevTools
Click to open the image in full screen

You should be able to see something like this in the browser DevTools:

DevTools window open
Click to open the image in full screen

Here, note that all pagination HTML elements share the page-numbers CSS class. In detail, only HTML nodes involve a URL, while the span elements are placeholders. So, you can select all pagination elements with the a.page-numbers CSS selector.

To avoid scraping a page twice, you'll need a couple of extra data structures:

  • pagesDiscovered: A¬†List¬†to keep track of the URLs discovered by the crawler.
  • pagesToScrape: A¬†Queue¬†containing the list of pages the spider will scrape soon.

Also, a limit variable will prevent the C# spider from crawling pages forever.

// the URL of the first pagination web page 
var firstPageToScrape = "https://scrapeme.live/shop/page/1/"; 
 
// the list of pages discovered during the crawling task 
var pagesDiscovered = new List<string> { firstPageToScrape }; 
 
// the list of pages that remains to be scraped 
var pagesToScrape = new Queue<string>(); 
 
// initializing the list with firstPageToScrape 
pagesToScrape.Enqueue(firstPageToScrape); 
 
// current crawling iteration 
int i = 1; 
 
// the maximum number of pages to scrape before stopping 
int limit = 5; 
 
// until there are no pages to scrape or limit is hit 
while (pagesToScrape.Count != 0 && i < limit) 
{ 
	// extracting the current page to scrape from the queue 
	var currentPage = pagesToScrape.Dequeue(); 
 
	// loading the page 
	var currentDocument = web.Load(currentPage); 
 
	// selecting the list of pagination HTML elements 
	var paginationHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("a.page-numbers"); 
 
	// to avoid visiting a page twice 
	foreach (var paginationHTMLElement in paginationHTMLElements) 
	{ 
		// extracting the current pagination URL 
		var newPaginationLink = paginationHTMLElement.Attributes["href"].Value; 
 
		// if the page discovered is new 
		if (!pagesDiscovered.Contains(newPaginationLink)) 
		{ 
			// if the page discovered needs to be scraped 
			if (!pagesToScrape.Contains(newPaginationLink)) 
			{ 
				pagesToScrape.Enqueue(newPaginationLink); 
			} 
			pagesDiscovered.Add(newPaginationLink); 
		} 
	} 
 
	// scraping logic... 
	 
	// incrementing the crawling counter 
	i++; 
}

The data crawler above does the following:

  1. Starts from the first page of the pagination list.
  2. Looks for new pagination URLs on the current page.
  3. Adds them to the scraping queue.
  4. Scrapes data from the current page.
  5. Repeats the previous four steps for each page in the queue until there are none there or it visited a number limit of pages.

Since ScrapeMe consists of 48 pages, set limit to 48 to scrape data from all products. In this case, pokemon-product.csv will have a record for each of the 755 products contained on the website.

Way to go! You're now able to build a web scraping C# app that can scrape a complete website!

Avoid Being Blocked

Your data scraper in C# may fail. This is due to the several anti-scraping mechanisms websites might adopt. There are many anti-scraping techniques your script should be ready for. Use ZenRows to easily get around them!

The most basic technique is to block HTTP requests based on the value of their headers. This generally happens when the requests use an invalid User-Agent value.

The User-Agent header contains info to qualify where the request comes from. Typically, the accepted ones refer to popular browsers and OS. Scraping libraries tend to use placeholder User-Agents that can easily expose your spider.

You can globally set a valid User-Agent in Html Agility Pack with the line below:

// setting a global User-Agent header in HAP 
web.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36";

Now, all HTTP requests performed by HAP will seem to come from Chrome 109.

Parallel Web Scraping in C#

The performance of web scraping with C# depends on the target web server's speed. Tackle this by making parallel requests and scraping pages simultaneously. Avoid dead time and take the speed of your scraper to the next level... this is what parallel web scraping in C# is about!

Store the list of all pages your C# data crawler should visit in a ConcurrentBag:

var pagesToScrape = new ConcurrentBag<string> { 
	"https://scrapeme.live/shop/page/1/", 
	"https://scrapeme.live/shop/page/2/", 
	"https://scrapeme.live/shop/page/3/", 
	// ... 
	"https://scrapeme.live/shop/page/47/", 
	"https://scrapeme.live/shop/page/48/" 
};

In C#, List isn't thread-safe, and you shouldn't use it when it comes to parallel tasks. Replace it with its non-order thread-safe alternative ConcurrentBag.

For the same reason, make pokemonProducts a ConcurrentBag:

var pokemonProducts = new ConcurrentBag<PokemonProduct>();

Let's perform parallel web scraping with C#! Use Parallel.forEach() to perform a foreach loop in parallel in C# and scrape several pages at the same time:

// the import statement required to use Parallel.forEach() 
using System.Collections.Concurrent; 
 
// ... 
 
Parallel.ForEach( 
	pagesToScrape, 
	// limiting the parallelization level to 4 pages at a time 
	new ParallelOptions { MaxDegreeOfParallelism = 4 }, 
	currentPage => { 
		// visiting the current page of the loop 
		var currentDocument = web.Load(currentPage); 
	 
		// complete scrapping logic 
		var productHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("li.product"); 
		foreach (var productHTMLElement in productHTMLElements) 
		{ 
			var url = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("a").Attributes["href"].Value); 
			var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("img").Attributes["src"].Value); 
			var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("h2").InnerText); 
			var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".price").InnerText); 
	 
			var pokemonProduct = new PokemonProduct() { Url = url, Image = image, Name = name, Price = price }; 
	 
			// storing the scraped product data in parallel 
			pokemonProducts.Add(pokemonProduct); 
		} 
	} 
);

Great! Your web scraper in C# is now lightning-fast! But don't forget to limit the level of parallelization to avoid stressing the server. Your goal is to extract data from a website, not to perform a DoS attack.

The snippet above served as an example to understand how to achieve parallel crawling in C#. Take a look at the entire parallel C# data spider here:

using HtmlAgilityPack; 
using CsvHelper; 
using System.Globalization; 
using System.Collections.Concurrent; 
 
namespace SimpleWebScraper 
{ 
	public class Program 
	{ 
		public class PokemonProduct 
		{ 
			public string? Url { get; set; } 
			public string? Image { get; set; } 
			public string? Name { get; set; } 
			public string? Price { get; set; } 
		} 
 
		public static void Main() 
		{ 
			// initializing HAP 
			var web = new HtmlWeb(); 
		 
			// this can't be a List because it's not thread-safe 
			var pokemonProducts = new ConcurrentBag<PokemonProduct>(); 
		 
			// the complete list of pages to scrape 
			var pagesToScrape = new ConcurrentBag<string> { 
				"https://scrapeme.live/shop/page/1/", 
				"https://scrapeme.live/shop/page/2/", 
				// ... 
				"https://scrapeme.live/shop/page/48/" 
			}; 
 
			// performing parallel web scraping 
			Parallel.ForEach( 
				pagesToScrape, 
				new ParallelOptions { MaxDegreeOfParallelism = 4 }, 
				currentPage => 
				{ 
					var currentDocument = web.Load(currentPage); 
 
					var productHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("li.product"); 
					foreach (var productHTMLElement in productHTMLElements) 
					{ 
						var url = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("a").Attributes["href"].Value); 
						var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("img").Attributes["src"].Value); 
						var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("h2").InnerText); 
						var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".price").InnerText); 
 
						var pokemonProduct = new PokemonProduct() { Url = url, Image = image, Name = name, Price = price }; 
 
						pokemonProducts.Add(pokemonProduct); 
					} 
				} 
			); 
 
			// exporting to CSV 
			using (var writer = new StreamWriter("pokemon-products.csv")) 
			using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture)) 
			{ 
				csv.WriteRecords(pokemonProducts); 
			} 
		} 
	} 
}

Scraping a Dynamic-Content Website with a Headless Browser in C#

Static-content sites have all their content embedded in the HTML pages returned by the server. This makes them an easy scraping target for any HTML parsing library.

Dynamic content websites use JavaScript for rendering or retrieving data. That's because they rely on JavaScript to dynamically retrieve all or part of the content. Scraping such websites requires a tool that can run JavaScript, like a headless browser. If you're not familiar with this term, a headless browser is a programmable browser with no GUI.

With more than 65 million downloads, the most used headless browser library for C# is Selenium. Install Selenium.WebDriver's NuGet package.

dotnet add package Selenium.WebDriver

Use Selenium in headless mode to scrape data from ScrapeMe with the following logic:

using CsvHelper; 
using System.Globalization; 
using OpenQA.Selenium; 
using OpenQA.Selenium.Chrome; 
 
namespace SimpleWebScraper 
{ 
	public class Program 
	{ 
		public class PokemonProduct 
		{ 
			public string? Url { get; set; } 
			public string? Image { get; set; } 
			public string? Name { get; set; } 
			public string? Price { get; set; } 
		} 
 
		public static void Main() 
		{ 
			var pokemonProducts = new List<PokemonProduct>(); 
 
			// to open Chrome in headless mode 
			var chromeOptions = new ChromeOptions(); 
			chromeOptions.AddArguments("headless"); 
 
			// starting a Selenium instance 
			using (var driver = new ChromeDriver(chromeOptions)) 
			{ 
				// navigating to the target page in the browser 
				driver.Navigate().GoToUrl("https://scrapeme.live/shop/"); 
 
				// getting the HTML product elements 
				var productHTMLElements = driver.FindElements(By.CssSelector("li.product")); 
				// iterating over them to scrape the data of interest 
				foreach (var productHTMLElement in productHTMLElements) 
				{ 
					// scraping logic 
					var url = productHTMLElement.FindElement(By.CssSelector("a")).GetAttribute("href"); 
					var image = productHTMLElement.FindElement(By.CssSelector("img")).GetAttribute("src"); 
					var name = productHTMLElement.FindElement(By.CssSelector("h2")).Text; 
					var price = productHTMLElement.FindElement(By.CssSelector(".price")).Text; 
 
					var pokemonProduct = new PokemonProduct() { Url = url, Image = image, Name = name, Price = price }; 
 
					pokemonProducts.Add(pokemonProduct); 
				} 
			} 
 
			// export logic 
			using (var writer = new StreamWriter("pokemon-products.csv")) 
			using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture)) 
			{ 
				csv.WriteRecords(pokemonProducts); 
			} 
		} 
	} 
}

The Selenium FindElements() function allows instructing the browser to look for HTML nodes. Thanks to it, you can select the product HTML elements via a CSS selector query. Then, iterate over them in a foreach loop. Apply GetAttribute() and use Text to extract the data of interest.

Scraping a website in C# with HAP or Selenium is about the same, code-wise. The difference is in the way they run the scraping logic. HAP parses HTML pages to extract data from them and Selenium runs the scraping statements in a headless browser.

Thanks to Selenium, you can crawl dynamic-content websites and interact with web pages in a browser as a real user would. This also means that your script is less likely to be detected as a bot since Selenium makes it easier to scrape a web page without getting blocked.

Html Agility Pack doesn't come with complete browser functionality, so you can only use HAP to scrape static-content websites. and it doesn't involve the resource overhead to run a browser typical of Selenium.

Put All Together: Final Code

Here is the complete code of the C# scraper with crawling and basic anti-block logic built with Html Agility Pack:

using HtmlAgilityPack; 
using System.Globalization; 
using CsvHelper; 
using System.Collections.Concurrent; 
namespace SimpleWebScraper 
{ 
	public class Program 
	{ 
		// defining a custom class to store 
		// the scraped data 
		public class PokemonProduct 
		{ 
			public string? Url { get; set; } 
			public string? Image { get; set; } 
			public string? Name { get; set; } 
			public string? Price { get; set; } 
		} 
		public static void Main() 
		{ 
			// initializing HAP 
			var web = new HtmlWeb(); 
			// setting a global User-Agent header 
			web.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"; 
			// creating the list that will keep the scraped data 
 
			var pokemonProducts = new List<PokemonProduct>(); 
			// the URL of the first pagination web page 
			var firstPageToScrape = "https://scrapeme.live/shop/page/1/"; 
			// the list of pages discovered during the crawling task 
			var pagesDiscovered = new List<string> { firstPageToScrape }; 
			// the list of pages that remains to be scraped 
			var pagesToScrape = new Queue<string>(); 
			// initializing the list with firstPageToScrape 
			pagesToScrape.Enqueue(firstPageToScrape); 
			// current crawling iteration 
			int i = 1; 
			// the maximum number of pages to scrape before stopping 
			int limit = 5; 
			// until there is a page to scrape or limit is hit 
			while (pagesToScrape.Count != 0 && i < limit) 
			{ 
				// getting the current page to scrape from the queue 
				var currentPage = pagesToScrape.Dequeue(); 
				// loading the page 
				var currentDocument = web.Load(currentPage); 
				// selecting the list of pagination HTML elements 
				var paginationHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("a.page-numbers"); 
				// to avoid visiting a page twice 
				foreach (var paginationHTMLElement in paginationHTMLElements) 
				{ 
					// extracting the current pagination URL 
					var newPaginationLink = paginationHTMLElement.Attributes["href"].Value; 
					// if the page discovered is new 
					if (!pagesDiscovered.Contains(newPaginationLink)) 
					{ 
						// if the page discovered needs to be scraped 
						if (!pagesToScrape.Contains(newPaginationLink)) 
						{ 
							pagesToScrape.Enqueue(newPaginationLink); 
						} 
						pagesDiscovered.Add(newPaginationLink); 
					} 
				} 
				// getting the list of HTML product nodes 
				var productHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("li.product"); 
				// iterating over the list of product HTML elements 
				foreach (var productHTMLElement in productHTMLElements) 
				{ 
					// scraping logic 
					var url = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("a").Attributes["href"].Value); 
					var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("img").Attributes["src"].Value); 
					var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("h2").InnerText); 
					var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".price").InnerText); 
					var pokemonProduct = new PokemonProduct() { Url = url, Image = image, Name = name, Price = price }; 
					pokemonProducts.Add(pokemonProduct); 
				} 
				// incrementing the crawling counter 
				i++; 
			} 
			// opening the CSV stream reader 
			using (var writer = new StreamWriter("pokemon-products.csv")) 
			using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture)) 
			{ 
				// populating the CSV file 
				csv.WriteRecords(pokemonProducts); 
			} 
		} 
	} 
}

Wonderful! Less than 100 lines of code are enough to build a web scraper in C#!

Other Web Scraping Libraries in C#

Other tools to consider when it comes to web scraping with C# are:

  • ZenRows: A fully-featured easy-to-use API to make extracting data from web pages easy. ZenRows offers an automatic bypass for any anti-bot or anti-scraping system. Plus, it comes with rotating proxies, headless browser functionality, and a 99% uptime guarantee.
  • Puppeteer Sharp:¬†The .NET port of the popular¬†Puppeteer¬†Node.js library. With it, you can instruct a headless Chromium browser to perform testing and scraping.
  • AngleSharp: An open-source .NET library for parsing and manipulating XML and HTML. It allows you to extract data from a website and select HTML elements via CSS selectors.

This was a short reminder that there are other useful tools for data scraping with C#. Read our guide on the best C# web scraping libraries.

Conclusion

Our step-by-step tutorial covered everything you need to know about web scraping in C#. First, we learned the basics and then tackled the most advanced C# web scraping concepts.

As a recap, you now know:

  • How to do basic web scraping in C# with Html Agility Pack.
  • How to scrape an entire website through web crawling.
  • When you need to use a C# headless browser solution.
  • How to extract data from dynamic-content websites with Selenium.

Web data scraping using C# is a challenge. That's due to the many anti-scraping technologies websites now use. Bypassing them all isn't easy, and you always need to find a workaround. Avoid all this with a complete C# web scraping API, like ZenRows. Thanks to it, you perform data scraping via API calls and forget about anti-bot protections.

Frequent Questions

How Do You Scrape Data From a Website in C#?

Scraping data from the web in C# happens as in the other programming languages. With a C# web scraping library, you can connect to the desired website, select HTML elements from its DOM, and retrieve data.

Is C# Good for Web Scraping?

Yes, it is! C# is a general-purpose programming language that enables you to do web scraping. C# has a large and active community that developed many libraries to help you achieve your scraping goals.

What Is the Best Way to Scrape With C#?

Using one of the many NuGet libraries for scraping in C# makes everything easier. Some of the most popular C# libraries to support your data crawling project are Selenium, ScrapySharp, and Html Agility Pack.

Did you find the content helpful? Spread the word and share it on Twitter, or LinkedIn.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.