More and more companies take advantage of data extracted from the web nowadays, and one of the most suitable programming languages for this purpose is C#. In this step-by-step tutorial, you'll see how to do web scraping in C# using libraries like Selenium and Html Agility Pack.
Let's get started!
Prerequisites
Set Up the Environment
Here are the prerequisites you need to meet to follow this C# scraping guide:
- .NET 8+: The most recent version of the .NET SDK will do. At the time of writing, this is 8.0.205.
- An IDE for coding in C#:ย Visual Studio 2022 Community Editionย is a complete solution. If you prefer a lighter option,ย Visual Studio Codeย with theย C#ย extension is perfect.
To save time, you can directly install theย .NET Coding Pack. It includes Visual Studio Code with the essential .NET extensions and the .NET SDK. Otherwise, follow the links above to download the required tools.
You should now be all set to follow our web scraping C# tutorial now.
However, let's first verify that you installed .NET correctly.ย Launch a PowerShell window, and run the command below.
dotnet --list-sdks
This should print the version of the.NET SDK installed on your machine.
8.0.205 [C:\Program Files\dotnet\sdk]
If you receive aย 'dotnet' is not recognized as an internal or external command error
, then something went wrong. Restart your machine and try again. If the command above returns the same error, you'll need to reinstall .NET.
Initialize a C# Project
Let's create a .NET console application in Visual Studio Code. In case of problems, consult theย official guide.
First, create an empty folder calledย SimpleWebScraper
ย for your C# project.
mkdir SimpleWebScraper
Now, launch Visual Studio Code and select "File > Open Folder..." from the top menu.
Selectย SimpleWebScraper
ย and wait for Visual Studio Code to open the folder. Then, reach the Terminal window by selecting "View > Terminal" from the main menu.
In the Visual Studio Code terminal, launch the following command:
dotnet new console --framework net8.0
This will initialize a .NET 7.0 console project. Specifically, it will create aย .csproj
ย project fileย and aย Program.cs
ย C# file.
Now, replace the content ofย Program.cs
ย with the code below.
namespace SimpleWebScraper
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Hello, World!");
// scraping logic...
}
}
}
This is what a simple console script looks like in C#. Note that theย Main()
ย function will contain the C# data scraping logic.
Run the script by launching the command you see next:
dotnet run
Which should print:
"Hello, World!"
Great, your initial C# script works as expected!
You're about to learn the basics of web scraping in C#.
How to Scrape a Website in C#
We'll learn how to build a data scraper with C# by extracting data fromย ScrapingCourse.com, a demo site dedicated to web scrapers with real e-commerce features. The C# spider will automatically visit and extract the product data from every one of them.
This is what the target website looks like:
Let's install some dependencies and start scraping data from the web.
Step 1: Install Html Agility Pack and Its CSS Selector Extension
Html Agility Packย (HAP) is a powerful open-source .NET library for parsing HTML documents. It offers a flexible API for web scraping, allowing you to download an HTML page and parse it. You can also select HTML elements and extract data from them.
Install Html Agility Pack through theย NuGetย HtmlAgilityPack
ย package.
dotnet add package HtmlAgilityPack
BoldAlthough Html Agility Packย natively supportsย XPathย andย XSLT, these aren't the most popular approaches when it comes to selecting HTML elements from the DOM. Fortunately, there's theย HtmlAgilityPack CSS Selectorย extension.
Install it via the NuGetย HtmlAgilityPack.CssSelectors
ย library.
dotnet add package HtmlAgilityPack.CssSelectors
HAP will now be able to understand CSS Selector via extended methods.
Now, import Html Agility Pack in your C# web spider by adding the following line on top of yourย Program.cs
ย file.
using HtmlAgilityPack;
If Visual Studio Code doesn't report errors, then you're good to go.
Time to see how to use HAP for web scraping in C#!
Step 2: Load the Target Web Page
Start by initializing an Html Agility Pack object.
var web = new HtmlWeb();
HtmlWeb
ย gives you access to the web scraping capabilities offered by HAP.
Then, useย HtmlWeb
'sย Load()
ย method to get the HTML from a URL.
// loading the target web page
var document = web.Load("https://www.scrapingcourse.com/ecommerce/");
Behind the scene, HAP performs an HTTPย GET
ย request to download the web page and parse its HTML content. It raises anย HtmlAgilityPack.HtmlWebException
ย in case of error, and provides an HAPย HtmlDocument
ย object if everything works as expected.
You're now ready to useย HtmlDocument
ย to extract data from HTML elements. But first, let's study the code of the target page to define an effective strategy for selecting HTML elements.
Step 3: Inspecting the Target Page
Explore the target web page to see how it's structured. We'll start with the target HTML nodes, which are the product elements. Right-click on one and access the browser DevTools by selecting the "Inspect" option:
Here, you can clearly see a singleย li.product
ย HTML consists of the following four elements:
- The product URL in anย
a
. - The product image in anย
img
. - The product name in anย
h2
. - The product price in aย
.price
ยspan
ย HTML element.
Inspect other HTML products, and you'll seeย they all share the same structure. What changes are the values stored in the underlying HTML elements.ย This means that you can scrape them all programmatically.
Next, we'll learn how to scrape data from these product HTML elements with HAP in C#.
Step 4: Extract Data From HTML Elements
You need to define a custom C# class to help you store the scraped data. For this purpose, initialize a nestedย Product
ย class inside theย Program
ย as follows:
public class Product
{
public string? Url { get; set; }
public string? Image { get; set; }
public string? Name { get; set; }
public string? Price { get; set; }
}
This custom class contains theย Url
,ย Image
,ย Name
, andย Price
ย fields. These match what you're interested in scraping from every product.
Now, initialize a list ofย Product
ย in yourย Main()
ย function with the line below:
var products = new List<Product>();
This will contain the scraped data stored inย Product
ย instances.
It's time to use HAP to extract the list of allย li.product
ย HTML elements from the DOM, like this:
// selecting all HTML product elements from the current page
var productHTMLElements = document.DocumentNode.QuerySelectorAll("li.product");
QuerySelectorAll()
ย allows you to retrieve HTML nodes from the DOM with a CSS selector. Here, the method applies theย li.product
ย CSS selector strategy to get all product elements. Specifically,ย QuerySelectorAll()
ย returns a list of HAPย HtmlNode
ย objects.
Note thatย QuerySelectorAll()
ย comes from the HAP CSS selector extension, so you won't find it in Html Agility Pack's original interface.
Use aย foreach
ย loop to iterate over the list of HTML and scrape data from each product.
// iterating over the list of product elements
foreach (var productHTMLElement in productHTMLElements)
{
// scraping the interesting data from the current HTML element
var url = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("a").Attributes["href"].Value);
var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("img").Attributes["src"].Value);
var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("h2").InnerText);
var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".price").InnerText);
// instancing a new Product object
var product = new Product() { Url = url, Image = image, Name = name, Price = price };
// adding the object containing the scraped data to the list
products.Add(product);
}
Incredible! You just implemented C# web scraping logic!
Theย QuerySelector()
ย method applies a CSS selector in theย HtmlNode
ย child nodes to get just one.
Then, we select an HTML attribute fromย Attributes
ย and extract its date withย Value
. Wrap each value withย HtmlEntity.DeEntitize()
ย to replace knownย HTML entities.
Again, note thatย QuerySelector()
ย comes from the Html Agility Pack CSS Selector extension. You won't find that method in vanilla HAP.
Awesome! Time to learn how to export the scraped data in an easy-to-read format, such as CSV.
Step 5: Export the Scraped Data to CSV
You can convert scraped data to CSV with native C# functions, but a library will make it easier.
CsvHelperย is a fast, flexible, and reliable .NET library for reading and writing CSV files.
Install it by adding the NuGetย CsvHelper
ย package to your project's dependencies with:
dotnet add package CsvHelper
Import it into your project by adding this line to the top of yourย Program.cs
ย file:
using CsvHelper;
Convert the scraped data to a CSV output file with CsvHelper as below:
// initializing the CSV output file
using (var writer = new StreamWriter("products.csv"))
// initializing the CSV writer
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
{
// populating the CSV file
csv.WriteRecords(products);
}
The snippet here initializes aย products.csv
ย file. Then, CsvHelper'sย WriteRecords()
ย writes all the product records to that CSV file. Thanks to the C#ย using
ย statement, the script will automatically free the resources associated with the writing objects.
Note that the constructor requires aย CultureInfo
ย parameter. This defines the formatting specs and the delimiter and line-ending character to use.ย InvariantCulture
ย ensures that any software can parse the produced CSV regardless of the user's local settings.
To useย CultureInfo
ย values, you need the following extra import:
using System.Globalization;
Fantastic! All that remains is to launch the C# web scraper!
Step 6: Launch the Scraper
This is what theย Program.cs
ย C# data scraper implemented so far looks like this:
using HtmlAgilityPack;
using CsvHelper;
using System.Globalization;
namespace SimpleWebScraper
{
public class Program
{
// defining a custom class to store the scraped data
public class Product
{
public string? Url { get; set; }
public string? Image { get; set; }
public string? Name { get; set; }
public string? Price { get; set; }
}
public static void Main()
{
// creating the list that will keep the scraped data
var products = new List<Product>();
// creating the HAP object
var web = new HtmlWeb();
// visiting the target web page
var document = web.Load("https://www.scrapingcourse.com/ecommerce/");
// getting the list of HTML product nodes
var productHTMLElements = document.DocumentNode.QuerySelectorAll("li.product");
// iterating over the list of product HTML elements
foreach (var productHTMLElement in productHTMLElements)
{
// scraping logic
var url = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("a").Attributes["href"].Value);
var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("img").Attributes["src"].Value);
var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("h2").InnerText);
var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".price").InnerText);
var product = new Product() { Url = url, Image = image, Name = name, Price = price };
products.Add(product);
}
// crating the CSV output file
using (var writer = new StreamWriter("products.csv"))
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
{
// populating the CSV file
csv.WriteRecords(products);
}
}
}
}
Run the script with the command below:
dotnet run
It might take a while to complete depending on the response time of the target page's server. When it's done, you'll find aย products.csv
ย file in the root folder of your C# project. Open it to explore the data below:
Wow! In 50 lines of code, you built a fully functional C# data scraper!
Advanced Web Scraping in C#
Web scraping in C# is much more than the fundamentals you just saw. Now, you'll learn about more advanced techniques to help you become a C# scraping expert!
Web Crawling in .NET
Don't forget that SrapingCourse.com shows a paginated list of products. To scrape all products, you need to visit the whole website, which is what web crawling is about.
To do web crawling in C#, you must follow all pagination links. Let's retrieve them all!
Inspect the pagination HTML element to understand how to extract the pages' URLs. Right-click on the number and select "Inspect":
You should be able to see something like this in the browser DevTools:
Here, note that all pagination HTML elements share theย page-numbers
ย CSS class. In detail, only HTML nodes involve a URL, while theย span
ย elements are placeholders. So, you can select all pagination elements with theย a.page-numbers
ย CSS selector.
To avoid scraping a page twice, you'll need a couple of extra data structures:
-
pagesDiscovered
: AยList
ย to keep track of the URLs discovered by the crawler. -
pagesToScrape
: AยQueue
ย containing the list of pages the spider will scrape soon.
Also, aย limit
ย variable will prevent the C# spider from crawling pages forever.
// the URL of the first pagination web page
var firstPageToScrape = "https://www.scrapingcourse.com/ecommerce/page/1/";
// the list of pages discovered during the crawling task
var pagesDiscovered = new List<string> { firstPageToScrape };
// the list of pages that remains to be scraped
var pagesToScrape = new Queue<string>();
// initializing the list with firstPageToScrape
pagesToScrape.Enqueue(firstPageToScrape);
// current crawling iteration
int i = 1;
// the maximum number of pages to scrape before stopping
int limit = 12;
// until there are no pages to scrape or limit is hit
while (pagesToScrape.Count != 0 && i < limit)
{
// extracting the current page to scrape from the queue
var currentPage = pagesToScrape.Dequeue();
// loading the page
var currentDocument = web.Load(currentPage);
// selecting the list of pagination HTML elements
var paginationHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("a.page-numbers");
// to avoid visiting a page twice
foreach (var paginationHTMLElement in paginationHTMLElements)
{
// extracting the current pagination URL
var newPaginationLink = paginationHTMLElement.Attributes["href"].Value;
// if the page discovered is new
if (!pagesDiscovered.Contains(newPaginationLink))
{
// if the page discovered needs to be scraped
if (!pagesToScrape.Contains(newPaginationLink))
{
pagesToScrape.Enqueue(newPaginationLink);
}
pagesDiscovered.Add(newPaginationLink);
}
}
// scraping logic...
// incrementing the crawling counter
i++;
}
The data crawler above does the following:
- Starts from the first page of the pagination list.
- Looks for new pagination URLs on the current page.
- Adds them to the scraping queue.
- Scrapes data from the current page.
- Repeats the previous four steps for each page in the queue until there are none there or it visited a numberย
limit
ย of pages.
Since ScrapingCourse.com consists of 12 pages, setย limit
ย to 12 to scrape data from all products. In this case,ย product.csv
ย will have a record for each of the 188 products.
Here's the complete code:
using HtmlAgilityPack;
using System.Globalization;
using CsvHelper;
namespace SimpleWebScraper
{
public class Program
{
// defining a custom class to store
// the scraped data
public class Product
{
public string? Url { get; set; }
public string? Image { get; set; }
public string? Name { get; set; }
public string? Price { get; set; }
}
public static void Main()
{
// initializing HAP
var web = new HtmlWeb();
// creating the list that will keep the scraped data
var products = new List<Product>();
// the URL of the first pagination web page
var firstPageToScrape = "https://www.scrapingcourse.com/ecommerce/page/1/";
// the list of pages discovered during the crawling task
var pagesDiscovered = new List<string> { firstPageToScrape };
// the list of pages that remains to be scraped
var pagesToScrape = new Queue<string>();
// initializing the list with firstPageToScrape
pagesToScrape.Enqueue(firstPageToScrape);
// current crawling iteration
int i = 1;
// the maximum number of pages to scrape before stopping
int limit = 12;
// until there is a page to scrape or limit is hit
while (pagesToScrape.Count != 0 && i < limit)
{
// getting the current page to scrape from the queue
var currentPage = pagesToScrape.Dequeue();
// loading the page
var currentDocument = web.Load(currentPage);
// selecting the list of pagination HTML elements
var paginationHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("a.page-numbers");
// to avoid visiting a page twice
foreach (var paginationHTMLElement in paginationHTMLElements)
{
// extracting the current pagination URL
var newPaginationLink = paginationHTMLElement.Attributes["href"].Value;
// if the page discovered is new
if (!pagesDiscovered.Contains(newPaginationLink))
{
// if the page discovered needs to be scraped
if (!pagesToScrape.Contains(newPaginationLink))
{
pagesToScrape.Enqueue(newPaginationLink);
}
pagesDiscovered.Add(newPaginationLink);
}
}
// getting the list of HTML product nodes
var productHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("li.product");
// iterating over the list of product HTML elements
foreach (var productHTMLElement in productHTMLElements)
{
// scraping logic
var url = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("a").Attributes["href"].Value);
var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("img").Attributes["src"].Value);
var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("h2").InnerText);
var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".price").InnerText);
var product = new Product() { Url = url, Image = image, Name = name, Price = price };
products.Add(product);
}
// incrementing the crawling counter
i++;
}
// opening the CSV stream reader
using (var writer = new StreamWriter("products.csv"))
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
{
// populating the CSV file
csv.WriteRecords(products);
}
}
}
}
Way to go! You're now able to build a web scraping C# app that can scrape a complete website!
Avoid Being Blocked
Your data scraper in C# may fail. This is due to the severalย anti-scraping mechanisms websites might adopt.ย There are manyย anti-scraping techniquesย your script should be ready for. Useย ZenRowsย to easily get around them!
The most basic technique is toย block HTTP requests based on the value of their headers. This generally happens when the requests use an invalidย User-Agent
ย value.
Theย User-Agent
ย header contains info to qualify where the request comes from. Typically, the accepted ones refer to popular browsers and OS.ย Scraping libraries tend to use placeholderย User-Agent
s that can easily expose your spider.
You can globally set a validย User-Agent
ย in Html Agility Pack with the line below:
// setting a global User-Agent header in HAP
web.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36";
The final code looks like this after adding the User Agent:
using HtmlAgilityPack;
using System.Globalization;
using CsvHelper;
namespace SimpleWebScraper
{
public class Program
{
// defining a custom class to store
// the scraped data
public class Product
{
public string? Url { get; set; }
public string? Image { get; set; }
public string? Name { get; set; }
public string? Price { get; set; }
}
public static void Main()
{
// initializing HAP
var web = new HtmlWeb();
// setting a global User-Agent header
web.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36";
// creating the list that will keep the scraped data
var products = new List<Product>();
// the URL of the first pagination web page
var firstPageToScrape = "https://www.scrapingcourse.com/ecommerce/page/1/";
// the list of pages discovered during the crawling task
var pagesDiscovered = new List<string> { firstPageToScrape };
// the list of pages that remains to be scraped
var pagesToScrape = new Queue<string>();
// initializing the list with firstPageToScrape
pagesToScrape.Enqueue(firstPageToScrape);
// current crawling iteration
int i = 1;
// the maximum number of pages to scrape before stopping
int limit = 12;
// until there is a page to scrape or limit is hit
while (pagesToScrape.Count != 0 && i < limit)
{
// getting the current page to scrape from the queue
var currentPage = pagesToScrape.Dequeue();
// loading the page
var currentDocument = web.Load(currentPage);
// selecting the list of pagination HTML elements
var paginationHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("a.page-numbers");
// to avoid visiting a page twice
foreach (var paginationHTMLElement in paginationHTMLElements)
{
// extracting the current pagination URL
var newPaginationLink = paginationHTMLElement.Attributes["href"].Value;
// if the page discovered is new
if (!pagesDiscovered.Contains(newPaginationLink))
{
// if the page discovered needs to be scraped
if (!pagesToScrape.Contains(newPaginationLink))
{
pagesToScrape.Enqueue(newPaginationLink);
}
pagesDiscovered.Add(newPaginationLink);
}
}
// getting the list of HTML product nodes
var productHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("li.product");
// iterating over the list of product HTML elements
foreach (var productHTMLElement in productHTMLElements)
{
// scraping logic
var url = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("a").Attributes["href"].Value);
var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("img").Attributes["src"].Value);
var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("h2").InnerText);
var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".price").InnerText);
var product = new Product() { Url = url, Image = image, Name = name, Price = price };
products.Add(product);
}
// incrementing the crawling counter
i++;
}
// opening the CSV stream reader
using (var writer = new StreamWriter("products.csv"))
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
{
// populating the CSV file
csv.WriteRecords(products);
}
}
}
}
Wonderful! Less than 100 lines of code are enough to build a web scraper in C#! Now, all HTTP requests performed by HAP will seem to come from Chrome 124.
Parallel Web Scraping in C#
The performance of web scraping with C# depends on the target web server's speed. Tackle this byย making parallel requests and scraping pages simultaneously. Avoid dead time and take the speed of your scraper to the next level... this is what parallel web scraping in C# is about!
Store the list of all pages your C# data crawler should visit in aย ConcurrentBag
:
var pagesToScrape = new ConcurrentBag<string> {
"https://www.scrapingcourse.com/ecommerce/page/1/",
"https://www.scrapingcourse.com/ecommerce/page/2/",
"https://www.scrapingcourse.com/ecommerce/page/3/",
// ...
"https://www.scrapingcourse.com/ecommerce/page/11/",
"https://www.scrapingcourse.com/ecommerce/page/12/"
};
In C#,ย List
ย isn't thread-safe, and you shouldn't use it when it comes to parallel tasks. Replace it with its non-order thread-safe alternativeย ConcurrentBag
.
For the same reason, makeย products
ย aย ConcurrentBag
:
var products = new ConcurrentBag<Product>();
Let's perform parallel web scraping with C#! Useย Parallel.forEach()
ย to perform aย foreach
ย loop in parallel in C# and scrape several pages at the same time:
// the import statement required to use Parallel.forEach()
using System.Collections.Concurrent;
// ...
Parallel.ForEach(
pagesToScrape,
// limiting the parallelization level to 4 pages at a time
new ParallelOptions { MaxDegreeOfParallelism = 4 },
currentPage => {
// visiting the current page of the loop
var currentDocument = web.Load(currentPage);
// complete scrapping logic
var productHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("li.product");
foreach (var productHTMLElement in productHTMLElements)
{
var url = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("a").Attributes["href"].Value);
var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("img").Attributes["src"].Value);
var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("h2").InnerText);
var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".price").InnerText);
var product = new Product() { Url = url, Image = image, Name = name, Price = price };
// storing the scraped product data in parallel
products.Add(product);
}
}
);
Great! Your web scraper in C# is now lightning-fast! But don't forget toย limit the level of parallelization to avoid stressing the server. Your goal is to extract data from a website, not to perform a DoS attack.
The snippet above served as an example to understand how to achieve parallel crawling in C#. Take a look at the entire parallel C# data spider here:
using HtmlAgilityPack;
using CsvHelper;
using System.Globalization;
using System.Collections.Concurrent;
namespace SimpleWebScraper
{
public class Program
{
public class Product
{
public string? Url { get; set; }
public string? Image { get; set; }
public string? Name { get; set; }
public string? Price { get; set; }
}
public static void Main()
{
// initializing HAP
var web = new HtmlWeb();
// this can't be a List because it's not thread-safe
var products = new ConcurrentBag<Product>();
// the complete list of pages to scrape
var pagesToScrape = new ConcurrentBag<string> {
"https://www.scrapingcourse.com/ecommerce/page/1/",
"https://www.scrapingcourse.com/ecommerce/page/2/",
// ...
"https://www.scrapingcourse.com/ecommerce/page/12/"
};
// performing parallel web scraping
Parallel.ForEach(
pagesToScrape,
new ParallelOptions { MaxDegreeOfParallelism = 4 },
currentPage =>
{
var currentDocument = web.Load(currentPage);
var productHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("li.product");
foreach (var productHTMLElement in productHTMLElements)
{
var url = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("a").Attributes["href"].Value);
var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("img").Attributes["src"].Value);
var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector("h2").InnerText);
var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".price").InnerText);
var product = new Product() { Url = url, Image = image, Name = name, Price = price };
products.Add(product);
}
}
);
// exporting to CSV
using (var writer = new StreamWriter("products.csv"))
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
{
csv.WriteRecords(products);
}
}
}
}
Scraping a Dynamic-Content Website with a Headless Browser in C#
Static-content sites have all their content embedded in the HTML pages returned by the server. This makes them an easy scraping target for any HTML parsing library.
Dynamic content websites use JavaScript for rendering or retrieving data. That's because they rely on JavaScript to dynamically retrieve all or part of the content. Scraping such websites requires a tool that can run JavaScript, like aย headless browser. If you're not familiar with this term,ย a headless browser is a programmable browser with no GUI.
With more than 65 million downloads, the most used headless browser library for C# isย Selenium. Installย Selenium.WebDriver's NuGet package.
dotnet add package Selenium.WebDriver
Use Selenium in headless mode to scrape data from ScrapingCourse.com with the following logic:
using CsvHelper;
using System.Globalization;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
namespace SimpleWebScraper
{
public class Program
{
public class Product
{
public string? Url { get; set; }
public string? Image { get; set; }
public string? Name { get; set; }
public string? Price { get; set; }
}
public static void Main()
{
var products = new List<Product>();
// to open Chrome in headless mode
var chromeOptions = new ChromeOptions();
chromeOptions.AddArguments("headless");
// starting a Selenium instance
using (var driver = new ChromeDriver(chromeOptions))
{
// navigating to the target page in the browser
driver.Navigate().GoToUrl("https://www.scrapingcourse.com/ecommerce/");
// getting the HTML product elements
var productHTMLElements = driver.FindElements(By.CssSelector("li.product"));
// iterating over them to scrape the data of interest
foreach (var productHTMLElement in productHTMLElements)
{
// scraping logic
var url = productHTMLElement.FindElement(By.CssSelector("a")).GetAttribute("href");
var image = productHTMLElement.FindElement(By.CssSelector("img")).GetAttribute("src");
var name = productHTMLElement.FindElement(By.CssSelector("h2")).Text;
var price = productHTMLElement.FindElement(By.CssSelector(".price")).Text;
var product = new Product() { Url = url, Image = image, Name = name, Price = price };
products.Add(product);
}
}
// export logic
using (var writer = new StreamWriter("products.csv"))
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
{
csv.WriteRecords(products);
}
}
}
}
The Seleniumย FindElements()
ย function allows instructing the browser to look for HTML nodes. Thanks to it, you can select the product HTML elements via a CSS selector query. Then, iterate over them in aย foreach
ย loop. Applyย GetAttribute()
ย and useย Text
ย to extract the data of interest.
Scraping a website in C# with HAP or Selenium is about the same, code-wise. The difference is in the way they run the scraping logic. HAP parses HTML pages to extract data from them and Selenium runs the scraping statements in a headless browser.
Thanks to Selenium, you can crawl dynamic-content websites and interact with web pages in a browser as a real user would. This also means that your script is less likely to be detected as a bot since Selenium makes it easier toย scrape a web page without getting blocked.
Html Agility Pack doesn't come with complete browser functionality, so you can only use HAP to scrape static-content websites. and itย doesn't involve the resource overhead to run a browser typical of Selenium.
Other Web Scraping Libraries in C#
Other tools to consider when it comes to web scraping with C# are:
- ZenRows: A fully-featured easy-to-use API to make extracting data from web pages easy. ZenRows offers an automatic bypass for any anti-bot or anti-scraping system. Plus, it comes with rotating proxies, headless browser functionality, and a 99% uptime guarantee.
- Puppeteer Sharp:ย The .NET port of the popularย Puppeteerย Node.js library. With it, you can instruct a headless Chromium browser to perform testing and scraping.
- AngleSharp: An open-source .NET library for parsing and manipulating XML and HTML. It allows you to extract data from a website and select HTML elements via CSS selectors.
This was a short reminder that there are other useful tools for data scraping with C#. Read our guide on theย best C# web scraping libraries.
Conclusion
Our step-by-step tutorial covered everything you need to know about web scraping in C#. First, we learned the basics and then tackled the most advanced C# web scraping concepts.
As a recap, you now know:
- How to do basic web scraping in C# with Html Agility Pack.
- How to scrape an entire website through web crawling.
- When you need to use a C# headless browser solution.
- How to extract data from dynamic-content websites with Selenium.
Web data scraping using C# is a challenge. That's due to the many anti-scraping technologies websites now use. Bypassing them all isn't easy, and you always need to find a workaround. Avoid all this with a completeย C# web scraping API, likeย ZenRows. Thanks to it, you perform data scraping via API calls and forget about anti-bot protections.
Frequent Questions
How Do You Scrape Data From a Website in C#?
Scraping data from the web in C# happens as in the other programming languages. With a C# web scraping library, you can connect to the desired website, select HTML elements from its DOM, and retrieve data.
Is C# Good for Web Scraping?
Yes, it is! C# is a general-purpose programming language that enables you to do web scraping. C# has a large and active community that developed many libraries to help you achieve your scraping goals.
What Is the Best Way to Scrape With C#?
Using one of the many NuGet libraries for scraping in C# makes everything easier. Some of the most popular C# libraries to support your data crawling project are Selenium, ScrapySharp, and Html Agility Pack.