Web Crawler in C#: Step-by-Step Tutorial [2024]

Yuvraj Chandra
Yuvraj Chandra
November 21, 2024 · 9 min read

Web crawling is a powerful technique for automatically discovering and visiting web pages. Building a C# web crawler allows you to systematically explore websites to gather data at scale.

In this tutorial, you'll learn how to build a web crawler in C# from scratch. We'll cover everything from setting up your development environment to implementing advanced features like parallel crawling, handling JavaScript-rendered pages, and more.

By the end, you'll have a robust C# crawler capable of navigating websites, following links, and extracting data—all while adhering to best practices to avoid getting blocked.

What Is Web Crawling?

Web crawling is a technique for automatically discovering and navigating through web pages by following links. Unlike web scraping, which focuses on extracting specific data from web pages, crawling is about finding and visiting pages systematically.

Think of web crawling as drawing a map of a website's structure, while web scraping is about collecting specific information from each location on that map.

In practice, crawling and scraping often work together as complementary techniques. A crawler discovers and visits pages, maintaining a list of URLs to explore, while a scraper extracts the desired data from each discovered page.

Let's put this knowledge into practice by building a C# web crawler step by step. You'll create a crawler that can discover pages, manage a queue of URLs to visit and extract data along the way. 

Build Your First C# Web Crawler

You'll use the ScrapingCourse e-commerce demo site, which features a paginated list of products, as the target website.

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

You'll build a crawler to visit every page of the product catalog and retrieve the name, price, and image URL from each product. Since the products are spread across multiple pages, this is a perfect example to demonstrate web crawling in action.

If you're new to web data extraction in C#, you might want to first check out our guide on web scraping with C# to understand the basics.

Let's dive in and build our crawler step by step!

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step 1: Prerequisites for Building a C# Web Crawler

Before getting started, you'll need:

Let's initialize a C# project and install the necessary libraries. Create a folder called web-crawler and run the following command in your terminal:

Terminal
dotnet new console --framework net8.0

This will initialize a .NET 8.0 console project. For web crawling, you'll need Html Agility Pack to perform HTTP requests and parse HTML. Install it via NuGet:

Terminal
dotnet add package HtmlAgilityPack

Next, install the CSS Selector extension for Html Agility Pack to make selecting HTML elements easier:

Terminal
dotnet add package HtmlAgilityPack.CssSelectors

Great! Your development environment is now ready for web crawling in C#.

Step 2: Follow all the Links on a Website

Let's start by building a basic crawler to visit our target website and discover links. In the Program.cs file, add the necessary imports:

Example
using HtmlAgilityPack;

var web = new HtmlWeb();
var document = web.Load("https://www.scrapingcourse.com/ecommerce/");
Console.WriteLine("Page loaded successfully!");

Run the script with the command below:

Terminal
dotnet run

You'll see the following output, which confirms our basic setup is working and we can start building out the crawler functionality:

Output
Page loaded successfully!

Now, let's enhance the crawler to discover and follow links. You'll need to maintain two lists: one for pages you've discovered and another for pages you still need to visit.

Use a List to store discovered URLs (to prevent processing the same page multiple times) and a Queue for pages to visit (ensuring pages are processed in the order we find them). Initialize both data structures with the first pagination page URL to begin the crawling process.

Example
using HtmlAgilityPack;

// initializing HAP
var web = new HtmlWeb();

// the URL of the first pagination web page
var firstPageToScrape = "https://www.scrapingcourse.com/ecommerce/page/1/";

// the list of pages discovered during the crawling task
var pagesDiscovered = new List<string> { firstPageToScrape };

// the list of pages that remain to be scraped
var pagesToScrape = new Queue<string>();

// initializing the list with firstPageToScrape
pagesToScrape.Enqueue(firstPageToScrape);

The core of our crawler needs to systematically visit pages while avoiding potential infinite loops or excessive server requests. To achieve this, we'll implement a depth limit using a counter and a maximum page limit.

Inspect the pagination HTML element, right-click on the number and select "Inspect":

scrapingcourse ecommerce homepage inspect
Click to open the image in full screen

You'll see the following DevTools screen:

scrapingcourse ecommerce homepage devtools
Click to open the image in full screen

There're a total of 12 pages, and all pagination HTML elements share the same page-numbers CSS class. This means you can select all pagination elements with the a.page-numbers CSS selector.

Since we know the target website has precisely 12 pages of products, this limit ensures we cover all product pages while avoiding unnecessary crawling of other site sections.

Ensure the crawler continues running until either there are no more pages to visit or you've reached the page limit. For each page, we'll extract pagination links using Html Agility Pack's CSS selector functionality, add any new URLs to the queue, and keep track of all discovered URLs to avoid duplicates.

Here's the complete code:

Example
using HtmlAgilityPack;

// initializing HAP
var web = new HtmlWeb();

// the URL of the first pagination web page
var firstPageToScrape = "https://www.scrapingcourse.com/ecommerce/page/1/";

// the list of pages discovered during the crawling task
var pagesDiscovered = new List<string> { firstPageToScrape };

// the list of pages that remain to be scraped
var pagesToScrape = new Queue<string>();

// initializing the list with firstPageToScrape
pagesToScrape.Enqueue(firstPageToScrape);

// current crawling iteration
int i = 0;

// the maximum number of pages to scrape before stopping
int limit = 12;

// until there is a page to scrape or limit is hit
while (pagesToScrape.Count != 0 && i < limit)
{
    // getting the current page to scrape from the queue
    var currentPage = pagesToScrape.Dequeue();

    // loading the page
    var currentDocument = web.Load(currentPage);

    // selecting the list of pagination HTML elements
    var paginationHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("a.page-numbers");

    // to avoid visiting a page twice
    foreach (var paginationHTMLElement in paginationHTMLElements)
    {
        // extracting the current pagination URL
        var newPaginationLink = paginationHTMLElement.Attributes["href"].Value;

        // if the page discovered is new
        if (!pagesDiscovered.Contains(newPaginationLink))
        {
            // if the page discovered needs to be scraped
            if (!pagesToScrape.Contains(newPaginationLink))
            {
                pagesToScrape.Enqueue(newPaginationLink);
            }
            pagesDiscovered.Add(newPaginationLink);
        }
    }

    Console.WriteLine($"Processed page: {currentPage}");

    // incrementing the crawling counter
    i++;
}

You'll get the following output:

Output
Processed page: https://www.scrapingcourse.com/ecommerce/page/1/
# ...
Processed page: https://www.scrapingcourse.com/ecommerce/page/10/
# ...
Processed page: https://www.scrapingcourse.com/ecommerce/page/12/
# ...
Processed page: https://www.scrapingcourse.com/ecommerce/page/9/

Step 3: Extract Data From HTML Elements

Now that our crawler can navigate through the website's pages, let's extract product data from each page we visit. To do this, we first need to analyze the HTML structure of the product elements on the page.

By inspecting the page source, you can see that each product is contained in a li element with the class product. Within that element:

  • The product name is inside an h2 element with the class product-name.
  • The price is found in a span with the class product-price.
  • The product image is located in an img element with the product-image class.
scrapingcourse ecommerce homepage inspect first product li
Click to open the image in full screen

To organize the extracted data, we'll create a custom C# class. In C#, classes are typically wrapped in a namespace, which helps organize code and avoid naming conflicts. We'll name our namespace SimpleWebScraper. 

Within this namespace, define the Product class to store each product's information in a structured way. Here's how you can do it:

Example
using HtmlAgilityPack;

namespace SimpleWebScraper
{
    public class Program
    {
        public class Product
        {
            public string? Name { get; set; }
            public string? Price { get; set; }
            public string? Image { get; set; }
        }
        // ...
    }
}

Now, create a list to store all the products we find and add the scraping logic inside our crawling loop. We'll use CSS selectors for each page to find all product elements and extract the name, price, and image URL from each one.

Use the HtmlEntity.DeEntitize() method to convert HTML entities (like &amp;) into their proper characters (like &). This ensures the extracted text is clean and readable.

Example
using HtmlAgilityPack;

namespace SimpleWebScraper
{
   public class Program
   {
       // defining a custom class to store 
       // the scraped data
       public class Product
       {
           public string? Name { get; set; }
           public string? Price { get; set; }
           public string? Image { get; set; }
       }
       
       public static void Main()
       {
           // initializing HAP
           var web = new HtmlWeb();

           // creating the list that will keep the scraped data
           var products = new List<Product>();
           // the URL of the first pagination web page
           var firstPageToScrape = "https://www.scrapingcourse.com/ecommerce/page/1/";
           // the list of pages discovered during the crawling task
           var pagesDiscovered = new List<string> { firstPageToScrape };
           // the list of pages that remains to be scraped
           var pagesToScrape = new Queue<string>();
           // initializing the list with firstPageToScrape
           pagesToScrape.Enqueue(firstPageToScrape);
           
           // current crawling iteration
           int i = 0;
           // the maximum number of pages to scrape before stopping
           int limit = 12;
           
           // until there is a page to scrape or limit is hit
           while (pagesToScrape.Count != 0 && i < limit)
           {
               // getting the current page to scrape from the queue
               var currentPage = pagesToScrape.Dequeue();
               // loading the page
               var currentDocument = web.Load(currentPage);
               
               // pagination logic remains the same...
               
               // getting the list of HTML product nodes
               var productHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("li.product");
               // iterating over the list of product HTML elements
               foreach (var productHTMLElement in productHTMLElements)
               {
                   // scraping logic
                   var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".product-name").InnerText);
                   var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".product-price").InnerText);
                   var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".product-image").Attributes["src"].Value);
                   
                   var product = new Product() { Name = name, Price = price, Image = image };
                   products.Add(product);
                   
                   Console.WriteLine($"Found product: {name}");
               }
               
               // incrementing the crawling counter
               i++;
           }
           
           Console.WriteLine($"\nTotal products found: {products.Count}");
       }
   }
}

Here's the complete code putting everything together:

Example
using HtmlAgilityPack;

namespace SimpleWebScraper
{
   public class Program
   {
       // defining a custom class to store 
       // the scraped data
       public class Product
       {
           public string? Name { get; set; }
           public string? Price { get; set; }
           public string? Image { get; set; }
       }
       
       public static void Main()
       {
           // initializing HAP
           var web = new HtmlWeb();

           // creating the list that will keep the scraped data
           var products = new List<Product>();
           // the URL of the first pagination web page
           var firstPageToScrape = "https://www.scrapingcourse.com/ecommerce/page/1/";
           // the list of pages discovered during the crawling task
           var pagesDiscovered = new List<string> { firstPageToScrape };
           // the list of pages that remain to be scraped
           var pagesToScrape = new Queue<string>();
           // initializing the list with firstPageToScrape
           pagesToScrape.Enqueue(firstPageToScrape);
           
           // current crawling iteration
           int i = 0;
           // the maximum number of pages to scrape before stopping
           int limit = 12;
           
           // until there is a page to scrape or limit is hit
           while (pagesToScrape.Count != 0 && i < limit)
           {
               // getting the current page to scrape from the queue
               var currentPage = pagesToScrape.Dequeue();
               // loading the page
               var currentDocument = web.Load(currentPage);
               
               // selecting the list of pagination HTML elements
               var paginationHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("a.page-numbers");
               
               // to avoid visiting a page twice
               foreach (var paginationHTMLElement in paginationHTMLElements)
               {
                   // extracting the current pagination URL
                   var newPaginationLink = paginationHTMLElement.Attributes["href"].Value;
                   
                   // if the page discovered is new
                   if (!pagesDiscovered.Contains(newPaginationLink))
                   {
                       // if the page discovered needs to be scraped
                       if (!pagesToScrape.Contains(newPaginationLink))
                       {
                           pagesToScrape.Enqueue(newPaginationLink);
                       }
                       pagesDiscovered.Add(newPaginationLink);
                   }
               }
               
               // getting the list of HTML product nodes
               var productHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("li.product");
               // iterating over the list of product HTML elements
               foreach (var productHTMLElement in productHTMLElements)
               {
                   // scraping logic
                   var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".product-name").InnerText);
                   var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".product-price").InnerText);
                   var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".product-image").Attributes["src"].Value);
                   
                   var product = new Product() { Name = name, Price = price, Image = image };
                   products.Add(product);
                   
                   Console.WriteLine($"Found product: {name}");
               }
               
               // incrementing the crawling counter
               i++;
           }
           
           Console.WriteLine($"\nTotal products found: {products.Count}");
       }
   }
}

You'll get the following output on running this script:

Output
Found product: Abominable Hoodie
Found product: Adrienne Trek Jacket
Found product: Aeon Capri
# ... 

Total products found: 188

Congratulations! Your crawler is now not just finding pages, but also extracting and organizing product data from each page it visits.

It navigates through all pagination pages, finds all product elements on each page, extracts each product's name, price, and image URL, and finally stores the data in the structured Product objects.

The next step will be to save this data into a CSV file.

Step 4: Export the Scraped Data to CSV

Now that we're successfully crawling pages and extracting product data, we need to store this information in a format that's easy to work with and analyze. CSV (Comma-Separated Values) is a popular choice as it can be opened in spreadsheet software like Excel or imported into other programs for further processing.

To work with CSV files in C#, we'll use CsvHelper, a fast and flexible library for reading and writing CSV files. Install it via NuGet with the following command:

Terminal
dotnet add package CsvHelper

Add the required imports to the top of your file and modify the end of the crawler to export the data:

Example
// ...
using CsvHelper;
using System.Globalization;

namespace SimpleWebScraper
{
            // ... rest of the crawler code

            // opening the CSV stream reader
            using (var writer = new StreamWriter("products.csv"))
            using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
            {
                // populating the CSV file
                csv.WriteRecords(products);
            }
            // ...                
}

The code uses C#'s using statement to manage resources, and CultureInfo.InvariantCulture ensures consistent formatting regardless of system settings. Here's the complete code that combines crawling, data extraction, and CSV export:

Example
using HtmlAgilityPack;
using System.Globalization;
using CsvHelper;

namespace SimpleWebScraper
{
    public class Program
    {
        // defining a custom class to store 
        // the scraped data
        public class Product
        {
            public string? Name { get; set; }
            public string? Price { get; set; }
            public string? Image { get; set; }
        }

        public static void Main()
        {
            // initializing HAP
            var web = new HtmlWeb();

            // creating the list that will keep the scraped data
            var products = new List<Product>();
            // the URL of the first pagination web page
            var firstPageToScrape = "https://www.scrapingcourse.com/ecommerce/page/1/";
            // the list of pages discovered during the crawling task
            var pagesDiscovered = new List<string> { firstPageToScrape };
            // the list of pages that remains to be scraped
            var pagesToScrape = new Queue<string>();
            // initializing the list with firstPageToScrape
            pagesToScrape.Enqueue(firstPageToScrape);

            // current crawling iteration
            int i = 0;
            // the maximum number of pages to scrape before stopping
            int limit = 12;

            // until there is a page to scrape or limit is hit
            while (pagesToScrape.Count != 0 && i < limit)
            {
                // getting the current page to scrape from the queue
                var currentPage = pagesToScrape.Dequeue();
                // loading the page
                var currentDocument = web.Load(currentPage);

                // selecting the list of pagination HTML elements
                var paginationHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("a.page-numbers");

                // to avoid visiting a page twice
                foreach (var paginationHTMLElement in paginationHTMLElements)
                {
                    // extracting the current pagination URL
                    var newPaginationLink = paginationHTMLElement.Attributes["href"].Value;

                    // if the page discovered is new
                    if (!pagesDiscovered.Contains(newPaginationLink))
                    {
                        // if the page discovered needs to be scraped
                        if (!pagesToScrape.Contains(newPaginationLink))
                        {
                            pagesToScrape.Enqueue(newPaginationLink);
                        }
                        pagesDiscovered.Add(newPaginationLink);
                    }
                }

                // getting the list of HTML product nodes
                var productHTMLElements = currentDocument.DocumentNode.QuerySelectorAll("li.product");
                // iterating over the list of product HTML elements
                foreach (var productHTMLElement in productHTMLElements)
                {
                    // scraping logic
                    var name = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".product-name").InnerText);
                    var price = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".product-price").InnerText);
                    var image = HtmlEntity.DeEntitize(productHTMLElement.QuerySelector(".product-image").Attributes["src"].Value);

                    var product = new Product() { Name = name, Price = price, Image = image };
                    products.Add(product);

                    Console.WriteLine($"Found product: {name}");
                }

                // incrementing the crawling counter
                i++;
            }

            // opening the CSV stream reader
            using (var writer = new StreamWriter("products.csv"))
            using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
            {
                // populating the CSV file
                csv.WriteRecords(products);
            }

            Console.WriteLine($"\nTotal products found: {products.Count}");
            Console.WriteLine("Data exported to products.csv");
        }
    }
}

When you run the script, it will create a products.csv file in your project directory. You'll see something like this after opening it in a spreadsheet application:

Extracted Data in CSV File
Click to open the image in full screen

Congratulations! You've built a complete web crawler in C# that can navigate through pages, extract product information, and save the data in a structured format.

This is a significant achievement and forms the foundation for more advanced web crawling projects. You can now modify this crawler to target other websites or extract different data types based on your needs.

Optimize Your C# Web Crawler

Let's explore some key optimizations that can make your C# web crawler more efficient and robust. These improvements will help you handle larger websites and more complex crawling scenarios.

Avoid Duplicate Links

Duplicate links are a common challenge in web crawling because websites often have multiple paths leading to the same page or the same links appearing across different pages.

Our crawler already handles this using two data structures: pagesDiscovered to track all found URLs and pagesToScrape to manage unvisited pages.

Here's the relevant code that prevents duplicate crawling:

Example
if (!pagesDiscovered.Contains(newPaginationLink))
{
    if (!pagesToScrape.Contains(newPaginationLink))
    {
        pagesToScrape.Enqueue(newPaginationLink);
    }
    pagesDiscovered.Add(newPaginationLink);
}

Prioritize Specific Pages

Not all pages have equal importance when crawling a website. For example, on an e-commerce site, product listing pages are generally more valuable than FAQ or contact pages.

Our crawler currently prioritizes product listing pages by specifically targeting URLs containing "page/" in their path, ensuring we collect all product data before exploring other site sections.

Include Subdomains in Your Crawl

Currently, our crawler only handles URLs from the main domain (scrapingcourse.com). However, many websites spread their content across subdomains, like blog.example.com or shop.example.com. For an e-commerce site, you might want to crawl both the main product pages and the blog subdomain for product reviews and articles.

To include subdomains, you can modify our URL validation using regular expressions:

Example
using System.Text.RegularExpressions;

// replace the domain check with a regex pattern
var domainPattern = @"^https?:\/\/([\w\.-]+\.)?scrapingcourse\.com";
if (Regex.IsMatch(newPaginationLink, domainPattern))
{
    if (!pagesDiscovered.Contains(newPaginationLink))
    {
        // ... rest of the crawling logic
    }
}

Maintain a Single Crawl Session

Maintaining a single session across requests can improve performance and help avoid detection by making your crawler appear more like a regular user. You can modify the HtmlWeb instance to maintain cookies and headers consistently:

Example
var web = new HtmlWeb();
var handler = new HttpClientHandler { UseCookies = true };
var client = new HttpClient(handler);
web.PreRequest = (request) =>
{
    request.CookieContainer = handler.CookieContainer;
    request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36";
    return true;
};

All these optimizations make the crawler more efficient and systematic, but they may not be effective if the website blocks the requests. Many sites employ sophisticated anti-bot measures that can detect and block automated crawling attempts, even with proper session management.

They track patterns in your requests, look for suspicious headers, use JavaScript challenges to verify human users, and employ additional detection measures. In the next section, we will explore how to handle these anti-bot measures and keep our crawler running smoothly.

Avoid Getting Blocked While Crawling With C#

Web crawling is particularly vulnerable to blocking because it involves making many requests in a pattern that's easily distinguishable from human behavior. Unlike single-page scraping, crawlers follow links systematically and rapidly, making them more likely to trigger anti-bot defenses.

There are several basic approaches to avoid blocks: rotating your IP addresses through proxies, changing request headers to mimic different browsers, and adding delays between requests. You can implement these in your C# crawler by adding proxy support, randomizing User Agents, and using Thread.Sleep() with random intervals.

However, these solutions can quickly become complex to manage and maintain, especially when dealing with sophisticated anti-bot systems.

This is where ZenRows comes in, providing a robust solution for scalable web crawling. Instead of building and maintaining your own anti-blocking infrastructure, you can leverage ZenRows' web scraping API to handle all the complexity.

ZenRows simplifies web crawling with automatic anti-bot bypass, premium rotating proxies, JavaScript rendering, actual user spoofing, request header management, and more.

Let's use ZenRows and access the Anti-bot Challenge page, a web page highly protected by an anti-bot.

First, sign up for a free account and grab your API key. Once you're in the dashboard, you'll see the Request Builder:

building a scraper with zenrows
Click to open the image in full screen

In the left sidebar, paste the target URL (https://www.scrapingcourse.com/antibot-challenge) and enable Premium Proxies and JS Rendering.

On the right sidebar, select C# as your language and API as the connection mode. The builder will generate the following code:

Example
using RestSharp;

namespace TestApplication
{
    class Test
    {
        static void Main(string[] args)
        {
            var client = new RestClient("https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>>&url=https%3A%2F%2Fwww.scrapingcourse.com%2Fantibot-challenge&js_render=true&premium_proxy=true");
            var request = new RestRequest();

            var response = client.Get(request);
            Console.WriteLine(response.Content);
        }
    }
}

Install the RestSharp HTTP client library:

Terminal
dotnet add package RestSharp

Next, run the script. It'll print the source HTML of the target anti-bot-protected web page:

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

Success! You've just bypassed anti-bot protection that would typically block standard crawling attempts. You can combine this with your existing crawling logic to create a robust, scalable crawler that can handle even the most protected websites.

Web Crawling Tools for C#

There are several powerful web crawling tools available for C# developers that can simplify the process of discovering and processing web pages. Each tool offers different features and capabilities suited for various crawling scenarios.

  1. ZenRows: A complete scraping and crawling API that handles the complex challenges of web crawling out of the box, including anti-bot bypassing, proxy rotation, JavaScript rendering, and more. It's particularly valuable for protected websites or large-scale scraping or crawling operations.
  2. PuppeteerSharp: A .NET port of the popular Puppeteer library, offering a high-level API to control Chrome/Chromium over the DevTools Protocol. It's particularly useful when you need programmatic control over a browser while crawling complex web applications.
  3. Selenium: A powerful browser automation tool that can handle JavaScript-rendered content and complex user interactions. Its ability to control a real browser makes it excellent for crawling dynamic websites that require JavaScript execution or user interaction.
  4. Abot: An open-source C# web crawler framework built specifically for .NET. It offers features like crawl rate limiting, domain whitelisting/blacklisting, and respect for robots.txt rules, making it a reliable choice for basic crawling needs.

C# Crawling Best Practices and Considerations

Let's explore key considerations and best practices that can significantly improve your C# web crawler's performance and capabilities.

Parallel Crawling and Concurrency

When crawling websites, your script spends most of its time waiting for HTTP responses. By implementing parallel crawling, you can process multiple pages simultaneously, dramatically improving efficiency. C#'s async/await features and thread management make it easy to implement concurrent crawling while keeping resource usage under control.

Here's how you can modify the crawler to process pages in parallel:

Example
using System.Collections.Concurrent;
using System.Threading.Tasks;

// replace List with thread-safe collection
var products = new ConcurrentBag<Product>();
var pagesToScrape = new ConcurrentQueue<string>();
pagesToScrape.Enqueue(firstPageToScrape);

// process multiple pages concurrently
var tasks = new List<Task>();
int maxConcurrency = 4;

while (pagesToScrape.Count > 0)
{
    while (tasks.Count < maxConcurrency && pagesToScrape.TryDequeue(out string currentPage))
    {
        tasks.Add(Task.Run(async () =>
        {
            var document = await web.LoadFromWebAsync(currentPage);
            // existing scraping logic here...
        }));
    }
    
    // wait for at least one task to complete before continuing
    await Task.WhenAny(tasks);
    tasks.RemoveAll(t => t.IsCompleted);
}

// wait for remaining tasks
await Task.WhenAll(tasks);

For a deeper understanding of concurrency and how to use it effectively in web scraping, check out our guide on concurrency in C#.

Crawling JavaScript Rendered Pages in C#

Our current crawler using Html Agility Pack only processes static HTML content, which is unsuitable for modern websites that rely on JavaScript to load content dynamically. Many e-commerce sites, for example, use JavaScript to load product data, render pricing, or handle pagination.

To properly crawl JavaScript-rendered pages, you need a headless browser solution that can execute JavaScript and wait for dynamic content to load. Popular options include Selenium, PuppeteerSharp, and Playwright. Each offers different approaches to handling JavaScript rendering.

Learn more about implementing these solutions in our guides on C# headless browsers, PuppeteerSharp, and Selenium with C#. 

Distributed Web Crawling in C#

As your crawling needs grow, you may need to scale beyond what a single machine can handle efficiently. Distributed crawling involves spreading the crawling workload across multiple machines or processes. It enables you to handle larger websites and reduce crawling time. This becomes particularly important when dealing with rate limits, large datasets, or time-sensitive data collection needs.

Conclusion

In this guide, you've learned everything needed to build an effective web crawler in C#. You now know:

  • How to build a basic web crawler that systematically discovers and visits pages.
  • How to extract structured data from the pages your crawler visits.
  • How to manage the crawling state and avoid duplicate pages.
  • How to store the collected data in CSV format.
  • How to implement parallel crawling for better performance.
  • Best practices for handling JavaScript-rendered content and scaling your crawler.

Remember that while building a crawler is straightforward, maintaining its reliability against modern websites can be challenging due to anti-bot measures. Instead of spending time maintaining proxy rotations and handling anti-bot bypasses, consider using ZenRows to automatically manage these complexities at scale.

Sign up for free to try ZenRows and see how it can make your web crawling projects easier and more reliable.

Ready to get started?

Up to 1,000 URLs for free are waiting for you