The Anti-bot Solution to Scrape Everything? Get Your Free API Key! 😎

ScrapySharp: Comprehensive Tutorial for Scrapy in C# [2024]

January 4, 2024 · 10 min read

ScrapySharp is a popular C# scraping framework to get HTML content and parse it for data extraction. In this tutorial, you'll start with the basics and then see how to use it to address the challenges in web scraping.

Let's dive in!

Why You Should Use ScrapySharp

ScrapySharp is a powerful C# web scraping framework built on top of HtmlAgilityPack. It has a browser-like web client to perform HTTP requests and a complete API to parse HTML. This provides developers with a robust toolbox for extracting data from static sites.

One of its main benefits is its flexibility and ease of use. Thanks to a clean and intuitive API, it makes HTML parsing as natural as possible. That keeps the library accessible for both beginners and experienced developers.

ScrapySharp's feature set extends beyond basic HTML parsing. It allows for easy navigation through the DOM with both XPath and CSS Selectors. Also, its built-in web client can handle headers, cookies, and redirects like a web browser.

How to Use ScrapySharp

You're about to take your first steps with ScrapySharp. The target site will be ScrapeMe, an e-commerce site with a paginated list of Pokémon-inspired products.

Pokémon ecommerce
Click to open the image in full screen

Follow the instructions below to see how to build a C# web scraper!

Step 1: Install ScrapySharp

Before diving into this tutorial, make sure you have the .NET SDK installed on your computer. Download the .NET 8+ installer, execute it, and follow the wizard.

You're now ready to set up a SrapySharp C# project. Use PowerShell to create a ScrapySharpProject folder and enter it:

Terminal
mkdir ScrapySharpProject
cd ScrapySharpProject

Then, fire the new console command you see next to put in place a .NET 8 C# application:

Terminal
dotnet new console --framework net8.0

The project folder will now contain program.cs, along with other files. If you're not familiar with that, program.cs is the main C# file that will contain the web scraping logic.

Next, add the ScrapySharp Nuget Package to your project's dependencies:

Terminal
dotnet add package ScrapySharp

Add the following imports on top of your program.cs file to get access to the API provided by the library:

program.cs
using HtmlAgilityPack;
using ScrapySharp.Extensions;
using ScrapySharp.Network;

Bear in mind that you can launch your scraping script with this command:

Terminal
dotnet run

Awesome! You're ready to scrape some data from the web!

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step 2: Access a Web Page with ScrapySharp

Paste the line below in the Main function of program.cs to create a ScrapingBrowser object. That exposes the ScrapySharp browser-like API for handling cookies, session headers, and redirects.

program.cs
ScrapingBrowser browser = new ScrapingBrowser()

Next, use browser.NavigateToPage() to perform a GET request to the target page. Again, the function's name suggests that the library is navigating to the target page in a browser. That is just an illusion to make the API more intuitive.

program.cs
WebPage page = browser.NavigateToPage(new Uri("https://scrapeme.live/shop/"));

Then, access the Content attribute to get the source HTML of the page as a string. Print it in the terminal with this instruction:

program.cs
var html = page.Content;
Console.WriteLine(html);

This is what the current program.cs file should now contain:

program.cs
using HtmlAgilityPack;
using ScrapySharp.Extensions;
using ScrapySharp.Network;

class Program
{
  static void Main(string[] args)
  {
    // initialize the browser-like ScrapySharp API
    ScrapingBrowser browser = new ScrapingBrowser();
    // retrieve the HTML target page
    WebPage page = browser.NavigateToPage(new Uri("https://scrapeme.live/shop/"));

    // get the HTML source code
    // and print it
    var html = page.Content;
    Console.WriteLine(html);
  }
}

Execute the script, it'll log the following content in PowerShell:

Output
<!doctype html>
<html lang="en-GB">
  <head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=2.0">
  <link rel="profile" href="http://gmpg.org/xfn/11">
  <link rel="pingback" href="https://scrapeme.live/xmlrpc.php">
  
  <title>Products &#8211; ScrapeMe</title>
  <link rel='dns-prefetch' href='//fonts.googleapis.com' />
  <link rel='dns-prefetch' href='//s.w.org' />
  <link rel="alternate" type="application/rss+xml" title="ScrapeMe &raquo; Feed" href="https://scrapeme.live/feed/" />
  <!-- Omitted for brevity... -->

Well done, that's exactly the HTML code of the target page!

Learn how to extract specific data from it in the next step.

Step 3: Extract Specific Data by Parsing

ScrapySharp offers a way to connect to web pages and can also parse their HTML and extract data from it. That's what web scraping is all about!

For example, let's set the scraping goal as getting information fields from each product on the page. Reaching that involves a 3-step process:

  1. Find the product HTML nodes with an effective DOM selection strategy.
  2. Extract the data of interest from each of them.
  3. Store the retrieved data in a custom C# data structure.

In most cases, a DOM selection strategy is nothing more than a CSS Selector or an XPath expression. These are two practical ways to select HTML nodes. XPath expressions are complex yet powerful, while CSS selectors are intuitive. Learn more in our guide on CSS Selector vs XPath.

ScrapySharp comes with the XPath-based API for data parsing exposed by HTML Agility Pack but extends it with CSS Selector support. This means there's room for choice between the two approaches.

To keep things simple, opt for CSS selectors. You can't devise an effective CSS selector strategy without first getting familiar with the structure of the HTML. So, open the target page in your browser and inspect a product HTML element with the DevTools:

Analyze the HTML code and notice that each product is an HTML <li> node with a "product" class. Given a product element, the pieces of data to extract are:

  • The product URL in the <a>.
  • The product image in the <img>.
  • The product name in the <h2>.
  • The product price in the <span> with a “price” class.

You have enough information to select the product nodes and get data from them via CSS selectors. But first, define a custom C# class where to store that data:

program.cs
public class Product
{
  public string? Url { get; set; }
  public string? Image { get; set; }
  public string? Name { get; set; }
  public string? Price { get; set; }
}

Next, initialize an empty array of Products in Main(). This is the data structure that will store all scraped information.

program.cs
var products = new List<Product>();

Use the CssSelect() method from ScrapySharp to apply a CSS selector on the HTML and get elements.

program.cs
HtmlNode[] productHTMLElements = page.Html.CssSelect(".product").ToArray();

After selecting the product nodes, iterate over them and apply the data extraction logic.

program.cs
foreach (var productHTMLElement in productHTMLElements)
{
  // scrape the data of interest from the current HTML element
  var url = productHTMLElement.CssSelect("a").First().GetAttributeValue("href");
  var image = productHTMLElement.CssSelect("img").First().GetAttributeValue("src");
  var name = WebUtility.HtmlDecode(productHTMLElement.CssSelect("h2").First().InnerText);
  var price = WebUtility.HtmlDecode(productHTMLElement.CssSelect(".price").First().InnerText);

  // instantiate a new Product object and
  // add it to the list
  var product = new Product() { Url = url, Image = image, Name = name, Price = price };
  products.Add(product);
}

GetAttributeValue() returns the value of the specified HTML attribute. Instead, the InnerText attribute returns the text of the current node. That's all you need to accomplish the data scraping goal.

program.cs
using System.Net;

This is the code of program.cs so far:

program.cs
using HtmlAgilityPack;
using ScrapySharp.Extensions;
using ScrapySharp.Network;
using System.Net;

class Program
{
  // custom class to represent the
  // product objects to scrape
  public class Product
  {
    public string? Url { get; set; }
    public string? Image { get; set; }
    public string? Name { get; set; }
    public string? Price { get; set; }
  }

  static void Main(string[] args)
  {
    // initialize the browser-like ScrapySharp API
    ScrapingBrowser browser = new ScrapingBrowser();
    // retrieve the HTML target page
    WebPage page = browser.NavigateToPage(new Uri("https://scrapeme.live/shop/"));

    // select all HTML product elements with a CSS selector
    HtmlNode[] productHTMLElements = page.Html.CssSelect(".product").ToArray();

    // initialize a list of Product objects where to
    // store the scraped data
    var products = new List<Product>();

    // iterate over each HTML product element and
    // scrape data from them
    foreach (var productHTMLElement in productHTMLElements)
    {
      // scrape the data of interest from the current HTML element
      var url = productHTMLElement.CssSelect("a").First().GetAttributeValue("href");
      var image = productHTMLElement.CssSelect("img").First().GetAttributeValue("src");
      var name = WebUtility.HtmlDecode(productHTMLElement.CssSelect("h2").First().InnerText);
      var price = WebUtility.HtmlDecode(productHTMLElement.CssSelect(".price").First().InnerText);

      // instantiate a new Product object and
      // add it to the list
      var product = new Product() { Url = url, Image = image, Name = name, Price = price };
      products.Add(product);
    }

    // print the scraped products
    foreach (var product in products)
    {
      Console.WriteLine($"Url: {product.Url}, Image: {product.Image}, Name: {product.Name}, Price: {product.Price}");
    };
  }
}

Launch it, and it'll print this output:

Output
Url: https://scrapeme.live/shop/Bulbasaur/, Image: https://scrapeme.live/wp-content/uploads/2018/08/001-350x350.png, Name: Bulbasaur, Price: £63.00
Url: https://scrapeme.live/shop/Ivysaur/, Image: https://scrapeme.live/wp-content/uploads/2018/08/002-350x350.png, Name: Ivysaur, Price: £87.00
// omitted for brevity...
Url: https://scrapeme.live/shop/Beedrill/, Image: https://scrapeme.live/wp-content/uploads/2018/08/015-350x350.png, Name: Beedrill, Price: £168.00
Url: https://scrapeme.live/shop/Pidgey/, Image: https://scrapeme.live/wp-content/uploads/2018/08/016-350x350.png, Name: Pidgey, Price: £159.00

Fantastic! The ScrapySharp C# parsing logic works like a charm!

Step 4: Export the Data to CSV

Don't waste time and energy trying to export data to CSV with Vanilla C#. It's way easier to use a library like CsvHelper to convert the collected data to CSV. Add it to your project's dependencies with:

Terminal
dotnet add package CsvHelper

Add the line that follows on top of your program.cs file to import it:

program.cs
using CsvHelper;

Create a products.csv file in C# and harness the API exposed by CsvHelper to populate it. WriteRecords() transforms the product list to CSV and writes it to the output file.

program.cs
// initialize the CSV output file
// and the CSV writer
using (var writer = new StreamWriter("products.csv"))
using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
{
  // populate the CSV file
  csv.WriteRecords(products);
}
program.cs
using System.Globalization;

Put it all together, and you'll get the final code of your C# scraper:

program.cs
using HtmlAgilityPack;
using ScrapySharp.Extensions;
using ScrapySharp.Network;
using System.Net;
using CsvHelper;
using System.Globalization;

class Program
{
  // custom class to represent the
  // product objects to scrape
  public class Product
  {
    public string? Url { get; set; }
    public string? Image { get; set; }
    public string? Name { get; set; }
    public string? Price { get; set; }
  }

  static void Main(string[] args)
  {
    // initialize the browser-like ScrapySharp API
    ScrapingBrowser browser = new ScrapingBrowser();
    // retrieve the HTML target page
    WebPage page = browser.NavigateToPage(new Uri("https://scrapeme.live/shop/"));

    // select all HTML product elements with a CSS selector
    HtmlNode[] productHTMLElements = page.Html.CssSelect(".product").ToArray();

    // initialize a list of Product objects where to
    // store the scraped data
    var products = new List<Product>();

    // iterate over each HTML product element and
    // scrape data from them
    foreach (var productHTMLElement in productHTMLElements)
    {
      // scrape the data of interest from the current HTML element
      var url = productHTMLElement.CssSelect("a").First().GetAttributeValue("href");
      var image = productHTMLElement.CssSelect("img").First().GetAttributeValue("src");
      var name = WebUtility.HtmlDecode(productHTMLElement.CssSelect("h2").First().InnerText);
      var price = WebUtility.HtmlDecode(productHTMLElement.CssSelect(".price").First().InnerText);

      // instantiate a new Product object and
      // add it to the list
      var product = new Product() { Url = url, Image = image, Name = name, Price = price };
      products.Add(product);
    }

    // initialize the CSV output file
    // and the CSV writer
    using (var writer = new StreamWriter("products.csv"))
    using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
    {
      // populate the CSV file
      csv.WriteRecords(products);
    }
  }
}

Run the scraping script:

Terminal
dotnet run

Wait for the application to terminate; and a products.csv file will appear in the project's folder. Open it, and you'll see:

Congrats, you just learned how to perform web scraping with ScrapySharp! 🎉

Alternatives to ScrapySharp

ScrapySharp is a great web scraping framework but comes with some major limitations. The library's authors aren't very active, and the last commit was in 2020. Also, it can't deal with sites that use JavaScript for rendering or data retrieval.

Here's why you'll find the following list of ScrapySharp alternatives interesting! For a more complete analysis, take a look at our article on the best C# web scraping libraries.

PuppeteerSharp

PuppeteerSharp is a .NET library that mirrors the Puppeteer API for Node.js. Its goal is to control Chrome in headless mode for browser automation. Initialize a browser instance, navigate to a page, and use the API to interact with it.

You can use it to scrape dynamic content sites, for automating tasks like taking screenshots, submitting forms, and clicking elements. To find out all possible interactions available, check out our guide on PuppeteerSharp.

Selenium

Selenium is a widely used cross-browser and cross-language automation tool supporting C#. It enables developers to automate web browsers, scripting actions, and interactions. It communicates with browsers via specific drivers, translating commands into executable actions.

Selenium is a versatile tool that proves useful in a range of tasks, from testing to scraping. Use it to simulate complex interactions with dynamic Web elements. For a complete tutorial, follow our guide on Selenium C#.

Playwright .NET

Playwright .NET is a browser automation library that supports Chromium, Firefox, and WebKit. It's an official port from Microsoft of the popular Playwright library in TypeScript. It provides a unified API for browser automation tasks, enhancing cross-browser compatibility.

Compared to other tools, it relies on browser contexts with isolation. The library's main goal is to prove reliable and consistent with all browser engines. That opens the door to a modern approach for efficient data scraping on dynamic sites.

Avoid Getting Blocked when Web Scraping with ScrapySharp

The biggest challenge to web scraping is getting blocked by anti-bot measures. An effective way to mitigate that issue is to randomize your requests as much as possible. How? Set real-world User-Agent header values and use proxies to change your exit IP.

To configure a custom user agent in ScrapySharp, add a new element to the Headers field of browser:

program.cs
browser.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36");

Why is this important? Find out more in our article on User Agents for web scraping.

Setting a proxy involves passing a WebProxy object to the Proxy field of browser:

program.cs
// the URL of a free proxy
var proxyUri = new Uri("http://162.248.225.226:80");
browser.Proxy = new WebProxy(proxyUri);

No matter how randomized your requests are, these are just baby steps to bypass sophisticated anti-bots. For example, try to get the source HTML from a G2 review page:

program.cs
using HtmlAgilityPack;
using ScrapySharp.Extensions;
using ScrapySharp.Network;
using System.Net;

class Program
{
  static void Main(string[] args)
  {
    // initialize the browser-like ScrapySharp API
    ScrapingBrowser browser = new ScrapingBrowser();

    // set a real-world user-agent
    browser.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36");

    // set a proxy server
    var proxyUri = new Uri("http://162.248.225.226:80");
    browser.Proxy = new WebProxy(proxyUri);

    // retrieve the HTML target page
    WebPage page = browser.NavigateToPage(new Uri("https://www.g2.com/products/jira/reviews"));

    // get the HTML source code
    // and print it
    var html = page.Content;
    Console.WriteLine(html);
  }
}

Run the above script, and it'll print:

Output
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>
<title>Attention Required! | Cloudflare</title>
<!-- omitted for brevity -->

The G2 server responded with a 403 Forbidden error page from Cloudflare! This proves that advanced solutions such as Cloudflare can still see your script as a bot.

How to proceed, then? Use ZenRows! It integrates seamlessly with ScrapySharp, offers browser automation capabilities, and provides automatic User-Agent and IP rotation via premium residential proxies. Plus, its complete anti-bot toolkit will avoid any blocks and IP bans for you.

To get started with ZenRows, sign up for free. You'll reach the following Request Builder page:

Next, follow the steps below:

  1. Paste your target URL (https://www.g2.com/products/jira/reviews) in the "URL to Scrape” field.
  2. Select the “AI Anti-bot” option for maximum effectiveness.
  3. Check the "Premium Proxy" option to enable the IP rotation. (User-Agent rotation is always included by default).
  4. Select the “cURL” radio button on the right and then “API”. (“cURL” will get you the raw API endpoint you can call with any HTTP client, regardless of the programming language).
  5. Copy the generated link and pass it to the Uri constructor passed to the NavigateToPage() method:
program.cs
using HtmlAgilityPack;
using ScrapySharp.Extensions;
using ScrapySharp.Network;

class Program
{
  static void Main(string[] args)
  {
    // initialize the browser-like ScrapySharp API
    ScrapingBrowser browser = new ScrapingBrowser();

    // retrieve the HTML target page
    WebPage page = browser.NavigateToPage(new Uri("https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fjira%2Freviews&js_render=true&antibot=true&premium_proxy=true"));

    // get the HTML source code
    // and print it
    var html = page.Content;
    Console.WriteLine(html);
  }
}

Execute the script, and this time it'll print the source HTML of the G2 page:

Output
<!DOCTYPE html>
<head>
  <meta charset="utf-8" />
  <link href="https://www.g2.com/assets/favicon-fdacc4208a68e8ae57a80bf869d155829f2400fa7dd128b9c9e60f07795c4915.ico" rel="shortcut icon" type="image/x-icon" />
  <title>Jira Reviews 2023: Details, Pricing, &amp; Features | G2</title>
  <!-- omitted for brevity ... -->

Amazing! You just integrated ZenRows into ScrapySharp.

Conclusion

In this tutorial, you saw the fundamentals of parsing HTML documents in C#. You started with the basics and then learned how to overcome the scraping challenges.

Now you know:

  • What the ScrapySharp C# library is.
  • How to use it to download a web page, parse its HTML, and collect data from it.
  • What its biggest limitations are.
  • What its main alternative libraries for HTML parsing are.

ScrapySharp is a great tool but only works with static HTML pages. Aiming for web scraping without getting blocked with such a limited tool isn't easy. Extend it with ZenRows, a web scraping API with headless browser functionality, IP rotation, and an advanced built-in toolkit to avoid any anti-bot systems. Scraping dynamic content is easier now. Try ZenRows for free!

Frequent Questions

Does ScrapySharp Support Dynamic Content Scraping?

No, ScrapySharp does not support dynamic content scraping. Even though its API involves a ScrapingBrowser, that's only a trick for making the code easier to write and read. The tool doesn't support browser automation in any form. As a result, it can't interact with pages involving dynamic content or interaction.

Is ScrapySharp a Port of the Scrapy Library in Python?

No, ScrapySharp isn't a direct port of the Scrapy library in Python. It's a C# web scraping framework inspired by Scrapy but has its own design and implementation. The two projects are independent and come with different syntax and functionality. To see the differences, explore our guide to Scrapy in Python.

Is It Possible to Integrate Splash into ScrapySharp?

No, integrating Splash into ScrapySharp isn't possible. Unlike Scrapy, the C# library can't be extended with that popular rendering service. For scraping dynamic content pages, you may need to explore other solutions like Selenium C# or PuppeteerSharp.

Did you find the content helpful? Spread the word and share it on Twitter, or LinkedIn.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.