Web Scraping with C in 2024

June 1, 2024 · 7 min read

C is one of the most efficient programming languages on the planet, and its performance makes it ideal for web scraping, which involves tons of pages or very large ones! In this step-by-step tutorial, you'll learn how to do web scraping in C with the libcurl and libxml2 libraries.

Let's dive in!

Can You Do Web Scraping with C?

When you think of the online world, C isn't the first language that comes to mind. For web scraping, most developers prefer Python because of its extensive packages. Or they use JavaScript in Node.js because of its large community.

At the same time, C is a viable option for doing web scraping, especially when resource use is critical. A web scraping application in C can achieve extreme performance thanks to the low-level nature of the language.

Learn more about the best programming languages for web scraping.

How to Do Web Scraping in C

Web scraping using C involves three simple steps:

  1. Download the HTML content of the target page with libcurl.
  2. Parse the HTML and scrape data from it with the HTML parser libxml2.
  3. Export the collected data to a file.

As a target site, we'll use ScrapingCourse.com, a demo website with e-commerce features:

Scrapingcourse Ecommerce Store
Click to open the image in full screen

The C scraper you're going to build will be able to retrieve all product data from each page of the site.

Let's build a C web scraping program!

Step 1: Install the Necessary Tools

Coding in C requires an environment for compilation and execution. On Windows, rely on Visual Studio. On macOS or Linux, go for Visual Studio Code with the C/C++ extension. Open the IDE and follow the instructions to create a C project based on your local compiler.

Then, install the C package manager vcpkg and set it up in Visual Studio as explained in the official guide. This package manager allows you to install the dependencies required to build a web scraper in C:

  • libcurl: An open-source and easy-to-use HTTP client for C built on top of cURL.
  • libxml2: An HTML and XML parser with a complete element selection API based on XPath.

libcurl helps you retrieve the HTML of the target pages that you can then parse with libxml2 to extract the desired data.

To install libcurl and libxml2, run the command below in the root folder of the project:

Terminal
    vcpkg install curl libxml2

Fantastic! You're now fully set up.

Time to initialize your web scraping C script. Create a scraper.c file in your project as follows. This is the easiest C program, but the main() function will soon contain some scraping logic.

scraper.c
    #include <stdio.h>
    #include <stdlib.h>
    
    int main() {
        printf("Hello, World!\n");
        return 0;
    }

Import the two libraries installed earlier by adding the below three lines on top of the scraper.c file. The first import is for libcurl while the other two come from libxml2. In detail, HTMLparser.h exposes functions to parse an HTML document and XPath.h to select the desired elements from it.

scraper.c
    #include <curl/curl.h>
    #include "libxml/HTMLparser.h"
    #include "libxml/xpath.h"

Great! You're now ready to learn the basics of web scraping with C!

Step 2: Get the HTML of Your Target Webpage

Requests with libcurl involve boilerplate operations you don't want to repeat every time. Encapsulate them in a reusable function that:

  1. Receives a cURL instance as a parameter.
  2. Uses it to make an HTTP GET request to the URL passed as a parameter.
  3. Returns the HTML document produced by the server as a special data structure.

This is how:

scraper.c
    struct CURLResponse
    {
        char *html;
        size_t size;
    };
    
    static size_t WriteHTMLCallback(void *contents, size_t size, size_t nmemb, void *userp)
    {
        size_t realsize = size * nmemb;
        struct CURLResponse *mem = (struct CURLResponse *)userp;
        char *ptr = realloc(mem->html, mem->size + realsize + 1);
      
        if (!ptr)
        {
            printf("Not enough memory available (realloc returned NULL)\n");
            return 0;
        }
      
        mem->html = ptr;
        memcpy(&(mem->html[mem->size]), contents, realsize);
        mem->size += realsize;
        mem->html[mem->size] = 0;
      
        return realsize;
    }
    
    struct CURLResponse GetRequest(CURL *curl_handle, const char *url)
    {
        CURLcode res;
        struct CURLResponse response;
      
        // initialize the response
        response.html = malloc(1);
        response.size = 0;
      
        // specify URL to GET
        curl_easy_setopt(curl_handle, CURLOPT_URL, url);
        // send all data returned by the server to WriteHTMLCallback
        curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, WriteHTMLCallback);
        // pass "response" to the callback function
        curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, (void *)&response);
        // set a User-Agent header
        curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36");
        // perform the GET request
        res = curl_easy_perform(curl_handle);
      
        // check for HTTP errors
        if (res != CURLE_OK)
        {
            fprintf(stderr, "GET request failed: %s\n", curl_easy_strerror(res));
        }
      
        return response;
    }

Now, use GetRequest() in the main() function of scraper.c to retrieve the target HTML document as a char*:

scraper.c
    #include <stdio.h>
    #include <curl/curl.h>
    #include <libxml/HTMLparser.h>
    #include <libxml/xpath.h>
    
    // ... 
    // struct CURLResponse GetRequest(CURL *curl_handle, const char *url) ...
    
    int main() {
        // initialize curl globally
        curl_global_init(CURL_GLOBAL_ALL);
        
        // initialize a CURL instance
        CURL *curl_handle = curl_easy_init();
    
        // retrieve the HTML document of the target page
        struct CURLResponse response = GetRequest(curl_handle, "https://www.scrapingcourse.com/ecommerce/");
        // print the HTML content
        printf("%s\n", response.html);
    
        // scraping logic...
    
        // cleanup the curl instance
        curl_easy_cleanup(curl_handle);
        // cleanup the curl resources
        curl_global_cleanup();
    
        return 0;
    }

The script above is your initial script. Compile and run it to produce the following output in your terminal:

Output
    <!DOCTYPE html>
<html lang="en-US">
<head>
    <!--- ... --->
  
    <title>Ecommerce Test Site to Learn Web Scraping – ScrapingCourse.com</title>
    
  <!--- ... --->
</head>
<body class="home archive ...">
    <p class="woocommerce-result-count">Showing 1–16 of 188 results</p>
    <ul class="products columns-4">

        <!--- ... --->

    </ul>
</body>
</html>

Wonderful! That's the HTML code of the target page!

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step 3: Extract Specific Data from the Page

After retrieving the HTML code, feed it to libxml2, where htmlReadMemory() parses the HTML char * content and produces a tree you can explore via XPath expressions.

scraper.c
    htmlDocPtr doc = htmlReadMemory(response.html, (unsigned long)response.size, NULL, NULL, HTML_PARSE_NOERROR);

The next step is to define an effective selection strategy. To do so, you need to inspect the target site and familiarize yourself with its structure.

Open the target site in the browser, right-click on a product HTML node, and choose the "Inspect" option. The following DevTools window will open:

Scrapingcourse Ecommerce Homepage Inspect First Page
Click to open the image in full screen

Take a look at the DOM of the page and note that all products are <li> elements with the product class. Thus, you can retrieve them all with the XPath query below:

scraper.c
    //li[contains(@class, 'product')]

Apply the XPath selector in libxml2 to retrieve all HTML products. xmlXPathNewContext() sets the XPath context to the entire document. Next, xmlXPathEvalExpression() applies the selector strategy defined above.

scrape.c
    xmlXPathContextPtr context = xmlXPathNewContext(doc);
    xmlXPathObjectPtr productHTMLElements = xmlXPathEvalExpression((xmlChar *)"//li[contains(@class, 'product')]", context);

Note that each product on the page contains this information:

  • A link to the detail page in an <a>.
  • An image in an <img>.
  • A name in an <h2>.
  • A price in a <span>.

To scrape that data and keep track of it, define a custom data structure on top of scraper.c with typedef. C doesn't support classes but has structs, collections of data fields grouped under the same name.

scraper.c
    typedef struct
    {
        char *url;
        char *image;
        char *name;
        char *price;
    } Product;

There are many elements on a single pagination page, so you'll need an array of Product:

scraper.c
    Product products[MAX_PRODUCTS];

MAX_PRODUCTS is a macro storing the number of products on a page:

scraper.c
    #define MAX_PRODUCTS 16

Time to iterate over the selected product nodes and extract the desired info from each of them. At the end of the for cycle, products will contain all product data of interest!

scraper.c
    for (int i = 0; i < productHTMLElements->nodesetval->nodeNr; ++i)
    {
        // get the current element of the loop
        xmlNodePtr productHTMLElement = productHTMLElements->nodesetval->nodeTab[i];
    
        // set the context to restrict XPath selectors
        // to the children of the current element
        xmlXPathSetContextNode(productHTMLElement, context);
        xmlNodePtr urlHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a", context)->nodesetval->nodeTab[0];
        char *url = (char *)xmlGetProp(urlHTMLElement, (xmlChar *)"href");
        xmlNodePtr imageHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/img", context)->nodesetval->nodeTab[0];
        char *image = (char *)(xmlGetProp(imageHTMLElement, (xmlChar *)"src"));
        xmlNodePtr nameHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/h2", context)->nodesetval->nodeTab[0];
        char *name = (char *)(xmlNodeGetContent(nameHTMLElement));
        xmlNodePtr priceHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/span", context)->nodesetval->nodeTab[0];
        char *price = (char *)(xmlNodeGetContent(priceHTMLElement));
    
        // store the scraped data in a Product instance
        Product product;
        product.url = strdup(url);
        product.image = strdup(image);
        product.name = strdup(name);
        product.price = strdup(price);
    
        // free up the resources you no longer need
        free(url);
        free(image);
        free(name);
        free(price);
    
        // add a new product to the array
        products[productCount] = product;
        productCount++;
    }

After the loop, remember to free up the resources you allocated to achieve the goal:

scraper.c
    free(response.html);
    // free up libxml2 resources
    xmlXPathFreeContext(context);
    xmlFreeDoc(doc);
    xmlCleanupParser();

The current scraper.c file for C web scraping contains:

scraper.c
    #include <stdio.h>
    #include <stdlib.h>
    #include <curl/curl.h>
    #include <libxml/HTMLparser.h>
    #include <libxml/xpath.h>
    
    #define MAX_PRODUCTS 16
    
    // initialize a data structure to
    // store the scraped data
    typedef struct
    {
        char *url;
        char *image;
        char *name;
        char *price;
    } Product;
    
    // ... 
    // struct CURLResponse GetRequest(CURL *curl_handle, const char *url) ...
    
    int main(void)
    {
        // initialize curl globally
        curl_global_init(CURL_GLOBAL_ALL);
      
        // initialize a CURL instance
        CURL *curl_handle = curl_easy_init();
      
        // initialize the array that will contain
        // the scraped data
        Product products[MAX_PRODUCTS];
        int productCount = 0;
      
        // get the HTML document associated with the page
        struct CURLResponse response = GetRequest(curl_handle, "https://www.scrapingcourse.com/ecommerce/");
    
        // parse the HTML document returned by the server
        htmlDocPtr doc = htmlReadMemory(response.html, (unsigned long)response.size, NULL, NULL, HTML_PARSE_NOERROR);
        xmlXPathContextPtr context = xmlXPathNewContext(doc);
    
        // get the product HTML elements on the page
        xmlXPathObjectPtr productHTMLElements = xmlXPathEvalExpression((xmlChar *)"//li[contains(@class, 'product')]", context);
    
        // iterate over them and scrape data from each of them
        for (int i = 0; i < productHTMLElements->nodesetval->nodeNr; ++i)
        {
            // get the current element of the loop
            xmlNodePtr productHTMLElement = productHTMLElements->nodesetval->nodeTab[i];
      
            // set the context to restrict XPath selectors
            // to the children of the current element
            xmlXPathSetContextNode(productHTMLElement, context);
            xmlNodePtr urlHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a", context)->nodesetval->nodeTab[0];
            char *url = (char *)xmlGetProp(urlHTMLElement, (xmlChar *)"href");
            xmlNodePtr imageHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/img", context)->nodesetval->nodeTab[0];
            char *image = (char *)(xmlGetProp(imageHTMLElement, (xmlChar *)"src"));
            xmlNodePtr nameHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/h2", context)->nodesetval->nodeTab[0];
            char *name = (char *)(xmlNodeGetContent(nameHTMLElement));
            xmlNodePtr priceHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/span", context)->nodesetval->nodeTab[0];
            char *price = (char *)(xmlNodeGetContent(priceHTMLElement));
      
            // store the scraped data in a Product instance
            Product product;
            product.url = strdup(url);
            product.image = strdup(image);
            product.name = strdup(name);
            product.price = strdup(price);
      
            // free up the resources you no longer need
            free(url);
            free(image);
            free(name);
            free(price);
      
            // add a new product to the array
            products[productCount] = product;
            productCount++;
        }
    
        // free up the allocated resources
        free(response.html);
        xmlXPathFreeContext(context);
        xmlFreeDoc(doc);
        xmlCleanupParser();
      
        // cleanup the curl instance
        curl_easy_cleanup(curl_handle);
        // cleanup the curl resources
        curl_global_cleanup();
      
        return 0;
    }

Good job! Now that you know how to extract data from HTML in C, it only remains to get the output. See that in the next step, together with the final code.

Step 4: Export Data to CSV

Right now, the scraped data is stored in an array of C structs. That's not the best format to share data with other users. Instead, export it to a more useful format, such as CSV.

And you don't even need an extra library to achieve that. All you have to do is open a .csv file, convert Product instances to CSV records, and append them to the file:

scraper.c
    // open a CSV file for writing
    FILE *csvFile = fopen("products.csv", "w");
    if (csvFile == NULL)
    {
        perror("Failed to open the CSV output file!");
        return 1;
    }
    
    // write the CSV header
    fprintf(csvFile, "url,image,name,price\n");
    // write each product's data to the CSV file
    
    for (int i = 0; i < productCount; i++)
    {
        fprintf(csvFile, "%s,%s,%s,%s\n", products[i].url, products[i].image, products[i].name, products[i].price);
    }
    
    // close the CSV file
    fclose(csvFile);

Remember to free up the memory allocated for the struct fields:

scraper.c
    for (int i = 0; i < productCount; i++)
    {
        free(products[i].url);
        free(products[i].image);
        free(products[i].name);
        free(products[i].price);
    }

Put it all together, and you'll get the final code for your C web scraping script:

scraper.c
    #include <stdio.h>
    #include <stdlib.h>
    #include <curl/curl.h>
    #include <libxml/HTMLparser.h>
    #include <libxml/xpath.h>
    
    #define MAX_PRODUCTS 16
    
    // initialize a data structure to
    // store the scraped data
    typedef struct
    {
        char *url;
        char *image;
        char *name;
        char *price;
    } Product;
    
    struct CURLResponse
    {
        char *html;
        size_t size;
    };
    
    static size_t
    WriteHTMLCallback(void *contents, size_t size, size_t nmemb, void *userp)
    {
        size_t realsize = size * nmemb;
        struct CURLResponse *mem = (struct CURLResponse *)userp;
      
        char *ptr = realloc(mem->html, mem->size + realsize + 1);
        if (!ptr)
        {
            printf("Not enough memory available (realloc returned NULL)\n");
            return 0;
        }
      
        mem->html = ptr;
        memcpy(&(mem->html[mem->size]), contents, realsize);
        mem->size += realsize;
        mem->html[mem->size] = 0;
      
        return realsize;
    }
    
    struct CURLResponse GetRequest(CURL *curl_handle, const char *url)
    {
        CURLcode res;
        struct CURLResponse response;
      
        // initialize the response
        response.html = malloc(1);
        response.size = 0;
      
        // specify URL to GET
        curl_easy_setopt(curl_handle, CURLOPT_URL, url);
      
        // send all data returned by the server to WriteHTMLCallback
        curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, WriteHTMLCallback);
      
        // pass "response" to the callback function
        curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, (void *)&response);
      
        // set a User-Agent header
        curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36");
      
        // perform the GET request
        res = curl_easy_perform(curl_handle);
      
        // check for HTTP errors
        if (res != CURLE_OK)
        {
            fprintf(stderr, "GET request failed: %s\n", curl_easy_strerror(res));
        }
      
        return response;
    }
    
    int main(void)
    {
        // initialize curl globally
        curl_global_init(CURL_GLOBAL_ALL);
      
        // initialize a CURL instance
        CURL *curl_handle = curl_easy_init();
      
        // initialize the array that will contain
        // the scraped data
        Product products[MAX_PRODUCTS];
        int productCount = 0;
      
        // get the HTML document associated with the page
        struct CURLResponse response = GetRequest(curl_handle, "https://www.scrapingcourse.com/ecommerce/");
    
        // parse the HTML document returned by the server
        htmlDocPtr doc = htmlReadMemory(response.html, (unsigned long)response.size, NULL, NULL, HTML_PARSE_NOERROR);
        xmlXPathContextPtr context = xmlXPathNewContext(doc);
    
        // get the product HTML elements on the page
        xmlXPathObjectPtr productHTMLElements = xmlXPathEvalExpression((xmlChar *)"//li[contains(@class, 'product')]", context);
    
        // iterate over them and scrape data from each of them
        for (int i = 0; i < productHTMLElements->nodesetval->nodeNr; ++i)
        {
            // get the current element of the loop
            xmlNodePtr productHTMLElement = productHTMLElements->nodesetval->nodeTab[i];
      
            // set the context to restrict XPath selectors
            // to the children of the current element
            xmlXPathSetContextNode(productHTMLElement, context);
            xmlNodePtr urlHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a", context)->nodesetval->nodeTab[0];
            char *url = (char *)xmlGetProp(urlHTMLElement, (xmlChar *)"href");
            xmlNodePtr imageHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/img", context)->nodesetval->nodeTab[0];
            char *image = (char *)(xmlGetProp(imageHTMLElement, (xmlChar *)"src"));
            xmlNodePtr nameHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/h2", context)->nodesetval->nodeTab[0];
            char *name = (char *)(xmlNodeGetContent(nameHTMLElement));
            xmlNodePtr priceHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/span", context)->nodesetval->nodeTab[0];
            char *price = (char *)(xmlNodeGetContent(priceHTMLElement));
      
            // store the scraped data in a Product instance
            Product product;
            product.url = strdup(url);
            product.image = strdup(image);
            product.name = strdup(name);
            product.price = strdup(price);
      
            // free up the resources you no longer need
            free(url);
            free(image);
            free(name);
            free(price);
      
            // add a new product to the array
            products[productCount] = product;
            productCount++;
        }
    
        // free up the allocated resources
        free(response.html);
        xmlXPathFreeContext(context);
        xmlFreeDoc(doc);
        xmlCleanupParser();
      
        // cleanup the curl instance
        curl_easy_cleanup(curl_handle);
        // cleanup the curl resources
        curl_global_cleanup();
      
        // open a CSV file for writing
        FILE *csvFile = fopen("products.csv", "w");
        if (csvFile == NULL)
        {
            perror("Failed to open the CSV output file!");
            return 1;
        }
        // write the CSV header
        fprintf(csvFile, "url,image,name,price\n");
        // write each product's data to the CSV file
        for (int i = 0; i < productCount; i++)
        {
            fprintf(csvFile, "%s,%s,%s,%s\n", products[i].url, products[i].image, products[i].name, products[i].price);
        }
      
        // close the CSV file
        fclose(csvFile);
      
        // free the resources associated with each product
        for (int i = 0; i < productCount; i++)
        {
            free(products[i].url);
            free(products[i].image);
            free(products[i].name);
            free(products[i].price);
        }
      
        return 0;
    }

Compile the scraper application and run it. The products.csv file below will appear in your project's folder containing the products from ScrapingCourse first page.

ScrapingCourse csv output with lower case headings
Click to open the image in full screen

Amazing! You just learned the basics of web scraping with C, but there's still a lot to learn. For example, you need to learn how to get the data from the next ecommerce pages. Keep reading to become a C data scraping expert.

Advanced Web Scraping in C

Scraping requires more than the basics. Time to dig into the advanced concepts of web scraping in C!

Scrape Multiple Pages: Web Crawling with C

The script built above retrieves products from a single page. However, the target site consists of several pages. To scrape them all, you need to go through each of them with web crawling. In other words, you have to discover all the links on the sites and visit them all automatically. This involves sets, support data structures, and custom logic to avoid visiting a page twice.

Implementing web crawling with automatic page discovery in C is possible but also complex and error-prone. To avoid a headache, you should go for a smart approach. Take a look at the URLs of the pagination pages. These all follow the format below:

scraper.c
    https://www.scrapingcourse.com/ecommerce/page/<page>/
ScrapingCourse page number demo
Click to open the image in full screen

As there are 12 pages on the site, scrape them all by applying the following scraping logic to each pagination URL:

scraper.c
    for (int page = 1; page <= NUM_PAGES; ++page)
    {
        // build the URL of the target page
        char url[256];
        snprintf(url, sizeof(url), "https://www.scrapingcourse.com/ecommerce/page/%d/", page);
      
        // get the HTML document associated with the current page
        struct CURLResponse response = GetRequest(curl_handle, &url);  
      
        // scraping logic...
    }

NUM_PAGES is a macro that contains the number of pages the spider will visit. You'll also need to adapt MAX_PRODUCTS accordingly:

scraper.c
    #define NUM_PAGES 12
    #define MAX_PRODUCTS NUM_PAGES * 16

The scraper.c file will now contain:

scraper.c
    #include <stdio.h>
    #include <stdlib.h>
    #include <curl/curl.h>
    #include <libxml/HTMLparser.h>
    #include <libxml/xpath.h>
    
    #define NUM_PAGES 5 // limit to 5 to avoid crawling the entire site
    #define MAX_PRODUCTS NUM_PAGES * 16
    
    // initialize a data structure to
    // store the scraped data
    typedef struct
    {
        char *url;
        char *image;
        char *name;
        char *price;
    } Product;
    
    struct CURLResponse
    {
        char *html;
        size_t size;
    };
    
    static size_t
    WriteHTMLCallback(void *contents, size_t size, size_t nmemb, void *userp)
    {
        size_t realsize = size * nmemb;
        struct CURLResponse *mem = (struct CURLResponse *)userp;
      
        char *ptr = realloc(mem->html, mem->size + realsize + 1);
        if (!ptr)
        {
            printf("Not enough memory available (realloc returned NULL)\n");
            return 0;
        }
      
        mem->html = ptr;
        memcpy(&(mem->html[mem->size]), contents, realsize);
        mem->size += realsize;
        mem->html[mem->size] = 0;
      
        return realsize;
    }
    
    struct CURLResponse GetRequest(CURL *curl_handle, const char *url)
    {
        CURLcode res;
        struct CURLResponse response;
      
        // initialize the response
        response.html = malloc(1);
        response.size = 0;
      
        // specify URL to GET
        curl_easy_setopt(curl_handle, CURLOPT_URL, url);
      
        // send all data returned by the server to WriteHTMLCallback
        curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, WriteHTMLCallback);
      
        // pass "response" to the callback function
        curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, (void *)&response);
      
        // set a User-Agent header
        curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36");
      
        // perform the GET request
        res = curl_easy_perform(curl_handle);
      
        // check for HTTP errors
        if (res != CURLE_OK)
        {
            fprintf(stderr, "GET request failed: %s\n", curl_easy_strerror(res));
        }
      
        return response;
    }
    
    int main(void)
    {
        // initialize curl globally
        curl_global_init(CURL_GLOBAL_ALL);
      
        // initialize a CURL instance
        CURL *curl_handle = curl_easy_init();
      
        // initialize the array that will contain
        // the scraped data
        Product products[MAX_PRODUCTS];
        int productCount = 0;
      
        // iterate over the pages to scrape
        for (int page = 1; page <= NUM_PAGES; ++page)
        {
            // build the URL of the target page
            char url[256];
            snprintf(url, sizeof(url), "https://www.scrapingcourse.com/ecommerce/page/%d/", page);
        
            // get the HTML document associated with the current page
            struct CURLResponse response = GetRequest(curl_handle, &url);
        
            // parse the HTML document returned by the server
            htmlDocPtr doc = htmlReadMemory(response.html, (unsigned long)response.size, NULL, NULL, HTML_PARSE_NOERROR);
            xmlXPathContextPtr context = xmlXPathNewContext(doc);
        
            // get the product HTML elements on the page
            xmlXPathObjectPtr productHTMLElements = xmlXPathEvalExpression((xmlChar *)"//li[contains(@class, 'product')]", context);
        
            // iterate over them and scrape data from each of them
            for (int i = 0; i < productHTMLElements->nodesetval->nodeNr; ++i)
            {
                // get the current element of the loop
                xmlNodePtr productHTMLElement = productHTMLElements->nodesetval->nodeTab[i];
          
                // set the context to restrict XPath selectors
                // to the children of the current element
                xmlXPathSetContextNode(productHTMLElement, context);
                xmlNodePtr urlHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a", context)->nodesetval->nodeTab[0];
                char *url = (char *)xmlGetProp(urlHTMLElement, (xmlChar *)"href");
                xmlNodePtr imageHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/img", context)->nodesetval->nodeTab[0];
                char *image = (char *)(xmlGetProp(imageHTMLElement, (xmlChar *)"src"));
                xmlNodePtr nameHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/h2", context)->nodesetval->nodeTab[0];
                char *name = (char *)(xmlNodeGetContent(nameHTMLElement));
                xmlNodePtr priceHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/span", context)->nodesetval->nodeTab[0];
                char *price = (char *)(xmlNodeGetContent(priceHTMLElement));
          
                // store the scraped data in a Product instance
                Product product;
                product.url = strdup(url);
                product.image = strdup(image);
                product.name = strdup(name);
                product.price = strdup(price);
          
                // free up the resources you no longer need
                free(url);
                free(image);
                free(name);
                free(price);
          
                // add a new product to the array
                products[productCount] = product;
                productCount++;
            }
        
            // free up the allocated resources
            free(response.html);
            xmlXPathFreeContext(context);
            xmlFreeDoc(doc);
            xmlCleanupParser();
        }
      
        // cleanup the curl instance
        curl_easy_cleanup(curl_handle);
        // cleanup the curl resources
        curl_global_cleanup();
      
        // open a CSV file for writing
        FILE *csvFile = fopen("products.csv", "w");
        if (csvFile == NULL)
        {
            perror("Failed to open the CSV output file!");
            return 1;
        }
        // write the CSV header
        fprintf(csvFile, "url,image,name,price\n");
        // write each product's data to the CSV file
        for (int i = 0; i < productCount; i++)
        {
            fprintf(csvFile, "%s,%s,%s,%s\n", products[i].url, products[i].image, products[i].name, products[i].price);
        }
      
        // close the CSV file
        fclose(csvFile);
      
        // free the resources associated with each product
        for (int i = 0; i < productCount; i++)
        {
            free(products[i].url);
            free(products[i].image);
            free(products[i].name);
            free(products[i].price);
        }
      
        return 0;
    }

This C web scraping script crawls the entire site, getting the data from each product on every pagination page. Run it, and the resulting CSV file will contain all products discovered in the new visited pages, too.

Congrats, you just reached your data extraction goal!

Avoid Getting Blocked

Data is the new oil, and companies know that. That's why many websites protect their data with anti-scraping measures, which can block requests coming from automated software like your C scraper.

Take for example the G2.com site, which uses the Cloudflare WAF to prevent bots from accessing its pages, and try to make a request to it:

scraper.c
    struct CURLResponse response = GetRequest(curl_handle, "https://www.g2.com/products/asana/reviews");
    printf("%s\n", response.html)

That'll print the following anti-bot page:

Terminal
    <!DOCTYPE html>
    <html lang="en-US">
      <head>
        <title>Just a moment...</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <meta http-equiv="X-UA-Compatible" content="IE=Edge">
        <meta name="robots" content="noindex,nofollow">
        <meta name="viewport" content="width=device-width,initial-scale=1">
        <link href="/cdn-cgi/styles/challenges.css" rel="stylesheet">
      </head>
    <!-- Omitted for brevity... -->

Anti-bot measures represent the biggest challenge when performing web scraping in C. There are, of course, some solutions. Find out more in our in-depth guide on how to do web scraping without getting blocked.

At the same time, most of those techniques are tricks that work only for a while or not consistently. A better alternative to avoid any blocks is ZenRows, a full-featured web scraping API that provides premium proxies, headless browser capabilities, and a complete anti-bot toolkit.

Follow the steps below to get started with ZenRows:

Sign up for free to get your free 1,000 credits, and you'll get to the Request Builder page.

building a scraper with zenrows
Click to open the image in full screen

Paste your target URL (https://www.g2.com/products/asana/reviews). Then, activate “Premium Proxies” and enable the "JS Rendering” boost mode.

On the right of the screen, select the "cURL” option, and then the “API” connection mode. Next, pass the generated URL to your GetRequest() method:

scraper.c
    struct CURLResponse response = GetRequest(curl_handle, "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fasana%2Freviews&js_render=true&premium_proxy=true");
    printf("%s\n", response.html)

That snippet will result in the following output:

Output
    <!DOCTYPE html>
    <head>
      <meta charset="utf-8" />
      <link href="https://www.g2.com/assets/favicon-fdacc4208a68e8ae57a80bf869d155829f2400fa7dd128b9c9e60f07795c4915.ico" rel="shortcut icon" type="image/x-icon" />
      <title>Asana Reviews 2023: Details, Pricing, &amp; Features | G2</title>
    <!-- omitted for brevity ... -->

Wow! Bye-bye, anti-bot limitations!

Render JavaScript: Headless Browser Scraping in C

Most pages use JavaScript for rendering or data retrieval. To scrape them, you need a tool that can execute JS: a headless browser. The problem is that, as of this writing, there's no headless browser library for C.

The closest project you can find is webdriverxx, but it only works with C++. You can explore our C++ web scraping tutorial to learn more. But if you don't want to change the programming language, the solution is to rely on ZenRows' JS rendering capabilities.

ZenRows works with C and any other programming language, and can render JavaScript. It also offers JavaScript actions to interact with pages as a human user would do. You don't need to adopt a different language to deal with dynamic content pages in C.

Conclusion

This step-by-step tutorial taught you how to build a C web scraping application. You started from the basics and then dug into more complex topics. You have become a web scraping C ninja!

Now, you know:

  • Why C is great for efficient scraping.
  • The basics of scraping with C.
  • How to do web crawling in C.
  • How to use C to deal with JavaScript-rendered sites.

However, no matter how sophisticated your script is, anti-scraping technologies can still block it. Bypass them all with ZenRows, a scraping tool with the best built-in anti-bot bypass features on the market. A single API call allows you to get your desired data.

Ready to get started?

Up to 1,000 URLs for free are waiting for you