C is one of the most efficient programming languages on the planet, and its performance makes it ideal for web scraping, which involves tons of pages or very large ones! In this step-by-step tutorial, you'll learn how to do web scraping in C with the libcurl and libxml2 libraries.
Let's dive in!
Can You Do Web Scraping with C?
When you think of the online world, C isn't the first language that comes to mind. For web scraping, most developers prefer Python because of its extensive packages. Or they use JavaScript in Node.js because of its large community.
At the same time, C is a viable option for doing web scraping, especially when resource use is critical. A web scraping application in C can achieve extreme performance thanks to the low-level nature of the language.
Learn more about the best programming languages for web scraping.
How to Do Web Scraping in C
Web scraping using C involves three simple steps:
- Download the HTML content of the target page with libcurl.
- Parse the HTML and scrape data from it with the HTML parser libxml2.
- Export the collected data to a file.
As a target site, we'll use ScrapingCourse.com, a demo website with e-commerce features:
The C scraper you're going to build will be able to retrieve all product data from each page of the site.
Let's build a C web scraping program!
Step 1: Install the Necessary Tools
Coding in C requires an environment for compilation and execution. On Windows, rely on Visual Studio. On macOS or Linux, go for Visual Studio Code with the C/C++ extension. Open the IDE and follow the instructions to create a C project based on your local compiler.
Then, install the C package manager vcpkg
and set it up in Visual Studio as explained in the official guide. This package manager allows you to install the dependencies required to build a web scraper in C:
- libcurl: An open-source and easy-to-use HTTP client for C built on top of cURL.
- libxml2: An HTML and XML parser with a complete element selection API based on XPath.
libcurl
helps you retrieve the HTML of the target pages that you can then parse with libxml2
to extract the desired data.
To install libcurl and libxml2, run the command below in the root folder of the project:
vcpkg install curl libxml2
Fantastic! You're now fully set up.
Time to initialize your web scraping C script. Create a scraper.c
file in your project as follows. This is the easiest C program, but the main()
function will soon contain some scraping logic.
#include <stdio.h>
#include <stdlib.h>
int main() {
printf("Hello, World!\n");
return 0;
}
Import the two libraries installed earlier by adding the below three lines on top of the scraper.c
file. The first import is for libcurl while the other two come from libxml2. In detail, HTMLparser.h
exposes functions to parse an HTML document and XPath.h
to select the desired elements from it.
#include <curl/curl.h>
#include "libxml/HTMLparser.h"
#include "libxml/xpath.h"
Great! You're now ready to learn the basics of web scraping with C!
Step 2: Get the HTML of Your Target Webpage
Requests with libcurl involve boilerplate operations you don't want to repeat every time. Encapsulate them in a reusable function that:
- Receives a cURL instance as a parameter.
- Uses it to make an HTTP GET request to the URL passed as a parameter.
- Returns the HTML document produced by the server as a special data structure.
This is how:
struct CURLResponse
{
char *html;
size_t size;
};
static size_t WriteHTMLCallback(void *contents, size_t size, size_t nmemb, void *userp)
{
size_t realsize = size * nmemb;
struct CURLResponse *mem = (struct CURLResponse *)userp;
char *ptr = realloc(mem->html, mem->size + realsize + 1);
if (!ptr)
{
printf("Not enough memory available (realloc returned NULL)\n");
return 0;
}
mem->html = ptr;
memcpy(&(mem->html[mem->size]), contents, realsize);
mem->size += realsize;
mem->html[mem->size] = 0;
return realsize;
}
struct CURLResponse GetRequest(CURL *curl_handle, const char *url)
{
CURLcode res;
struct CURLResponse response;
// initialize the response
response.html = malloc(1);
response.size = 0;
// specify URL to GET
curl_easy_setopt(curl_handle, CURLOPT_URL, url);
// send all data returned by the server to WriteHTMLCallback
curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, WriteHTMLCallback);
// pass "response" to the callback function
curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, (void *)&response);
// set a User-Agent header
curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36");
// perform the GET request
res = curl_easy_perform(curl_handle);
// check for HTTP errors
if (res != CURLE_OK)
{
fprintf(stderr, "GET request failed: %s\n", curl_easy_strerror(res));
}
return response;
}
Now, use GetRequest()
in the main()
function of scraper.c
to retrieve the target HTML document as a char*
:
#include <stdio.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>
// ...
// struct CURLResponse GetRequest(CURL *curl_handle, const char *url) ...
int main() {
// initialize curl globally
curl_global_init(CURL_GLOBAL_ALL);
// initialize a CURL instance
CURL *curl_handle = curl_easy_init();
// retrieve the HTML document of the target page
struct CURLResponse response = GetRequest(curl_handle, "https://www.scrapingcourse.com/ecommerce/");
// print the HTML content
printf("%s\n", response.html);
// scraping logic...
// cleanup the curl instance
curl_easy_cleanup(curl_handle);
// cleanup the curl resources
curl_global_cleanup();
return 0;
}
The script above is your initial script. Compile and run it to produce the following output in your terminal:
<!DOCTYPE html>
<html lang="en-US">
<head>
<!--- ... --->
<title>Ecommerce Test Site to Learn Web Scraping – ScrapingCourse.com</title>
<!--- ... --->
</head>
<body class="home archive ...">
<p class="woocommerce-result-count">Showing 1–16 of 188 results</p>
<ul class="products columns-4">
<!--- ... --->
</ul>
</body>
</html>
Wonderful! That's the HTML code of the target page!
Step 3: Extract Specific Data from the Page
After retrieving the HTML code, feed it to libxml2, where htmlReadMemory()
parses the HTML char *
content and produces a tree you can explore via XPath expressions.
htmlDocPtr doc = htmlReadMemory(response.html, (unsigned long)response.size, NULL, NULL, HTML_PARSE_NOERROR);
The next step is to define an effective selection strategy. To do so, you need to inspect the target site and familiarize yourself with its structure.
Open the target site in the browser, right-click on a product HTML node, and choose the "Inspect" option. The following DevTools window will open:
Take a look at the DOM of the page and note that all products are <li>
elements with the product
class. Thus, you can retrieve them all with the XPath query below:
//li[contains(@class, 'product')]
Apply the XPath selector in libxml2 to retrieve all HTML products. xmlXPathNewContext()
sets the XPath context to the entire document. Next, xmlXPathEvalExpression()
applies the selector strategy defined above.
xmlXPathContextPtr context = xmlXPathNewContext(doc);
xmlXPathObjectPtr productHTMLElements = xmlXPathEvalExpression((xmlChar *)"//li[contains(@class, 'product')]", context);
You can learn about XPath for web scraping in our tutorial.
Note that each product on the page contains this information:
- A link to the detail page in an
<a>
. - An image in an
<img>
. - A name in an
<h2>
. - A price in a
<span>
.
To scrape that data and keep track of it, define a custom data structure on top of scraper.c
with typedef
. C doesn't support classes but has struct
s, collections of data fields grouped under the same name.
typedef struct
{
char *url;
char *image;
char *name;
char *price;
} Product;
There are many elements on a single pagination page, so you'll need an array of Product
:
Product products[MAX_PRODUCTS];
MAX_PRODUCTS
is a macro storing the number of products on a page:
#define MAX_PRODUCTS 16
Time to iterate over the selected product nodes and extract the desired info from each of them. At the end of the for
cycle, products
will contain all product data of interest!
for (int i = 0; i < productHTMLElements->nodesetval->nodeNr; ++i)
{
// get the current element of the loop
xmlNodePtr productHTMLElement = productHTMLElements->nodesetval->nodeTab[i];
// set the context to restrict XPath selectors
// to the children of the current element
xmlXPathSetContextNode(productHTMLElement, context);
xmlNodePtr urlHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a", context)->nodesetval->nodeTab[0];
char *url = (char *)xmlGetProp(urlHTMLElement, (xmlChar *)"href");
xmlNodePtr imageHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/img", context)->nodesetval->nodeTab[0];
char *image = (char *)(xmlGetProp(imageHTMLElement, (xmlChar *)"src"));
xmlNodePtr nameHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/h2", context)->nodesetval->nodeTab[0];
char *name = (char *)(xmlNodeGetContent(nameHTMLElement));
xmlNodePtr priceHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/span", context)->nodesetval->nodeTab[0];
char *price = (char *)(xmlNodeGetContent(priceHTMLElement));
// store the scraped data in a Product instance
Product product;
product.url = strdup(url);
product.image = strdup(image);
product.name = strdup(name);
product.price = strdup(price);
// free up the resources you no longer need
free(url);
free(image);
free(name);
free(price);
// add a new product to the array
products[productCount] = product;
productCount++;
}
After the loop, remember to free up the resources you allocated to achieve the goal:
free(response.html);
// free up libxml2 resources
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
xmlCleanupParser();
The current scraper.c
file for C web scraping contains:
#include <stdio.h>
#include <stdlib.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>
#define MAX_PRODUCTS 16
// initialize a data structure to
// store the scraped data
typedef struct
{
char *url;
char *image;
char *name;
char *price;
} Product;
// ...
// struct CURLResponse GetRequest(CURL *curl_handle, const char *url) ...
int main(void)
{
// initialize curl globally
curl_global_init(CURL_GLOBAL_ALL);
// initialize a CURL instance
CURL *curl_handle = curl_easy_init();
// initialize the array that will contain
// the scraped data
Product products[MAX_PRODUCTS];
int productCount = 0;
// get the HTML document associated with the page
struct CURLResponse response = GetRequest(curl_handle, "https://www.scrapingcourse.com/ecommerce/");
// parse the HTML document returned by the server
htmlDocPtr doc = htmlReadMemory(response.html, (unsigned long)response.size, NULL, NULL, HTML_PARSE_NOERROR);
xmlXPathContextPtr context = xmlXPathNewContext(doc);
// get the product HTML elements on the page
xmlXPathObjectPtr productHTMLElements = xmlXPathEvalExpression((xmlChar *)"//li[contains(@class, 'product')]", context);
// iterate over them and scrape data from each of them
for (int i = 0; i < productHTMLElements->nodesetval->nodeNr; ++i)
{
// get the current element of the loop
xmlNodePtr productHTMLElement = productHTMLElements->nodesetval->nodeTab[i];
// set the context to restrict XPath selectors
// to the children of the current element
xmlXPathSetContextNode(productHTMLElement, context);
xmlNodePtr urlHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a", context)->nodesetval->nodeTab[0];
char *url = (char *)xmlGetProp(urlHTMLElement, (xmlChar *)"href");
xmlNodePtr imageHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/img", context)->nodesetval->nodeTab[0];
char *image = (char *)(xmlGetProp(imageHTMLElement, (xmlChar *)"src"));
xmlNodePtr nameHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/h2", context)->nodesetval->nodeTab[0];
char *name = (char *)(xmlNodeGetContent(nameHTMLElement));
xmlNodePtr priceHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/span", context)->nodesetval->nodeTab[0];
char *price = (char *)(xmlNodeGetContent(priceHTMLElement));
// store the scraped data in a Product instance
Product product;
product.url = strdup(url);
product.image = strdup(image);
product.name = strdup(name);
product.price = strdup(price);
// free up the resources you no longer need
free(url);
free(image);
free(name);
free(price);
// add a new product to the array
products[productCount] = product;
productCount++;
}
// free up the allocated resources
free(response.html);
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
xmlCleanupParser();
// cleanup the curl instance
curl_easy_cleanup(curl_handle);
// cleanup the curl resources
curl_global_cleanup();
return 0;
}
Good job! Now that you know how to extract data from HTML in C, it only remains to get the output. See that in the next step, together with the final code.
Step 4: Export Data to CSV
Right now, the scraped data is stored in an array of C structs
. That's not the best format to share data with other users. Instead, export it to a more useful format, such as CSV.
And you don't even need an extra library to achieve that. All you have to do is open a .csv
file, convert Product
instances to CSV records, and append them to the file:
// open a CSV file for writing
FILE *csvFile = fopen("products.csv", "w");
if (csvFile == NULL)
{
perror("Failed to open the CSV output file!");
return 1;
}
// write the CSV header
fprintf(csvFile, "url,image,name,price\n");
// write each product's data to the CSV file
for (int i = 0; i < productCount; i++)
{
fprintf(csvFile, "%s,%s,%s,%s\n", products[i].url, products[i].image, products[i].name, products[i].price);
}
// close the CSV file
fclose(csvFile);
Remember to free up the memory allocated for the struct
fields:
for (int i = 0; i < productCount; i++)
{
free(products[i].url);
free(products[i].image);
free(products[i].name);
free(products[i].price);
}
Put it all together, and you'll get the final code for your C web scraping script:
#include <stdio.h>
#include <stdlib.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>
#define MAX_PRODUCTS 16
// initialize a data structure to
// store the scraped data
typedef struct
{
char *url;
char *image;
char *name;
char *price;
} Product;
struct CURLResponse
{
char *html;
size_t size;
};
static size_t
WriteHTMLCallback(void *contents, size_t size, size_t nmemb, void *userp)
{
size_t realsize = size * nmemb;
struct CURLResponse *mem = (struct CURLResponse *)userp;
char *ptr = realloc(mem->html, mem->size + realsize + 1);
if (!ptr)
{
printf("Not enough memory available (realloc returned NULL)\n");
return 0;
}
mem->html = ptr;
memcpy(&(mem->html[mem->size]), contents, realsize);
mem->size += realsize;
mem->html[mem->size] = 0;
return realsize;
}
struct CURLResponse GetRequest(CURL *curl_handle, const char *url)
{
CURLcode res;
struct CURLResponse response;
// initialize the response
response.html = malloc(1);
response.size = 0;
// specify URL to GET
curl_easy_setopt(curl_handle, CURLOPT_URL, url);
// send all data returned by the server to WriteHTMLCallback
curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, WriteHTMLCallback);
// pass "response" to the callback function
curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, (void *)&response);
// set a User-Agent header
curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36");
// perform the GET request
res = curl_easy_perform(curl_handle);
// check for HTTP errors
if (res != CURLE_OK)
{
fprintf(stderr, "GET request failed: %s\n", curl_easy_strerror(res));
}
return response;
}
int main(void)
{
// initialize curl globally
curl_global_init(CURL_GLOBAL_ALL);
// initialize a CURL instance
CURL *curl_handle = curl_easy_init();
// initialize the array that will contain
// the scraped data
Product products[MAX_PRODUCTS];
int productCount = 0;
// get the HTML document associated with the page
struct CURLResponse response = GetRequest(curl_handle, "https://www.scrapingcourse.com/ecommerce/");
// parse the HTML document returned by the server
htmlDocPtr doc = htmlReadMemory(response.html, (unsigned long)response.size, NULL, NULL, HTML_PARSE_NOERROR);
xmlXPathContextPtr context = xmlXPathNewContext(doc);
// get the product HTML elements on the page
xmlXPathObjectPtr productHTMLElements = xmlXPathEvalExpression((xmlChar *)"//li[contains(@class, 'product')]", context);
// iterate over them and scrape data from each of them
for (int i = 0; i < productHTMLElements->nodesetval->nodeNr; ++i)
{
// get the current element of the loop
xmlNodePtr productHTMLElement = productHTMLElements->nodesetval->nodeTab[i];
// set the context to restrict XPath selectors
// to the children of the current element
xmlXPathSetContextNode(productHTMLElement, context);
xmlNodePtr urlHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a", context)->nodesetval->nodeTab[0];
char *url = (char *)xmlGetProp(urlHTMLElement, (xmlChar *)"href");
xmlNodePtr imageHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/img", context)->nodesetval->nodeTab[0];
char *image = (char *)(xmlGetProp(imageHTMLElement, (xmlChar *)"src"));
xmlNodePtr nameHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/h2", context)->nodesetval->nodeTab[0];
char *name = (char *)(xmlNodeGetContent(nameHTMLElement));
xmlNodePtr priceHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/span", context)->nodesetval->nodeTab[0];
char *price = (char *)(xmlNodeGetContent(priceHTMLElement));
// store the scraped data in a Product instance
Product product;
product.url = strdup(url);
product.image = strdup(image);
product.name = strdup(name);
product.price = strdup(price);
// free up the resources you no longer need
free(url);
free(image);
free(name);
free(price);
// add a new product to the array
products[productCount] = product;
productCount++;
}
// free up the allocated resources
free(response.html);
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
xmlCleanupParser();
// cleanup the curl instance
curl_easy_cleanup(curl_handle);
// cleanup the curl resources
curl_global_cleanup();
// open a CSV file for writing
FILE *csvFile = fopen("products.csv", "w");
if (csvFile == NULL)
{
perror("Failed to open the CSV output file!");
return 1;
}
// write the CSV header
fprintf(csvFile, "url,image,name,price\n");
// write each product's data to the CSV file
for (int i = 0; i < productCount; i++)
{
fprintf(csvFile, "%s,%s,%s,%s\n", products[i].url, products[i].image, products[i].name, products[i].price);
}
// close the CSV file
fclose(csvFile);
// free the resources associated with each product
for (int i = 0; i < productCount; i++)
{
free(products[i].url);
free(products[i].image);
free(products[i].name);
free(products[i].price);
}
return 0;
}
Compile the scraper application and run it. The products.csv
file below will appear in your project's folder containing the products from ScrapingCourse first page.
Amazing! You just learned the basics of web scraping with C, but there's still a lot to learn. For example, you need to learn how to get the data from the next ecommerce pages. Keep reading to become a C data scraping expert.
Advanced Web Scraping in C
Scraping requires more than the basics. Time to dig into the advanced concepts of web scraping in C!
Scrape Multiple Pages: Web Crawling with C
The script built above retrieves products from a single page. However, the target site consists of several pages. To scrape them all, you need to go through each of them with web crawling. In other words, you have to discover all the links on the sites and visit them all automatically. This involves sets, support data structures, and custom logic to avoid visiting a page twice.
Implementing web crawling with automatic page discovery in C is possible but also complex and error-prone. To avoid a headache, you should go for a smart approach. Take a look at the URLs of the pagination pages. These all follow the format below:
https://www.scrapingcourse.com/ecommerce/page/<page>/
As there are 12 pages on the site, scrape them all by applying the following scraping logic to each pagination URL:
for (int page = 1; page <= NUM_PAGES; ++page)
{
// build the URL of the target page
char url[256];
snprintf(url, sizeof(url), "https://www.scrapingcourse.com/ecommerce/page/%d/", page);
// get the HTML document associated with the current page
struct CURLResponse response = GetRequest(curl_handle, &url);
// scraping logic...
}
NUM_PAGES
is a macro that contains the number of pages the spider will visit. You'll also need to adapt MAX_PRODUCTS
accordingly:
#define NUM_PAGES 12
#define MAX_PRODUCTS NUM_PAGES * 16
The scraper.c
file will now contain:
#include <stdio.h>
#include <stdlib.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>
#define NUM_PAGES 5 // limit to 5 to avoid crawling the entire site
#define MAX_PRODUCTS NUM_PAGES * 16
// initialize a data structure to
// store the scraped data
typedef struct
{
char *url;
char *image;
char *name;
char *price;
} Product;
struct CURLResponse
{
char *html;
size_t size;
};
static size_t
WriteHTMLCallback(void *contents, size_t size, size_t nmemb, void *userp)
{
size_t realsize = size * nmemb;
struct CURLResponse *mem = (struct CURLResponse *)userp;
char *ptr = realloc(mem->html, mem->size + realsize + 1);
if (!ptr)
{
printf("Not enough memory available (realloc returned NULL)\n");
return 0;
}
mem->html = ptr;
memcpy(&(mem->html[mem->size]), contents, realsize);
mem->size += realsize;
mem->html[mem->size] = 0;
return realsize;
}
struct CURLResponse GetRequest(CURL *curl_handle, const char *url)
{
CURLcode res;
struct CURLResponse response;
// initialize the response
response.html = malloc(1);
response.size = 0;
// specify URL to GET
curl_easy_setopt(curl_handle, CURLOPT_URL, url);
// send all data returned by the server to WriteHTMLCallback
curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, WriteHTMLCallback);
// pass "response" to the callback function
curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, (void *)&response);
// set a User-Agent header
curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36");
// perform the GET request
res = curl_easy_perform(curl_handle);
// check for HTTP errors
if (res != CURLE_OK)
{
fprintf(stderr, "GET request failed: %s\n", curl_easy_strerror(res));
}
return response;
}
int main(void)
{
// initialize curl globally
curl_global_init(CURL_GLOBAL_ALL);
// initialize a CURL instance
CURL *curl_handle = curl_easy_init();
// initialize the array that will contain
// the scraped data
Product products[MAX_PRODUCTS];
int productCount = 0;
// iterate over the pages to scrape
for (int page = 1; page <= NUM_PAGES; ++page)
{
// build the URL of the target page
char url[256];
snprintf(url, sizeof(url), "https://www.scrapingcourse.com/ecommerce/page/%d/", page);
// get the HTML document associated with the current page
struct CURLResponse response = GetRequest(curl_handle, &url);
// parse the HTML document returned by the server
htmlDocPtr doc = htmlReadMemory(response.html, (unsigned long)response.size, NULL, NULL, HTML_PARSE_NOERROR);
xmlXPathContextPtr context = xmlXPathNewContext(doc);
// get the product HTML elements on the page
xmlXPathObjectPtr productHTMLElements = xmlXPathEvalExpression((xmlChar *)"//li[contains(@class, 'product')]", context);
// iterate over them and scrape data from each of them
for (int i = 0; i < productHTMLElements->nodesetval->nodeNr; ++i)
{
// get the current element of the loop
xmlNodePtr productHTMLElement = productHTMLElements->nodesetval->nodeTab[i];
// set the context to restrict XPath selectors
// to the children of the current element
xmlXPathSetContextNode(productHTMLElement, context);
xmlNodePtr urlHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a", context)->nodesetval->nodeTab[0];
char *url = (char *)xmlGetProp(urlHTMLElement, (xmlChar *)"href");
xmlNodePtr imageHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/img", context)->nodesetval->nodeTab[0];
char *image = (char *)(xmlGetProp(imageHTMLElement, (xmlChar *)"src"));
xmlNodePtr nameHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/h2", context)->nodesetval->nodeTab[0];
char *name = (char *)(xmlNodeGetContent(nameHTMLElement));
xmlNodePtr priceHTMLElement = xmlXPathEvalExpression((xmlChar *)".//a/span", context)->nodesetval->nodeTab[0];
char *price = (char *)(xmlNodeGetContent(priceHTMLElement));
// store the scraped data in a Product instance
Product product;
product.url = strdup(url);
product.image = strdup(image);
product.name = strdup(name);
product.price = strdup(price);
// free up the resources you no longer need
free(url);
free(image);
free(name);
free(price);
// add a new product to the array
products[productCount] = product;
productCount++;
}
// free up the allocated resources
free(response.html);
xmlXPathFreeContext(context);
xmlFreeDoc(doc);
xmlCleanupParser();
}
// cleanup the curl instance
curl_easy_cleanup(curl_handle);
// cleanup the curl resources
curl_global_cleanup();
// open a CSV file for writing
FILE *csvFile = fopen("products.csv", "w");
if (csvFile == NULL)
{
perror("Failed to open the CSV output file!");
return 1;
}
// write the CSV header
fprintf(csvFile, "url,image,name,price\n");
// write each product's data to the CSV file
for (int i = 0; i < productCount; i++)
{
fprintf(csvFile, "%s,%s,%s,%s\n", products[i].url, products[i].image, products[i].name, products[i].price);
}
// close the CSV file
fclose(csvFile);
// free the resources associated with each product
for (int i = 0; i < productCount; i++)
{
free(products[i].url);
free(products[i].image);
free(products[i].name);
free(products[i].price);
}
return 0;
}
This C web scraping script crawls the entire site, getting the data from each product on every pagination page. Run it, and the resulting CSV file will contain all products discovered in the new visited pages, too.
Congrats, you just reached your data extraction goal!
Avoid Getting Blocked
Data is the new oil, and companies know that. That's why many websites protect their data with anti-scraping measures, which can block requests coming from automated software like your C scraper.
Take for example the G2.com
site, which uses the Cloudflare WAF to prevent bots from accessing its pages, and try to make a request to it:
struct CURLResponse response = GetRequest(curl_handle, "https://www.g2.com/products/asana/reviews");
printf("%s\n", response.html)
That'll print the following anti-bot page:
<!DOCTYPE html>
<html lang="en-US">
<head>
<title>Just a moment...</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta name="robots" content="noindex,nofollow">
<meta name="viewport" content="width=device-width,initial-scale=1">
<link href="/cdn-cgi/styles/challenges.css" rel="stylesheet">
</head>
<!-- Omitted for brevity... -->
Anti-bot measures represent the biggest challenge when performing web scraping in C. There are, of course, some solutions. Find out more in our in-depth guide on how to do web scraping without getting blocked.
At the same time, most of those techniques are tricks that work only for a while or not consistently. A better alternative to avoid any blocks is ZenRows, a full-featured web scraping API that provides premium proxies, headless browser capabilities, and a complete anti-bot toolkit.
Follow the steps below to get started with ZenRows:
Sign up for free to get your free 1,000 credits, and you'll get to the Request Builder page.
Paste your target URL (https://www.g2.com/products/asana/reviews
). Then, activate “Premium Proxies” and enable the "JS Rendering” boost mode.
On the right of the screen, select the "cURL” option, and then the “API” connection mode. Next, pass the generated URL to your GetRequest()
method:
struct CURLResponse response = GetRequest(curl_handle, "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fasana%2Freviews&js_render=true&premium_proxy=true");
printf("%s\n", response.html)
That snippet will result in the following output:
<!DOCTYPE html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/assets/favicon-fdacc4208a68e8ae57a80bf869d155829f2400fa7dd128b9c9e60f07795c4915.ico" rel="shortcut icon" type="image/x-icon" />
<title>Asana Reviews 2023: Details, Pricing, & Features | G2</title>
<!-- omitted for brevity ... -->
Wow! Bye-bye, anti-bot limitations!
Render JavaScript: Headless Browser Scraping in C
Most pages use JavaScript for rendering or data retrieval. To scrape them, you need a tool that can execute JS: a headless browser. The problem is that, as of this writing, there's no headless browser library for C.
The closest project you can find is webdriverxx
, but it only works with C++. You can explore our C++ web scraping tutorial to learn more. But if you don't want to change the programming language, the solution is to rely on ZenRows' JS rendering capabilities.
ZenRows works with C and any other programming language, and can render JavaScript. It also offers JavaScript actions to interact with pages as a human user would do. You don't need to adopt a different language to deal with dynamic content pages in C.
Conclusion
This step-by-step tutorial taught you how to build a C web scraping application. You started from the basics and then dug into more complex topics. You have become a web scraping C ninja!
Now, you know:
- Why C is great for efficient scraping.
- The basics of scraping with C.
- How to do web crawling in C.
- How to use C to deal with JavaScript-rendered sites.
However, no matter how sophisticated your script is, anti-scraping technologies can still block it. Bypass them all with ZenRows, a scraping tool with the best built-in anti-bot bypass features on the market. A single API call allows you to get your desired data.