How to Use Curl_cffi for Web Scraping

July 10, 2024 · 7 min read

Table of contents

Why use curl_cffi?
How to build a curl_cffi scraper
- Prerequisites
- Get the page's HTML
- Extract product data
- Export the data as a CSV file
Advanced Web scraping techniques
- Integrate with asyncio
- Handling sessions and authentication
Limitations and best alternative
Conclusion

Curl_cffi, a modified cURL version that mimics browsers, can boost your scraper's ability to avoid blocks.

In this tutorial, you'll learn how curl_cffi works and how to use it for content extraction, from basic tasks to concurrent requests or handing sessions and authentication.

Why Use Curl_cffi for Web Scraping?

Curl_cffi is a Python binding for curl-impersonate, a cURL patch that mimics popular browsers to avoid anti-bot detection. Curl_cffi makes curl-impersonate more scalable, allowing the developers to use it directly with Python rather than as a command-line tool.

Since curl_cffi inherits its core features from curl-impersonate, it can mimic popular browsers, including Chrome, Safari, and Edge. However, unlike curl-impersonate, curl_cffi doesn't support Firefox.

One of its anti-bot bypass strategies is replacing easily detectable signals like the cURL OpenSSL library with browser fingerprints like Chrome's BoringSSL. Curl_cffi also modifies TLS and SSL options to resemble a legitimate browser.

Ready to build a scraper with curl_cffi in Python? Follow the next sections below.

Frustrated that your web scrapers are blocked once and again?

ZenRows API handles rotating proxies and headless browsers for you.

Try for FREE

Tutorial: How to Build Your Scraper With Curl_cffi

Curl_cffi is an alternative to popular HTTP clients like Python's Requests. Thanks to its browser features, it boasts a higher success rate.

Let's see how it works by extracting product information from Scraping Course, a paginated demo e-commerce website.

ScrapingCourse.com Ecommerce homepage — Click to open the image in full screen

Prerequisites

Curl_cffi is compatible with Python 3+. This tutorial uses Python 3.12.1 on a Windows operating system.

Once you have Python up and running, install curl_cffi. You'll also need BeautifulSoup to extract specific elements. Install both libraries using pip:

                    Terminal
                
pip install curl_cffi --upgrade beautifulsoup4

Copied!

Now, create a project folder with a scraper.py file and open it via any code editor like VS Code.

Step 1: Get the Page's HTML

In this first step, you'll build a basic scraper with curl_cffi by extracting the target website's full-page HTML. You'll also use curl_cffi's impersonate feature to mimic a mainstream browser.

The code below makes curl_cffi visit the target website using the iOS Safari browser. Pay attention to the impersonate keyword:

                    scraper.py
                
# import the required libraries
from curl_cffi import requests

# add an impersonate parameter
response = requests.get(
    "https://www.scrapingcourse.com/ecommerce/",
    impersonate="safari_ios"
)

Copied!

Now, extend the code to verify the response status and retrieve the website's full-page HTML:

                    scraper.py
                
# ...

# verify response status
if response.status_code != 200:
    print(f"An error occured with status {response.status_code}")
else:
    # print the website's HTML if the response status is okay  
    print(response.text)

Copied!

Combine both snippets to get this final code:

                    scraper.py
                
# import the required libraries
from curl_cffi import requests

# add an impersonate parameter
response = requests.get(
    "https://www.scrapingcourse.com/ecommerce/",
    impersonate="safari_ios"
)

# verify response status
if response.status_code != 200:
    print(f"An error occured with status {response.status_code}")
else:
    # print the website's HTML if the response status is okay  
    print(response.text)

  
  

  
Copied!

The code above outputs the target website's HTML. Here's its prettified version:

                    Output
                
<!DOCTYPE html>
<html lang="en-US">
<head>
    <!--- ... --->
  
    <title>Ecommerce Test Site to Learn Web Scraping &#8211; ScrapingCourse.com</title>
    <!--- ... --->
    
</head>
<body class="home archive ...">
    <p class="woocommerce-result-count" id="result-count" data-testid="result-count" data-sorting="count"> Showing 1&ndash;16 of 188 results</p>
    <!--- ... --->
    
    <ul class="products columns-4" id="product-list" data-testid="product-list" data-products="list">
    <!--- ... --->

    </ul>
    
    <!--- ... --->
</body>
</html>

  
  

  
Copied!

We've just built a basic curl_cffi scraper. Now, let's take it a step further and extract specific elements.

Step 2: Extract Product Data

In this section, you'll extract specific product information from the target website. First, you'll obtain the target website's HTML with curl_cffi. Then, you'll use BeautifulSoup as the HTML parser to extract product names, prices, and URLs using CSS selectors.

Let's start with inspecting the website's HTML to expose the target elements. Open the website on a browser like Chrome. Right-click the first product and select Inspect. You'll see that each product is inside a list (li) tag:

scrapingcourse ecommerce homepage inspect first product li — Click to open the image in full screen

To extract product information, add BeautifulSoup to your imported libraries. Modify the else statement from the previous code to parse the web page HTML and extract its product containers:

                    scraper.py
                
# ...
from bs4 import BeautifulSoup

# ...

else:
    # parse the website's HTML with Beautifulsoup
    soup = BeautifulSoup(response.text, features="html.parser")

    # extract all product containers
    products = soup.find_all(class_="product")

Copied!

Specify an empty product array to collect the extracted data. Then, apply the for loop to iterate through each product container and extract its content with CSS selectors. Append the extracted data to the product array and print it:

                    scraper.py
                
# ...
   
    product_data = []

    # loop through each product container to extract its content
    for product in products:

        data={
        "name": product.find(class_="woocommerce-loop-product__title").text,
        "price": product.find(class_="price").text,
        "url": product.find("a").get("href")
        }

        # append the extracted data to the product data array
        product_data.append(data)

# output the scraped product
print(product_data)

Copied!

Here's the updated full code:

                    scraper.py
                
# import the required libraries
from curl_cffi import requests
from bs4 import BeautifulSoup

# add an impersonate parameter
response = requests.get(
    "https://www.scrapingcourse.com/ecommerce/",
    impersonate="safari_ios"
)

# verify response status
if response.status_code != 200:
    print(f"An error occured with status {response.status_code}")
else:
    # parse the website's HTML with Beautifulsoup
    soup = BeautifulSoup(response.text, features="html.parser")

    # extract all product containers
    products = soup.find_all(class_="product")

    # specify an empty array to collect the extracted data
    product_data = []

    # loop through each product container to extract its content
    for product in products:

        data={
        "name": product.find(class_="woocommerce-loop-product__title").text,
        "price": product.find(class_="price").text,
        "url": product.find("a").get("href")
        }

        # append the extracted data to the product data array
        product_data.append(data)

# output the scraped product
print(product_data)

  
  

  
Copied!

Run the code, and you'll get the following result:

                    Output
                
[
    {
        'name': 'Abominable Hoodie', 
        'price': '$69.00', 
        'Url': 'https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/'
    },

    # ... other products omitted for brevity

    {
        'name': 'Artemis Running Short', 
        'price': '$45.00', 
        'Url': 'https://www.scrapingcourse.com/ecommerce/product/artemis-running-short/'
    }
]

  
  

  
Copied!

Your curl_cffi Python scraper now scrapes specific product data. Let's export it to a CSV file.

Step 3: Export the Data as a CSV File

It's time to export the extracted data to a CSV file. Include the Python's built-in CSV package. Then, modify the else statement again with the following line of code that creates a products.csv file and writes the extracted product information into it:

                    scraper.py
                
# ...
import csv

   # ...

   # save the data to a CSV file
    keys = product_data[0].keys()
    with open("products.csv", "w", newline="", encoding="utf-8") as output_file:
        dict_writer = csv.DictWriter(output_file, fieldnames=keys)
        dict_writer.writeheader()
        dict_writer.writerows(product_data)
        print("CSV created successfully")

  
  

  
Copied!

Merge the snippet above with the previous scraper. Your final code should look like this:

                    scraper.py
                
# import the required libraries
from curl_cffi import requests
from bs4 import BeautifulSoup
import csv

# add an impersonate parameter
response = requests.get(
    "https://www.scrapingcourse.com/ecommerce/",
    impersonate="safari_ios"
)

# verify response status
if response.status_code != 200:
    print(f"An error occured with status {response.status_code}")
else:
    # parse the website's HTML with Beautifulsoup
    soup = BeautifulSoup(response.text, features="html.parser")

    # extract all product containers
    products = soup.find_all(class_="product")

    # specify an empty array to collect the extracted data
    product_data = []

    # loop through each product container to extract its content
    for product in products:

        data={
        "name": product.find(class_="woocommerce-loop-product__title").text,
        "price": product.find(class_="price").text,
        "url": product.find("a").get("href")
        }

        # append the extracted data to the product data array
        product_data.append(data)

    # save the data to a CSV file
    keys = product_data[0].keys()
    with open("products.csv", "w", newline="", encoding="utf-8") as output_file:
        dict_writer = csv.DictWriter(output_file, fieldnames=keys)
        dict_writer.writeheader()
        dict_writer.writerows(product_data)
        print("CSV created successfully")


  
  

  
Copied!

The code produces a CSV file containing the extracted data. You'll find it in your project's root directory as products.csv:

ScrapingCourse price, name, and url CSV — Click to open the image in full screen

Great job! You've just built your first web scraper with curl_cffi and BeautifulSoup in Python.

Still, curl_cffi has a few advanced features, including concurrency for scraping multiple pages simultaneously and session management for scraping behind a login. They will significantly enhance your scraper's efficiency, so let's explore them in the next section.

Advanced Web Scraping Techniques With Curl_cffi

With functionalities such as concurrent requests and session management, curl_cffi makes it easy to scrape multiple pages or scrape behind a login.

In this section, you'll learn how to do it.

Integrate With Asyncio for Concurrent Requests

Concurrency speeds up the scraping process, especially when dealing with many pages simultaneously. Curl_cffi's concurrency feature lets you scrape multiple pages. It uses Python's asyncio package to browse a list of URLs asynchronously.

Our target website, ScrapingCourse.com, is paginated and distributes products across 12 pages.

Each page's URL is formatted like this:

                    Example
                
https://www.scrapingcourse.com/ecommerce/page/<page_number>/

Copied!

For instance, this is the third page's URL:

                    Example
                
https://www.scrapingcourse.com/ecommerce/page/3/

Copied!

To confirm, open the website via a browser like Chrome, navigate to the third page, and observe the URL format in the link box:

ScrapingCourse page number — Click to open the image in full screen

To scrape all those pages concurrently with curl_cffi, you must supply each page's URL in a list and access them concurrently.

First, define a scraper function that contains your scraping logic. This function is similar to the previous scraper but accepts a response argument, representing the HTML response returned by each page. It then returns a product data array containing the scraped products:

                    scraper.py
                
# import the required libraries
from bs4 import BeautifulSoup

def scraper(response):
    # parse the website's HTML with Beautifulsoup
    soup = BeautifulSoup(response.text, features="html.parser")

    # extract all product containers
    products = soup.find_all(class_="product")

    # specify an empty array to collect the extracted data
    product_data = []

    # loop through each product container to extract its content
    for product in products:

        data={
        "name": product.find(class_="woocommerce-loop-product__title").text,
        "price": product.find(class_="price").text,
        "url": product.find("a").get("href")
        }

        # append the extracted data to the product data array
        product_data.append(data)

    return product_data

  
  

  
Copied!

That function doesn't perform any action yet. You'll need to run it through a response object to make it functional. Let's implement concurrency to use it.

Add asyncio and curl_cffi's AsyncSession to your imports. Define a function that opens an asynchronous session as curl. Then, use the for loop to create a list of all the 12 product pages in a list comprehension:

                    scraper.py
                
# import the required libraries

# ...
import asyncio
from curl_cffi.requests import AsyncSession

async def concurrentScraper():
    async with AsyncSession() as curl:

        # create a list of all the URLs (pages 1 to 12)
        urls = [f"https://www.scrapingcourse.com/ecommerce/page/{i}/" for i in range(1, 13)]

Copied!

Hint: Increasing the page number iteratively in a list comprehension is useful when dealing with a paginated website with plenty of pages. However, if handling different domains with distinct URLs, you'll have to list the URLs manually.

The next step is to specify an empty concurrency task list and product array to collect the scraped data. Add each URL to that concurrency queue iteratively using a for loop. Asynchronously dump all the tasks in a list of response objects:

                    scraper.py
                
async def concurrentScraper():

        # ...

        # create an empty task list
        tasks = []

        # an empty array to collect the scraped products
        all_products = []

        # run the concurrent request through asyncio
        for url in urls:
            task = curl.get(url)
            tasks.append(task)

        # dump all responses in an iterable
        responses = await asyncio.gather(*tasks)

Copied!

Now, run the scraper function through each response in the task queue. This action applies the scraping logic to all the 12 pages.

Since the scraper function already returns a scraped product list, extend the all_products array with newly scraped data to avoid creating a nested list. Print the scraped products:

                    scraper.py
                
async def concurrentScraper():

        # ...

        # run the scraper function through each response
        for response in responses:
            all_products.extend(scraper(response))

        # output all products
        print(all_products)

Copied!

Finally, run the concurrentScraper function using asyncio.

If you're using Windows, include the event loop policy setting, as done below. This setting configures asyncio to use the selector event loop, which supports low-level asynchronous I/O operations such as reads and writes.

                    scraper.py
                
if __name__ == "__main__":
   
    # set the event loop policy to WindowsSelectorEventLoopPolicy
    asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
   
    # run the concurrent_scraper function
    asyncio.run(concurrentScraper())

Copied!

Combine all the snippets. Here's the full code:

                    scraper.py
                
# import the required libraries
from bs4 import BeautifulSoup
import asyncio
from curl_cffi.requests import AsyncSession

def scraper(response):
    # parse the website's HTML with Beautifulsoup
    soup = BeautifulSoup(response.text, features="html.parser")

    # extract all product containers
    products = soup.find_all(class_="product")

    # specify an empty array to collect the extracted data
    product_data = []

    # loop through each product container to extract its content
    for product in products:

        data={
        "name": product.find(class_="woocommerce-loop-product__title").text,
        "price": product.find(class_="price").text,
        "url": product.find("a").get("href")
        }

        # append the extracted data to the product data array
        product_data.append(data)

    return product_data

async def concurrentScraper():
    async with AsyncSession() as curl:

        # create a list of all the URLs (pages 1 to 12)
        urls = [f"https://www.scrapingcourse.com/ecommerce/page/{i}/" for i in range(1, 13)]

        # create an empty task list
        tasks = []

        # an empty array to collect the scraped products
        all_products = []

        # run the concurrent request through asyncio
        for url in urls:
            task = curl.get(url)
            tasks.append(task)
        responses = await asyncio.gather(*tasks)

        # run the scraper function through each response
        for response in responses:
            all_products.extend(scraper(response))

        # output all products
        print(all_products)

if __name__ == "__main__":

    # set the event loop policy to WindowsSelectorEventLoopPolicy
    asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
   
    # run the concurrent_scraper function
    asyncio.run(concurrentScraper())

  
  

  
Copied!

The above code browses all the product pages concurrently to extract the target data. See the result below:

                    Output
                
[
    {'name': 'Abominable Hoodie', 'price': '$69.00', 'url': 'https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/'}, 
    {'name': 'Adrienne Trek Jacket', 'price': '$57.00', 'url': 'https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/'},

    # other products omitted for brevity

    {'name': 'Zoe Tank', 'price': '$29.00', 'url': 'https://www.scrapingcourse.com/ecommerce/product/zoe-tank/'}, 
    {'name': 'Zoltan Gym Tee', 'price': '$29.00', 'url': 'https://www.scrapingcourse.com/ecommerce/product/zoltan-gym-tee/'}
]

Copied!

Congratulations, you've just scraped multiple pages concurrently with curl_cffi.

Handling Sessions and Authentication

Session handling helps maintain a user's state in order to scrape behind a login and keep consistent headers across multiple requests.

Let's first check your current cookie parameter from https://httpbin.io/cookies, a web page that returns your current session cookies.

Open that web page via a browser like Chrome. It returns an empty JSON object because you haven't set any cookies yet.

Let's set session cookies with curl_cffi using https://httpbin.io/ as a test website. The endpoint to set cookies on the test website looks like this:

                    Example
                
https://httpbin.io/cookies/set?<cookie_key=cookie_value>

Copied!

So, assume you want to set k1=v1&k2=v2 as your session cookies. The URL becomes:

                    Example
                
https://httpbin.io/cookies/set?k1=v1&k2=v2

Copied!

Create a new curl_cffi session in your scraper file and set your session cookies:

                    scraper.py
                
# import the required libraries
from curl_cffi import requests

# create a new session
session = requests.Session()

# the following endpoint makes the server set cookies
session.get("https://httpbin.io/cookies/set?k1=v1&k2=v2")

Copied!

Now, verify the session cookies parameter on https://httpbin.io/cookies:

                    scraper.py
                
# ...

# retrieve cookies to verify
response = session.get("https://httpbin.io/cookies")

print(response.json())

Copied!

Merge both snippets. You'll get the complete code:

                    scraper.py
                
# import the required libraries
from curl_cffi import requests

# create a new session
session = requests.Session()

# the following endpoint makes the server set cookies
session.get("https://httpbin.io/cookies/set?k1=v1&k2=v2")

# retrieve cookies to verify
response = session.get("https://httpbin.io/cookies")

print(response.json())

Copied!

The above code outputs the following cookies, indicating that the added cookies work.

                    Output
                
{'k1': 'v1', 'k2': 'v2'}

Copied!

That's it! You now know how to set session cookies with curl_cffi.

However, despite dealing well with these advanced issues, curl_cffi still has a few shortcomings. You'll learn more about them and find a solution to overcome them in the next section.

Limitations and Best Alternative for curl_cffi

Curl_cffi is a powerful HTTP client. However, it lacks a few features that are essential for web scraping.

Curl_cffi impersonates browsers but can't automate user actions like scrolling and clicking, which makes scraping dynamic websites impossible.

Additionally, while curl_cffi patches the standard cURL library into mainstream browsers, it lacks advanced modifications such as support for Firefox Remote Debugging Protocol, Chrome DevTools Protocol, WebGL renderer, and more. This limitation reduces your scraper's chances of passing anti-bot browser fingerprinting checks.

Consequently, it may not bypass sophisticated anti-bot systems like Cloudflare and Akamai, which combine advanced fingerprinting with machine learning to detect bots. That means your scraper is likely to get blocked by most heavily protected websites.

For instance, the curl_cffi scraper below tries to access and extract content from G2 Reviews, a Cloudflare-protected web page.

                    Example
                
# import the required libraries
from curl_cffi import requests

# add an impersonate parameter
response = requests.get(
    "https://www.g2.com/products/asana/reviews",
    impersonate="safari_ios"
)

# verify response status
if response.status_code != 200:
    print(f"An error occurred with status {response.status_code}")
else:
    # extract the website's HTML
    print(response.text)

  
  

  
Copied!

The above scraper gets blocked with an error 403 response despite impersonating the iOS Safari browser:

                    Output
                
An error occurred with status 403

Copied!

These limitations make curl_cffi unsuitable for scraping at scale, especially if you're planning to access websites with strong protection systems.

If you want to scrape without getting blocked, the best alternative is to use a web scraping API like ZenRows. It's an all-in-one scraping solution that acts as a headless browser and provides full JavaScript support. ZenRows also fixes advanced fingerprinting issues, allowing you to evade all CAPTCHAs and sophisticated anti-bots.

Let's use ZenRows to scrape the G2 Reviews page that blocked you previously to see how it works.

Sign up to open the ZenRows Request Builder. Paste the target URL in the link box, activate Premium Proxies, and select JS Rendering. Choose Python as your preferred language and select the API connection mode. Copy and paste the generated code into your Python script.

building a scraper with zenrows — Click to open the image in full screen

Here's what the generated code looks like:

                    Example
                
# pip install requests
import requests

url = "https://www.g2.com/products/asana/reviews"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
    "url": url,
    "apikey": apikey,
    "js_render": "true",
    "premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)

  
  

  
Copied!

The above code bypasses the website's Cloudflare protection and extracts its full-page HTML, as shown:

                    Output
                
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8" />
    <link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
    <title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
</head>
<body>
    <!-- other content omitted for brevity -->
</body>
</html>

  
  

  
Copied!

Bravo! Your scraper now bypasses anti-bot protection with ZenRows.

Conclusion

You've seen how curl_cffi works and used its basic and advanced features for content extraction. You now know how to:

Get a website's full-page HTML with curl_cffi.
Pair curl_cffi with BeautifulSoup to extract specific elements from a web page.
Export extracted content to a CSV file.
Scrape multiple pages concurrently with curl_cffi.
Set session cookies with curl_cffi's Session object.

Remember that while you can perform all these actions with curl_cffi, the tool still misses some essential web scraping features. For example, it can't effectively avoid blocks. We recommend using ZenRows to bypass any anti-bot detection mechanism, regardless of its complexity.

Try ZenRows for free today!