How to Use Curl Impersonate for Web Scraping? [2024 Guide]

May 27, 2024 · 8 min read

Are you getting blocked while scraping with cURL via the command line? That’s because cURL can't handle browser fingerprinting.

Luckily, Curl Impersonate lets you use a real browser via your terminal, which helps deal with the blocks.

In this article, you'll learn how Curl Impersonate works and how to use it for content extraction. Let's roll!

What Is Curl Impersonate?

Curl Impersonate is a modified version of the standard cURL library. It behaves like a real browser by replicating the TLS and HTTP/2 handshake of popular browsers like Firefox, Chrome, Edge, and Safari.

Curl Impersonate also patches other underlying functionalities of the cURL library. For instance, it replaces cURL's OpenSSL with Mozilla's NSS (Network Security Services) or Chrome's BoringSSL to mimic the Firefox or Chrome secure connection protocol.

It also patches cURL's settings for HTTP/2 connections, adjusts its TLS extensions and SSL options configuration patterns, widens its support for more TLS extensions, and removes default patterns like --ciphers, --curves, and -H from HTTP requests. Talk about a full package!

All the functionalities above make Curl Impersonate's requests identical to those of real browsers, reducing the chance of getting blocked during web scraping with cURL.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

How to Scrape With Curl Impersonate

In this section, you'll learn how to use Curl Impersonate by scraping content from ScrapingCourse.com, a demo website with real e-commerce page features.

Here's what the target web page looks like:

Scrapingcourse Ecommerce Store
Click to open the image in full screen

Let's start!

Step 1: Install Curl Impersonate

There are a few ways to install Curl Impersonate. You can install it from a pre-compiled library, get its distro package, or build it from the source via a Docker image.

However, the pre-compiled library and distro package options have system-specific requirements. Since the Docker image option is applicable across different platforms, let's choose it.

First, ensure you've downloaded and installed the latest version of Docker Desktop.

Curl Impersonate features patches for Firefox and Chrome. Each is available in a separate image. You can pull any of these images, depending on your browser choice.

To pull Chrome's version of Curl Impersonate, use the following:

Example
docker pull lwthiker/curl-impersonate:0.6-chrome

To get the Firefox version:

Terminal
docker pull lwthiker/curl-impersonate:0.6-ff

That was easy! You're now all set to scrape with Curl Impersonate.

Step 2: Scrape Your Target Page's HTML

Scraping a website with Curl Impersonate requires running its Docker image. Each browser has a specific flag that lets Docker know which browser to run. To choose a particular browser version, you'll need to include that flag in the Docker command.

Let's use Chrome to extract the full-page HTML of the target website. You'll use Chrome version 110 in this example because Curl Impersonate doesn't support Chrome version 111+.

Below, you can see the basic Docker command to run Chrome version 110. The command consists of the Docker image source, the browser version, and the target website:

Terminal
docker run --rm lwthiker/curl-impersonate:0.6-chrome curl_chrome110 https://www.scrapingcourse.com/ecommerce/

The command above generates the target website's full-page HTML, as shown:

Output
<!DOCTYPE html>
<html lang="en-US">
<head>
    <!--- ... --->
 
    <title>Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com</title>
   
  <!--- ... --->
</head>
<body class="home archive ...">
    <p class="woocommerce-result-count">Showing 1-16 of 188 results</p>
    <ul class="products columns-4">
        <!--- ... --->
     
        <li>
            <h2 class="woocommerce-loop-product__title">Abominable Hoodie</h2>
            <span class="price">
                <span class="woocommerce-Price-amount amount">
                    <bdi>
                        <span class="woocommerce-Price-currencySymbol">$</span>69.00
                    </bdi>
                </span>
            </span>
            <a aria-describedby="This product has multiple variants. The options may ...">Select options</a>
        </li>
     
        <!--- ... other products omitted for brevity --->
    </ul>
</body>
</html>

You've just scraped a web page with Curl Impersonate. Congratulations!

Would you like to know how this tool modifies your request headers? Let's learn more in the next section.

Bonus: Check the Request Headers

One of Curl Impersonate's strategies for bypassing anti-bots is to make your request headers mimic those of a real browser.

To confirm, let's check your default header by replacing the URL in the previous command with https://httpbin.io/headers, a web page that returns your current request headers:

Terminal
docker run --rm lwthiker/curl-impersonate:0.6-chrome curl_chrome110  https://httpbin.io/headers

You'll see that Curl Impersonate uses a real User Agent for the specified Chrome version (Chrome/110.0.0.0). It also reflects the same browser version in the secure client hint User Agent (Sec-Ch-Ua) header. The consistency between these two headers lowers the chances of anti-bot detection.

In addition, other request headers, such as the accepted encoding and language, resemble a legitimate browser. See the result below:

File
"headers": {
    "Accept": [   "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"
{
    "headers": {
        "Accept-Encoding": [
            "gzip, deflate, br"
        ],
        "Accept-Language": [
            "en-US,en;q=0.9"
        ],
        "Host": [
            "httpbin.io"
        ],
        "Sec-Ch-Ua": [
            "\"Chromium\";v=\"110\", \"Not A(Brand\";v=\"24\", \"Google Chrome\";v=\"110\""
        ],
        "Sec-Ch-Ua-Mobile": [
            "?0"
        ],
        "Sec-Ch-Ua-Platform": [
            "\"Windows\""
        ],
        "Sec-Fetch-Dest": [
            "document"
        ],
        "Sec-Fetch-Mode": [
            "navigate"
        ],
        "Sec-Fetch-Site": [
            "none"
        ],
        "Sec-Fetch-User": [
            "?1"
        ],
        "Upgrade-Insecure-Requests": [
            "1"
        ],
        "User-Agent": [
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
        ]
    }
}

Check our article on the common HTTP headers for web scraping to learn more about the importance of request headers for content extraction.

You can also use Curl Impersonate with Python. You'll see how it works in the next section.

Use Curl Impersonate With Python

Scaling your content extraction project with the base Curl Impersonate library is difficult. That’s because it only works via the command line. However, it's achievable with a Python library like curl_cffi.

Curl_cffi is a language binding that lets you use Curl Impersonate's features in Python. It supports session management, which lets you keep user sessions while scraping behind a login. It also features an asynchronous process for concurrent scraping.

To begin using Curl Impersonate with Python, install the package using pip:

Terminal
pip install curl_cffi --upgrade

Let's scrape the full-page HTML from the ScrapingCourse e-commerce demo website.

Import the library's request module and go to the target website. The get method accepts an impersonate method to emulate your chosen browser version. For instance, if using Chrome version 110, the value of the impersonate argument should be "chrome110". Since you'll use the latest supported version, you don't need to specify a version number:

scraper.py
# import the required library
from curl_cffi import requests

# send your request and choose a browser type/version
response = requests.get(
    "https://www.scrapingcourse.com/ecommerce/",
    impersonate="chrome"
)

# output the response text to view the full HTML
print(response.text)

The code above extracts the website's full-page HTML using the latest supported Chrome browser. See the output below:

Output
<!DOCTYPE html>
<html lang="en-US">
<head>
    <!--- ... --->
 
    <title>Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com</title>
   
  <!--- ... --->
</head>
<body class="home archive ...">
    <p class="woocommerce-result-count">Showing 1-16 of 188 results</p>
    <ul class="products columns-4">
        <!--- ... --->
     
        <li>
            <h2 class="woocommerce-loop-product__title">Abominable Hoodie</h2>
            <span class="price">
                <span class="woocommerce-Price-amount amount">
                    <bdi>
                        <span class="woocommerce-Price-currencySymbol">$</span>69.00
                    </bdi>
                </span>
            </span>
            <a aria-describedby="This product has multiple variants. The options may ...">Select options</a>
        </li>
     
        <!--- ... other products omitted for brevity --->
    </ul>
</body>
</html>

Congratulations! You've just learned to scrape a website's HTML with Curl Impersonate via Python's curl_cffi library.

But despite all its handy capabilities, Curl Impersonate isn't a perfect web scraping tool. It has a few limitations you need to be aware of before starting your project.

Limitations and the Best Alternative of Curl Impersonate

Curl Impersonate offers a significant web scraping advantage over the regular cURL library, but it still has a few serious limitations.

Firstly, it doesn't stay up to date with the newer browser versions. This is a significant drawback since some anti-bots flag older browsers.

Secondly, Curl Impersonate can't handle JavaScript-rendered web pages, like those with infinite scrolling, since it's only a request impersonator and doesn't execute JavaScript.

These limitations lower the library's chances of bypassing sophisticated anti-bot measures like DataDome and Cloudflare.

Let's see how Curl Impersonate performs against a Coudflare-protected website, like the G2 Reviews page, to test its bypassing ability.

Try it out with the following command:

Terminal
docker run --rm lwthiker/curl-impersonate:0.6-chrome curl_chrome110 https://www.g2.com/products/asana/reviews

The command above shows the following HTML output with a title that says, "Just a moment…", indicating that Curl Impersonate couldn't bypass Cloudflare Turnstile:

Output
<!DOCTYPE html>
<html lang="en-US">
<head>
    <title>Just a moment...</title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <!-- ... other header content omitted for brevity -->
</head>
<body>
    <!-- ... content omitted for brevity -->
</body>
</html>

However, there is a way to overcome this hurdle and scrape any web page without getting blocked by using a web scraping API, such as ZenRows. ZenRows optimizes your request headers, auto-rotates premium proxies, and bypasses CAPTCHAs and other anti-bot systems at scale.

ZenRows also works perfectly with cURL and features JavaScript instructions, allowing it to act as a headless browser for scraping dynamic web pages.

Let's use ZenRows to scrape the G2 Reviews page that blocked you previously.

Sign up to open the ZenRows Request Builder. Paste the target URL in the link box, toggle the Boost mode to JS Rendering, and activate Premium Proxies. Select cURL as your chosen language and choose the API connection mode. Finally, copy and paste the generated code and run it in your command line.

ZenRows Request Builder
Click to open the image in full screen

The generated cURL code looks like this:

Terminal
curl "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fasana%2Freviews&js_render=true&premium_proxy=true"

Paste and run it in your terminal and watch ZenRows bypass the anti-bot measure.

The command scrapes the Cloudflare-protected website's full-page HTML. See the result below, with some content omitted for brevity:

Output
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8" />
    <link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
    <title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
</head>
<body>
    <!-- other content omitted for brevity -->
</body>

Congratulations! You've just scraped a Cloudflare-protected website with ZenRows.

Conclusion

In this tutorial, you've learned how to scrape a website with the Curl Impersonate library. You know:

  • How Curl Impersonate works and how to install it via Docker image.
  • How to scrape a full-page HTML with Curl Impersonate.
  • How to use Curl Impersonate with Python via the curl_cffi library.

Remember that complex anti-bots may block Curl Impersonate, preventing you from accessing your target data. We recommend using ZenRows to bypass all anti-bot measures and scrape any website without getting blocked. Try ZenRow now, and get your API key with up to 1000 free request credits!

Ready to get started?

Up to 1,000 URLs for free are waiting for you