How to Use Ferret for Web Scraping: Tutorial [2024]

July 4, 2024 ยท 10 min read

Do you want to take advantage of Golang's scraping power without writing a Go program? Ferret, a declarative query language, makes it possible thanks to simpler syntaxes.

In this article, you'll learn how to extract content from static and dynamic web pages using Ferret.

Let's go!

What Is Ferret?

Ferret is an open-source web scraping tool written in Golang. It uses the Ferret Query Language (FQL), a declarative query language, and features a command-line interface (CLI) package for running scraping queries within the system's terminal.

The tool runs headless Chromium under the hood, which gives it access to Chrome DevTools Protocol (CDP) and makes it suitable for dynamic content extraction.

Although Ferret provides a Go library for communicating directly with the FQL, the library is still under development and lacks documentation. Thus, the CLI option remains the easiest way to work with Ferret.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

How to Create Your First Ferret Scraper

In this section, you'll create your first Ferret scraper by extracting product information from ScrapingCourse, an e-commerce demo website.

Scrapingcourse Ecommerce Store
Click to open the image in full screen

Are you ready to scrape this website with Ferret? Let's start with the prerequisites.

Prerequisites

You'll use Ferret's CLI, which is an independent package for running scraping queries. Go to the Ferret CLI release page to download the latest version for your operating system and extract the downloaded file. If you use Windows, add the extracted file's .exe path to your system's variable.

Ferret depends on Chromium, so you'll need the latest build version. You can download the browser driver to your local machine. However, to allow cross-platform compatibility, we recommend pulling its image directly from Ferret and running its instance via Docker.

Download and install Docker Desktop if you haven't already. Then, pull Chromium's image from Ferret by running the following command in a terminal or command prompt:

Terminal
docker pull montferret/chromium

To run the Chromium image:

Terminal
docker run -d -p 9222:9222 montferret/chromium

You can write the Ferret Query Language (FQL) directly inside the command line to run simple and quick scraping tasks. However, that approach is unsuitable for large projects since writing blocks of code inside your terminal is difficult to maintain and reduces readability. A recommended approach is to write your FQL code in a Ferret file and execute that file via the command line.

Create a project directory, open your code editor to that folder, and create a new scraper.fql file. This tutorial uses VS Code, but you can follow along with any code editor.

You're now ready to build your first Ferret scraper!

How to Scrape Full HTML With Ferret

Ferret has a built-in HTTP client and can easily parse a website's HTML.

Let's start by scraping the target website's full-page HTML using the CLI package you downloaded previously.

To do this, write the following code in your scraper.fql file:

scraper.fql
// open the target website with Chromium
LET doc = DOCUMENT('https://www.scrapingcourse.com/ecommerce/', {
    driver: 'cdp'
}) 

// print the full-page HTML
RETURN doc

Open your terminal to your project root folder and run the scraper file with the following command:

Terminal
ferret exec scraper.fql

The code outputs the website's HTML, as shown. We've prettified the output to make it more readable:

Output
<head>
    <!--- ... --->
 
    <title>Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com</title>
   
  <!--- ... --->
</head>
<body class="home archive ...">
    <p class="woocommerce-result-count">Showing 1-16 of 188 results</p>
    <ul class="products columns-4">
        <!--- ... --->
     
        <li>
            <h2 class="woocommerce-loop-product__title">Abominable Hoodie</h2>
            <span class="price">
                <span class="woocommerce-Price-amount amount">
                    <bdi>
                        <span class="woocommerce-Price-currencySymbol">$</span>69.00
                    </bdi>
                </span>
            </span>
            <a aria-describedby="This product has multiple variants. The options may ...">Select options</a>
        </li>
     
        <!--- ... other products omitted for brevity --->
    </ul>
</body>

You've just run your first Ferret scraper! Now, let's scale the code up to extract specific product elements.

How to Extract Specific Web Elements With Ferret

In this section, you'll extract product names and prices from the same target website (ScrapingCourse).

Let's quickly inspect the website's elements before scraping them. Open the target website via a browser like Chrome, right-click the first product, and select Inspect.

You'll see that each product is inside individual list (li) tags.

Inspect Element
Click to open the image in full screen

Let's modify the previous Ferret scraper to extract each product's information by iterating over its containers (the li tag).

Obtain all the product containers via the class attribute (.product). Then, use the for loop to iterate over each container to extract product names and prices:

scraper.fql
// ...

// extract the product containers
LET products = ELEMENTS(doc, '.product')

// iterate through each container to extract product names and prices
FOR product IN products
    RETURN {
        Name: TRIM(INNER_TEXT(product, '.woocommerce-loop-product__title')),
        Price: TRIM(INNER_TEXT(product, '.price'))
    }

Merge the above snippet with the previous one to get the following complete code:

scraper.fql
// open the target website with Chromium

LET doc = DOCUMENT('https://www.scrapingcourse.com/ecommerce/', {
    driver: 'cdp'
})

// extract the product containers
LET products = ELEMENTS(doc, '.product')

// iterate through each container to extract product names and prices
FOR product IN products
    RETURN {
        Name: TRIM(INNER_TEXT(product, '.woocommerce-loop-product__title')),
        Price: TRIM(INNER_TEXT(product, '.price'))
    }

You've just scraped an e-commerce website with Ferret. Good job! However, the example website is static and only renders pre-loaded content.

In the next section, you'll see how to handle a dynamic website with Ferret.

How to Scrape JavaScript Rendered Pages

JavaScript-rendered content doesn't immediately load after opening the website. Fortunately, Ferret's built-in headless browser feature allows you to extract it.

To see how Ferret handles JavaScript rendering, let's scrape the ScrapingCourse JavaScript Rendering challenge page, a demo website that loads content asynchronously.

Notice that the website loads content only after some time passes:

async loading of a webpage
Click to open the image in full screen

You'll extract product names and prices from that website by implementing a wait mechanism that pauses for elements to load before extraction.

Inspect the target website's elements. Each product's name and price is in a div tag (.product-info). Let's scrape them!

Inspect Element
Click to open the image in full screen

First, launch the target website using the Chromium instance and implement a five-second pause for all the product containers to appear in the DOM:

scraper.fql
// open the target website with Chromium
LET doc = DOCUMENT('https://www.scrapingcourse.com/javascript-rendering', {
    driver: 'cdp'
})

// wait for 5 seconds for the product containers to load
WAIT_ELEMENT(doc, '.product-info', 5000)

Extract the product containers and iterate through each to extract the desired product information:

scraper.fql
// ...

// extract the product containers
LET products = ELEMENTS(doc, '.product-info')

// iterate through each container to extract product names and prices
FOR product IN products
    RETURN {
        Name: TRIM(INNER_TEXT(product, '.product-name')),
        Price: TRIM(INNER_TEXT(product, '.product-price'))
    }

Here's the full code after combining both snippets:

scraper.fql
// open the target website with Chromium
LET doc = DOCUMENT('https://www.scrapingcourse.com/javascript-rendering', {
    driver: 'cdp'
})

// wait for 5 seconds for the product containers to load
WAIT_ELEMENT(doc, '.product-info', 5000)

// extract the product containers
LET products = ELEMENTS(doc, '.product-info')

// iterate through each container to extract product names and prices
FOR product IN products
    RETURN {
        Name: TRIM(INNER_TEXT(product, '.product-name')),
        Price: TRIM(INNER_TEXT(product, '.product-price'))
    }

Execute the scraper with the Ferret command:

Terminal
ferret exec scraper.fql

The code implements the wait mechanism and extracts the names and prices of all the products from the dynamic web page:

Output
[
    {"Name":"Chaz Kangeroo Hoodie","Price":"$52"},
    {"Name":"Teton Pullover Hoodie","Price":"$70"},

    // ... other products omitted for brevity,

    {"Name":"Grayson Crewneck Sweatshirt","Price":"$64"},
    {"Name":"Ajax Full-Zip Sweatshirt","Price":"$69"}
]

You've just scraped a JavaScript-rendered page with Ferret. Congratulations!

While Ferret is perfectly capable of extracting data from pages, it has a few limitations you should learn about before using it for web scraping.

Limitations of Ferret for Web Scraping

Ferret aims to simplify web scraping by abstracting Golang's strict syntaxes. However, it has a few drawbacks that can affect your project.

First, the query language powering Ferret is still under development and has poor documentation, which complicates its learning curve.

Ferret also lacks advanced features suitable for handling complex extraction tasks. For instance, it can't scrape dynamic pages with infinite scrolling and doesn't support data exports to popular file formats like CSV and JSON.

Ferret also can't bypass CAPTCHAs and other anti-bot systems like Cloudflare despite its dependence on Chromium via the Chrome DevTools Protocol.

Although Ferret handles dynamic content extraction with Chromium, it requires consistent updates to maintain compatibility with new Chromium releases. This recurring setup can be difficult to sustain at scale, especially when executing scraping tasks across multiple machines or browser versions.

Fortunately, there's a solution to all these. Let's have it in the next section.

Best Solution for Web Scraping (Using Go)

You can overcome Ferret's limitations and scrape any protected website using a web scraping API, such as ZenRows. It's an all-in-one scraping solution that acts as a headless browser, auto-rotates premium proxies, optimizes your request headers, and bypasses all CAPTCHAs and any other anti-bot system at scale. As such, it mitigates all of Ferret's shortcomings.

Let's show you how ZenRows works by scraping a Cloudflare-protected website like this G2 Reviews page.

Sign up to open the ZenRows Request Builder. Paste the target URL in the link box, toggle the Boost mode to JS Rendering, and activate Premium Proxies. Select Go as your programming language and choose the API connection mode. Then, copy and paste the generated code into your scraper file.

ZenRows Request Builder
Click to open the image in full screen

The generated code should look like this in your Go file:

scraper.go
package main

import (
    "io"
    "log"
    "net/http"
)

func main() {
    client := &http.Client{}
    req, err := http.NewRequest(
        "GET",
        "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fasana%2Freviews&js_render=true&premium_proxy=true",
        nil,
    )
    resp, err := client.Do(req)
    if err != nil {
        log.Fatalln(err)
    }
    defer resp.Body.Close()

    body, err := io.ReadAll(resp.Body)
    if err != nil {
        log.Fatalln(err)
    }

    log.Println(string(body))
}

Run the above code with the following command:

Terminal
go run scraper.go

The code accesses the protected website and gets its full-page HTML, as shown:

Output
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8" />
    <link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
    <title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
</head>
<body>
    <!-- other content omitted for brevity -->
</body>

Congratulations! Your Go scraper now bypasses Cloudflare's anti-bot protection with ZenRows.

Conclusion

You now know how Ferret works and how to use it for web scraping. Here's what you can do:

  • Scrape a full-page HTML with Ferret.
  • Extract specific products from a static web page with the Ferret CLI.
  • Use the Ferret CLI to get content from a dynamic web page.

While Ferret provides a straightforward way to scrape with Golang, remember that it can't handle more complex data extraction tasks and is powerless against most anti-bot detectors. To deal with these issues, go for a web scraping API, such as ZenRows, to bypass all blocks and scrape any protected website at scale.

Try ZenRows for free now without a credit card!

Ready to get started?

Up to 1,000 URLs for free are waiting for you