How to Use Geziyor for Web Scraping

August 12, 2024 · 8 min read

Are you looking for a simple way to implement advanced crawling features in your Golang scraping project? Try Geziyor, a dedicated Golang crawling framework.

In this article, you'll learn about Geziyor's features, how they work, and how to use the tool for web scraping.

Let's get started!

How Can Geziyor Help With Web Scraping?

Geziyor is a framework for web scraping and crawling in Golang. It features a built-in HTTP client for easy requests and an HTML parser for specific data extraction with CSS selectors.

One of Geziyor's strengths is that it simplifies concurrent web scraping. To extract data simultaneously, you just need to pass multiple target URLs to the HTTP client. You can also limit concurrent requests to avoid getting blocked due to server overload.

Geziyor supports content caching, helping you store previous web page versions for quicker access during subsequent requests. It also has a cookie feature, which lets you manage sessions while scraping behind a login.

Another notable feature is automatic data export. You can save data locally in your chosen formats, such as JSON and CSV.

Although Geziyor supports dynamic scraping and allows you to delay requests for content to load into the DOM, it doesn't have event triggers such as clicking and scrolling. In the next section, we will see how scraping works with Geziyor.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Tutorial: How to Scrape With Geziyor

In this section, you'll learn how to use Geziyor by scraping Scraping Course, a demo website with e-commerce features. You'll start with the full-page HTML before extracting specific product data and exporting it to a CSV file.

Here's what the target website looks like:

Scrapingcourse Ecommerce Store
Click to open the image in full screen

Prerequisites

This tutorial uses Go 1.22+. Ensure you download and install the latest version from the Golang download page.

Create a new project folder using VS Code or any code editor you choose. Open your terminal to that directory and initiate a Golang scraper project:

Example
go mod init scraper

Create a new scraper.go file in that project root folder. Then, install Gezior with the following command:

Example
go get -u github.com/geziyor/geziyor

Now, let's build your first Geziyor scraper!

Step 1: Make Your First Request to Get HTML

Let’s build a basic scraper to understand how Geziyor works. Import Geziyor and its client and create a new Geziyor object. This object accepts a list of target URLs, a scraper function, and a robots.txt rule. The Start method then triggers the HTTP client to initiate a request.

The scraper function is a separate action that parses the website's HTML and returns its HTML content by selecting its body element with Geziyor's built-in CSS selector.

Example
package main

// import the required libraries
import (
    "fmt"

    "github.com/geziyor/geziyor"
    "github.com/geziyor/geziyor/client"
)

func main() {

    // create a new Geziyor object and specify the URLs and scraper function
    geziyor.NewGeziyor(&geziyor.Options{
        StartURLs:         []string{"https://www.scrapingcourse.com/ecommerce/"},
        ParseFunc:         scraper,
        RobotsTxtDisabled: true,
    }).Start()
}

// define the scraper function
func scraper(g *geziyor.Geziyor, r *client.Response) {
    fmt.Println("HTML Content:", r.HTMLDoc.Find("body").Text())

Execute the above code in your project root terminal with the following command:

Terminal
go run scraper.go

The scraper code extracts the target website's full-page HTML, as shown:

Output
HTML Content: 

<!--- ... --->
 
Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com
   
  <!--- ... --->

    Abominable Hoodie
    $69.00
  Select options

    Adrienne Trek Jacket
    $57.00
  Select options

  <!--- ... other products omitted for brevity --->


    Showing 1-16 of 188 results

  <!--- ... --->

You now know how to build a basic web scraper with Geziyor. Keep reading to learn how to extract specific elements.

Step 2: Extract Product Data

Geziyor's CSS selector simplifies specific element extraction. Let's see how it works by extracting the names and prices of all products from the target website's first page.

Open the target website via a browser like Chrome. Right-click the first product and select Inspect. You'll see that each product is inside a list tag.

scrapingcourse-elements-under-list-full-inspection
Click to open the image in full screen

Import an additional goquery module for specific element selection. Modify the previous scraper function by extracting all the product containers (li tag). Then, iterate through each container to extract its product's name and price:

scraper.go
package main

// import the required libraries
import (

    "github.com/PuerkitoBio/goquery"

    // ...
)

// define the scraper function
func scraper(g *geziyor.Geziyor, r *client.Response) {

    // loop through the product containers to extract names and prices
    r.HTMLDoc.Find("li.product").Each(func(_ int, s *goquery.Selection) {
        fmt.Println(
            "Name:", s.Find("h2.woocommerce-loop-product__title").Text(),
            "Price:", s.Find("span.price").Text(),
        )
    })
}

Combine the code above with the main function to get the following:

Example
package main

// import the required libraries
import (
    "fmt"

    "github.com/PuerkitoBio/goquery"
    "github.com/geziyor/geziyor"
    "github.com/geziyor/geziyor/client"
)

func main() {

    // create a new Geziyor object and specify the URLs and scraper function
    geziyor.NewGeziyor(&geziyor.Options{

        StartURLs:         []string{"https://www.scrapingcourse.com/ecommerce/"},
        ParseFunc:         scraper,
        RobotsTxtDisabled: true,
    }).Start()
}

// define the scraper function
func scraper(g *geziyor.Geziyor, r *client.Response) {

    // loop through the product containers to extract names and prices
    r.HTMLDoc.Find("li.product").Each(func(_ int, s *goquery.Selection) {
        fmt.Println(
            "Name:", s.Find("h2.woocommerce-loop-product__title").Text(),
            "Price:", s.Find("span.price").Text(),
        )
    })
}

Run the code. The output will look like this:

Output
Name: Abominable Hoodie Price: $69.00
Name: Adrienne Trek Jacket Price: $57.00

// ... other products omitted for brevity

Name: Ariel Roll Sleeve Sweatshirt Price: $39.00
Name: Artemis Running Short Price: $45.00

You've just scraped specific elements from a single page using Geziyor! The next step is collecting the extracted data into a CSV file.

Step 3: Export as a CSV File

Geziyor provides a straightforward auto-export feature, which requires adding an Exporters option to the Geziyor object.

Add Geziyor's export module to your imports. Specify the CSV file name in the Geziyor object Exporters option:

Example
package main

// import the required libraries
import (
    "github.com/geziyor/geziyor/export"

    // ...
)

func main() {

    // create a new Geziyor object and specify the URLs and scraper function
    geziyor.NewGeziyor(&geziyor.Options{

        StartURLs:         []string{"https://www.scrapingcourse.com/ecommerce/"},
        ParseFunc:         scraper,
        RobotsTxtDisabled: true,
        Exporters:         []export.Exporter{&export.CSV{FileName: "products.csv"}},
    }).Start()
}

You must also modify your scraper function to make the data export work. Call an export method to write each product information to the CSV during iteration.

Your new scraper function should look like this:

Example
// define the scraper function
func scraper(g *geziyor.Geziyor, r *client.Response) {

    // loop through the product containers to extract names and prices
    r.HTMLDoc.Find("li.product").Each(func(_ int, s *goquery.Selection) {

        // export the data
        g.Exports <- map[string]interface{}{
            "Name":  s.Find("h2.woocommerce-loop-product__title").Text(),
            "Price": s.Find("span.price").Text(),
        }
    })
}

Merge the snippets to get this complete code:

Example
package main

// import the required libraries
import (
    "github.com/PuerkitoBio/goquery"
    "github.com/geziyor/geziyor"
    "github.com/geziyor/geziyor/client"
    "github.com/geziyor/geziyor/export"
)

func main() {

    // create a new Geziyor object and specify the URLs and scraper function
    geziyor.NewGeziyor(&geziyor.Options{

        StartURLs:         []string{"https://www.scrapingcourse.com/ecommerce/"},
        ParseFunc:         scraper,
        RobotsTxtDisabled: true,
        Exporters:         []export.Exporter{&export.CSV{FileName: "products.csv"}},
    }).Start()
}

// define the scraper function
func scraper(g *geziyor.Geziyor, r *client.Response) {

    // loop through the product containers to extract names and prices
    r.HTMLDoc.Find("li.product").Each(func(_ int, s *goquery.Selection) {

        // export the data
        g.Exports <- map[string]interface{}{
            "Name":  s.Find("h2.woocommerce-loop-product__title").Text(),
            "Price": s.Find("span.price").Text(),
        }
    })
}

The code above extracts the product names and prices into a products.csv file. You'll find it in your project root folder.

Infinite Scrolling Initial Products CSV
Click to open the image in full screen

Congratulations! You've just learned to export data to a CSV file with Geziyor. However, Geziyor has more advanced features, such as scraping multiple pages and dynamic websites. Let's see how they work in the next section.

Advanced Scraping With Geziyor

One of Geziyor's strengths is its ability to scrape paginated websites and dynamic pages. In this section, you'll learn how to do it.

Scrape Multiple Pages

Geziyor offers a straightforward crawling method to scrape multiple pages. The target website (scrapingcourse.com) distributes products over several pages using pagination.

Scraping each page requires following the next page link in the navigation bar.

Let's inspect the navigation button first. Right-click the “Next” button on the navigation bar and choose “Inspect” to expose its element structure. You'll see that the navigation button (the next arrow) has the class name next, which appears in the DOM until you've navigated to the last page:

ScrapingCourse next page inspection
Click to open the image in full screen

Let's modify the previous scraper function to implement pagination with Geziyor. Add a condition that will follow all the subsequent page links. This logic ensures that Geziyor crawls all 12 pages on the target website:

Example
// define the scraper function
func scraper(g *geziyor.Geziyor, r *client.Response) {

    // ...

    // follow all next page links to scrape all pages
    if href, ok := r.HTMLDoc.Find("a.next").Attr("href"); ok {

        // open the next page and execute the scraper function
        g.Get(r.JoinURL(href), scraper)
    }
}

Combine the above code with the previous scraper function. Here's the final code:

Terminal
package main

// import the required libraries
import (
    "github.com/PuerkitoBio/goquery"
    "github.com/geziyor/geziyor"
    "github.com/geziyor/geziyor/client"
    "github.com/geziyor/geziyor/export"
)

func main() {

    // create a new Geziyor object and specify the URLs and scraper function
    geziyor.NewGeziyor(&geziyor.Options{

        StartURLs:         []string{"https://www.scrapingcourse.com/ecommerce/"},
        ParseFunc:         scraper,
        RobotsTxtDisabled: true,
        Exporters:         []export.Exporter{&export.CSV{FileName: "products.csv"}},
    }).Start()
}

// define the scraper function
func scraper(g *geziyor.Geziyor, r *client.Response) {

    // loop through the product containers to extract names and prices
    r.HTMLDoc.Find("li.product").Each(func(_ int, s *goquery.Selection) {

        // export the data
        g.Exports <- map[string]interface{}{
            "Name":  s.Find("h2.woocommerce-loop-product__title").Text(),
            "Price": s.Find("span.price").Text(),
        }
    })

    // follow all next page links to scrape all pages
    if href, ok := r.HTMLDoc.Find("a.next").Attr("href"); ok {

        // open the next page and execute the scraper function
        g.Get(r.JoinURL(href), scraper)
    }
}

The above code opens each page and scrapes its product names and prices. The new products.csv file now contains the product data from all 12 pages:

scraping course all products
Click to open the image in full screen

Congratulations! You've just scraped content from multiple pages with Geziyor. You'll see how to handle dynamic content in the next section.

Scrape JavaScript Rendered Pages

Geziyor doesn't support infinite scrolling, so it can't extract content from websites such as the ScrapingCourse Infinite Scrolling page.

However, its built-in wait and dynamic renderer can be used to pause before the JavaScript-rendered content loads.

Let's see how Geziyor's dynamic scraping works by extracting content from the ScrapingCourse JavaScript Rendering challenge page, a demo website that loads content asynchronously.

Here's how that page renders content:

js rendering screenshot javascript disabled
Click to open the image in full screen

Let's scrape the names and prices of all products from that page. First, inspect the first product. You'll see that each target product data is inside a div tag.

Click to open the image in full screen

To wait for the target content, add a delay option to the Geziyor object:

Example
func main() {

    // create a new Geziyor object and specify the URLs and scraper function
    geziyor.NewGeziyor(&geziyor.Options{
        StartRequestsFunc: requestFunc,
        ParseFunc:         scraper,
        RequestDelay:      10,
    }).Start()
}

Now, define a dedicated request function that implements a GetRendered method. This method adds JavaScript support to Geziyor's client:

Example
// use a dedicated JavaScript rendering function for request
func requestFunc(g *geziyor.Geziyor) {
    g.GetRendered("https://www.scrapingcourse.com/javascript-rendering", g.Opt.ParseFunc)
}

Next, change your scraper function. Modify the parser to iterate through each product container using the CSS selector (.product-info) and extract its product's name and price:

Example
// define the scraper function
func scraper(g *geziyor.Geziyor, r *client.Response) {

    // loop through the product containers to extract names and prices
    r.HTMLDoc.Find("div.product-info").Each(func(_ int, s *goquery.Selection) {
        fmt.Println(
            "Name:", s.Find(".product-name").Text(),
            "Price:", s.Find(".product-price").Text(),
        )
    })
}

Combine the three snippets, and you'll get this complete code:

Example
package main


// import the required libraries
import (
    "fmt"

    "github.com/PuerkitoBio/goquery"
    "github.com/geziyor/geziyor"
    "github.com/geziyor/geziyor/client"
)

func main() {

    // create a new Geziyor object and specify the URLs and scraper function
    geziyor.NewGeziyor(&geziyor.Options{
        StartRequestsFunc: requestFunc,
        ParseFunc:         scraper,
        RequestDelay:      10,
    }).Start()
}

// use a dedicated JavaScript rendering function for request
func requestFunc(g *geziyor.Geziyor) {
    g.GetRendered("https://www.scrapingcourse.com/javascript-rendering", g.Opt.ParseFunc)
}

// define the scraper function
func scraper(g *geziyor.Geziyor, r *client.Response) {

    // loop through the product containers to extract names and prices
    r.HTMLDoc.Find("div.product-info").Each(func(_ int, s *goquery.Selection) {
        fmt.Println(
            "Name:", s.Find(".product-name").Text(),
            "Price:", s.Find(".product-price").Text(),
        )
    })
}

The above scraper extracts the names and prices of all products on the dynamic web page:

Output
Name: Chaz Kangeroo Hoodie Price: $52
Name: Teton Pullover Hoodie Price: $70

// ... other products omitted for brevity

Name: Grayson Crewneck Sweatshirt Price: $64
Name: Ajax Full-Zip Sweatshirt Price: $69

You've just used Geziyor's dynamic scraping feature to obtain data from a JavaScript-rendered website. Good job!

Limitations of Geziyor for Web Scraping

You've tested Geziyor's capabilities in different scraping tasks. While it comes in handy for many popular use cases, it still has a few significant limitations that can hinder your web scraping efforts, especially if it's large scale.

Geziyor's most significant shortcoming is that it lacks an evasion mechanism to bypass CAPTCHAs and advanced anti-bot systems. This means you can't use it to scrape protected web pages.

Additionally, the library has only one release and few updates. Its community is relatively scarce, and the documentation is insufficient, so it's not beginner-friendly. And while Geziyor features proxy management, adding authenticated proxies may require extra steps.

However, if you'd still like to use this tool for scraping, here's some good news. A simple solution for all the drawbacks is a web scraping API.

Avoid Getting Blocked While Scraping With Geziyor

The best way to overcome Geziyor's limitations and scrape any website is to use a web scraping API like ZenRows. It's an all-in-one web scraping solution that acts as a headless browser, modifies your request headers, auto-rotates premium proxies, and bypasses CAPTCHAs and other anti-bot mechanisms at scale.

Let's see how ZenRows works by scraping a Cloudflare-protected website, in this case, a G2 Reviews page.

Sign up to open the ZenRows Request Builder and get your API key with up to 1000 URLs.

Once you're in the Builder, paste the target URL in the link box. Then, activate Premium Proxies and JS Rendering Boost mode. Set your preferred language to Go and choose the API connection mode. Copy and paste the generated code into your scraper file.

building a scraper with zenrows
Click to open the image in full screen

Your full code should look like this:

Example
package main

import (
    "io"
    "log"
    "net/http"
)

func main() {
    client := &http.Client{}
    req, err := http.NewRequest(
        "GET",
        "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fasana%2Freviews&js_render=true&premium_proxy=true",
        nil,
    )
    resp, err := client.Do(req)
    if err != nil {
        log.Fatalln(err)
    }
    defer resp.Body.Close()

    body, err := io.ReadAll(resp.Body)
    if err != nil {
        log.Fatalln(err)
    }

    log.Println(string(body))
}

This code accesses and scrapes the heavily protected website. Check out its full-page HTML output:

Output
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8" />
    <link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
    <title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
</head>
<body>
    <!-- other content omitted for brevity -->
</body>

Congratulations! You've just bypassed Cloudflare protection with ZenRows.

Conclusion

In this tutorial, you've explored how Geziyor works and used it to scrape different websites. You now know how to:

  • Scrape a website's full-page HTML with Geziyor.
  • Extract specific elements from a single web page.
  • Use Geziyor's built-in function to export the extracted data to a CSV file.
  • Scrape all the pages of a paginated website.
  • Extract content from a JavaScript-rendered web page. Geziyor has many useful features, such as crawling and concurrent scraping. However, some of its limitations, such as the inability to bypass anti-bots, can prevent you from accessing your target data. To scrape any website at scale and automatically avoid blocks and bans, it's best to use a web scraping API, such as ZenRows.

Try ZenRows for free today without a credit card!

Ready to get started?

Up to 1,000 URLs for free are waiting for you