How to Use GoSpider for Web Crawling

February 6, 2025 · 7 min read

Table of contents

Build a GoSpider web crawler
- Set up GoSpider
- Access the target website
- Follow links with GoSpider
- Scrape data from links
- Export scraped data to CSV
Avoid getting blocked
Conclusion

GoSpider is a command-line web crawling framework known for its speed. It offers numerous features, all embedded in an intuitive interface that makes it easy to collect data from basic targets that do not require complex crawling logic.

This tutorial will walk you through crawling websites, following links, and scraping valuable data using GoSpider. By the end, you'll be able to discover page URLs and extract data from these links.

Build Your First GoSpider Web Crawler

Real-world examples are the best learning tools. In this tutorial, we'll crawl the ScrapingCourse E-commerce Test site.

ScrapingCourse.com Ecommerce homepage — Click to open the image in full screen

We'll find and follow product links on the website and also scrape valuable information (product name, price, and image URL) as the crawler navigates each product page.

Step 1: Set up GoSpider

To follow along in this tutorial, ensure you meet the following requirements.

The steps below will help you set up your Go project.

Run the command below to verify your Go installation.

                    Terminal
                
go version

Copied!

If you have Go installed, this command will return the version, as seen below.

                    Output
                
go version go1.23.4 windows/386

Copied!

Next, navigate to a directory where you'd like to store your code and run the GoSpider installation command.

                    Terminal
                
GO111MODULE=on go install github.com/jaeles-project/gospider@latest

Copied!

On Windows, this command fetches the latest version of the tool's GitHub repository and places the executable in your $GOPATH/bin directory.

That's it. You're all set up and ready to start crawling with GoSpider.

Crawl websites at scale without getting blocked.

ZenRows improves your data collection workflow with fast and scalable web crawlers.

Try for Free

But before you dive in, it's important to familiarize yourself with the tool's functionality and options. You can use GoSpider's help menu to get started. To access this menu, run the following command.

                    Terminal
                
gospider -h

Copied!

This provides an overview of GoSpider's options and their usage.

                    Output
                
Flags:
  -s, --site string               Site to crawl
  -S, --sites string              Site list to crawl
  -p, --proxy string              Proxy (Ex: http://127.0.0.1:8080)
  -o, --output string             Output folder
  -u, --user-agent string         User Agent to use
                                        web: random web user-agent
                                        mobi: random mobile user-agent
                                        or you can set your special user-agent (default "web")
      --cookie string             Cookie to use (testA=a; testB=b)
  -H, --header stringArray        Header to use (Use multiple flag to set multiple header)

# ... truncated for brevity ... #

  
  

  
Copied!

Step 2: Access the Target Website

Let's start with a basic GoSpider command to access the target website and crawl all available links on the page.

With just a few parameters, you can instruct GoSpider to find links and save those results in a text file.

Here's what the command looks like (your "hello world" moment with GoSpider):

                    Terminal
                
gospider -q -s "https://www.scrapingcourse.com/ecommerce/" -o output

Copied!

This command fires GoSpider to begin crawling, listing all the links it finds on the target page. Here's what each flag means:

-q (quiet): suppresses verbose output and only crawls URLs.
-s (site): takes a string as an argument, specifying the URL to crawl.
-o (output): tells the crawler to store the result in a folder named output. The text files are often stored using the target website's domain name, www_scrapingcource_com.

Once you run this command, GoSpider starts processing, and you'll see results similar to the one below.

                    Output
                
https://www.scrapingcourse.com/ecommerce/
https://www.scrapingcourse.com
https://www.scrapingcourse.com/ecommerce/feed/
https://www.scrapingcourse.com/ecommerce/comments/feed/
https://www.scrapingcourse.com/ecommerce/shop/feed/
https://www.scrapingcourse.com/ecommerce/cart/

# ... truncated for brevity ... #

Copied!

Step 3: Follow Links With GoSpider

Now, let's scale our crawler to find and follow specific links. For this tutorial, we'll keep things simple and focus only on pagination links.

GoSpider provides various command flags that allow you to configure your crawler according to your needs. Using the initial "basic" command, GoSpider will only find all the links on the target page but won't follow them.

However, if you open the target page in a browser, you'll notice that not all pagination elements are immediately visible in the displayed HTML.

scrapingcourse ecommerce homepage inspect — Click to open the image in full screen

Pages 5, 6, 7, 8, and 9 are missing, so our previous command couldn't find these links. You'll need to crawl across multiple levels to locate them, as they appear in subsequent pagination pages.

To achieve this, GoSpider offers the -d flag, which allows you to set a recursion depth for visited URLs. The default value is 1 (only collects links on the start page), and a depth of 0 results in infinite crawling, which can break your crawler.

Therefore, set a maximum depth of 3 to crawl until you find all pagination links. This directs GoSpider to find and follow links three levels away from the start URL, ensuring it covers the pagination chain.

                    Terminal
                
gospider -q -s "https://www.scrapingcourse.com/ecommerce/" -d 3 -o output

Copied!

Your output file will look like this:

                    Output
                
[href] - https://www.scrapingcourse.com/ecommerce/page/2/
[href] - https://www.scrapingcourse.com/ecommerce/page/3/
[href] - https://www.scrapingcourse.com/ecommerce/page/4/

# ... truncated for brevity ... #

Copied!

Ideally, you should be able to use GoSpider's --whitelist and --blacklist options to crawl only pagination links. However, they do not work at the time of writing.

As a workaround, you can use command-line tools, such as awk and grep, to filter the result and isolate pagination links.

To do this, you must first inspect the page to identify the pagination structure. Navigate to the target page in a browser, right-click on a pagination element, and select Inspect. This will open the Developer Tools window, as seen in the image below:

scrapingcourse ecommerce homepage devtools — Click to open the image in full screen

Here, you'll notice that the pagination links all end with the format: ecommerce/page/{number}/. Using this information, navigate to your output folder and write a command to filter out pagination links.

You'll want to eliminate unwanted text columns, handle duplicates, and isolate the desired links.

                    Terminal
                
cat www_scrapingcourse_com | awk '{print $3}' | awk '!seen[$0]++' | grep -E "/ecommerce/page/[0-9]+/$" | tee pagination_links

Copied!

This command filters the result and saves pagination links in a new text file named pagintion_links.

Here's an overview of what each command does:

awk '{print $3}': prints only the third column of the file content. In this case, it eliminates flags, such as [href], leaving only the https:// column.
awk '!seen[$0]++': This ensures there are no duplicate lines.
grep -E "/ecommerce/page/[0-9]+/$": selects links that end with the format; /ecommerce/page/{number}/.
tee pagination_links: This moves the edited text content to a new file named pagination_links.

Your output should look like this:

                    Output
                
# ... omitted for brevity ... #
https://www.scrapingcourse.com/ecommerce/page/10/
https://www.scrapingcourse.com/ecommerce/page/11/
https://www.scrapingcourse.com/ecommerce/page/12/

Copied!

Step 4: Scrape Data From Collected Links

Now that you've successfully isolated pagination links, the next step is to extract product information.

While GoSpider allows you to quickly crawl all links, starting from the seed URL, it doesn't offer built-in support for downloading and processing HTML pages. However, you can scrape data from collected links using other Go frameworks, such as the powerful Colly.

Colly is a lightning-fast Go library that allows you to extract structured data from websites. Therefore, you can integrate GoSpider's output with Colly to extract product information.

We'll access each pagination link and extract the product name, price, and image URL. Below is a step-by-step guide.

Start by initializing a Go project using the following command.

                    Terminal
                
go mod init crawler

Copied!

Then, install Colly.

                    Terminal
                
go get github.com/gocolly/colly

Copied!

Next, create a Go file (crawler.go) and prepare to write some code. Also, ensure that the file containing your pagination links and this Go file are in the same root directory.

In your crawler.go file, import the required libraries and open the pagination links file using Go's os.Open() method.

                    crawler.go
                
package main

// import the required libraries
import (
    "bufio"
    "fmt"
    "log"
    "os"
)

func main() {
    // open the file with pagination links
    file, err := os.Open("pagination_links")
    if err != nil {
        log.Fatalf("Error opening file: %v", err)
    }
    defer file.Close()

}

  
  

  
Copied!

Then, initialize a new scanner instance to read the file line by line. As a starting point, print each line to verify that your code is working correctly.

                    crawler.go
                
// ...

func main() {
    // ...
    
    // initialize a new scanner instance to read the file line by line
    scanner := bufio.NewScanner(file) 
    for scanner.Scan() { 	
        url := scanner.Text()
        fmt.Println("Crawling:", scanner.Text()) 
    
    }

}

Copied!

If everything works correctly, this code will log each line of your pagination_links file.

Now, using Colly, access each pagination link and extract product data. To achieve this, import Colly and initialize a new Collector.

                    crawler.go
                
package main

// import the required libraries
import (
    // ...
    "github.com/gocolly/colly/v2"
)

func main() {
    // initialize a new collector 	
    c := colly.NewCollector()

    // ...
}

  
  

  
Copied!

After that, inspect a product card to identify the right selectors for the desired data points (product name, price, and image URL).

scrapingcourse ecommerce homepage inspect first product li — Click to open the image in full screen

You'll notice that each product is a list item with the class product. The following HTML elements within the list items represent each data point.

Product name: <h2> with class product-name.
Product price: span element with class, product-price.
Product image: <img> tag with class product-image.

Using this information, create an OnHTML() callback to find all product elements on the page and extract their product name, price, and image URL.

Colly's OnHTML is a core feature that allows you to identify and retrieve data from HTML elements using CSS selectors. The callback fires whenever your crawler encounters an HTML element that matches the defined selectors.

Additionally, we recommend organizing and managing the scraped data using structs in Go. To achieve this, define a global struct to store the product details.

                    crawler.go
                
// ... 

// define a struct to store product details
type Product struct {
    Name     string
    Price    string
    ImageURL string
}

// declare a slice to store the products
var products []Product

func main() {
    // ...
    
    // select product (list item with class product)
    c.OnHTML("li.product", func(e *colly.HTMLElement) {
        // retrieve product name, price, and images
        productName := e.ChildText(".product-name")
        productPrice := e.ChildText(".product-price")
        imageURL := e.ChildAttr(".product-image", "src")
       
        // create a new product instance and add it to the slice
        product := Product{
            Name: productName,
            Price: productPrice,
            ImageURL: imageURL,
           
        }
       
        products = append(products, product)
    })

    // ...
    
}

  
  

  
Copied!

Lastly, modify the scanner to visit each link, line by line.

                    crawler.go
                
// ...


func main() {
    // ...

    // initialize scanner instance to read file line by line
    scanner := bufio.NewScanner(file) 
    // for each line visit the pagination link
    for scanner.Scan() { 
        url := scanner.Text()	
        fmt.Println("Crawling:", url)
        err := c.Visit(url) 		
        if err != nil { 			
            log.Printf("Error visiting %s: %v", url, err) 		
        } 
    
    }
}

  
  

  
Copied!

That's it.

Now, combine all the steps to get the following complete code.

                    crawler.go
                
package main

// import the required libraries
import (
    "bufio"
    "fmt"
    "log"
    "os"

    "github.com/gocolly/colly/v2"
)

// define a struct to store product details
type Product struct {
    Name     string
    Price    string
    ImageURL string
}

// declare a slice to store the products
var products []Product

func main() {
    // initialize a new collector 	
    c := colly.NewCollector()

    // open the file with pagination links
    file, err := os.Open("pagination_links")
    if err != nil {
        log.Fatalf("Error opening file: %v", err)
    }
    defer file.Close()

    // select product (list item with class product)
    c.OnHTML("li.product", func(e *colly.HTMLElement) {
        // retrieve product name, price, and images
        productName := e.ChildText(".product-name")
        productPrice := e.ChildText(".product-price")
        imageURL := e.ChildAttr(".product-image", "src")
       
        // create a new product instance and add it to the slice
        product := Product{
            Name: productName,
            Price: productPrice,
            ImageURL: imageURL,
           
        }
       
        products = append(products, product)
        fmt.Printf("Product Name: %s\nProduct Price: %s\nImage URL: %s\n", productName, productPrice, imageURL)

    })

    // initialize scanner instance to read file line by line
    scanner := bufio.NewScanner(file) 
    // for each line visit the pagination link
    for scanner.Scan() { 
        url := scanner.Text()	
        fmt.Println("Crawling:", url)
        err := c.Visit(url) 		
        if err != nil { 			
            log.Printf("Error visiting %s: %v", url, err) 		
        } 
    }
}

  
  

  
Copied!

This extracts each product's name, price, and image URL on each page. Here's what your terminal would look like.

                    Output
                
Crawling: https://www.scrapingcourse.com/ecommerce/page/9/
Product Name: Pierce Gym Short
Product Price: $27.00
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/msh12-red_main.jpg 
Product Name: Portia Capri
Product Price: $49.00
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wp13-orange_main.jpg

// ... truncated for brevity ... //

Copied!

Step 5: Export the Scraped Data to CSV

One way to turn data into actionable insights is by exporting it to CSV for further analysis. You can do this in Go using the encoding/csv package, which allows you to create a FileWriter class and write to CSV.

Since we're using Colly, you can use its OnScraped() callback to define actions once crawling ends.

In that case, we'll create a function to export scraped data to CSV, then call that function within the OnScraped() callback.

Here's a step-by-step guide:

Import the required libraries and open a function to export them to CSV.

                    crawler.go
                
package main

// import the required modules
import (
    // ... 
    
    "encoding/csv"
)

// function to export scraped data to CSV
func exportToCSV(filename string) {
    // define logic to export to csv
}

Copied!

This function opens a CSV file, initializes a writer class, writes the headers, and populates the rows with the scraped data.

                    crawler.go
                
// function to export scraped data to CSV
func exportToCSV(filename string) {
    // open a CSV file
    file, err := os.Create(filename)
    if err != nil {
        fmt.Println("Error creating CSV file:", err)
        return
    }
    defer file.Close()
    // initialize a writer class
    writer := csv.NewWriter(file)
    defer writer.Flush()

    // write the header row
    writer.Write([]string{"Name", "Price", "Image URL"})

    // write the product details
    for _, product := range products {
        writer.Write([]string{product.Name, product.Price, product.ImageURL})
    }
    fmt.Println("Product details exported to", filename)
}

  
  

  
Copied!

That's it.

To verify that everything works, combine the steps above, add the OnScraped() callback, and call the exportToCSV() function within this callback.

You'll get the following complete code:

                    crawler.go
                
// import the required libraries
import (
    "bufio"
    "fmt"
    "log"
    "encoding/csv"
    "os"

    "github.com/gocolly/colly/v2"
)

// define a struct to store product details
type Product struct {
    Name     string
    Price    string
    ImageURL string
}

// declare a slice to store the products
var products []Product

func main() {
    // initialize a new collector 	
    c := colly.NewCollector()

    // open the file with pagination links
    file, err := os.Open("pagination_links")
    if err != nil {
        log.Fatalf("Error opening file: %v", err)
    }
    defer file.Close()

    // select product (list item with class product)
    c.OnHTML("li.product", func(e *colly.HTMLElement) {
        // retrieve product name, price, and images
        productName := e.ChildText(".product-name")
        productPrice := e.ChildText(".product-price")
        imageURL := e.ChildAttr(".product-image", "src")
       
        // create a new product instance and add it to the slice
        product := Product{
            Name: productName,
            Price: productPrice,
            ImageURL: imageURL,
           
        }
       
        products = append(products, product)
        fmt.Printf("Product Name: %s\nProduct Price: %s\nImage URL: %s\n", productName, productPrice, imageURL)

    })

     // add the OnScraped callback to define actions after extracting data.
    c.OnScraped(func(r *colly.Response) {
        fmt.Println("Data extraction complete", r.Request.URL)
        // export the collected products to a CSV file after scraping.
        exportToCSV("product_data.csv")
    })
    // initialize scanner instance to read file line by line
    scanner := bufio.NewScanner(file) 
    // for each line visit the pagination link
    for scanner.Scan() { 
        url := scanner.Text()	
        fmt.Println("Crawling:", url)
        err := c.Visit(url) 		
        if err != nil { 			
            log.Printf("Error visiting %s: %v", url, err) 		
        } 
    
    }
}

// function to export scraped data to CSV
func exportToCSV(filename string) {
    // open a CSV file
    file, err := os.Create(filename)
    if err != nil {
        fmt.Println("Error creating CSV file:", err)
        return
    }
    defer file.Close()
    // initialize a writer class
    writer := csv.NewWriter(file)
    defer writer.Flush()

    // write the header row
    writer.Write([]string{"Name", "Price", "Image URL"})

    // write the product details
    for _, product := range products {
        writer.Write([]string{product.Name, product.Price, product.ImageURL})
    }
    fmt.Println("Product details exported to", filename)
}

  
  

  
Copied!

This exports the scraped data to a CSV file named product_data.csv in your project's root directory.

Here's a sample screenshot for reference.

CSV Data Export — Click to open the image in full screen

Congratulations! You now know how to use GoSpider for web crawling and also export scraped data to CSV.

Avoid Getting Blocked While Crawling With GoSpider

Getting blocked is a common challenge when web crawling. This is because web crawlers exhibit patterns that make it easy for anti-bot solutions to identify and block your requests.

Here's a GoSpider command attempting to crawl the Antibot Challenge page, a protected website.

                    Terminal
                
gospider -s "https://www.scrapingcourse.com/antibot-challenge" -o output

Copied!

You'll get the following 403 error, indicating that the target server understood your requests but refused to fulfill them.

                    Output
                
[url] - [code-403] - https://www.scrapingcourse.com/antibot-challenge

Copied!

This happens because GoSpider is unable to pass the anti-bot challenge and ultimately gets blocked.

Common recommendations for overcoming this challenge include rotating proxies and setting custom user agents. However, these measures do not work against advanced anti-bot solutions.

To guarantee you can crawl any website without getting blocked, consider ZenRows' Universal Scraper API, the most reliable solution for scalable web crawling.

ZenRows is a complete web scraping toolkit that handles every anti-bot solution for you, allowing you to focus on extracting your desired data. Some of its features include advanced anti-bot bypass out of the box, geo-located requests, fingerprinting evasion, actual user spoofing, request header management, and more.

Here's ZenRows in action against the same anti-bot challenge where GoSpider failed.

To follow along in this example, sign up to get your free API key.

Completing your sign-up will take you to the Request Builder page, where you'll find your API key at the top right.

building a scraper with zenrows — Click to open the image in full screen

Input your target URL and activate Premium Proxies and JS Rendering boost mode.

Next, select the Go language and choose the API option. ZenRows works with any language and provides ready-to-use snippets for the most popular ones.

Copy the generated code on the right to your editor for testing.

Your code should look like this:

                    crawler.go
                
package main

import (
    "io"
    "log"
    "net/http"
)

func main() {
    client := &http.Client{}
    req, err := http.NewRequest("GET", "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.scrapingcourse.com%2Fantibot-challenge&js_render=true&premium_proxy=true", nil)
    resp, err := client.Do(req)
    if err != nil {
        log.Fatalln(err)
    }
    defer resp.Body.Close()

    body, err := io.ReadAll(resp.Body)
    if err != nil {
        log.Fatalln(err)
    }

    log.Println(string(body))
}

  
  

  
Copied!

This code bypasses the anti-bot challenge and retrieves the HTML.

                    Output
                
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

  
  

  
Copied!

Congratulations! You're well-equipped to crawl any website without getting blocked.

Conclusion

You've learned how to crawl websites using GoSpider. From setting up your project to integrating with other Go tools, here's a quick recap of your progress.

You now know how to:

Crawl specific links.
Extract data from collected links.
Export scraped data to CSV.

Bear in mind that to take advantage of your crawling skills, you must first overcome anti-bot challenges. GoSpider is a useful crawling tool. However, advanced anti-bot solutions will always block your GoSpider crawler.

To crawl any website without getting blocked, consider ZenRows, an easy-to-implement and scalable solution.