Web Crawler with Go: Step-by-Step Tutorial

Rubén del Campo
Rubén del Campo
January 10, 2025 · 10 min read

Most large-scale web scraping projects in Go begin with discovering and organizing URLs using a Golang web crawler. This tool enables you to navigate an initial target domain, also known as a "seed URL", and recursively visit the links on the page to uncover more links.

This guide will teach you how to build and optimize a Golang web crawler using real-world examples. By the end of this tutorial, you'll have a web crawler capable of following all the links on a web page and extracting data from select links.

Before we dive in, here's some background information.

What Is Web Crawling?

Web crawling refers to the process of systematically browsing the web to discover specific information (usually URLs and page links) for various purposes, such as indexing for search engines.

Although the terms web crawling and web scraping are often used interchangeably, they actually refer to different processes with distinct applications.

Web scraping is the process of retrieving information from websites, while web crawling is all about discovering URLs and putting them to use.

Most large-scale data extraction projects require both. For instance, you might first crawl a target domain to discover URLs, then scrape those URLs to extract the desired data.

For more details on the differences between the two, check out our in-depth comparison guide on web crawling vs. web scraping.

Build Your First Golang Web Crawler

In this tutorial, we'll crawl the links from the ScrapingCourse e-commerce test website.

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

This website has many pages, including paginated products, carts, and checkout. After crawling all the links from the seed URL, we'll select some and extract valuable product data.

If you're new to data extraction in Go or want a quick refresher on the topic, check out our guide on web scraping in Golang.

In the meantime, follow the steps below to build your first Golang web crawler.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step 1: Prerequisites for Building a Golang Web Crawler  

Having the right stack is critical to building a Golang web crawler. Here are the tools you'll need to follow along in this tutorial:

  • Go: Ensure your machine has the latest version of Go. You can download it from the official Go website and follow the installation prompts.
  • Your preferred IDE: In this tutorial, we'll use Visual Studio Code, but you can use any IDE you choose.
  • Colly: You'll require this library to fetch and parse HTML.

Follow the steps below to set up your project and ensure everything's in place.

Run the following command in your terminal to verify your Go installation.

Terminal
go version

If Go runs on your machine, this command will return its version, as in the example below.

Output
go version go1.23.4 windows/386

Next, navigate to a directory where you'd like to store your code and initialize a Go project using the following command.

Terminal
go mod init crawler

This command creates a new go.mod file, where you can add your project's dependencies.

Run the following command to install Colly and all its dependencies from its GitHub repository.

Terminal
go get github.com/gocolly/colly

That's it. You're all set up.

Now, create a Go file (crawler.go), open it in your preferred IDE, and prepare to write some code.

Let's start with the most basic functionality of a Golang web crawler: making a GET request to the seed URL and retrieving its HTML content.

We'll create a crawl() function to visit and extract the HTML of the target page. This function will take the seed URL as an argument. Later, we'll extend it to find and follow all the links on the page.

crawler.go
package main

import (
    "fmt"

    "github.com/gocolly/colly"
)

func main() {
    //define the seed URL
    seedurl := "https://www.scrapingcourse.com/ecommerce/"

    // call the crawl function
    crawl(seedurl)
}

func crawl(currenturl string) {
    // create a new collector
    c := colly.NewCollector(
        colly.AllowedDomains("www.scrapingcourse.com"),
    )

    // extract and log the page title
    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println("Page Title:", e.Text)
    })

    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Crawling", r.URL)
    })

    // handle request errors
    c.OnError(func(e *colly.Response, err error) {
        fmt.Println("Request URL:", e.Request.URL, "failed with response:", e, "\nError:", err)
    })

    // visit the seed URL
    err := c.Visit(currenturl)
    if err != nil {
        fmt.Println("Error visiting page:", err)
    }
}

This code restricts crawling to the target domain, preventing unintentional visits to external websites.

It then makes a GET request to the seed URL, retrieves its HTML, and logs the following output.

Output
Crawling https://www.scrapingcourse.com/ecommerce/
Page Title: Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com

Awesome!

But that's only scratching the surface. The next step is to modify your crawler to find and follow all links on the web page.

To do this, you'll need to track visited URLs to ensure you don't visit the same link multiple times. Then, set a depth limit and crawl each link recursively to discover more URLs.

Depth limits are important to control how deep your crawler goes from the seed URL. For example, a depth limit of 1 means that your crawler will only extract and follow the links on the seed URL. It won't go beyond that.

Additionally, since Colly uses predefined callbacks, you can trigger a callback function whenever your crawler encounters a link. This eliminates the need to create a recursive function from scratch.

Here's a step-by-step guide.

In Go, the map data structure allows you to handle duplicates. If you try to include an entry that already exists, the map will overwrite the existing entry.

Thus, start by initializing a global map to store visited URLs.

crawler.go
//...

// initialize a map to store visited URLs
var visitedurls = make(map[string]bool)

Next, modify your crawl function to take two arguments: the current URL and the max depth. Then, add the colly.MaxDepth option to the Collector object. This option ensures that Colly manages the depth of each request for you.

crawler.go
//...

func crawl (currenturl string, maxdepth int) {
    // instantiate  a new collector
    c := colly.NewCollector(
        colly.AllowedDomains("www.scrapingcourse.com"),
        colly.MaxDepth(maxdepth),
    )

    // ...

}

Within the crawl function, and after logging the page title, add an OnHTML callback to find all links on the page and recursively visit each link.

Website links are often defined in anchor tags. Therefore, you can find every URL on the page by selecting the href attribute for all anchor tags.

Colly's OnHTML function allows you to select HTML elements using their CSS selectors (in this case, a[href]). Within this function, check if the current URL has been visited. If no, add the current URL to visitedurls, and proceed to visit the current URL.

We recommend getting the absolute URL of the href attribute to avoid trying to crawl relative paths. The AbsoluteURL() method automatically concatenates relative paths to form the complete URL.

crawler.go
crawl (currenturl string, maxdepth int) {    
    // ...

    // ----- find and visit all links ---- //
    // select the href attribute of all anchor tags
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        // get absolute URL
        link := e.Request.AbsoluteURL(e.Attr("href"))
        // check if current URL has already been visited
        if link != "" && !visitedurls[link] {
            // add current URL to visitedURLs
            visitedurls[link] = true
            fmt.Println("Found link:", link)
            // visit current URL
            e.Request.Visit(link)
        }
    })

}

That's it.

Now, combine all the steps above and modify the initial crawler.go file accordingly to get the following complete code.

crawler.go
package main

import (
    "fmt"

    "github.com/gocolly/colly"
)

// initialize a map to store visited URLs
var visitedurls = make(map[string]bool)

func main() {
    // define the seed URL
    seedurl := "https://www.scrapingcourse.com/ecommerce/"

    // call the crawl function 
    crawl(seedurl, 0)
}

func crawl (currenturl string, maxdepth int) {
    // instantiate  a new collector
    c := colly.NewCollector(
        colly.AllowedDomains("www.scrapingcourse.com"),
        colly.MaxDepth(maxdepth),
    )

    // extract and log the page title
    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println("Page Title:", e.Text)
    })


    // ----- find and visit all links ---- //
    // select the href attribute of all anchor tags
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        // get absolute URL
        link := e.Request.AbsoluteURL(e.Attr("href"))
        // check if current URL has already been visited
        if link != "" && !visitedurls[link] {
            // add current URL to visitedURLs
            visitedurls[link] = true
            fmt.Println("Found link:", link)
            // visit current URL
            e.Request.Visit(link)
        }
    })

    // add an OnRequest callback to track progress
    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Crawling", r.URL)
    })

    // handle request errors
    c.OnError(func(e *colly.Response, err error) {
        fmt.Println("Request URL:", e.Request.URL, "failed with response:", e, "\nError:", err)
    })

    // visit the seed URL
    err := c.Visit(currenturl)
    if err != nil {
        fmt.Println("Error visiting page:", err)
    }

}

This code finds and follows the target website links until the depth limit is reached. Here's what your console would look like:

Output
Crawling https://www.scrapingcourse.com/ecommerce/
Page Title: Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com
Found link: https://www.scrapingcourse.com/ecommerce/
Found link: https://www.scrapingcourse.com/ecommerce/cart/

// ...  truncated for brevity ... //

Congratulations! You've built your first Golang web crawler.

However, most data extraction projects aim to crawl specific links rather than every URL on the page.

Let's see how to modify your code to crawl select URLs. In this example, we'll extract pagination links on the target web page.

To do that, you need to first inspect the page to identify the CSS selector for the pagination links. Right-click on the pagination and select Inspect.

ScrapingCourse Next Page Link Inspection
Click to open the image in full screen

This will open the DevTools window, which shows the page's HTML structure, as shown in the image below.

scrapingcourse ecommerce homepage devtools
Click to open the image in full screen

You'll find that there are 12 product pages, all of which share the same page-number class.

Using this information, modify the OnHTML callback that crawled all links to only process pagination links.

crawler.go
package main

import (
    "fmt"

    "github.com/gocolly/colly"
)

// initialize a map to store visited URLs
var visitedurls = make(map[string]bool)

func main() {
    // define the seed URL
    seedurl := "https://www.scrapingcourse.com/ecommerce/"

    // call the crawl function 
    crawl(seedurl, 0)
}

func crawl (currenturl string, maxdepth int) {
    // instantiate  a new collector
    c := colly.NewCollector(
        colly.AllowedDomains("www.scrapingcourse.com"),
        colly.MaxDepth(maxdepth),
    )

    // extract and log the page title
    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println("Page Title:", e.Text)
    })


    // ----- find and visit pagination links ---- //
    // select the href attribute of all anchor tags
    c.OnHTML("a.page-numbers", func(e *colly.HTMLElement) {
        // get absolute URL
        link := e.Request.AbsoluteURL(e.Attr("href"))
        // check if current URL has already been visited
        if link != "" && !visitedurls[link] {
            // add current URL to visitedURLs
            visitedurls[link] = true
            fmt.Println("Found link:", link)
            // visit current URL
            e.Request.Visit(link)
        }
    })

    // add an OnRequest callback to track progress
    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Crawling", r.URL)
    })

    // handle request errors
    c.OnError(func(e *colly.Response, err error) {
        fmt.Println("Request URL:", e.Request.URL, "failed with response:", e, "\nError:", err)
    })

    // visit the seed URL
    err := c.Visit(currenturl)
    if err != nil {
        fmt.Println("Error visiting page:", err)
    }

}

This ensures your crawler only finds and follows pagination links. Here's what your output could look like:

Output
// ...
Found link: https://www.scrapingcourse.com/ecommerce/page/10/
Crawling https://www.scrapingcourse.com/ecommerce/page/10/
Page Title: Ecommerce Test Site to Learn Web Scraping - Page 10 - ScrapingCourse.com
Found link: https://www.scrapingcourse.com/ecommerce/page/11/
Crawling https://www.scrapingcourse.com/ecommerce/page/11/
Page Title: Ecommerce Test Site to Learn Web Scraping - Page 11 - ScrapingCourse.com
Found link: https://www.scrapingcourse.com/ecommerce/page/12/
Crawling https://www.scrapingcourse.com/ecommerce/page/12/
Page Title: Ecommerce Test Site to Learn Web Scraping - Page 12 - ScrapingCourse.com

Awesome!

Step 3: Extract Data From Your Crawler

Let's extend your crawler to extract some valuable product information. Once the crawler navigates to each pagination link, we'll extract the following data points:

  • Product name.
  • Product price.
  • Product Image.

As before, inspect the page to identify the right selectors for each data point.

scrapingcourse ecommerce homepage inspect first product li
Click to open the image in full screen

You'll find that each product is a list item with the class product. The following HTML elements within the list items represent each data point.

  • Product name: <h2> with class product-name.
  • Product price: span element with class, price.
  • Product image: <img> tag with class product-image.

Using this information, add an OnHTML to find all product items on the current page and extract their product name, price, and image URL.

We recommend grouping and managing the scraped data using structs. So, define a global struct to store product details.

crawler.go
// define a struct to store product details
type Product struct {
    Name     string
    Price    string
    ImageURL string
}

// declare a slice to store the products
var products []Product

Create the OnHTML callback function.

crawler.go
func crawl (currenturl string, maxdepth int) {
    // ...

    // ---- extract product details ---- //
    // select product (list item with class product)
    c.OnHTML("li.product", func(e *colly.HTMLElement) {
        // retrieve product name, price, and images
        productName := e.ChildText(".product-name")
        productPrice := e.ChildText(".product-price")
        imageURL := e.ChildAttr(".product-image", "src")
       
        // create a new product instance and add it to the slice
        product := Product{
            Name: productName,
            Price: productPrice,
            ImageURL: imageURL,
           
        }
       
        products = append(products, product)
    })

    // ...
}

Add the callback for extracting product details before the one that handles pagination links. This ensures that your crawler extracts product details once it visits a page.

Here's the full code:

crawler.go
package main

import (
    "fmt"

    "github.com/gocolly/colly"
)

// initialize a map to store visited URLs
var visitedurls = make(map[string]bool)

// define a struct to store product details
type Product struct {
    Name     string
    Price    string
    ImageURL string
}

// declare a slice to store the products
var products []Product

func main() {
    // define the seed URL
    seedurl := "https://www.scrapingcourse.com/ecommerce/"

    // call the crawl function 
    crawl(seedurl, 0)
}

func crawl (currenturl string, maxdepth int) {
    // instantiate  a new collector
    c := colly.NewCollector(
        colly.AllowedDomains("www.scrapingcourse.com"),
        colly.MaxDepth(maxdepth),
    )

    // extract and log the page title
    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println("Page Title:", e.Text)
    })

    // ---- extract product details ---- //
    // select product (list item with class product)
    c.OnHTML("li.product", func(e *colly.HTMLElement) {
        // retrieve product name, price, and images
        productName := e.ChildText(".product-name")
        productPrice := e.ChildText(".product-price")
        imageURL := e.ChildAttr(".product-image", "src")
       
        // create a new product instance and add it to the slice
        product := Product{
            Name: productName,
            Price: productPrice,
            ImageURL: imageURL,
           
        }
       
        products = append(products, product)
        // log product details
        fmt.Printf("Product Name: %s\nProduct Price: %s\nImage URL: %s\n", productName, productPrice, imageURL)     
    })

    // ----- find and visit pagination links ---- //
    // select the href attribute of all anchor tags
    c.OnHTML("a.page-numbers", func(e *colly.HTMLElement) {
        // get absolute URL
        link := e.Request.AbsoluteURL(e.Attr("href"))
        // check if current URL has already been visited
        if link != "" && !visitedurls[link] {
            // add current URL to visitedURLs
            visitedurls[link] = true
            fmt.Println("Found link:", link)
            // visit current URL
            e.Request.Visit(link)
        }
    }) 

    // add an OnRequest callback to track progress
    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Crawling", r.URL)
    })

    // handle request errors
    c.OnError(func(e *colly.Response, err error) {
        fmt.Println("Request URL:", e.Request.URL, "failed with response:", e, "\nError:", err)
    })

    // visit the seed URL
    err := c.Visit(currenturl)
    if err != nil {
        fmt.Println("Error visiting page:", err)
    }

}

This extracts the product name, price, and images whenever the crawler encounters a list item with the class product.

Here's what your terminal would look like:

Output
// ... other content omitted for brevity ... //
Found link: https://www.scrapingcourse.com/ecommerce/page/2/
Crawling https://www.scrapingcourse.com/ecommerce/page/2/
Page Title: Ecommerce Test Site to Learn Web Scraping - Page 2 - ScrapingCourse.com
Product Name: Atlas Fitness Tank
Product Price: $18.00
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mt11-blue_main.jpg 
// ... omitted for brevity ... //

Step 4: Export the Scraped Data to CSV

Storing data in a structured format is often essential for easy analysis. In Colly, the OnScraped callback allows you to define actions, such as exporting to CSV, once scraping is complete.

Also, Go's encoding/csv package allows you to initialize a file writer and write data to CSV.

So, to export to CSV, we recommend abstracting this functionality into a reusable method and calling it within the OnScraped callback. This makes for cleaner, modular code.

Here's a step-by-step guide:

Import the required modules

crawler.go
import (
    // ... 
    
    // import the required modules
    "encoding/csv"
    "os"
)

Define a function to export the scraped data to CSV. Within this function, open a .csv file, initialize a file writer class, and write the headers and rows.

crawler.go
// function to export scraped data to CSV
func exportToCSV(filename string) {
    // open a CSV file
    file, err := os.Create(filename)
    if err != nil {
        fmt.Println("Error creating CSV file:", err)
        return
    }
    defer file.Close()
    // initialize a writer class
    writer := csv.NewWriter(file)
    defer writer.Flush()

    // write the header row
    writer.Write([]string{"Name", "Price", "Image URL"})

    // write the product details
    for _, product := range products {
        writer.Write([]string{product.Name, product.Price, product.ImageURL})
    }
    fmt.Println("Product details exported to", filename)
}

Now, add the OnScraped callback in the crawl() function and call the exportToCSV() function within the callback.

crawler.go
// ...
func crawl (currenturl string, maxdepth int) {
    // ...

    // add the OnScraped callback to define actions after extracting data.
    c.OnScraped(func(r *colly.Response) {
        fmt.Println("Data extraction complete", r.Request.URL)
        // export the collected products to a CSV file after scraping.
        exportToCSV("products.csv")
    })

    // ...

}

// ...

That's it.

Now, combine the steps above to get the following complete code.

crawler.go
package main

import (
    "fmt"
    "encoding/csv"
    "os"

    "github.com/gocolly/colly"
)

// initialize a map to store visited URLs
var visitedurls = make(map[string]bool)

// define a struct to store product details
type Product struct {
    Name     string
    Price    string
    ImageURL string
}

// declare a slice to store the products
var products []Product

func main() {
    // define the seed URL
    seedurl := "https://www.scrapingcourse.com/ecommerce/"

    // call the crawl function 
    crawl(seedurl, 0)
}

func crawl (currenturl string, maxdepth int) {
    // instantiate  a new collector
    c := colly.NewCollector(
        colly.AllowedDomains("www.scrapingcourse.com"),
        colly.MaxDepth(maxdepth),
    )

    // extract and log the page title
    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println("Page Title:", e.Text)
    })

    // ---- extract product details ---- //
    // select product (list item with class product)
    c.OnHTML("li.product", func(e *colly.HTMLElement) {
        // retrieve product name, price, and images
        productName := e.ChildText("h2")
        productPrice := e.ChildText("span.product-price")
        imageURL := e.ChildAttr("img", "src")
       
        // create a new product instance and add it to the slice
        product := Product{
            Name: productName,
            Price: productPrice,
            ImageURL: imageURL,
           
        }
       
        products = append(products, product)
    })

    // ----- find and visit pagination links ---- //
    // select the href attribute of all anchor tags
    c.OnHTML("a.page-numbers", func(e *colly.HTMLElement) {
        // get absolute URL
        link := e.Request.AbsoluteURL(e.Attr("href"))
        // check if current URL has already been visited
        if link != "" && !visitedurls[link] {
            // add current URL to visitedURLs
            visitedurls[link] = true
            fmt.Println("Found link:", link)
            // visit current URL
            e.Request.Visit(link)
        }
    }) 

    // add the OnScraped callback to define actions after extracting data.
    c.OnScraped(func(r *colly.Response) {
        fmt.Println("Data extraction complete", r.Request.URL)
        // Export the collected products to a CSV file after scraping
        exportToCSV("products.csv")
    })


    // add an OnRequest callback to track progress
    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Crawling", r.URL)
    })

    // handle request errors
    c.OnError(func(e *colly.Response, err error) {
        fmt.Println("Request URL:", e.Request.URL, "failed with response:", e, "\nError:", err)
    })

    // visit the seed URL
    err := c.Visit(currenturl)
    if err != nil {
        fmt.Println("Error visiting page:", err)
    }

}

// function to export scraped data to CSV
func exportToCSV(filename string) {
    // open a CSV file
    file, err := os.Create(filename)
    if err != nil {
        fmt.Println("Error creating CSV file:", err)
        return
    }
    defer file.Close()
    // initialize a writer class
    writer := csv.NewWriter(file)
    defer writer.Flush()

    // write the header row
    writer.Write([]string{"Name", "Price", "Image URL"})

    // write the product details
    for _, product := range products {
        writer.Write([]string{product.Name, product.Price, product.ImageURL})
    }
    fmt.Println("Product details exported to", filename)
}

This creates a new CSV file in your project's root directory and exports the product details to it. Here's a sample screenshot of the CSV file for context.

CSV Data Export
Click to open the image in full screen

Awesome! You now know how to crawl links and extract data from your crawler.

Optimize Your Web Crawler

Here are some key areas to consider when optimizing your Golang web crawler.

Crawling duplicate links can lead to an infinite loop, which wastes resources and, even worse, triggers anti-bot restrictions. Therefore, you should ensure that your crawler visits each link only once.

You can achieve this using different approaches. In the current crawler, we used Go's map data structure, which automatically handles duplications, to store visited URLs. Also, once the crawler finds a link, we check if the link has already been visited before proceeding to crawl.

Prioritize Specific Pages

Prioritizing specific pages can help streamline your crawling process, allowing you to focus on crawling usable links. In the current crawler, we used CSS selectors to target only pagination links and extract valuable product information.

However, if you're interested in all links on the page and want to prioritize pagination, you can maintain separate queues and process pagination links first.

Here's how to modify the initial crawler to prioritize pagination links (product pages).

This code maintains separate lists for pagination links and other links. It then crawls pagination links first.

crawler.go
package main

import (
    "fmt"

    "github.com/gocolly/colly"
)

// ...

// create variables to separate pagination links from other links
var paginationURLs = []string{}
var otherURLs = []string{}

func main() {
    // ...
}

func crawl (currenturl string, maxdepth int) {
    // ...

    // ----- find and visit all links ---- //
    // select the href attribute of all anchor tags
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        // get absolute URL
        link := e.Request.AbsoluteURL(e.Attr("href"))
        // check if the current URL has already been visited
        if link != "" && !visitedurls[link] {
            // add current URL to visitedURLs
            visitedurls[link] = true
            if e.Attr("class") == "page-numbers" {
                paginationURLs = append(paginationURLs, link)
             } else {
                otherURLs = append(otherURLs, link)
             }
        }
    })


    // ...

    // process pagination links first
    for len(paginationURLs) > 0 {
        nextURL := paginationURLs[0]
        paginationURLs = paginationURLs[1:]
        visitedurls[nextURL] = true
        err := c.Visit(nextURL)
        if err != nil {
            fmt.Println("Error visiting page:", err)
        }
    }

    // process other links
    for len(otherURLs) > 0 {
        nextURL := otherURLs[0]
        otherURLs = otherURLs[1:]
        visitedurls[nextURL] = true
        err := c.Visit(nextURL)
        if err != nil {
            fmt.Println("Error visiting page:", err)
        }
    }

}

Maintain a Single Crawl Session

Maintaining a single crawl session ensures that your web crawler executes all predefined tasks in a persisted state across multiple requests. This eliminates the need for frequent connections and significantly improves your overall efficiency.

Moreover, this optimization technique is particularly useful against websites that use rate-limiting technologies to control traffic.

Colly's built-in extensions and SetCookieJar() method allow you to manage and maintain sessions across multiple requests.

Here's how to modify the current crawler to maintain a single session.

crawler.go
package main 

import ( 
    // ... 

    "net/http/cookiejar"

    // ...
)


func crawl(currenturl string, maxdepth int) {
    // ...

    // add an OnRequest callback
    c.OnRequest(func(r *colly.Request) {
        // set custom headers
        r.Headers.Set("User-Agent", "Mozilla/5.0 (compatible; Colly/2.1; +https://github.com/gocolly/colly)")
        fmt.Println("Crawling", r.URL)
    })

    // manage cookies
    cookiesJar, _ := cookiejar.New(nil)
    c.SetCookieJar(cookiesJar)


    // ...
}

However, you're not quite done yet.

With modern websites implementing sophisticated anti-bot measures that can block your web crawler, you must overcome these obstacles to take advantage of your crawler's efficiency and performance.

Let's see how to handle these anti-bot measures in the next section.

Avoid Getting Blocked While Crawling With Go

Websites employ various techniques to distinguish bots from human traffic. Once your request fails the human test, you get served challenges like CAPTCHAs, 403 forbidden error, etc.

Web crawlers are quickly detected by anti-bot solutions as they typically send multiple requests in a manner that's easily distinguishable from human behavior. A typical web crawler is designed to find and follow links rapidly, a recipe for triggering anti-bot restrictions.

That said, you can employ best practices, such as proxy rotation, user agent spoofing, and reducing request frequency, to tweak your crawler to make human-like requests.

Just bear in mind that manual configurations are not reliable, especially against advanced and evolving anti-bot systems.

To avoid getting blocked while crawling with Go, you must completely emulate natural browsing behavior, which is challenging to achieve with basic evasion measures. Enter ZenRows.

The ZenRows Scraper API abstracts all the complexities of imitating a natural user, providing the easiest and most reliable solution for scalable web crawling.

With features such as advanced anti-bot bypass out of the box, geo-located requests, cookie support for session persistence, fingerprinting evasion, actual user spoofing, request header management, and more, ZenRows handles any anti-bot solution for you. This allows you to focus on extracting the necessary information rather than the intricacies of maintaining manual configurations.

Let's see ZenRows in action against an anti-bot solution.

To follow along in this example, sign up to get your free API key.

Completing your sign-up will take you to the Request Builder page.

Input your target URL and activate Premium Proxies and JS Rendering boost mode. For this example, we'll use the ScrapingCourse Antibot Challenge page as the target URL.

Next, select the Go language and choose the API option. ZenRows works with any language and provides ready-to-use snippets for the most popular ones.

building a scraper with zenrows
Click to open the image in full screen

Copy the generated code on the right to your editor for testing.

Your code should look like this:

crawler.go
package main

import (
    "io"
    "log"
    "net/http"
)

func main() {
    client := &http.Client{}
    req, err := http.NewRequest("GET", "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.scrapingcourse.com%2Fantibot-challenge&js_render=true&premium_proxy=true", nil)
    resp, err := client.Do(req)
    if err != nil {
        log.Fatalln(err)
    }
    defer resp.Body.Close()

    body, err := io.ReadAll(resp.Body)
    if err != nil {
        log.Fatalln(err)
    }

    log.Println(string(body))
}

This code bypasses the anti-bot challenge and prints the web page's HTML.

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

Congratulations! You're now well-equipped to crawl any website without getting blocked.  

Web Crawling Tools for Go

The importance of choosing the right web crawling tools cannot be overstated. Here are some tools to consider when creating a Golang web crawler.

  • ZenRows: The all-in-one web crawling tool that empowers you to crawl any website at any scale without getting blocked.
  • Selenium: Another valuable Go web crawling tool is Selenium. While it's popular for its browser automation capabilities, Selenium's ability to render JavaScript like an actual browser extends its application to web crawling.
  • Colly: A popular Go package for fetching and parsing HTML documents. This tool offers numerous functionalities that can streamline your web crawling process, including callbacks and options like MaxDepth that eliminates the need to create recursive functions manually. 

Golang Crawling Best Practices and Considerations

The following best practices can boost your overall efficiency and performance.

Parallel Crawling and Concurrency

Synchronously crawling multiple pages can be inefficient, as only one goroutine can actively process tasks at any given time. Your web crawler spends most of its time waiting for responses and processing the data before moving on to the next task.

However, parallel crawling and Go's concurrency features can significantly reduce your overall crawl time.

At the same time, you must manage concurrency properly to avoid overwhelming the target server and triggering anti-bot restrictions.

Using Colly, you can easily implement parallel crawling by setting the colly.async option to true when initializing your Collector. Colly also provides some built-in limit rules that allow you to set a maximum number of concurrent requests and introduce delays between them.

Here's an example using the current crawler.

crawler.go
// ...

func crawl (currenturl string, maxdepth int) {
    // instantiate  a new collector
    c := colly.NewCollector(
        colly.AllowedDomains("www.scrapingcourse.com"),
        colly.MaxDepth(maxdepth),
        colly.Async(true),
    )
    // set concurrency limit and introduce delays between requests
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 5,
        Delay:       2 * time.Second,
    })

    // ...

    // wait for all goroutines to finish
    c.Wait()
}

Crawling JavaScript Rendered Pages in Go

While Colly is a great web crawling tool with numerous built-in features, it cannot crawl JavaScript-rendered pages (dynamic content). It can only fetch and parse static HTML, and dynamic content isn't present in a website's static HTML.

However, you can integrate with JavaScript engines or headless browsers like Chromedp to crawl dynamic content.

Distributed Web Crawling in Go

The distributed web crawling technique involves dividing the tasks across multiple machines or instances to streamline the process and improve efficiency. This is particularly valuable in large-scale web crawling tasks where scalability is critical.

To learn how to build a distributed crawler architecture, check out our distributed web crawling guide.

Conclusion

You learned how to build a Go web crawler, starting from the basics to more advanced topics. Remember that while building a web crawler to navigate web pages is a great starting point, you must overcome anti-bot measures to gain access to modern websites.

Rather than toiling with manual configurations that would most likely fail, consider ZenRows, the most reliable solution for bypassing any anti-bot system. Try ZenRows for free to get started today!

Ready to get started?

Up to 1,000 URLs for free are waiting for you