Most large-scale web scraping projects in Go begin with discovering and organizing URLs using a Golang web crawler. This tool enables you to navigate an initial target domain, also known as a "seed URL", and recursively visit the links on the page to uncover more links.
This guide will teach you how to build and optimize a Golang web crawler using real-world examples. By the end of this tutorial, you'll have a web crawler capable of following all the links on a web page and extracting data from select links.
Before we dive in, here's some background information.
What Is Web Crawling?
Web crawling refers to the process of systematically browsing the web to discover specific information (usually URLs and page links) for various purposes, such as indexing for search engines.
Although the terms web crawling and web scraping are often used interchangeably, they actually refer to different processes with distinct applications.
Web scraping is the process of retrieving information from websites, while web crawling is all about discovering URLs and putting them to use.
Most large-scale data extraction projects require both. For instance, you might first crawl a target domain to discover URLs, then scrape those URLs to extract the desired data.
For more details on the differences between the two, check out our in-depth comparison guide on web crawling vs. web scraping.
Build Your First Golang Web Crawler
In this tutorial, we'll crawl the links from the ScrapingCourse e-commerce test website.
This website has many pages, including paginated products, carts, and checkout. After crawling all the links from the seed URL, we'll select some and extract valuable product data.
If you're new to data extraction in Go or want a quick refresher on the topic, check out our guide on web scraping in Golang.
In the meantime, follow the steps below to build your first Golang web crawler.
Step 1: Prerequisites for Building a Golang Web Crawler Â
Having the right stack is critical to building a Golang web crawler. Here are the tools you'll need to follow along in this tutorial:
- Go: Ensure your machine has the latest version of Go. You can download it from the official Go website and follow the installation prompts.
- Your preferred IDE: In this tutorial, we'll use Visual Studio Code, but you can use any IDE you choose.
- Colly: You'll require this library to fetch and parse HTML.
Follow the steps below to set up your project and ensure everything's in place.
Run the following command in your terminal to verify your Go installation.
go version
If Go runs on your machine, this command will return its version, as in the example below.
go version go1.23.4 windows/386
Next, navigate to a directory where you'd like to store your code and initialize a Go project using the following command.
go mod init crawler
This command creates a new go.mod
file, where you can add your project's dependencies.
Run the following command to install Colly and all its dependencies from its GitHub repository.
go get github.com/gocolly/colly
That's it. You're all set up.
Now, create a Go file (crawler.go
), open it in your preferred IDE, and prepare to write some code.
Step 2: Follow all the Links on a Website
Let's start with the most basic functionality of a Golang web crawler: making a GET request to the seed URL and retrieving its HTML content.
We'll create a crawl()
function to visit and extract the HTML of the target page. This function will take the seed URL as an argument. Later, we'll extend it to find and follow all the links on the page.
package main
import (
"fmt"
"github.com/gocolly/colly"
)
func main() {
//define the seed URL
seedurl := "https://www.scrapingcourse.com/ecommerce/"
// call the crawl function
crawl(seedurl)
}
func crawl(currenturl string) {
// create a new collector
c := colly.NewCollector(
colly.AllowedDomains("www.scrapingcourse.com"),
)
// extract and log the page title
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println("Page Title:", e.Text)
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Crawling", r.URL)
})
// handle request errors
c.OnError(func(e *colly.Response, err error) {
fmt.Println("Request URL:", e.Request.URL, "failed with response:", e, "\nError:", err)
})
// visit the seed URL
err := c.Visit(currenturl)
if err != nil {
fmt.Println("Error visiting page:", err)
}
}
This code restricts crawling to the target domain, preventing unintentional visits to external websites.
It then makes a GET request to the seed URL, retrieves its HTML, and logs the following output.
Crawling https://www.scrapingcourse.com/ecommerce/
Page Title: Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com
Awesome!
But that's only scratching the surface. The next step is to modify your crawler to find and follow all links on the web page.
To do this, you'll need to track visited URLs to ensure you don't visit the same link multiple times. Then, set a depth limit and crawl each link recursively to discover more URLs.
Depth limits are important to control how deep your crawler goes from the seed URL. For example, a depth limit of 1 means that your crawler will only extract and follow the links on the seed URL. It won't go beyond that.
Additionally, since Colly uses predefined callbacks, you can trigger a callback function whenever your crawler encounters a link. This eliminates the need to create a recursive function from scratch.
Here's a step-by-step guide.
In Go, the map
data structure allows you to handle duplicates. If you try to include an entry that already exists, the map will overwrite the existing entry.
Thus, start by initializing a global map to store visited URLs.
//...
// initialize a map to store visited URLs
var visitedurls = make(map[string]bool)
Next, modify your crawl function to take two arguments: the current URL and the max depth. Then, add the colly.MaxDepth
option to the Collector object. This option ensures that Colly manages the depth of each request for you.
//...
func crawl (currenturl string, maxdepth int) {
// instantiate a new collector
c := colly.NewCollector(
colly.AllowedDomains("www.scrapingcourse.com"),
colly.MaxDepth(maxdepth),
)
// ...
}
Within the crawl function, and after logging the page title, add an OnHTML
callback to find all links on the page and recursively visit each link.
Website links are often defined in anchor tags. Therefore, you can find every URL on the page by selecting the href
attribute for all anchor tags.
Colly's OnHTML
function allows you to select HTML elements using their CSS selectors (in this case, a[href]
). Within this function, check if the current URL has been visited. If no, add the current URL to visitedurls
, and proceed to visit the current URL.
We recommend getting the absolute URL of the href
attribute to avoid trying to crawl relative paths. The AbsoluteURL()
method automatically concatenates relative paths to form the complete URL.
crawl (currenturl string, maxdepth int) {
// ...
// ----- find and visit all links ---- //
// select the href attribute of all anchor tags
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
// get absolute URL
link := e.Request.AbsoluteURL(e.Attr("href"))
// check if current URL has already been visited
if link != "" && !visitedurls[link] {
// add current URL to visitedURLs
visitedurls[link] = true
fmt.Println("Found link:", link)
// visit current URL
e.Request.Visit(link)
}
})
}
That's it.
Now, combine all the steps above and modify the initial crawler.go
file accordingly to get the following complete code.
package main
import (
"fmt"
"github.com/gocolly/colly"
)
// initialize a map to store visited URLs
var visitedurls = make(map[string]bool)
func main() {
// define the seed URL
seedurl := "https://www.scrapingcourse.com/ecommerce/"
// call the crawl function
crawl(seedurl, 0)
}
func crawl (currenturl string, maxdepth int) {
// instantiate a new collector
c := colly.NewCollector(
colly.AllowedDomains("www.scrapingcourse.com"),
colly.MaxDepth(maxdepth),
)
// extract and log the page title
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println("Page Title:", e.Text)
})
// ----- find and visit all links ---- //
// select the href attribute of all anchor tags
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
// get absolute URL
link := e.Request.AbsoluteURL(e.Attr("href"))
// check if current URL has already been visited
if link != "" && !visitedurls[link] {
// add current URL to visitedURLs
visitedurls[link] = true
fmt.Println("Found link:", link)
// visit current URL
e.Request.Visit(link)
}
})
// add an OnRequest callback to track progress
c.OnRequest(func(r *colly.Request) {
fmt.Println("Crawling", r.URL)
})
// handle request errors
c.OnError(func(e *colly.Response, err error) {
fmt.Println("Request URL:", e.Request.URL, "failed with response:", e, "\nError:", err)
})
// visit the seed URL
err := c.Visit(currenturl)
if err != nil {
fmt.Println("Error visiting page:", err)
}
}
This code finds and follows the target website links until the depth limit is reached. Here's what your console would look like:
Crawling https://www.scrapingcourse.com/ecommerce/
Page Title: Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com
Found link: https://www.scrapingcourse.com/ecommerce/
Found link: https://www.scrapingcourse.com/ecommerce/cart/
// ... truncated for brevity ... //
Congratulations! You've built your first Golang web crawler.
However, most data extraction projects aim to crawl specific links rather than every URL on the page.
Let's see how to modify your code to crawl select URLs. In this example, we'll extract pagination links on the target web page.
To do that, you need to first inspect the page to identify the CSS selector for the pagination links. Right-click on the pagination and select Inspect.
This will open the DevTools window, which shows the page's HTML structure, as shown in the image below.
You'll find that there are 12 product pages, all of which share the same page-number
class.
Using this information, modify the OnHTML
callback that crawled all links to only process pagination links.
package main
import (
"fmt"
"github.com/gocolly/colly"
)
// initialize a map to store visited URLs
var visitedurls = make(map[string]bool)
func main() {
// define the seed URL
seedurl := "https://www.scrapingcourse.com/ecommerce/"
// call the crawl function
crawl(seedurl, 0)
}
func crawl (currenturl string, maxdepth int) {
// instantiate a new collector
c := colly.NewCollector(
colly.AllowedDomains("www.scrapingcourse.com"),
colly.MaxDepth(maxdepth),
)
// extract and log the page title
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println("Page Title:", e.Text)
})
// ----- find and visit pagination links ---- //
// select the href attribute of all anchor tags
c.OnHTML("a.page-numbers", func(e *colly.HTMLElement) {
// get absolute URL
link := e.Request.AbsoluteURL(e.Attr("href"))
// check if current URL has already been visited
if link != "" && !visitedurls[link] {
// add current URL to visitedURLs
visitedurls[link] = true
fmt.Println("Found link:", link)
// visit current URL
e.Request.Visit(link)
}
})
// add an OnRequest callback to track progress
c.OnRequest(func(r *colly.Request) {
fmt.Println("Crawling", r.URL)
})
// handle request errors
c.OnError(func(e *colly.Response, err error) {
fmt.Println("Request URL:", e.Request.URL, "failed with response:", e, "\nError:", err)
})
// visit the seed URL
err := c.Visit(currenturl)
if err != nil {
fmt.Println("Error visiting page:", err)
}
}
This ensures your crawler only finds and follows pagination links. Here's what your output could look like:
// ...
Found link: https://www.scrapingcourse.com/ecommerce/page/10/
Crawling https://www.scrapingcourse.com/ecommerce/page/10/
Page Title: Ecommerce Test Site to Learn Web Scraping - Page 10 - ScrapingCourse.com
Found link: https://www.scrapingcourse.com/ecommerce/page/11/
Crawling https://www.scrapingcourse.com/ecommerce/page/11/
Page Title: Ecommerce Test Site to Learn Web Scraping - Page 11 - ScrapingCourse.com
Found link: https://www.scrapingcourse.com/ecommerce/page/12/
Crawling https://www.scrapingcourse.com/ecommerce/page/12/
Page Title: Ecommerce Test Site to Learn Web Scraping - Page 12 - ScrapingCourse.com
Awesome!
Step 3: Extract Data From Your Crawler
Let's extend your crawler to extract some valuable product information. Once the crawler navigates to each pagination link, we'll extract the following data points:
- Product name.
- Product price.
- Product Image.
As before, inspect the page to identify the right selectors for each data point.
You'll find that each product is a list item with the class product
. The following HTML elements within the list items represent each data point.
- Product name:
<h2>
with classproduct-name
. - Product price: span element with class,
price
. - Product image:
<img>
tag with classproduct-image
.
Using this information, add an OnHTML
to find all product items on the current page and extract their product name, price, and image URL.
We recommend grouping and managing the scraped data using structs. So, define a global struct to store product details.
// define a struct to store product details
type Product struct {
Name string
Price string
ImageURL string
}
// declare a slice to store the products
var products []Product
Create the OnHTML
callback function.
func crawl (currenturl string, maxdepth int) {
// ...
// ---- extract product details ---- //
// select product (list item with class product)
c.OnHTML("li.product", func(e *colly.HTMLElement) {
// retrieve product name, price, and images
productName := e.ChildText(".product-name")
productPrice := e.ChildText(".product-price")
imageURL := e.ChildAttr(".product-image", "src")
// create a new product instance and add it to the slice
product := Product{
Name: productName,
Price: productPrice,
ImageURL: imageURL,
}
products = append(products, product)
})
// ...
}
Add the callback for extracting product details before the one that handles pagination links. This ensures that your crawler extracts product details once it visits a page.
Here's the full code:
package main
import (
"fmt"
"github.com/gocolly/colly"
)
// initialize a map to store visited URLs
var visitedurls = make(map[string]bool)
// define a struct to store product details
type Product struct {
Name string
Price string
ImageURL string
}
// declare a slice to store the products
var products []Product
func main() {
// define the seed URL
seedurl := "https://www.scrapingcourse.com/ecommerce/"
// call the crawl function
crawl(seedurl, 0)
}
func crawl (currenturl string, maxdepth int) {
// instantiate a new collector
c := colly.NewCollector(
colly.AllowedDomains("www.scrapingcourse.com"),
colly.MaxDepth(maxdepth),
)
// extract and log the page title
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println("Page Title:", e.Text)
})
// ---- extract product details ---- //
// select product (list item with class product)
c.OnHTML("li.product", func(e *colly.HTMLElement) {
// retrieve product name, price, and images
productName := e.ChildText(".product-name")
productPrice := e.ChildText(".product-price")
imageURL := e.ChildAttr(".product-image", "src")
// create a new product instance and add it to the slice
product := Product{
Name: productName,
Price: productPrice,
ImageURL: imageURL,
}
products = append(products, product)
// log product details
fmt.Printf("Product Name: %s\nProduct Price: %s\nImage URL: %s\n", productName, productPrice, imageURL)
})
// ----- find and visit pagination links ---- //
// select the href attribute of all anchor tags
c.OnHTML("a.page-numbers", func(e *colly.HTMLElement) {
// get absolute URL
link := e.Request.AbsoluteURL(e.Attr("href"))
// check if current URL has already been visited
if link != "" && !visitedurls[link] {
// add current URL to visitedURLs
visitedurls[link] = true
fmt.Println("Found link:", link)
// visit current URL
e.Request.Visit(link)
}
})
// add an OnRequest callback to track progress
c.OnRequest(func(r *colly.Request) {
fmt.Println("Crawling", r.URL)
})
// handle request errors
c.OnError(func(e *colly.Response, err error) {
fmt.Println("Request URL:", e.Request.URL, "failed with response:", e, "\nError:", err)
})
// visit the seed URL
err := c.Visit(currenturl)
if err != nil {
fmt.Println("Error visiting page:", err)
}
}
This extracts the product name, price, and images whenever the crawler encounters a list item with the class product
.
Here's what your terminal would look like:
// ... other content omitted for brevity ... //
Found link: https://www.scrapingcourse.com/ecommerce/page/2/
Crawling https://www.scrapingcourse.com/ecommerce/page/2/
Page Title: Ecommerce Test Site to Learn Web Scraping - Page 2 - ScrapingCourse.com
Product Name: Atlas Fitness Tank
Product Price: $18.00
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mt11-blue_main.jpg
// ... omitted for brevity ... //
Step 4: Export the Scraped Data to CSV
Storing data in a structured format is often essential for easy analysis. In Colly, the OnScraped
callback allows you to define actions, such as exporting to CSV, once scraping is complete.
Also, Go's encoding/csv
package allows you to initialize a file writer and write data to CSV.
So, to export to CSV, we recommend abstracting this functionality into a reusable method and calling it within the OnScraped
callback. This makes for cleaner, modular code.
Here's a step-by-step guide:
Import the required modules
import (
// ...
// import the required modules
"encoding/csv"
"os"
)
Define a function to export the scraped data to CSV. Within this function, open a .csv
file, initialize a file writer class, and write the headers and rows.
// function to export scraped data to CSV
func exportToCSV(filename string) {
// open a CSV file
file, err := os.Create(filename)
if err != nil {
fmt.Println("Error creating CSV file:", err)
return
}
defer file.Close()
// initialize a writer class
writer := csv.NewWriter(file)
defer writer.Flush()
// write the header row
writer.Write([]string{"Name", "Price", "Image URL"})
// write the product details
for _, product := range products {
writer.Write([]string{product.Name, product.Price, product.ImageURL})
}
fmt.Println("Product details exported to", filename)
}
Now, add the OnScraped
callback in the crawl()
function and call the exportToCSV()
function within the callback.
// ...
func crawl (currenturl string, maxdepth int) {
// ...
// add the OnScraped callback to define actions after extracting data.
c.OnScraped(func(r *colly.Response) {
fmt.Println("Data extraction complete", r.Request.URL)
// export the collected products to a CSV file after scraping.
exportToCSV("products.csv")
})
// ...
}
// ...
That's it.
Now, combine the steps above to get the following complete code.
package main
import (
"fmt"
"encoding/csv"
"os"
"github.com/gocolly/colly"
)
// initialize a map to store visited URLs
var visitedurls = make(map[string]bool)
// define a struct to store product details
type Product struct {
Name string
Price string
ImageURL string
}
// declare a slice to store the products
var products []Product
func main() {
// define the seed URL
seedurl := "https://www.scrapingcourse.com/ecommerce/"
// call the crawl function
crawl(seedurl, 0)
}
func crawl (currenturl string, maxdepth int) {
// instantiate a new collector
c := colly.NewCollector(
colly.AllowedDomains("www.scrapingcourse.com"),
colly.MaxDepth(maxdepth),
)
// extract and log the page title
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println("Page Title:", e.Text)
})
// ---- extract product details ---- //
// select product (list item with class product)
c.OnHTML("li.product", func(e *colly.HTMLElement) {
// retrieve product name, price, and images
productName := e.ChildText("h2")
productPrice := e.ChildText("span.product-price")
imageURL := e.ChildAttr("img", "src")
// create a new product instance and add it to the slice
product := Product{
Name: productName,
Price: productPrice,
ImageURL: imageURL,
}
products = append(products, product)
})
// ----- find and visit pagination links ---- //
// select the href attribute of all anchor tags
c.OnHTML("a.page-numbers", func(e *colly.HTMLElement) {
// get absolute URL
link := e.Request.AbsoluteURL(e.Attr("href"))
// check if current URL has already been visited
if link != "" && !visitedurls[link] {
// add current URL to visitedURLs
visitedurls[link] = true
fmt.Println("Found link:", link)
// visit current URL
e.Request.Visit(link)
}
})
// add the OnScraped callback to define actions after extracting data.
c.OnScraped(func(r *colly.Response) {
fmt.Println("Data extraction complete", r.Request.URL)
// Export the collected products to a CSV file after scraping
exportToCSV("products.csv")
})
// add an OnRequest callback to track progress
c.OnRequest(func(r *colly.Request) {
fmt.Println("Crawling", r.URL)
})
// handle request errors
c.OnError(func(e *colly.Response, err error) {
fmt.Println("Request URL:", e.Request.URL, "failed with response:", e, "\nError:", err)
})
// visit the seed URL
err := c.Visit(currenturl)
if err != nil {
fmt.Println("Error visiting page:", err)
}
}
// function to export scraped data to CSV
func exportToCSV(filename string) {
// open a CSV file
file, err := os.Create(filename)
if err != nil {
fmt.Println("Error creating CSV file:", err)
return
}
defer file.Close()
// initialize a writer class
writer := csv.NewWriter(file)
defer writer.Flush()
// write the header row
writer.Write([]string{"Name", "Price", "Image URL"})
// write the product details
for _, product := range products {
writer.Write([]string{product.Name, product.Price, product.ImageURL})
}
fmt.Println("Product details exported to", filename)
}
This creates a new CSV file in your project's root directory and exports the product details to it. Here's a sample screenshot of the CSV file for context.
Awesome! You now know how to crawl links and extract data from your crawler.
Optimize Your Web Crawler
Here are some key areas to consider when optimizing your Golang web crawler.
Avoid Duplicate Links
Crawling duplicate links can lead to an infinite loop, which wastes resources and, even worse, triggers anti-bot restrictions. Therefore, you should ensure that your crawler visits each link only once.
You can achieve this using different approaches. In the current crawler, we used Go's map
data structure, which automatically handles duplications, to store visited URLs. Also, once the crawler finds a link, we check if the link has already been visited before proceeding to crawl.
Prioritize Specific Pages
Prioritizing specific pages can help streamline your crawling process, allowing you to focus on crawling usable links. In the current crawler, we used CSS selectors to target only pagination links and extract valuable product information.
However, if you're interested in all links on the page and want to prioritize pagination, you can maintain separate queues and process pagination links first.
Here's how to modify the initial crawler to prioritize pagination links (product pages).
This code maintains separate lists for pagination links and other links. It then crawls pagination links first.
package main
import (
"fmt"
"github.com/gocolly/colly"
)
// ...
// create variables to separate pagination links from other links
var paginationURLs = []string{}
var otherURLs = []string{}
func main() {
// ...
}
func crawl (currenturl string, maxdepth int) {
// ...
// ----- find and visit all links ---- //
// select the href attribute of all anchor tags
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
// get absolute URL
link := e.Request.AbsoluteURL(e.Attr("href"))
// check if the current URL has already been visited
if link != "" && !visitedurls[link] {
// add current URL to visitedURLs
visitedurls[link] = true
if e.Attr("class") == "page-numbers" {
paginationURLs = append(paginationURLs, link)
} else {
otherURLs = append(otherURLs, link)
}
}
})
// ...
// process pagination links first
for len(paginationURLs) > 0 {
nextURL := paginationURLs[0]
paginationURLs = paginationURLs[1:]
visitedurls[nextURL] = true
err := c.Visit(nextURL)
if err != nil {
fmt.Println("Error visiting page:", err)
}
}
// process other links
for len(otherURLs) > 0 {
nextURL := otherURLs[0]
otherURLs = otherURLs[1:]
visitedurls[nextURL] = true
err := c.Visit(nextURL)
if err != nil {
fmt.Println("Error visiting page:", err)
}
}
}
Maintain a Single Crawl Session
Maintaining a single crawl session ensures that your web crawler executes all predefined tasks in a persisted state across multiple requests. This eliminates the need for frequent connections and significantly improves your overall efficiency.
Moreover, this optimization technique is particularly useful against websites that use rate-limiting technologies to control traffic.
Colly's built-in extensions and SetCookieJar()
method allow you to manage and maintain sessions across multiple requests.
Here's how to modify the current crawler to maintain a single session.
package main
import (
// ...
"net/http/cookiejar"
// ...
)
func crawl(currenturl string, maxdepth int) {
// ...
// add an OnRequest callback
c.OnRequest(func(r *colly.Request) {
// set custom headers
r.Headers.Set("User-Agent", "Mozilla/5.0 (compatible; Colly/2.1; +https://github.com/gocolly/colly)")
fmt.Println("Crawling", r.URL)
})
// manage cookies
cookiesJar, _ := cookiejar.New(nil)
c.SetCookieJar(cookiesJar)
// ...
}
However, you're not quite done yet.
With modern websites implementing sophisticated anti-bot measures that can block your web crawler, you must overcome these obstacles to take advantage of your crawler's efficiency and performance.
Let's see how to handle these anti-bot measures in the next section.
Avoid Getting Blocked While Crawling With Go
Websites employ various techniques to distinguish bots from human traffic. Once your request fails the human test, you get served challenges like CAPTCHAs, 403 forbidden error, etc.
Web crawlers are quickly detected by anti-bot solutions as they typically send multiple requests in a manner that's easily distinguishable from human behavior. A typical web crawler is designed to find and follow links rapidly, a recipe for triggering anti-bot restrictions.
That said, you can employ best practices, such as proxy rotation, user agent spoofing, and reducing request frequency, to tweak your crawler to make human-like requests.
Just bear in mind that manual configurations are not reliable, especially against advanced and evolving anti-bot systems.
To avoid getting blocked while crawling with Go, you must completely emulate natural browsing behavior, which is challenging to achieve with basic evasion measures. Enter ZenRows.
The ZenRows Scraper API abstracts all the complexities of imitating a natural user, providing the easiest and most reliable solution for scalable web crawling.
With features such as advanced anti-bot bypass out of the box, geo-located requests, cookie support for session persistence, fingerprinting evasion, actual user spoofing, request header management, and more, ZenRows handles any anti-bot solution for you. This allows you to focus on extracting the necessary information rather than the intricacies of maintaining manual configurations.
Let's see ZenRows in action against an anti-bot solution.
To follow along in this example, sign up to get your free API key.
Completing your sign-up will take you to the Request Builder page.
Input your target URL and activate Premium Proxies and JS Rendering boost mode. For this example, we'll use the ScrapingCourse Antibot Challenge page as the target URL.
Next, select the Go language and choose the API option. ZenRows works with any language and provides ready-to-use snippets for the most popular ones.
Copy the generated code on the right to your editor for testing.
Your code should look like this:
package main
import (
"io"
"log"
"net/http"
)
func main() {
client := &http.Client{}
req, err := http.NewRequest("GET", "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.scrapingcourse.com%2Fantibot-challenge&js_render=true&premium_proxy=true", nil)
resp, err := client.Do(req)
if err != nil {
log.Fatalln(err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
log.Fatalln(err)
}
log.Println(string(body))
}
This code bypasses the anti-bot challenge and prints the web page's HTML.
<html lang="en">
<head>
<!-- ... -->
<title>Antibot Challenge - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
<h2>
You bypassed the Antibot challenge! :D
</h2>
<!-- other content omitted for brevity -->
</body>
</html>
Congratulations! You're now well-equipped to crawl any website without getting blocked. Â
Web Crawling Tools for Go
The importance of choosing the right web crawling tools cannot be overstated. Here are some tools to consider when creating a Golang web crawler.
- ZenRows: The all-in-one web crawling tool that empowers you to crawl any website at any scale without getting blocked.
- Selenium: Another valuable Go web crawling tool is Selenium. While it's popular for its browser automation capabilities, Selenium's ability to render JavaScript like an actual browser extends its application to web crawling.
-
Colly: A popular Go package for fetching and parsing HTML documents. This tool offers numerous functionalities that can streamline your web crawling process, including callbacks and options like
MaxDepth
that eliminates the need to create recursive functions manually.Â
Golang Crawling Best Practices and Considerations
The following best practices can boost your overall efficiency and performance.
Parallel Crawling and Concurrency
Synchronously crawling multiple pages can be inefficient, as only one goroutine can actively process tasks at any given time. Your web crawler spends most of its time waiting for responses and processing the data before moving on to the next task.
However, parallel crawling and Go's concurrency features can significantly reduce your overall crawl time.
At the same time, you must manage concurrency properly to avoid overwhelming the target server and triggering anti-bot restrictions.
Using Colly, you can easily implement parallel crawling by setting the colly.async
option to true when initializing your Collector
. Colly also provides some built-in limit rules that allow you to set a maximum number of concurrent requests and introduce delays between them.
Here's an example using the current crawler.
// ...
func crawl (currenturl string, maxdepth int) {
// instantiate a new collector
c := colly.NewCollector(
colly.AllowedDomains("www.scrapingcourse.com"),
colly.MaxDepth(maxdepth),
colly.Async(true),
)
// set concurrency limit and introduce delays between requests
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 5,
Delay: 2 * time.Second,
})
// ...
// wait for all goroutines to finish
c.Wait()
}
Crawling JavaScript Rendered Pages in Go
While Colly is a great web crawling tool with numerous built-in features, it cannot crawl JavaScript-rendered pages (dynamic content). It can only fetch and parse static HTML, and dynamic content isn't present in a website's static HTML.
However, you can integrate with JavaScript engines or headless browsers like Chromedp to crawl dynamic content.
Distributed Web Crawling in Go
The distributed web crawling technique involves dividing the tasks across multiple machines or instances to streamline the process and improve efficiency. This is particularly valuable in large-scale web crawling tasks where scalability is critical.
To learn how to build a distributed crawler architecture, check out our distributed web crawling guide.
Conclusion
You learned how to build a Go web crawler, starting from the basics to more advanced topics. Remember that while building a web crawler to navigate web pages is a great starting point, you must overcome anti-bot measures to gain access to modern websites.
Rather than toiling with manual configurations that would most likely fail, consider ZenRows, the most reliable solution for bypassing any anti-bot system. Try ZenRows for free to get started today!