When web scraping using Go and Colly, you must emulate natural user behavior to avoid getting blocked by anti-bot systems. One way to do this is by changing your Colly User Agent. This parameter controls how the target server perceives your web request.
In this article, we'll walk you through customizing your User Agent to match that of a regular browser.
But before we dive into the steps, here's some background information.
What Is Colly User Agent?
A Colly User Agent (UA) is one of the critical elements of the HTTP headers sent with every Colly request.
When you make an HTTP request to a target website, it includes HTTP headers, which consist of metadata that provides some additional information about the request. The server then uses the data to tailor its responses. For some protected websites, you can get a 403 error or similar status code, indicating a failed request. Why is that?
Although all headers play a role, the User Agent influences this interaction the most. It's like a fingerprint with the requesting agent's details, such as browser, operating system, and device. Websites often rely on the User Agent to determine whether to fulfill your requests or not.
Here's what a regular browser's UA looks like:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36
In contrast, a non-browser client sends a generic or empty User Agent string, which makes it easy for websites to flag you as a bot.
If you make the following basic request to HTTPBin's User Agent endpoint, you can see the Colly User Agent string.
package main
import (
"fmt"
"log"
"github.com/gocolly/colly"
)
func main() {
// create a new collector
c := colly.NewCollector()
// call the onResponse callback and print the HTML content
c.OnResponse(func(r *colly.Response) {
fmt.Println(string(r.Body))
})
// handle request errors
c.OnError(func(e *colly.Response, err error) {
log.Println("Request URL:", e.Request.URL, "failed with response:", e, "\nError:", err)
})
// start scraping
err := c.Visit("https://httpbin.io/user-agent")
if err != nil {
log.Fatal(err)
}
}
Here's the result:
{
"user-agent": "colly - https://github.com/gocolly/colly"
}
This apparent difference makes it easy for web servers to identify your scraper as a bot and block your request.
But don't worry. The following section shows you how to customize your UA to emulate that of a regular browser.
How to Set Up Custom User Agent in Colly
To configure a custom User Agent, you must first obtain the UA of the browser you want to emulate.
Here's the UA of a Chrome browser for this tutorial.
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36
We got this by inspecting any web page in a browser using the DevTools and navigating to the headers section of the main request in the Network tab.
Now, let's customize Colly to use this User Agent.
Colly provides the UserAgent()
function, which takes a custom User Agent string as a parameter. This allows you to customize UA for all requests the Collector
makes.
Therefore, in your Collector
, use the UserAgent()
function to set the User Agent string to the one above.
func main() {
// create a new collector
c := colly.NewCollector(
// set custom user agent
colly.UserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36")
)
}
That's it.
To verify it works, update the previous code with the snippet above.
package main
import (
"fmt"
"log"
"github.com/gocolly/colly"
)
func main() {
// create a new collector
c := colly.NewCollector(
// set custom user agent
colly.UserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"),
)
// call the onResponse callback and print the HTML content
c.OnResponse(func(r *colly.Response) {
fmt.Println(string(r.Body))
})
// handle request errors
c.OnError(func(e *colly.Response, err error) {
log.Println("Request URL:", e.Request.URL, "failed with response:", e, "\nError:", err)
})
// start scraping
err := c.Visit("https://httpbin.io/user-agent")
if err != nil {
log.Fatal(err)
}
}
If your configuration worked, your result will be the custom UA string.
{
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"
}
Congratulations! You've set up your first custom Colly User Agent.
However, while this is a great start, there's more. Websites can eventually track your UA's activity and block your scraper.
Read on to find out how you can avoid this.
Rotate User Agents in Go Colly
Most modern websites use advanced detection techniques that track users' activities over time. Making multiple requests from the same UA string can raise suspicion and get you blocked.
However, rotating User Agents per request can further disguise your scraping activities, making it appear as if your requests originate from unique users. To rotate User Agents in Go Colly, follow the steps below.
Define a list of User Agent strings you'd like to rotate through. We've grabbed a few from this list of web scraping User Agents.
// define your user agent list
var userAgents = []string{
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.2420.81",
}
After that, create a function to randomly select a UA from the list.
import (
// ...
"math/rand"
// ...
)
//...
// create a function to select a UA at random
func randomUA() string{
return userAgents[rand.Intn(len(userAgents))]
}
You need to import the math/rand
package to use the rand.Intn
function.
Lastly, in your collector, call the function above to set the User Agent to the randomly selected one.
//...
func main() {
// create a new collector
c := colly.NewCollector(
// set user agent to random UA
colly.UserAgent(randomUA()),
)
}
To verify everything works, add these code snippets to the initial basic script to get the following complete code.
package main
import (
"fmt"
"log"
"math/rand"
"github.com/gocolly/colly"
)
// define your user agent list
var userAgents = []string{
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.2420.81",
}
// create a function to select a UA at random
func randomUA() string {
return userAgents[rand.Intn(len(userAgents))]
}
func main() {
// create a new collector
c := colly.NewCollector(
// set user agent to random UA
colly.UserAgent(randomUA()),
)
// call the OnResponse callback and print the HTML content
c.OnResponse(func(r *colly.Response) {
fmt.Println(string(r.Body))
})
// handle request errors
c.OnError(func(e *colly.Response, err error) {
log.Println("Request URL:", e.Request.URL, "failed with response:", e, "\nError:", err)
})
// start scraping
err := c.Visit("https://httpbin.io/user-agent")
if err != nil {
log.Fatal(err)
}
}
Run it, and you'll get a different UA for each request. Here's the result for three runs.
// request 1
{
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}
// request 2
{
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.2420.81"
}
// request 3
{
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0"
}
Awesome!
Now that you've successfully rotated UAs, it's worth noting the importance of constructing a proper one.
Websites are familiar with an actual browser's User Agent string, and any discrepancy, no matter how small, will be flagged and your request blocked. Some best practices to avoid this, include using UAs with recent browser versions and ensuring that your User Agent string matches other headers.
While the example above felt pretty straightforward, real-world use cases often require large UA lists, and maintaining such an array of adequately crafted UAs can require a lot of time and effort.
However, there are also easier solutions to rotate UAs and avoid blocks and bans. Read on.
Change User Agent at Scale and Avoid Getting Blocked
Rotating User Agents is usually not enough to bypass advanced protection systems, no matter how extensive your list is.
The anti-bot systems continuously evolve their anti-scraping measures to include various techniques that make it extremely challenging for web scrapers. Even if you combine rotating UAs with recommended best practices such as proxies, your scraper might still get blocked.
The only solution that works in 100% of cases is a web scraping API, which combines numerous techniques working against all anti-bot systems. An example of such an API is ZenRows, an all-in-one web scraping toolkit.
This solution automatically rotates a wide array of properly crafted web scraping User Agents, manages proxies, bypasses CAPTCHAs, and offers an advanced anti-bot bypass functionality that allows you to scrape any website.
What's more, ZenRows provides an intuitive API and can also serve as a Colly alternative for web scraping at scale.
Below is a step-by-step guide on how to avoid detection using ZenRows. For this example, we'll be scraping a G2 Reviews page.
Sign up, and you'll be directed to the Request Builder page.
Input the target URL and activate Premium Proxies and the JS Rendering mode.
Select the Go language option on the right and choose the API mode. ZenRows will generate your request code.
Copy the code and use your preferred HTTP client to make a request to the ZenRows API. The code below uses Golang's net/http.
package main
import (
"io"
"log"
"net/http"
)
func main() {
client := &http.Client{}
req, err := http.NewRequest("GET", "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fasana%2Freviews&js_render=true&premium_proxy=true", nil)
resp, err := client.Do(req)
if err != nil {
log.Fatalln(err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
log.Fatalln(err)
}
log.Println(string(body))
}
Run it, and you'll get the page's HTML content.
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
<!-- ... -->
</head>
<body>
<!-- other content omitted for brevity -->
</body>
Well done!
Conclusion
Customizing a Colly User Agent to mimic an actual browser can reduce your chances of getting blocked. However, it's not always enough when trying to bypass advanced detection systems.
Use ZenRows to scrape any website, regardless of its anti-bot complexity. This web scraping API handles everything under the hood, allowing you to focus on extracting the necessary data. Try ZenRows for free now!