How to Parse HTML in Golang [2024 Tutorial]

May 31, 2024 ยท 5 min read

You need a Golang HTML parser to transform raw data from a web scraper in Golang to a structured and readable format, like CSV or a database? In this tutorial, you'll learn how to navigate through HTML documents and extract your desired information.

We'll use the recommended net/html library but also see some other options.

Prerequisites

We'll use the built-in net/html package as it's one of the most popular Golang HTML parsers for its efficiency and speed.

But before leveraging these capabilities, we must fetch the raw HTML data we'll be parsing in the tutorial. For that, we'll build a basic scraper that makes an HTTP request to ScrapingCourse.com, a demo website with e-commerce features and retrieves its HTML content as the response using the built-in Go library net/http for making HTTP requests.

Scrapingcourse Ecommerce Store
Click to open the image in full screen

Here's the scraper code. Run it using go run main.go, and you'll have your raw HTML of the page.

main.go
package main
 
import (
    "fmt"
    "io"
    "net/http"
)
 
func main() {
    // URL to make the HTTP request to
    url := "https://www.scrapingcourse.com/ecommerce/"
 
    // Make the GET request
    resp, _ := http.Get(url)
    defer resp.Body.Close()
 
    // Read the response body
    bytes, _ := io.ReadAll(resp.Body)
 
    // Print the body as a string
    fmt.Println("HTML:\n\n", string(bytes))
}

Now, let's start parsing! Install the net/html package using the following command:

Terminal
go get -u golang.org/x/net

For the next steps, you must know the net/html package offers two main APIs: The tokenizer API and the node parsing API. We'll explore both options in this tutorial, yet the node parsing API is often preferred for its high-level abstraction and ease of use.ย 

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The node parsing API is a higher-level abstraction of the tokenizer API. It represents the HTML document as a tree of nodes, where each node corresponds to an element, attribute, or text in the HTML.

Let's parse all matching product data from the scraped page using this approach. For that, start by parsing the response body from the request using the html.parse() function.

main.go
//..    
    // Use the html package to parse the response body from the request
    doc, err := html.Parse(resp.Body)
    if err != nil {
        fmt.Println("Error:", err)
        return
    }

html.parse() takes io.Reader as its argument, which is the response body obtained from the HTTP request in this case.

Next, inspect the target web page https://www.scrapingcourse.com/ecommerce/ to identify the elements containing the data you want to extract.

Scrapingcourse Ecommerce Homepage Inspect First Page
Click to open the image in full screen

All products are list elements inside an unordered list. Their names, prices, and images are in their respective anchor tags within each list.

So, to extract those details, define a function that'll iterate through the nodes in the HTML document to find all list elements. Within the function, process the name, price, and image of each product. For that, let's call a function that we'll define later.

Next, traverse the child and sibling nodes to complete the function. Then, make a recursive call to your function in the main function.

main.go
func main() {
    //.. 
 
    // find all <li> elements
    var processAllProduct func(*html.Node)
    processAllProduct = func(n *html.Node) {
        if n.Type == html.ElementNode && n.Data == "li" {
            // process the Product details within each <li> element
            processNode(n)
 
        }
        // traverse the child nodes
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            processAllProduct(c)
        }
    }
    // make a recursive call to your function
    processAllProduct(doc)
}

Now, define the node-processing function. This function takes the HTML node as an argument to serve as a pointer to the HTML node of the list element.

In the node-processing function, traverse the HTML structure within the list element. Match the tag name of the current node to the desired elements (h2, span, and img) using the switch and case statements and extract their text content.

After that, traverse the child nodes to complete the function.

main.go
// process the details of the Product within the <li> element
func processNode(n *html.Node) {
    switch n.Data {
    case "h2":
        // check if FirstChild node of the h2 element is a text
        if n.FirstChild != nil && n.FirstChild.Type == html.TextNode {
            // if yes, retrieve FirstChild's data (name)
            name := n.FirstChild.Data
            // print name
            fmt.Println("Name:", name)
        }
 
    case "span":
        // check for the span with class "amount"
        for _, a := range n.Attr {
            if a.Key == "class" && strings.Contains(a.Val, "amount") {
                // retrieve the text content of the "amount" span
                for c := n.FirstChild; c != nil; c = c.NextSibling {
                    if c.Type == html.TextNode {
                        // print Product price
                        fmt.Println("Price:", c.Data)
                    }
                }
            }
        }
 
    case "img":
        // check for the src attribute in the img tag
        for _, a := range n.Attr {
            if a.Key == "src" {
                // retrieve src value
                ImageURL := a.Val
                // print image URL
                fmt.Println("Image URL:", ImageURL)
            }
        }
    }
 
    // Traverse child nodes
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        processNode(c)
    }
}

For h2, this code checks if it has a non-nil FirstChild (text node) and extracts the product name from the text node.

For span, if this tag has an attribute with a class containing amount, it processes the amount span and extracts the text content. Notice that the node parsing API allows you to select an element easily using its class. This isn't the case with the tokenizer API (more on that later).ย 

Similarly, if the current node is an <img> tag, it extracts and prints the product image URL from the src attribute.

Finally, put everything together to create your complete code.

main.go
package main
 
import (
    "fmt"
    "net/http"
    "strings"
 
    "golang.org/x/net/html"
)
 
func main() {
    //.. HTTP request
 
    // find all <li> elements
    var processAllProduct func(*html.Node)
    processAllProduct = func(n *html.Node) {
        if n.Type == html.ElementNode && n.Data == "li" {
            // process the Product details within each <li> element
            processNode(n)
 
        }
        // traverse the child nodes
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            processAllProduct(c)
        }
    }
    // make a recursive call to your function
    processAllProduct(doc)
}
 
// process the details of the product within the <li> element
func processNode(n *html.Node) {
    switch n.Data {
    case "h2":
        // check if FirstChild node of the h2 element is a text
        if n.FirstChild != nil && n.FirstChild.Type == html.TextNode {
            // if yes, retrieve FirstChild's data (name)
            name := n.FirstChild.Data
            // print name
            fmt.Println("Name:", name)
        }
 
    case "span":
        // check for the span with class "amount"
        for _, a := range n.Attr {
            if a.Key == "class" && strings.Contains(a.Val, "amount") {
                // retrieve the text content of the "amount" span
                for c := n.FirstChild; c != nil; c = c.NextSibling {
                    if c.Type == html.TextNode {
                        // print product price
                        fmt.Println("Price:", c.Data)
                    }
                }
            }
        }
 
    case "img":
        // check for the src attribute in the img tag
        for _, a := range n.Attr {
            if a.Key == "src" {
                // retrieve src value
                ImageURL := a.Val
                // print image URL
                fmt.Println("Image URL:", ImageURL)
            }
        }
    }
 
    // Traverse child nodes
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        processNode(c)
    }
}

Run it using go run main.go, and you'll get the following result.

Output
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg
Name: Abominable Hoodie
Price: 69.00

Congrats! You've parsed your first HTML in Golang.

Option 2: Parse HTML Using the Tokenizer API

This alternative approach for parsing with net/html is an API that provides a low-level view of the HTML structure and is particularly useful when you need fine-grained control over the parsing process.

It breaks down an HTML document into tokens, each representing different elements, attributes, and text nodes. Here are the types of tokens and what they represent:

Token Name What it Represents
ErrorToken An error occurred during tokenization.
TextToken A text node.
EndTagToken The closing tags of an HTML element, like </a>.
StartTagToken The opening of an HTML tag, like <a>.
SelfClosingTagToken HTML tags that don't need to be closed manually, such as <img/>.
CommentToken Comments, which look like <!--x-->.
DocTypeToken Document type tag, which looks like <!DOCTYPE x>.

To get started, initialize a new HTML tokenizer function, which takes the response body as its argument.

The tokenizer reads the content incrementally and provides a stream of tokens as it progresses through the HTML.

main.go
z := html.NewTokenizer(resp.Body)

Then, advance to the next tokens using the .next() method and iterate over the returned tokens to retrieve the desired types.

main.go
//..
    for {
            tokenType := z.Next()
 
            switch tokenType {
            case html.StartTagToken, html.SelfClosingTagToken:
                token := z.Token()
 
                //..
            }

Now, recall that all product are list elements in an unordered list and the names, prices, and images have individual anchor tags within each list. Therefore, we are specifically interested in start and self-closing tags. For each one encountered, retrieve its information using the Token method.

Next, check for all the list elements and process their details to extract the names, prices, and images of each product.

main.go
        //..            
             // Check for all <li> element
            if token.Data == "li" {
                // Process the details of the product within this <li> element
                processProductDetails(z)
 
                // Exit the loop after processing the details
                //return
            }
        }

Notice that we called the undefined function processProductDetails(). It'll process the names, prices, and images.ย  After that, check for errors and handle them to close the main function.

main.go
    //..
        
         if tokenType == html.ErrorToken {
            break
        }

Now, let's define processProductDetails().ย 

Loop through tokens and retrieve the relevant token types within the list element.

main.go
func processProductDetails(z *html.Tokenizer) {
    // Retrive Tokens for Relevant Data within the <li> element
    for {
        tokenType := z.Next()
 
        switch tokenType {
        case html.StartTagToken, html.SelfClosingTagToken:
            token := z.Token()
 
            //..
        }
    }
}

Lastly, parse the tokens for the relevant data, one after the other. Then match each token to the desired element and process it using the switch/case conditions.

For the name, fetch the next token in the <h2> tag and check if it's plain HTML text content.

If itโ€™s plain HTML text content, retrieve the text content of the current token, assign it to the variable name, and print the name in the console to verify it works.

main.go
        //..            
            // Evaluate Tokens for H2
            switch token.Data {
            case "h2":
                // Fetch next token within H2 and extract its text content.
                tokenType = z.Next()
                if tokenType == html.TextToken {
                    name := z.Token().Data
                    fmt.Println("Name:", name)
                }
            }

Similarly, for the image, evaluate the tokens within the list element to find the image tag and extract the image URL.

main.go
        //..
            case "img":
                // check for src attribute and retrieve its value
                for _, attr := range token.Attr {
                    if attr.Key == "src" {
                        imageURL := attr.Val
                        fmt.Println("Image URL:", imageURL)
                    }
                }

Parsing the price requires a slightly different approach because of the nested span elements.

Here, check for the span with class price. Then, fetch the next token and check if it's a span with class amount. If so, loop through and print the text content.ย 

main.go
            case "span":
                // Check for the span with class "price"
                hasPriceClass := false
                for _, attr := range token.Attr {
                    if attr.Key == "class" && strings.Contains(attr.Val, "price") {
                        hasPriceClass = true
                        break
                    }
                }
 
                // ...
 
                if hasPriceClass {
                    tokenType = z.Next()
 
                    // Check if the next token is a span with class "amount"
                    if tokenType == html.StartTagToken || tokenType == html.SelfClosingTagToken {
                        nextToken := z.Token()
                        if nextToken.Data == "span" {
                            amountClass := false
                            for _, attr := range nextToken.Attr {
                                if attr.Key == "class" && strings.Contains(attr.Val, "amount") {
                                    amountClass = true
                                    break
                                }
                            }
 
                            // If the next span has class "amount," loop through and print its text content
                            if amountClass {
                                var currencySymbol, priceValue string
 
                                for {
                                    tokenType = z.Next()
                                    if tokenType == html.TextToken {
                                        currencySymbol = z.Token().Data
                                    }
 
                                    tokenType = z.Next()
                                    if tokenType == html.TextToken {
                                        priceValue = z.Token().Data
                                    } else if tokenType == html.EndTagToken && z.Token().Data == "span" {
                                        break
                                    }
                                }
 
                                amount := currencySymbol + priceValue
                                fmt.Println("Price:", amount)
                            }
                        }
                    }
                }

Putting everything together, you should have the following complete code:

main.go
package main
 
import (
    "fmt"
    "net/http"
    "strings"
 
    "golang.org/x/net/html"
)
 
func main() {   
    //.. HTTP request
 
    // Create an HTML tokenizer
    z := html.NewTokenizer(resp.Body)
 
    // Loop through HTML tokens
    for {
        tokenType := z.Next()
 
        switch tokenType {
        case html.StartTagToken, html.SelfClosingTagToken:
            token := z.Token()
 
            // Check for all <li> element
            if token.Data == "li" {
                // Process the details of the product within this <li> element
                processProductDetails(z)
 
                // Exit the loop after processing the details
                //return
            }
        }
 
        if tokenType == html.ErrorToken {
            break
        }
    }
}
 
func processProductDetails(z *html.Tokenizer) {
    // parse Tokens for Relevant Data within the <li> element
    for {
        tokenType := z.Next()
 
        switch tokenType {
        case html.StartTagToken, html.SelfClosingTagToken:
            token := z.Token()
 
            // parse Tokens for Relevant Data
            switch token.Data {
            case "h2":
                // Extracting product name
                tokenType = z.Next()
                if tokenType == html.TextToken {
                    name := z.Token().Data
                    fmt.Println("Name:", name)
                }
 
            case "span":
                // check for the span with class "price"
                hasPriceClass := false
                for _, attr := range token.Attr {
                    if attr.Key == "class" && strings.Contains(attr.Val, "price") {
                        hasPriceClass = true
                        break
                    }
                }
 
                // ...
 
                if hasPriceClass {
                    tokenType = z.Next()
 
                    // check if the next token is a span with class "amount"
                    if tokenType == html.StartTagToken || tokenType == html.SelfClosingTagToken {
                        nextToken := z.Token()
                        if nextToken.Data == "span" {
                            amountClass := false
                            for _, attr := range nextToken.Attr {
                                if attr.Key == "class" && strings.Contains(attr.Val, "amount") {
                                    amountClass = true
                                    break
                                }
                            }
 
                            // if the next span has class "amount," loop through and print its text content
                            if amountClass {
                                var currencySymbol, priceValue string
 
                                for {
                                    tokenType = z.Next()
                                    if tokenType == html.TextToken {
                                        currencySymbol = z.Token().Data
                                    }
 
                                    tokenType = z.Next()
                                    if tokenType == html.TextToken {
                                        priceValue = z.Token().Data
                                    } else if tokenType == html.EndTagToken && z.Token().Data == "span" {
                                        break
                                    }
                                }
 
                                amount := currencySymbol + priceValue
                                fmt.Println("Price:", amount)
                            }
                        }
                    }
                }
 
            case "img":
                // check for the src attribute and retrieve its value
                for _, attr := range token.Attr {
                    if attr.Key == "src" {
                        imageURL := attr.Val
                        fmt.Println("Image URL:", imageURL)
                    }
                }
            }
        }
        if tokenType == html.ErrorToken {
            break
        }
    }
}

And here's the result:

Output
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg
Name: Abominable Hoodie
Price: $69.00

Awesome!

Now that you know how to extract data using both net/html APIs, let's discuss other possible tools for parsing HTML in Golang.

Alternatives to the net/html Library

While the net/html library offers a robust way to parse HTML documents, there are alternatives that may be better suited to specific project needs. Let's explore the most popular ones.

Goquery (Best)

Goquery is a popular HTML parsing library for Go that builds on the foundation of the net/html package but provides a more convenient and intuitive API inspired by jQuery. Its API allows you to perform HTML document traversals, selections, and manipulations using familiar jQuery-style syntax.ย 

Also, it uses Cascadia, a CSS selector library that allows you to query HTML elements efficiently using CSS selectors. That means you can quickly extract specific information directly by class, ID, or tag name. Overall, it is a well-maintained and easy-to-use library, ideal for web scraping tasks.

For a step-by-step tutorial, check out our Goquery data parsing guide.

Gohtml

Gohtml is an HTML formatter for Go, not primarily a parsing library like net/html or Goquery. It takes the HTML source code and outputs formatted HTML and is a good choice for applications that need to render HTML templates quickly and efficiently.

However, it's not actively maintained, with no major update since October 2020.

Html2go

Html2go is a tool for converting HTML files into Go source code. This can be useful for generating code that can render HTML templates. However, it's no longer maintained.ย 

Go-html-transform

The go-html-transform package is a Go library that allows you to scrape, parse, and transform HTML documents using a CSS selector-based approach. It is best suited for applications that need to modify the structure of HTML documents.

However, it may not offer the same level of performance as parsers like net/html and Goquery. Also, it's no longer maintained, with its last major update coming in 2016.

Conclusion

There are various libraries for parsing HTML in Golang, and the net/html package is one of the most popular and offers two APIs for extracting data: The node parsing API and the tokenizer API.

While both options are robust solutions, the first one is preferable as it represents the HTML document as a tree of nodes, making it easier to use. However, this approach can also become cumbersome when dealing with complex structures.

If you get blocked while scraping web pages, consider ZenRows as an easy way to get the data to be parsed.

Ready to get started?

Up to 1,000 URLs for free are waiting for you