You need a Golang HTML parser to transform raw data from a web scraper in Golang to a structured and readable format, like CSV or a database? In this tutorial, you'll learn how to navigate through HTML documents and extract your desired information.
We'll use the recommended net/html
library but also see some other options.
Prerequisites
We'll use the built-in net/html package as it's one of the most popular Golang HTML parsers for its efficiency and speed.
But before leveraging these capabilities, we must fetch the raw HTML data we'll be parsing in the tutorial. For that, we'll build a basic scraper that makes an HTTP request to ScrapingCourse.com, a demo website with e-commerce features and retrieves its HTML content as the response using the built-in Go library net/http for making HTTP requests.
Here's the scraper code. Run it using go run main.go
, and you'll have your raw HTML of the page.
package main
import (
"fmt"
"io"
"net/http"
)
func main() {
// URL to make the HTTP request to
url := "https://www.scrapingcourse.com/ecommerce/"
// Make the GET request
resp, _ := http.Get(url)
defer resp.Body.Close()
// Read the response body
bytes, _ := io.ReadAll(resp.Body)
// Print the body as a string
fmt.Println("HTML:\n\n", string(bytes))
}
Now, let's start parsing! Install the net/html package using the following command:
go get -u golang.org/x/net
For the next steps, you must know the net/html package offers two main APIs: The tokenizer API and the node parsing API. We'll explore both options in this tutorial, yet the node parsing API is often preferred for its high-level abstraction and ease of use.ย
Option 1: Parse HTML with the Node Parsing API (Recommended)
The node parsing API is a higher-level abstraction of the tokenizer API. It represents the HTML document as a tree of nodes, where each node corresponds to an element, attribute, or text in the HTML.
Let's parse all matching product data from the scraped page using this approach. For that, start by parsing the response body from the request using the html.parse()
function.
//..
// Use the html package to parse the response body from the request
doc, err := html.Parse(resp.Body)
if err != nil {
fmt.Println("Error:", err)
return
}
html.parse()
takes io.Reader
as its argument, which is the response body obtained from the HTTP request in this case.
Next, inspect the target web page https://www.scrapingcourse.com/ecommerce/
to identify the elements containing the data you want to extract.
All products are list elements inside an unordered list. Their names, prices, and images are in their respective anchor tags within each list.
So, to extract those details, define a function that'll iterate through the nodes in the HTML document to find all list elements. Within the function, process the name, price, and image of each product. For that, let's call a function that we'll define later.
Next, traverse the child and sibling nodes to complete the function. Then, make a recursive call to your function in the main function.
func main() {
//..
// find all <li> elements
var processAllProduct func(*html.Node)
processAllProduct = func(n *html.Node) {
if n.Type == html.ElementNode && n.Data == "li" {
// process the Product details within each <li> element
processNode(n)
}
// traverse the child nodes
for c := n.FirstChild; c != nil; c = c.NextSibling {
processAllProduct(c)
}
}
// make a recursive call to your function
processAllProduct(doc)
}
Now, define the node-processing function. This function takes the HTML node as an argument to serve as a pointer to the HTML node of the list element.
In the node-processing function, traverse the HTML structure within the list element. Match the tag name of the current node to the desired elements (h2, span, and img) using the switch and case statements and extract their text content.
After that, traverse the child nodes to complete the function.
// process the details of the Product within the <li> element
func processNode(n *html.Node) {
switch n.Data {
case "h2":
// check if FirstChild node of the h2 element is a text
if n.FirstChild != nil && n.FirstChild.Type == html.TextNode {
// if yes, retrieve FirstChild's data (name)
name := n.FirstChild.Data
// print name
fmt.Println("Name:", name)
}
case "span":
// check for the span with class "amount"
for _, a := range n.Attr {
if a.Key == "class" && strings.Contains(a.Val, "amount") {
// retrieve the text content of the "amount" span
for c := n.FirstChild; c != nil; c = c.NextSibling {
if c.Type == html.TextNode {
// print Product price
fmt.Println("Price:", c.Data)
}
}
}
}
case "img":
// check for the src attribute in the img tag
for _, a := range n.Attr {
if a.Key == "src" {
// retrieve src value
ImageURL := a.Val
// print image URL
fmt.Println("Image URL:", ImageURL)
}
}
}
// Traverse child nodes
for c := n.FirstChild; c != nil; c = c.NextSibling {
processNode(c)
}
}
For h2
, this code checks if it has a non-nil FirstChild (text node) and extracts the product name from the text node.
For span
, if this tag has an attribute with a class containing amount
, it processes the amount
span and extracts the text content. Notice that the node parsing API allows you to select an element easily using its class. This isn't the case with the tokenizer API (more on that later).ย
Similarly, if the current node is an <img>
tag, it extracts and prints the product image URL from the src
attribute.
Finally, put everything together to create your complete code.
package main
import (
"fmt"
"net/http"
"strings"
"golang.org/x/net/html"
)
func main() {
//.. HTTP request
// find all <li> elements
var processAllProduct func(*html.Node)
processAllProduct = func(n *html.Node) {
if n.Type == html.ElementNode && n.Data == "li" {
// process the Product details within each <li> element
processNode(n)
}
// traverse the child nodes
for c := n.FirstChild; c != nil; c = c.NextSibling {
processAllProduct(c)
}
}
// make a recursive call to your function
processAllProduct(doc)
}
// process the details of the product within the <li> element
func processNode(n *html.Node) {
switch n.Data {
case "h2":
// check if FirstChild node of the h2 element is a text
if n.FirstChild != nil && n.FirstChild.Type == html.TextNode {
// if yes, retrieve FirstChild's data (name)
name := n.FirstChild.Data
// print name
fmt.Println("Name:", name)
}
case "span":
// check for the span with class "amount"
for _, a := range n.Attr {
if a.Key == "class" && strings.Contains(a.Val, "amount") {
// retrieve the text content of the "amount" span
for c := n.FirstChild; c != nil; c = c.NextSibling {
if c.Type == html.TextNode {
// print product price
fmt.Println("Price:", c.Data)
}
}
}
}
case "img":
// check for the src attribute in the img tag
for _, a := range n.Attr {
if a.Key == "src" {
// retrieve src value
ImageURL := a.Val
// print image URL
fmt.Println("Image URL:", ImageURL)
}
}
}
// Traverse child nodes
for c := n.FirstChild; c != nil; c = c.NextSibling {
processNode(c)
}
}
Run it using go run main.go
, and you'll get the following result.
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg
Name: Abominable Hoodie
Price: 69.00
Congrats! You've parsed your first HTML in Golang.
Adding data from multiple pages will require crawling. Check out our Golang web crawling tutorial to learn more.
Option 2: Parse HTML Using the Tokenizer API
This alternative approach for parsing with net/html is an API that provides a low-level view of the HTML structure and is particularly useful when you need fine-grained control over the parsing process.
It breaks down an HTML document into tokens, each representing different elements, attributes, and text nodes. Here are the types of tokens and what they represent:
Token Name | What it Represents |
---|---|
ErrorToken | An error occurred during tokenization. |
TextToken | A text node. |
EndTagToken | The closing tags of an HTML element, like </a> . |
StartTagToken | The opening of an HTML tag, like <a> . |
SelfClosingTagToken | HTML tags that don't need to be closed manually, such as <img/> . |
CommentToken | Comments, which look like <!--x--> . |
DocTypeToken | Document type tag, which looks like <!DOCTYPE x> . |
To get started, initialize a new HTML tokenizer function, which takes the response body as its argument.
The tokenizer reads the content incrementally and provides a stream of tokens as it progresses through the HTML.
z := html.NewTokenizer(resp.Body)
Then, advance to the next tokens using the .next()
method and iterate over the returned tokens to retrieve the desired types.
//..
for {
tokenType := z.Next()
switch tokenType {
case html.StartTagToken, html.SelfClosingTagToken:
token := z.Token()
//..
}
Now, recall that all product are list elements in an unordered list and the names, prices, and images have individual anchor tags within each list. Therefore, we are specifically interested in start and self-closing tags. For each one encountered, retrieve its information using the Token
method.
Next, check for all the list elements and process their details to extract the names, prices, and images of each product.
//..
// Check for all <li> element
if token.Data == "li" {
// Process the details of the product within this <li> element
processProductDetails(z)
// Exit the loop after processing the details
//return
}
}
Notice that we called the undefined function processProductDetails()
. It'll process the names, prices, and images.ย
After that, check for errors and handle them to close the main function.
//..
if tokenType == html.ErrorToken {
break
}
Now, let's define processProductDetails()
.ย
Loop through tokens and retrieve the relevant token types within the list element.
func processProductDetails(z *html.Tokenizer) {
// Retrive Tokens for Relevant Data within the <li> element
for {
tokenType := z.Next()
switch tokenType {
case html.StartTagToken, html.SelfClosingTagToken:
token := z.Token()
//..
}
}
}
Lastly, parse the tokens for the relevant data, one after the other. Then match each token to the desired element and process it using the switch/case conditions.
For the name, fetch the next token in the <h2>
tag and check if it's plain HTML text content.
If itโs plain HTML text content, retrieve the text content of the current token, assign it to the variable name, and print the name in the console to verify it works.
//..
// Evaluate Tokens for H2
switch token.Data {
case "h2":
// Fetch next token within H2 and extract its text content.
tokenType = z.Next()
if tokenType == html.TextToken {
name := z.Token().Data
fmt.Println("Name:", name)
}
}
Similarly, for the image, evaluate the tokens within the list element to find the image tag and extract the image URL.
//..
case "img":
// check for src attribute and retrieve its value
for _, attr := range token.Attr {
if attr.Key == "src" {
imageURL := attr.Val
fmt.Println("Image URL:", imageURL)
}
}
Parsing the price requires a slightly different approach because of the nested span elements.
Here, check for the span
with class price
. Then, fetch the next token and check if it's a span
with class amount
. If so, loop through and print the text content.ย
case "span":
// Check for the span with class "price"
hasPriceClass := false
for _, attr := range token.Attr {
if attr.Key == "class" && strings.Contains(attr.Val, "price") {
hasPriceClass = true
break
}
}
// ...
if hasPriceClass {
tokenType = z.Next()
// Check if the next token is a span with class "amount"
if tokenType == html.StartTagToken || tokenType == html.SelfClosingTagToken {
nextToken := z.Token()
if nextToken.Data == "span" {
amountClass := false
for _, attr := range nextToken.Attr {
if attr.Key == "class" && strings.Contains(attr.Val, "amount") {
amountClass = true
break
}
}
// If the next span has class "amount," loop through and print its text content
if amountClass {
var currencySymbol, priceValue string
for {
tokenType = z.Next()
if tokenType == html.TextToken {
currencySymbol = z.Token().Data
}
tokenType = z.Next()
if tokenType == html.TextToken {
priceValue = z.Token().Data
} else if tokenType == html.EndTagToken && z.Token().Data == "span" {
break
}
}
amount := currencySymbol + priceValue
fmt.Println("Price:", amount)
}
}
}
}
Putting everything together, you should have the following complete code:
package main
import (
"fmt"
"net/http"
"strings"
"golang.org/x/net/html"
)
func main() {
//.. HTTP request
// Create an HTML tokenizer
z := html.NewTokenizer(resp.Body)
// Loop through HTML tokens
for {
tokenType := z.Next()
switch tokenType {
case html.StartTagToken, html.SelfClosingTagToken:
token := z.Token()
// Check for all <li> element
if token.Data == "li" {
// Process the details of the product within this <li> element
processProductDetails(z)
// Exit the loop after processing the details
//return
}
}
if tokenType == html.ErrorToken {
break
}
}
}
func processProductDetails(z *html.Tokenizer) {
// parse Tokens for Relevant Data within the <li> element
for {
tokenType := z.Next()
switch tokenType {
case html.StartTagToken, html.SelfClosingTagToken:
token := z.Token()
// parse Tokens for Relevant Data
switch token.Data {
case "h2":
// Extracting product name
tokenType = z.Next()
if tokenType == html.TextToken {
name := z.Token().Data
fmt.Println("Name:", name)
}
case "span":
// check for the span with class "price"
hasPriceClass := false
for _, attr := range token.Attr {
if attr.Key == "class" && strings.Contains(attr.Val, "price") {
hasPriceClass = true
break
}
}
// ...
if hasPriceClass {
tokenType = z.Next()
// check if the next token is a span with class "amount"
if tokenType == html.StartTagToken || tokenType == html.SelfClosingTagToken {
nextToken := z.Token()
if nextToken.Data == "span" {
amountClass := false
for _, attr := range nextToken.Attr {
if attr.Key == "class" && strings.Contains(attr.Val, "amount") {
amountClass = true
break
}
}
// if the next span has class "amount," loop through and print its text content
if amountClass {
var currencySymbol, priceValue string
for {
tokenType = z.Next()
if tokenType == html.TextToken {
currencySymbol = z.Token().Data
}
tokenType = z.Next()
if tokenType == html.TextToken {
priceValue = z.Token().Data
} else if tokenType == html.EndTagToken && z.Token().Data == "span" {
break
}
}
amount := currencySymbol + priceValue
fmt.Println("Price:", amount)
}
}
}
}
case "img":
// check for the src attribute and retrieve its value
for _, attr := range token.Attr {
if attr.Key == "src" {
imageURL := attr.Val
fmt.Println("Image URL:", imageURL)
}
}
}
}
if tokenType == html.ErrorToken {
break
}
}
}
And here's the result:
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg
Name: Abominable Hoodie
Price: $69.00
Awesome!
Now that you know how to extract data using both net/html APIs, let's discuss other possible tools for parsing HTML in Golang.
Alternatives to the net/html Library
While the net/html library offers a robust way to parse HTML documents, there are alternatives that may be better suited to specific project needs. Let's explore the most popular ones.
Goquery (Best)
Goquery is a popular HTML parsing library for Go that builds on the foundation of the net/html package but provides a more convenient and intuitive API inspired by jQuery. Its API allows you to perform HTML document traversals, selections, and manipulations using familiar jQuery-style syntax.ย
Also, it uses Cascadia, a CSS selector library that allows you to query HTML elements efficiently using CSS selectors. That means you can quickly extract specific information directly by class, ID, or tag name. Overall, it is a well-maintained and easy-to-use library, ideal for web scraping tasks.
For a step-by-step tutorial, check out our Goquery data parsing guide.
Gohtml
Gohtml is an HTML formatter for Go, not primarily a parsing library like net/html or Goquery. It takes the HTML source code and outputs formatted HTML and is a good choice for applications that need to render HTML templates quickly and efficiently.
However, it's not actively maintained, with no major update since October 2020.
Html2go
Html2go is a tool for converting HTML files into Go source code. This can be useful for generating code that can render HTML templates. However, it's no longer maintained.ย
Go-html-transform
The go-html-transform package is a Go library that allows you to scrape, parse, and transform HTML documents using a CSS selector-based approach. It is best suited for applications that need to modify the structure of HTML documents.
However, it may not offer the same level of performance as parsers like net/html and Goquery. Also, it's no longer maintained, with its last major update coming in 2016.
Conclusion
There are various libraries for parsing HTML in Golang, and the net/html package is one of the most popular and offers two APIs for extracting data: The node parsing API and the tokenizer API.
While both options are robust solutions, the first one is preferable as it represents the HTML document as a tree of nodes, making it easier to use. However, this approach can also become cumbersome when dealing with complex structures.
If you get blocked while scraping web pages, consider ZenRows as an easy way to get the data to be parsed.