Nokogiri Tutorial: Best HTML Parser for Ruby

July 8, 2024 · 8 min read

You've done the hard work of accessing a website and fetching its raw HTML data. But how do you turn it into easy-to-access information?

When web scraping in Ruby, consider Nokogiri, the tried-and-tested HTML parser.

In this Nokogiri tutorial, you'll learn how to parse HTML in Ruby, starting with the basics and progressing to more complex scenarios using real-world examples.

What Is Nokogiri?

Nokogiri is a Ruby wrapper for C's libxml2 and libgumbo. This unique configuration gives Nokogiri the high-grade performance of a C-based HTML/XML parser with Ruby's ease of use and scalability.

Nokogiri supports Node manipulation, XPath, and CSS3 selectors. Combined with user-friendly jQuery extensions, these features allow you to easily read, write, modify, and query HTML and XML documents. All this and more make Nokogiri a valuable tool for large-scale scraping projects.

How to Parse Html With Nokogiri in Ruby?

Parsing HTML with Nokogiri in Ruby is easy. Once you load the HTML document, you can query the DOM for the target elements and extract your desired data.

Below, you’ll find a step-by-step Nokogiri tutorial on parsing HTML in Ruby. As an exercise, you’ll extract data from the Scraping Course, a sample eCommerce website for testing web scrapers.

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

Step #1: Install Nokogiri

Nokogiri provides "Native gems", which reduce the need to compile C extensions or configure system dependencies and make the installation straightforward.

First, ensure your environment meets the Ruby 3.0 or higher prerequisite. Then, install Nokogiri using native gems:

Terminal
gem install nokogiri

Your installation should only take a few seconds, and your output should look like this:

Output
Fetching nokogiri-1.16.5-x64-mingw-ucrt.gem
Successfully installed nokogiri-1.16.5-x64-mingw-ucrt

Alternatively, if you're using Bundler, add Nokogiri to your GemFile and run the following command to build:

Terminal
bundle install

Once your installation is complete, create a .rb file in your project directory and import the necessary modules, Nokogiri, and open-URI (an HTTP client, part of the Ruby standard library, for making GET requests and retrieving raw HTML).

scraper.rb
require 'nokogiri'
require 'open-uri'

All done! Now, let's move on to extracting HTML data.

Step #2: Extract HTML

Let's extract the HTML source file that you’ll later parse.

Make a GET request to the target URL (https://www.scrapingcourse.com/ecommerce/) using open-URI and retrieve the response.

scraper.rb
# import the necessary modules
require 'nokogiri'
require 'open-uri'
 
begin
  # make the GET request and retrieve the HTML content
  html_content = URI.open('https://www.scrapingcourse.com/ecommerce/').read
 
  # output the HTML content
  puts html_content
 
rescue OpenURI::HTTPError => e
  puts "An error occurred: #{e.message}"
end
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step #3: Parse Your First Data

When parsing HTML with Nokogiri, you can search for nodes using XPath, CSS selectors, or mixing and matching tags. The best option depends on your project needs and use cases. If you’re unsure which method to choose, check out our XPath vs CSS selector comparison guide.

In this tutorial, we'll use CSS selectors to extract the first product's title since they're easier and more user-friendly.

To start parsing your first data, inspect the target web page on a browser to identify the target HTML attribute. In this case, it's the first product's title (H2).

This web page has a basic structure, and it's easy to identify that the first product title is located in the first H2. For more complex websites, you may need to right-click on the element and select Copy Selector to get the CSS selector.

CSS Selector
Click to open the image in full screen

Return to your scraping script, and load the HTML document retrieved in Step 2.

scraper.rb
# import the necessary modules
require 'nokogiri'
require 'open-uri'
 
begin
  #...
 
  # load the HTML document
  doc = Nokogiri::HTML(html_content)

Then, select the target element using the CSS selector you obtained by inspecting the web page and print its text content.

scraper.rb
begin
  #...
  
  # select the first product title
  first_product_title = doc.at_css('h2')
  puts first_product_title.text
 
rescue OpenURI::HTTPError => e
  puts "An error occurred: #{e.message}"
end

Combine all the code snippets for the following final script:

scraper.rb
# import the necessary modules
require 'nokogiri'
require 'open-uri'
 
begin
  # make the GET request and retrieve the HTML content
  html_content = URI.open('https://www.scrapingcourse.com/ecommerce/').read
 
  # load HTML document
  doc = Nokogiri::HTML(html_content)
  
  # select the first product title
  first_product_title = doc.at_css('h2')
  puts first_product_title.text
 
rescue OpenURI::HTTPError => e
  puts "An error occurred: #{e.message}"
end

Run it, and you'll get the first product's name.

Output
Abominable Hoodie

Congratulations! You've just created your first Ruby HTML parser using Nokogiri.

Step #4: Extract More Data

Now that you're familiar with the basics let's tackle a scenario closer to a real-world use case: extracting the title, image URL, and link for each product on the page.

But before diving in, let's differentiate between the two Nokogiri methods we'll use in this example; at_css and css.

The at_css method returns only the first element that matches the CSS selector, while css returns an array of all the matched elements.

We're after each product's title, image URL, and link, which means a set of elements. Therefore, we'll select all products using css before iterating through each item to extract the individual elements.

Let's put this into practice.

Start by selecting all product elements using css. You may need to inspect the web page on a browser to identify your CSS selector (li.product)

scraper.rb
begin
  #...
  
  # select all product elements
  products = doc.css('li.product')
 
 
 
rescue OpenURI::HTTPError => e
  puts "An error occurred: #{e.message}"
end

Afterward, iterate over each product and use at_css to retrieve the first matched title, image URL, and link within each product element.

scraper.rb
begin
  #...
  
  products.each do |product|
    # select and print the product title
    product_title = product.at_css('h2')
    puts "Title: #{product_title.text}" 
    
    # select and print the product image URL
    product_image = product.at_css('img')
    puts "Image URL: #{product_image['src']}" 
    
    # select and print the product link
    product_link = product.at_css('a')
    puts "Product Link: #{product_link['href']}"
 
    # print a separator for readability
    puts "-" * 40
  end
 
rescue OpenURI::HTTPError => e
  puts "An error occurred: #{e.message}"
end

Put everything together. Your final code should look like this:

scraper.rb
# import the necessary modules
require 'nokogiri'
require 'open-uri'
 
begin
  # make the GET request and retrieve the HTML content
  html_content = URI.open('https://www.scrapingcourse.com/ecommerce/').read
 
  # load HTML document
  doc = Nokogiri::HTML(html_content)
  
  # select all product elements
  products = doc.css('li.product')
 
  # iterate through each product  
  products.each do |product|
    # select and print the product title
    product_title = product.at_css('h2')
    puts "Title: #{product_title.text}" 
    
    # select and print the product image URL
    product_image = product.at_css('img')
    puts "Image URL: #{product_image['src']}" 
    
    # select and print the product link
    product_link = product.at_css('a')
    puts "Product Link: #{product_link['href']}"
 
    # print a separator for readability
    puts "-" * 40
  end
 
rescue OpenURI::HTTPError => e
  puts "An error occurred: #{e.message}"
end

Run it, and it'll return each product's title, image URL, and link.

Output
Title: Abominable Hoodie
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg
Product Link: https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/
----------------------------------------
Title: Adrienne Trek Jacket
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_main-324x324.jpg
Product Link: https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/
----------------------------------------
 
#... truncated for brevity ...#

Good job!

Step #5: Export Data to CSV

Real-world web scraping scenarios usually involve extracting data in a structured format, such as CSV. Let's see how to do it with Nokogiri.

First, import Ruby's CSV library. After selecting all products as you did in Step 4, open the CSV file, specify the file path, and add headers.

scraper.rb
# import the necessary modules
require 'nokogiri'
require 'open-uri'
require 'csv'
 
begin
  #...
 
  # open the CSV file in write mode and add headers
  CSV.open('products.csv', 'w', headers: true) do |csv|
    # write headers to the CSV file
    csv << ['Title', 'Image URL', 'Product Link']

Next, iterate over each product, select the desired details, and write to CSV. Remember to handle errors.

scraper.rb
begin
#...
 
    # iterate over each product 
    products.each do |product|
      # select product details
      product_title = product.at_css('h2').text
      product_image_url = product.at_css('img')['src']
      product_link = product.at_css('a')['href']
 
      # write product details to the CSV file
      csv << [product_title, product_image_url, product_link]
     end
  end
 
  puts "Data has been successfully exported to products.csv"
 
rescue OpenURI::HTTPError => e
  puts "An error occurred: #{e.message}"
rescue StandardError => e
  puts "An unexpected error occurred: #{e.message}"
end

Combine everything to get your final code.

File
# import the necessary modules
require 'nokogiri'
require 'open-uri'
require 'csv'
 
begin
  # make the GET request and retrieve the HTML content
  html_content = URI.open('https://www.scrapingcourse.com/ecommerce/')
 
  # load HTML content
  doc = Nokogiri::HTML(html_content)
  
  # select all product elements
  products = doc.css('#main ul li.product')
 
  # open the CSV file in write mode and add headers
  CSV.open('products.csv', 'w', headers: true) do |csv|
    # write headers to the CSV file
    csv << ['Title', 'Image URL', 'Product Link']
 
    # iterate over each product 
    products.each do |product|
      # select product details
      product_title = product.at_css('h2').text
      product_image_url = product.at_css('img')['src']
      product_link = product.at_css('a')['href']
 
      # write product details to the CSV file
      csv << [product_title, product_image_url, product_link]
    end
  end
 
  puts "Data has been successfully exported to #{csv_file_path}"
 
rescue OpenURI::HTTPError => e
  puts "An error occurred: #{e.message}"
rescue StandardError => e
  puts "An unexpected error occurred: #{e.message}"
end

Your result should be a products.csv file, like in the image below.

CSV File
Click to open the image in full screen

Congratulations! You've now successfully extracted, parsed, and exported data with Ruby and Nokogiri.

Conclusion

Creating a Ruby HTML parser is a breeze with Nokogiri. All you need to do is load the HTML document using Nokogiri and query for your desired data.

Following this tutorial, you've gone from the basics to more complex scenarios. Still, it's just the tip of the iceberg when it comes to parsing HTML in Ruby. To learn how to parse multiple pages and scrape without getting blocked, check out this advanced guide on web crawling in Ruby.

Ready to get started?

Up to 1,000 URLs for free are waiting for you