You've done the hard work of accessing a website and fetching its raw HTML data. But how do you turn it into easy-to-access information?
When web scraping in Ruby, consider Nokogiri, the tried-and-tested HTML parser.
In this Nokogiri tutorial, you'll learn how to parse HTML in Ruby, starting with the basics and progressing to more complex scenarios using real-world examples.
What Is Nokogiri?
Nokogiri is a Ruby wrapper for C's libxml2 and libgumbo. This unique configuration gives Nokogiri the high-grade performance of a C-based HTML/XML parser with Ruby's ease of use and scalability.
Nokogiri supports Node manipulation, XPath, and CSS3 selectors. Combined with user-friendly jQuery extensions, these features allow you to easily read, write, modify, and query HTML and XML documents. All this and more make Nokogiri a valuable tool for large-scale scraping projects.
How to Parse Html With Nokogiri in Ruby?
Parsing HTML with Nokogiri in Ruby is easy. Once you load the HTML document, you can query the DOM for the target elements and extract your desired data.
Below, you’ll find a step-by-step Nokogiri tutorial on parsing HTML in Ruby. As an exercise, you’ll extract data from the Scraping Course, a sample eCommerce website for testing web scrapers.
Step #1: Install Nokogiri
Nokogiri provides "Native gems", which reduce the need to compile C extensions or configure system dependencies and make the installation straightforward.
First, ensure your environment meets the Ruby 3.0 or higher prerequisite. Then, install Nokogiri using native gems:
gem install nokogiri
Your installation should only take a few seconds, and your output should look like this:
Fetching nokogiri-1.16.5-x64-mingw-ucrt.gem
Successfully installed nokogiri-1.16.5-x64-mingw-ucrt
Alternatively, if you're using Bundler, add Nokogiri to your GemFile
and run the following command to build:
bundle install
Once your installation is complete, create a .rb
file in your project directory and import the necessary modules, Nokogiri, and open-URI (an HTTP client, part of the Ruby standard library, for making GET requests and retrieving raw HTML).
require 'nokogiri'
require 'open-uri'
All done! Now, let's move on to extracting HTML data.
Step #2: Extract HTML
Let's extract the HTML source file that you’ll later parse.
Make a GET
request to the target URL (https://www.scrapingcourse.com/ecommerce/
) using open-URI and retrieve the response.
# import the necessary modules
require 'nokogiri'
require 'open-uri'
begin
# make the GET request and retrieve the HTML content
html_content = URI.open('https://www.scrapingcourse.com/ecommerce/').read
# output the HTML content
puts html_content
rescue OpenURI::HTTPError => e
puts "An error occurred: #{e.message}"
end
If your target website blocks your request, consider using a web scraping API, such as ZenRows, to bypass any antibot system. ZenRows supports Ruby and guarantees easy, uninterrupted scraping.
Step #3: Parse Your First Data
When parsing HTML with Nokogiri, you can search for nodes using XPath, CSS selectors, or mixing and matching tags. The best option depends on your project needs and use cases. If you’re unsure which method to choose, check out our XPath vs CSS selector comparison guide.
In this tutorial, we'll use CSS selectors to extract the first product's title since they're easier and more user-friendly.
To start parsing your first data, inspect the target web page on a browser to identify the target HTML attribute. In this case, it's the first product's title (H2
).
This web page has a basic structure, and it's easy to identify that the first product title is located in the first H2
. For more complex websites, you may need to right-click on the element and select Copy Selector
to get the CSS selector.
Return to your scraping script, and load the HTML document retrieved in Step 2.
# import the necessary modules
require 'nokogiri'
require 'open-uri'
begin
#...
# load the HTML document
doc = Nokogiri::HTML(html_content)
Then, select the target element using the CSS selector you obtained by inspecting the web page and print its text content.
begin
#...
# select the first product title
first_product_title = doc.at_css('h2')
puts first_product_title.text
rescue OpenURI::HTTPError => e
puts "An error occurred: #{e.message}"
end
Combine all the code snippets for the following final script:
# import the necessary modules
require 'nokogiri'
require 'open-uri'
begin
# make the GET request and retrieve the HTML content
html_content = URI.open('https://www.scrapingcourse.com/ecommerce/').read
# load HTML document
doc = Nokogiri::HTML(html_content)
# select the first product title
first_product_title = doc.at_css('h2')
puts first_product_title.text
rescue OpenURI::HTTPError => e
puts "An error occurred: #{e.message}"
end
Run it, and you'll get the first product's name.
Abominable Hoodie
Congratulations! You've just created your first Ruby HTML parser using Nokogiri.
Step #4: Extract More Data
Now that you're familiar with the basics let's tackle a scenario closer to a real-world use case: extracting the title, image URL, and link for each product on the page.
But before diving in, let's differentiate between the two Nokogiri methods we'll use in this example; at_css
and css
.
The at_css
method returns only the first element that matches the CSS selector, while css
returns an array of all the matched elements.
We're after each product's title, image URL, and link, which means a set of elements. Therefore, we'll select all products using css
before iterating through each item to extract the individual elements.
Let's put this into practice.
Start by selecting all product elements using css
. You may need to inspect the web page on a browser to identify your CSS selector (li.product
)
begin
#...
# select all product elements
products = doc.css('li.product')
rescue OpenURI::HTTPError => e
puts "An error occurred: #{e.message}"
end
Afterward, iterate over each product and use at_css
to retrieve the first matched title, image URL, and link within each product element.
begin
#...
products.each do |product|
# select and print the product title
product_title = product.at_css('h2')
puts "Title: #{product_title.text}"
# select and print the product image URL
product_image = product.at_css('img')
puts "Image URL: #{product_image['src']}"
# select and print the product link
product_link = product.at_css('a')
puts "Product Link: #{product_link['href']}"
# print a separator for readability
puts "-" * 40
end
rescue OpenURI::HTTPError => e
puts "An error occurred: #{e.message}"
end
Put everything together. Your final code should look like this:
# import the necessary modules
require 'nokogiri'
require 'open-uri'
begin
# make the GET request and retrieve the HTML content
html_content = URI.open('https://www.scrapingcourse.com/ecommerce/').read
# load HTML document
doc = Nokogiri::HTML(html_content)
# select all product elements
products = doc.css('li.product')
# iterate through each product
products.each do |product|
# select and print the product title
product_title = product.at_css('h2')
puts "Title: #{product_title.text}"
# select and print the product image URL
product_image = product.at_css('img')
puts "Image URL: #{product_image['src']}"
# select and print the product link
product_link = product.at_css('a')
puts "Product Link: #{product_link['href']}"
# print a separator for readability
puts "-" * 40
end
rescue OpenURI::HTTPError => e
puts "An error occurred: #{e.message}"
end
Run it, and it'll return each product's title, image URL, and link.
Title: Abominable Hoodie
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg
Product Link: https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/
----------------------------------------
Title: Adrienne Trek Jacket
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_main-324x324.jpg
Product Link: https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/
----------------------------------------
#... truncated for brevity ...#
Good job!
Step #5: Export Data to CSV
Real-world web scraping scenarios usually involve extracting data in a structured format, such as CSV. Let's see how to do it with Nokogiri.
First, import Ruby's CSV library. After selecting all products as you did in Step 4, open the CSV file, specify the file path, and add headers.
# import the necessary modules
require 'nokogiri'
require 'open-uri'
require 'csv'
begin
#...
# open the CSV file in write mode and add headers
CSV.open('products.csv', 'w', headers: true) do |csv|
# write headers to the CSV file
csv << ['Title', 'Image URL', 'Product Link']
Next, iterate over each product, select the desired details, and write to CSV. Remember to handle errors.
begin
#...
# iterate over each product
products.each do |product|
# select product details
product_title = product.at_css('h2').text
product_image_url = product.at_css('img')['src']
product_link = product.at_css('a')['href']
# write product details to the CSV file
csv << [product_title, product_image_url, product_link]
end
end
puts "Data has been successfully exported to products.csv"
rescue OpenURI::HTTPError => e
puts "An error occurred: #{e.message}"
rescue StandardError => e
puts "An unexpected error occurred: #{e.message}"
end
Combine everything to get your final code.
# import the necessary modules
require 'nokogiri'
require 'open-uri'
require 'csv'
begin
# make the GET request and retrieve the HTML content
html_content = URI.open('https://www.scrapingcourse.com/ecommerce/')
# load HTML content
doc = Nokogiri::HTML(html_content)
# select all product elements
products = doc.css('#main ul li.product')
# open the CSV file in write mode and add headers
CSV.open('products.csv', 'w', headers: true) do |csv|
# write headers to the CSV file
csv << ['Title', 'Image URL', 'Product Link']
# iterate over each product
products.each do |product|
# select product details
product_title = product.at_css('h2').text
product_image_url = product.at_css('img')['src']
product_link = product.at_css('a')['href']
# write product details to the CSV file
csv << [product_title, product_image_url, product_link]
end
end
puts "Data has been successfully exported to #{csv_file_path}"
rescue OpenURI::HTTPError => e
puts "An error occurred: #{e.message}"
rescue StandardError => e
puts "An unexpected error occurred: #{e.message}"
end
Your result should be a products.csv
file, like in the image below.
Congratulations! You've now successfully extracted, parsed, and exported data with Ruby and Nokogiri.
Conclusion
Creating a Ruby HTML parser is a breeze with Nokogiri. All you need to do is load the HTML document using Nokogiri and query for your desired data.
Following this tutorial, you've gone from the basics to more complex scenarios. Still, it's just the tip of the iceberg when it comes to parsing HTML in Ruby. To learn how to parse multiple pages and scrape without getting blocked, check out this advanced guide on web crawling in Ruby.