Would you like to bring Playwright's capabilities while web scraping with Ruby? The playwright-ruby-client
is a community-maintained Ruby library that allows you to do just that.
In this tutorial, you'll explore how to use Playwright in Ruby and explore more advanced browser interactions. You'll learn how to:
Let's go!
How to Use Playwright in Ruby
To understand how to use Playwright in Ruby, you'll scrape the following Infinite Scrolling demo page:
This page keeps displaying new items as you scroll down. The content loads dynamically, making it a perfect target page to demonstrate headless browser capabilities.
Let's scrape it!
Step 1: Install Playwright in Ruby
To get started, create a directory for your project and install the playwright-ruby-client
gem.
Playwright is primarily a Node.js library developed by Microsoft. While Microsoft has not released an official Ruby client for Playwright, the Ruby community has developed the playwright-ruby-client
gem, which provides bindings to the Playwright API.
gem install playwright-ruby-client
By default, this gem doesn't include the Playwright-compatible browser binaries. Therefore, you first need to install Playwright using the following command:
npm install playwright
Then run this command to install the browser binaries (Chromium, Firefox, and WebKit):
./node_modules/.bin/playwright install
For Windows machines, you need to run the following command to install the browser binaries:
node_modules\.bin\playwright install
Great! You now have everything you need to start scraping with Playwright in Ruby.
Step 2: Scrape With Playwright in Ruby
Let's set up a basic script that extracts the entire HTML of the target demo page.
Initialize the Playwright library and launch the Chromium browser in headless mode. Then, create a browser context and navigate to the target URL. Finally, print the HTML content.
Here's what your script should look like:
require 'playwright'
# initialize Playwright
Playwright.create(playwright_cli_executable_path: 'node_modules/.bin/playwright') do |playwright|
# launch a browser (Chromium in this case)
browser = playwright.chromium.launch(headless: true)
# open a new browser context
context = browser.new_context
# open a new page
page = context.new_page
# navigate to the desired URL
url = 'https://www.scrapingcourse.com/infinite-scrolling'
page.goto(url)
# display the HTML content
puts page.content
# close the browser
browser.close
end
After running this code, you'll get the following output:
<!DOCTYPE html>
<html lang="en">
<head>
<!-- ... -->
<title>Infinite Scroll Challenge to Learn Web Scraping - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
</body>
</html>
Awesome! That's the HTML of the target page.
Step 3: Extract the Data You Want
Playwright includes functionalities to parse the HTML content of a page to extract specific details. For the sake of this tutorial, let's parse the name and price of each product. You need to:
- Locate the product elements by applying a DOM selector strategy.
- Extract the required data.
- Store the extracted data in a Ruby data structure (for example, array).
We'll be using CSS selectors to locate the product elements for simplicity.
Start with opening the Infinite Scroll demo page in a browser and inspecting the first product using DevTools.
You'll notice that each product item is within divs with the product-item
class. Also, the product names and prices are enclosed within span tags having classes product-name
and product-price
respectively.
Let's use this information to locate the product items and extract the required data.
Locate all elements matching the .product-item
selector using the query_selector_all
method. Store the returned data in an array (in this case, products
). Then, iterate through this array to extract the product names and product prices using the query_selector
method. Store the extracted data in an array of hashes (product_data
).
Here's the modified code:
require 'playwright'
# initialize Playwright
Playwright.create(playwright_cli_executable_path: 'node_modules/.bin/playwright') do |playwright|
# launch a browser (Chromium in this case)
browser = playwright.chromium.launch(headless: true)
# open a new browser context
context = browser.new_context
# open a new page
page = context.new_page
# navigate to the desired URL
url = 'https://www.scrapingcourse.com/infinite-scrolling'
page.goto(url)
# extract product names and prices
products = page.query_selector_all('.product-info')
product_data = products.map do |product|
name = product.query_selector('.product-name').text_content.strip
price = product.query_selector('.product-price').text_content.strip
{ name: name, price: price }
end
puts product_data
# close the browser
browser.close
end
Run it, and it'll print the following output in the terminal:
{:name=>"Chaz Kangeroo Hoodie", :price=>"$52"}
{:name=>"Teton Pullover Hoodie", :price=>"$70"}
{:name=>"Bruno Compete Hoodie", :price=>"$63"}
{:name=>"Frankie Sweatshirt", :price=>"$60"}
{:name=>"Hollister Backyard Sweatshirt", :price=>"$52"}
{:name=>"Stark Fundamental Hoodie", :price=>"$42"}
{:name=>"Hero Hoodie", :price=>"$54"}
{:name=>"Oslo Trek Hoodie", :price=>"$42"}
{:name=>"Abominable Hoodie", :price=>"$69"}
{:name=>"Mach Street Sweatshirt", :price=>"$62"}
{:name=>"Grayson Crewneck Sweatshirt", :price=>"$64"}
{:name=>"Ajax Full-Zip Sweatshirt", :price=>"$69"}
Well done! Your parsing logic works like a charm!
Step 4: Export to CSV
Now, you need to export the extracted product data to a CSV file. Your script must require the necessary csv
library. Open a new CSV file in write mode and write the headers. For each product in the product_data
array, write new rows containing the product's name and price.
Here's the updated code implementing the export to CSV functionality:
require 'playwright'
require 'csv'
# initialize Playwright
Playwright.create(playwright_cli_executable_path: 'node_modules/.bin/playwright') do |playwright|
# launch a browser (Chromium in this case)
browser = playwright.chromium.launch(headless: true)
# open a new browser context
context = browser.new_context
# open a new page
page = context.new_page
# navigate to the desired URL
url = 'https://www.scrapingcourse.com/infinite-scrolling'
page.goto(url)
# extract product names and prices
products = page.query_selector_all('.product-info')
product_data = products.map do |product|
name = product.query_selector('.product-name').text_content.strip
price = product.query_selector('.product-price').text_content.strip
{ name: name, price: price }
end
# close the browser
browser.close
# write data to a CSV file
CSV.open('products_data.csv', 'w', headers: true) do |csv|
csv << ['Product Name', 'Product Price']
product_data.each do |product|
csv << [product[:name], product[:price]]
end
end
end
After running this code, a products_data.csv
file will be created in the root directory of your project. It'll contain the following data:
Great! You now know the basics of using Playwright with Ruby.
However, the output only displays 12 products, even though the Infinite Scroll demo page features many more. This is because the initial page loads just 12 results and relies on infinite scrolling to load additional ones.
In the next section, you'll learn how to scrape all products.
Interact With a Browser With Playwright and Ruby
The playwright-ruby-client library supports interactions such as mouse movements, scrolls, screenshots, waits, network interactions, etc. These actions enable your automated script to mimic human behavior, helping it bypass anti-bot measures and access dynamic content.
First, use the scrolling function to scrape all the product data from the infinite-scrolling demo page. Then, we'll explore some other interactions with Playwright and Ruby.
Scrolling
In the previous section, our script could only scrape the first 12 products. To scrape all of them, you must execute JavaScript directly on the page using the evaluate
method.
You need to perform a few steps in a loop. Start with getting the current scroll height. Then, scroll to the bottom of the page and wait for a short period to allow new content to load. Get the new scroll height and compare it with the previous one. Break the loop if they are the same, indicating no more new content.
The following Ruby snippet implements the above scrolling logic:
require 'playwright'
require 'csv'
# initialize Playwright
Playwright.create(playwright_cli_executable_path: 'node_modules/.bin/playwright') do |playwright|
# launch a browser (Chromium in this case)
browser = playwright.chromium.launch(headless: true)
# open a new browser context
context = browser.new_context
# open a new page
page = context.new_page
# navigate to the desired URL
url = 'https://www.scrapingcourse.com/infinite-scrolling'
page.goto(url)
# scroll and wait for more products to load
loop do
previous_height = page.evaluate('document.body.scrollHeight')
page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
sleep 5 # adjust the sleep time as necessary
new_height = page.evaluate('document.body.scrollHeight')
break if new_height == previous_height
end
# extract product names and prices
products = page.query_selector_all('.product-info')
product_data = products.map do |product|
name = product.query_selector('.product-name').text_content.strip
price = product.query_selector('.product-price').text_content.strip
{ name: name, price: price }
end
# close the browser
browser.close
# write data to a CSV file
CSV.open('products_data.csv', 'w', headers: true) do |csv|
csv << ['Product Name', 'Product Price']
product_data.each do |product|
csv << [product[:name], product[:price]]
end
end
end
The generated products_data.csv
file will now contain all the product data:
Congratulations! You scraped all the product details.
Let's explore other advanced interactions using Playwright and Ruby.
Wait for Element
Relying on hard waits is inefficient and unreliable. It reduces the robustness of your web scraping scripts and creates fixed delays that don't adapt to varying load times. The race conditions where the content isn't fully loaded can lead to incomplete data extraction.
A better practice is implementing wait strategies that allow the script to proceed only when specific conditions are met. It makes the scraping process more efficient, reliable, and adaptable to dynamic content loading.
The wait_for_selector
method waits for the element to be ready (either appear/disappear from DOM or become visible/hidden) before performing an action.
Let's use the JavaScript Rendering demo page this time. This page is more suitable for demonstrating wait strategies because it involves dynamic content that loads asynchronously based on JavaScript execution. This means the content may not be available immediately after the initial page load, making explicit waits necessary to ensure the elements are present before interaction.
Here's a snippet showing the wait_for_selector
usage that waits until 10 seconds for the .product-info
elements to be visible in the DOM:
# wait until 10 seconds for the .product-info elements to be visible
page.wait_for_selector('.product-info', timeout: 10000)
Your final code implementing the wait-for-element strategy will look like this:
require 'playwright'
# initialize Playwright
Playwright.create(playwright_cli_executable_path: 'node_modules/.bin/playwright') do |playwright|
# launch a browser (Chromium in this case)
browser = playwright.chromium.launch(headless: true)
# open a new browser context
context = browser.new_context
# open a new page
page = context.new_page
# navigate to the desired URL
url = 'https://www.scrapingcourse.com/javascript-rendering'
page.goto(url)
# wait until 10 seconds for the .product-info elements to be visible
page.wait_for_selector('.product-info', timeout: 10000)
# extract product names and prices
products = page.query_selector_all('.product-info')
product_data = products.map do |product|
name = product.query_selector('.product-name').text_content.strip
price = product.query_selector('.product-price').text_content.strip
{ name: name, price: price }
end
# print the extracted product data
puts product_data
# close the browser
browser.close
end
After running this code, you’ll get the following output:
{:name=>"Chaz Kangeroo Hoodie", :price=>"$52"}
{:name=>"Teton Pullover Hoodie", :price=>"$70"}
{:name=>"Bruno Compete Hoodie", :price=>"$63"}
{:name=>"Frankie Sweatshirt", :price=>"$60"}
{:name=>"Hollister Backyard Sweatshirt", :price=>"$52"}
{:name=>"Stark Fundamental Hoodie", :price=>"$42"}
{:name=>"Hero Hoodie", :price=>"$54"}
{:name=>"Oslo Trek Hoodie", :price=>"$42"}
{:name=>"Abominable Hoodie", :price=>"$69"}
{:name=>"Mach Street Sweatshirt", :price=>"$62"}
{:name=>"Grayson Crewneck Sweatshirt", :price=>"$64"}
{:name=>"Ajax Full-Zip Sweatshirt", :price=>"$69"}
Wait for Page to Load
You can implement the "wait for page to load" strategy using Playwright's wait_for_load_state
method. This method ensures that you wait until the page is fully loaded.
This is particularly useful in scenarios where you need to interact with elements on the page that might take some time to become available. Waiting for the page to load prevents errors related to elements not being found or actions being attempted before the page is ready.
Here's the modified script implementing this strategy:
require 'playwright'
# initialize Playwright
Playwright.create(playwright_cli_executable_path: 'node_modules/.bin/playwright') do |playwright|
# launch a browser (Chromium in this case)
browser = playwright.chromium.launch(headless: true)
# open a new browser context
context = browser.new_context
# open a new page
page = context.new_page
# navigate to the desired URL
url = 'https://www.scrapingcourse.com/javascript-rendering'
page.goto(url)
# wait for the page to load completely
page.wait_for_load_state
# extract product names and prices
products = page.query_selector_all('.product-info')
product_data = products.map do |product|
name = product.query_selector('.product-name').text_content.strip
price = product.query_selector('.product-price').text_content.strip
{ name: name, price: price }
end
# print the extracted product data
puts product_data
# close the browser
browser.close
end
You'll get the following output:
{:name=>"Chaz Kangeroo Hoodie", :price=>"$52"}
{:name=>"Teton Pullover Hoodie", :price=>"$70"}
{:name=>"Bruno Compete Hoodie", :price=>"$63"}
{:name=>"Frankie Sweatshirt", :price=>"$60"}
{:name=>"Hollister Backyard Sweatshirt", :price=>"$52"}
{:name=>"Stark Fundamental Hoodie", :price=>"$42"}
{:name=>"Hero Hoodie", :price=>"$54"}
{:name=>"Oslo Trek Hoodie", :price=>"$42"}
{:name=>"Abominable Hoodie", :price=>"$69"}
{:name=>"Mach Street Sweatshirt", :price=>"$62"}
{:name=>"Grayson Crewneck Sweatshirt", :price=>"$64"}
{:name=>"Ajax Full-Zip Sweatshirt", :price=>"$69"}
Take a Screenshot
Apart from extracting data, Playwright provides a screenshot feature that allows you to capture and save an image of a web page's current state.
You can capture a screenshot of any web page (for example, the infinite-scrolling demo page) using the following code:
require 'playwright'
# initialize Playwright
Playwright.create(playwright_cli_executable_path: 'node_modules/.bin/playwright') do |playwright|
# launch a browser (Chromium in this case)
browser = playwright.chromium.launch(headless: true)
# open a new browser context
context = browser.new_context
# open a new page
page = context.new_page
# navigate to the desired URL
url = 'https://www.scrapingcourse.com/infinite-scrolling'
page.goto(url)
# capture Screenshot
page.screenshot(path: './screenshot_using_playwright_ruby.png')
# close the browser
browser.close
end
This code will generate the following screenshot:
However, no matter how good your Playwright Ruby web scraper is, it may fail to do its job if it gets blocked by target websites. And the risk is pretty high.
In the next section, you'll learn how to avoid anti-bot systems to scrape uninterrupted.
Avoid Blocks When Scraping With Playwright in Ruby
One of the major challenges in web scraping is getting blocked by anti-bot solutions.
Randomizing your requests with proxies and setting custom User Agents is a frequently recommended solution, but in reality, websites with strong anti-bot measures can still detect and block you.
Let's prove it by integrating a proxy and a custom User Agent with Playwright using Capybara in Ruby. In this section, you'll use G2 Reviews, a protected web page, as the target URL.
The playwright-ruby-client
is a library designed specifically for browser automation, while Rails applications use Capybara for system testing. The capybara-playwright-driver
integrates the Playwright library with Capybara, making it easy to use Playwright's powerful browser automation capabilities within Capybara's testing framework.
Install the capybara-playwright-driver
library using the following command:
gem install capybara-playwright-driver
In your script, require the capybara
and capybara/playwright
libraries.
require 'capybara'
require 'capybara/playwright'
Configure Capybara to use Playwright by registering a new driver with custom options, including the path to the Playwright CLI, the browser type (Chromium), headless mode, the proxy server address, and the custom User Agent string.
You can grab a free proxy from the Free Proxy List website and a User Agent from our list of top User Agents for web scraping.
# ...
# configure Capybara to use Playwright with custom options
Capybara.register_driver :playwright do |app|
Capybara::Playwright::Driver.new(app,
playwright_cli_executable_path: 'node_modules/.bin/playwright',
browser_type: :chromium,
headless: true,
proxy: { server: 'http://8.219.97.248:80' },
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36'
)
end
Set Capybara's default driver to the registered Playwright driver. Then, create a new Capybara session, navigate to the target URL (G2 Reviews), and print the HTML content of the page.
# ...
Capybara.default_driver = :playwright
# navigate and print the HTML content
session = Capybara::Session.new(:playwright)
session.visit('https://www.g2.com/products/airtable/reviews')
# print the HTML content
puts session.html
This setup ensures your browser interactions use the specified proxy and User Agent. Here's the final code integrating the above snippets:
require 'capybara'
require 'capybara/playwright'
# configure Capybara to use Playwright with custom options
Capybara.register_driver :playwright do |app|
Capybara::Playwright::Driver.new(app,
playwright_cli_executable_path: 'node_modules/.bin/playwright',
browser_type: :chromium,
headless: true,
proxy: { server: 'http://8.219.97.248:80' },
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36'
)
end
Capybara.default_driver = :playwright
# navigate and print the HTML content
session = Capybara::Session.new(:playwright)
session.visit('https://www.g2.com/products/airtable/reviews')
# print the HTML content
puts session.html
Free proxies are short-lived, so the above snippet is unlikely to work by the time you read this guide. Feel free to grab a new proxy from the Free Proxy List.
You'll get the following output:
<!DOCTYPE html>
<html lang="en-US" class="lang-en-us">
<head>
<title>Just a moment...</title>
<!-- ... -->
<meta http-equiv="refresh" content="390">
<script src="/cdn-cgi/challenge-platform/h/g/orchestrate/chl_page/v1?ray=892d27a56cd2855b"></script>
<script src="https://challenges.cloudflare.com/turnstile/v0/g/6aac8896f227/api.js?onload=OZxW4&render=explicit"
async="" defer="" crossorigin="anonymous"></script>
</head>
<body class="no-js">
<div class="main-wrapper" role="main">
<div class="main-content">
<h1 class="zone-name-title h1"><img src="/favicon.ico" class="heading-favicon"
alt="Icon for www.g2.com">www.g2.com</h1>
<h2 id="challenge-running" class="h2">Verifying you are human. This may take a few seconds.</h2>
<div id="challenge-stage"></div>
<!-- ... -->
<div id="challenge-body-text" class="core-msg spacer">www.g2.com needs to review the security of your
connection before proceeding.</div>
<!-- ... -->
<div id="challenge-error-title">
<div class="h2"><span id="challenge-error-text">Enable JavaScript and cookies to continue</span>
</div>
</div>
</noscript>
</div>
</div>
<!-- ... -->
<div class="footer" role="contentinfo">
<!-- ... -->
<div class="footer-inner">
<!-- ... -->
<div class="text-center" id="footer-text">Performance & security by Cloudflare</div>
</div>
</div>
</body>
</html>
This means that Cloudflare detected your script as a bot.
A good solution to this problem is using a web scraping API, such as ZenRows. It's an alternative to Playwright since it offers similar headless browser functionalities without the complex setup. Additionally, ZenRows includes features like rotating premium proxies, auto-rotating User Agents, anti-CAPTCHA tools, and more, making it a complete web-scraping toolkit that helps avoid all blocks and bans.
Let's see how to use ZenRows with Ruby.
First, sign up for free, and you'll get redirected to the Request Builder page. Paste the same G2 Reviews URL in the URL to Scrape box. Click on the Premium Proxies check box and enable JS Rendering. Select Ruby as your language and click on the API tab.
The generated code uses Ruby's Faraday as the HTTP client. Install the library using the following command:
gem install faraday
Here's the final code using the ZenRows API to access a protected website:
# gem install faraday
require 'faraday'
# make your request
url = URI.parse('https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fairtable%2Freviews&js_render=true&premium_proxy=true')
conn = Faraday.new()
conn.options.timeout = 180
res = conn.get(url, nil, nil)
# print the HTML content
print(res.body)
This code will access the protected website and print its HTML:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<title>Airtable Reviews 2024: Details, Pricing, & Features | G2</title>
</head>
<body>
<!-- other content omitted for brevity -->
</body>
Awesome! You've just bypassed Cloudflare with ZenRows.
Conclusion
In this tutorial, you learned how to use Playwright in Ruby. Then, you explored more advanced browser interactions, including scrolling, waiting for an element, waiting for a page to load, and capturing screenshots. You also explored how to integrate custom User Agents and proxies with Playwright and Ruby.
However, even the most advanced web scraper still needs to avoid blocks and bans to be effective. The best solution is to use a web scraping API, such as ZenRows. ZenRows is a viable alternative to Playwright in Ruby, providing a more reliable, automated solution for web scraping and browser automation. Try ZenRows for free!