How to Use Playwright in Ruby [2024 Guide]

August 18, 2024 · 8 min read

Would you like to bring Playwright's capabilities while web scraping with Ruby? The playwright-ruby-client is a community-maintained Ruby library that allows you to do just that.

In this tutorial, you'll explore how to use Playwright in Ruby and explore more advanced browser interactions. You'll learn how to:

Let's go!

How to Use Playwright in Ruby

To understand how to use Playwright in Ruby, you'll scrape the following Infinite Scrolling demo page:

ScrapingCourse infinite scroll demo
Click to open the image in full screen

This page keeps displaying new items as you scroll down. The content loads dynamically, making it a perfect target page to demonstrate headless browser capabilities.

Let's scrape it!

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step 1: Install Playwright in Ruby

To get started, create a directory for your project and install the playwright-ruby-client gem.

Terminal
gem install playwright-ruby-client

By default, this gem doesn't include the Playwright-compatible browser binaries. Therefore, you first need to install Playwright using the following command:

Terminal
npm install playwright

Then run this command to install the browser binaries (Chromium, Firefox, and WebKit):

Terminal
./node_modules/.bin/playwright install

For Windows machines, you need to run the following command to install the browser binaries:

Terminal
node_modules\.bin\playwright install

Great! You now have everything you need to start scraping with Playwright in Ruby.

Step 2: Scrape With Playwright in Ruby

Let's set up a basic script that extracts the entire HTML of the target demo page.

Initialize the Playwright library and launch the Chromium browser in headless mode. Then, create a browser context and navigate to the target URL. Finally, print the HTML content.

Here's what your script should look like:

Example
require 'playwright'

# initialize Playwright
Playwright.create(playwright_cli_executable_path: 'node_modules/.bin/playwright') do |playwright|
  # launch a browser (Chromium in this case)
  browser = playwright.chromium.launch(headless: true)

  # open a new browser context
  context = browser.new_context

  # open a new page
  page = context.new_page

  # navigate to the desired URL
  url = 'https://www.scrapingcourse.com/infinite-scrolling'
  page.goto(url)

  # display the HTML content
  puts page.content

  # close the browser
  browser.close
end

After running this code, you'll get the following output:

Output
<!DOCTYPE html>
<html lang="en">
  <head>

    <!-- ... -->

    <title>Infinite Scroll Challenge to Learn Web Scraping - ScrapingCourse.com</title>

    <!-- ... -->

  </head>

  <body>
    <!-- ... -->
  </body>
</html>

Awesome! That's the HTML of the target page.

Step 3: Extract the Data You Want

Playwright includes functionalities to parse the HTML content of a page to extract specific details. For the sake of this tutorial, let's parse the name and price of each product. You need to:

  1. Locate the product elements by applying a DOM selector strategy.
  2. Extract the required data.
  3. Store the extracted data in a Ruby data structure (for example, array).

We'll be using CSS selectors to locate the product elements for simplicity.

Start with opening the Infinite Scroll demo page in a browser and inspecting the first product using DevTools.

Infinite Scrolling First Product
Click to open the image in full screen

You'll notice that each product item is within divs with the product-itemclass. Also, the product names and prices are enclosed within span tags having classes product-name and product-price respectively.

Let's use this information to locate the product items and extract the required data.

Locate all elements matching the .product-item selector using the query_selector_all method. Store the returned data in an array (in this case, products). Then, iterate through this array to extract the product names and product prices using the query_selector method. Store the extracted data in an array of hashes (product_data).

Here's the modified code:

Example
require 'playwright'

# initialize Playwright
Playwright.create(playwright_cli_executable_path: 'node_modules/.bin/playwright') do |playwright|
  # launch a browser (Chromium in this case)
  browser = playwright.chromium.launch(headless: true)

  # open a new browser context
  context = browser.new_context

  # open a new page
  page = context.new_page

  # navigate to the desired URL
  url = 'https://www.scrapingcourse.com/infinite-scrolling'
  page.goto(url)

  # extract product names and prices
  products = page.query_selector_all('.product-info')
  product_data = products.map do |product|
    name = product.query_selector('.product-name').text_content.strip
    price = product.query_selector('.product-price').text_content.strip
    { name: name, price: price }
  end

  puts product_data

  # close the browser
  browser.close
end

Run it, and it'll print the following output in the terminal:

Output
{:name=>"Chaz Kangeroo Hoodie", :price=>"$52"}
{:name=>"Teton Pullover Hoodie", :price=>"$70"}
{:name=>"Bruno Compete Hoodie", :price=>"$63"}
{:name=>"Frankie  Sweatshirt", :price=>"$60"}
{:name=>"Hollister Backyard Sweatshirt", :price=>"$52"}
{:name=>"Stark Fundamental Hoodie", :price=>"$42"}
{:name=>"Hero Hoodie", :price=>"$54"}
{:name=>"Oslo Trek Hoodie", :price=>"$42"}
{:name=>"Abominable Hoodie", :price=>"$69"}
{:name=>"Mach Street Sweatshirt", :price=>"$62"}
{:name=>"Grayson Crewneck Sweatshirt", :price=>"$64"}
{:name=>"Ajax Full-Zip Sweatshirt", :price=>"$69"}

Well done! Your parsing logic works like a charm!

Step 4: Export to CSV

Now, you need to export the extracted product data to a CSV file. Your script must require the necessary csv library. Open a new CSV file in write mode and write the headers. For each product in the product_data array, write new rows containing the product's name and price.

Here's the updated code implementing the export to CSV functionality:

Example
require 'playwright'
require 'csv'

# initialize Playwright
Playwright.create(playwright_cli_executable_path: 'node_modules/.bin/playwright') do |playwright|
  # launch a browser (Chromium in this case)
  browser = playwright.chromium.launch(headless: true)

  # open a new browser context
  context = browser.new_context

  # open a new page
  page = context.new_page

  # navigate to the desired URL
  url = 'https://www.scrapingcourse.com/infinite-scrolling'
  page.goto(url)

  # extract product names and prices
  products = page.query_selector_all('.product-info')
  product_data = products.map do |product|
    name = product.query_selector('.product-name').text_content.strip
    price = product.query_selector('.product-price').text_content.strip
    { name: name, price: price }
  end

  # close the browser
  browser.close

  # write data to a CSV file
  CSV.open('products_data.csv', 'w', headers: true) do |csv|
    csv << ['Product Name', 'Product Price']
    product_data.each do |product|
      csv << [product[:name], product[:price]]
    end
  end

end

After running this code, a products_data.csv file will be created in the root directory of your project. It'll contain the following data:

Infinite Scrolling Initial Products CSV
Click to open the image in full screen

Great! You now know the basics of using Playwright with Ruby.

However, the output only displays 12 products, even though the Infinite Scroll demo page features many more. This is because the initial page loads just 12 results and relies on infinite scrolling to load additional ones. 

In the next section, you'll learn how to scrape all products.

Interact With a Browser With Playwright and Ruby

The playwright-ruby-client library supports interactions such as mouse movements, scrolls, screenshots, waits, network interactions, etc. These actions enable your automated script to mimic human behavior, helping it bypass anti-bot measures and access dynamic content.

First, use the scrolling function to scrape all the product data from the infinite-scrolling demo page. Then, we'll explore some other interactions with Playwright and Ruby.

Scrolling

In the previous section, our script could only scrape the first 12 products. To scrape all of them, you must execute JavaScript directly on the page using the evaluate method.

You need to perform a few steps in a loop. Start with getting the current scroll height. Then, scroll to the bottom of the page and wait for a short period to allow new content to load. Get the new scroll height and compare it with the previous one. Break the loop if they are the same, indicating no more new content.

The following Ruby snippet implements the above scrolling logic:

Example
require 'playwright'
require 'csv'

# initialize Playwright
Playwright.create(playwright_cli_executable_path: 'node_modules/.bin/playwright') do |playwright|
  # launch a browser (Chromium in this case)
  browser = playwright.chromium.launch(headless: true)

  # open a new browser context
  context = browser.new_context

  # open a new page
  page = context.new_page

  # navigate to the desired URL
  url = 'https://www.scrapingcourse.com/infinite-scrolling'
  page.goto(url)

  # scroll and wait for more products to load
  loop do
    previous_height = page.evaluate('document.body.scrollHeight')
    page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
    sleep 5 # adjust the sleep time as necessary
    new_height = page.evaluate('document.body.scrollHeight')
    break if new_height == previous_height
  end

  # extract product names and prices
  products = page.query_selector_all('.product-info')
  product_data = products.map do |product|
    name = product.query_selector('.product-name').text_content.strip
    price = product.query_selector('.product-price').text_content.strip
    { name: name, price: price }
  end

  # close the browser
  browser.close

  # write data to a CSV file
  CSV.open('products_data.csv', 'w', headers: true) do |csv|
    csv << ['Product Name', 'Product Price']
    product_data.each do |product|
      csv << [product[:name], product[:price]]
    end
  end

end

The generated products_data.csv file will now contain all the product data:

Infinite Scrolling All Products CSV
Click to open the image in full screen

Congratulations! You scraped all the product details.

Let's explore other advanced interactions using Playwright and Ruby. 

Wait for Element

Relying on hard waits is inefficient and unreliable. It reduces the robustness of your web scraping scripts and creates fixed delays that don't adapt to varying load times. The race conditions where the content isn't fully loaded can lead to incomplete data extraction.

A better practice is implementing wait strategies that allow the script to proceed only when specific conditions are met. It makes the scraping process more efficient, reliable, and adaptable to dynamic content loading.

The wait_for_selector method waits for the element to be ready (either appear/disappear from DOM or become visible/hidden) before performing an action.

Let's use the JavaScript Rendering demo page this time. This page is more suitable for demonstrating wait strategies because it involves dynamic content that loads asynchronously based on JavaScript execution. This means the content may not be available immediately after the initial page load, making explicit waits necessary to ensure the elements are present before interaction.

Here's a snippet showing the wait_for_selector usage that waits until 10 seconds for the .product-info elements to be visible in the DOM:

Example
# wait until 10 seconds for the .product-info elements to be visible
page.wait_for_selector('.product-info', timeout: 10000)

Your final code implementing the wait-for-element strategy will look like this:

Example
require 'playwright'

# initialize Playwright
Playwright.create(playwright_cli_executable_path: 'node_modules/.bin/playwright') do |playwright|
  # launch a browser (Chromium in this case)
  browser = playwright.chromium.launch(headless: true)

  # open a new browser context
  context = browser.new_context

  # open a new page
  page = context.new_page

  # navigate to the desired URL
  url = 'https://www.scrapingcourse.com/javascript-rendering'
  page.goto(url)

  # wait until 10 seconds for the .product-info elements to be visible
  page.wait_for_selector('.product-info', timeout: 10000)

  # extract product names and prices
  products = page.query_selector_all('.product-info')
  product_data = products.map do |product|
    name = product.query_selector('.product-name').text_content.strip
    price = product.query_selector('.product-price').text_content.strip
    { name: name, price: price }
  end

  # print the extracted product data
  puts product_data

  # close the browser
  browser.close
end

After running this code, you’ll get the following output:

Output
{:name=>"Chaz Kangeroo Hoodie", :price=>"$52"}
{:name=>"Teton Pullover Hoodie", :price=>"$70"}
{:name=>"Bruno Compete Hoodie", :price=>"$63"}
{:name=>"Frankie  Sweatshirt", :price=>"$60"}
{:name=>"Hollister Backyard Sweatshirt", :price=>"$52"}
{:name=>"Stark Fundamental Hoodie", :price=>"$42"}
{:name=>"Hero Hoodie", :price=>"$54"}
{:name=>"Oslo Trek Hoodie", :price=>"$42"}
{:name=>"Abominable Hoodie", :price=>"$69"}
{:name=>"Mach Street Sweatshirt", :price=>"$62"}
{:name=>"Grayson Crewneck Sweatshirt", :price=>"$64"}
{:name=>"Ajax Full-Zip Sweatshirt", :price=>"$69"}

Wait for Page to Load

You can implement the "wait for page to load" strategy using Playwright's wait_for_load_state method. This method ensures that you wait until the page is fully loaded.

This is particularly useful in scenarios where you need to interact with elements on the page that might take some time to become available. Waiting for the page to load prevents errors related to elements not being found or actions being attempted before the page is ready.

Here's the modified script implementing this strategy:

Example
require 'playwright'

# initialize Playwright
Playwright.create(playwright_cli_executable_path: 'node_modules/.bin/playwright') do |playwright|
  # launch a browser (Chromium in this case)
  browser = playwright.chromium.launch(headless: true)

  # open a new browser context
  context = browser.new_context

  # open a new page
  page = context.new_page

  # navigate to the desired URL
  url = 'https://www.scrapingcourse.com/javascript-rendering'
  page.goto(url)

  # wait for the page to load completely
  page.wait_for_load_state

  # extract product names and prices
  products = page.query_selector_all('.product-info')
  product_data = products.map do |product|
    name = product.query_selector('.product-name').text_content.strip
    price = product.query_selector('.product-price').text_content.strip
    { name: name, price: price }
  end  

  # print the extracted product data
  puts product_data

  # close the browser
  browser.close
end

You'll get the following output:

Output
{:name=>"Chaz Kangeroo Hoodie", :price=>"$52"}
{:name=>"Teton Pullover Hoodie", :price=>"$70"}
{:name=>"Bruno Compete Hoodie", :price=>"$63"}
{:name=>"Frankie  Sweatshirt", :price=>"$60"}
{:name=>"Hollister Backyard Sweatshirt", :price=>"$52"}
{:name=>"Stark Fundamental Hoodie", :price=>"$42"}
{:name=>"Hero Hoodie", :price=>"$54"}
{:name=>"Oslo Trek Hoodie", :price=>"$42"}
{:name=>"Abominable Hoodie", :price=>"$69"}
{:name=>"Mach Street Sweatshirt", :price=>"$62"}
{:name=>"Grayson Crewneck Sweatshirt", :price=>"$64"}
{:name=>"Ajax Full-Zip Sweatshirt", :price=>"$69"}

Take a Screenshot

Apart from extracting data, Playwright provides a screenshot feature that allows you to capture and save an image of a web page's current state.

You can capture a screenshot of any web page (for example, the infinite-scrolling demo page) using the following code:

Example
require 'playwright'

# initialize Playwright
Playwright.create(playwright_cli_executable_path: 'node_modules/.bin/playwright') do |playwright|
  # launch a browser (Chromium in this case)
  browser = playwright.chromium.launch(headless: true)

  # open a new browser context
  context = browser.new_context

  # open a new page
  page = context.new_page

  # navigate to the desired URL
  url = 'https://www.scrapingcourse.com/infinite-scrolling'
  page.goto(url)

  # capture Screenshot
  page.screenshot(path: './screenshot_using_playwright_ruby.png')

  # close the browser
  browser.close
end

This code will generate the following screenshot:

Scraping Uisng Playwright in Ruby
Click to open the image in full screen

However, no matter how good your Playwright Ruby web scraper is, it may fail to do its job if it gets blocked by target websites. And the risk is pretty high.

In the next section, you'll learn how to avoid anti-bot systems to scrape uninterrupted.

Avoid Blocks When Scraping With Playwright in Ruby

One of the major challenges in web scraping is getting blocked by anti-bot solutions.

Randomizing your requests with proxies and setting custom User Agents is a frequently recommended solution, but in reality, websites with strong anti-bot measures can still detect and block you.

Let's prove it by integrating a proxy and a custom User Agent with Playwright using Capybara in Ruby. In this section, you'll use G2 Reviews, a protected web page, as the target URL.

The playwright-ruby-client is a library designed specifically for browser automation, while Rails applications use Capybara for system testing. The capybara-playwright-driver integrates the Playwright library with Capybara, making it easy to use Playwright's powerful browser automation capabilities within Capybara's testing framework.

Install the capybara-playwright-driver library using the following command:

Terminal
gem install capybara-playwright-driver

In your script, require the capybara and capybara/playwright libraries.

Example
require 'capybara'
require 'capybara/playwright'

Configure Capybara to use Playwright by registering a new driver with custom options, including the path to the Playwright CLI, the browser type (Chromium), headless mode, the proxy server address, and the custom User Agent string.

Example
# ...

# configure Capybara to use Playwright with custom options
Capybara.register_driver :playwright do |app|
  Capybara::Playwright::Driver.new(app,
    playwright_cli_executable_path: 'node_modules/.bin/playwright',
    browser_type: :chromium,
    headless: true,
    proxy: { server: 'http://8.219.97.248:80' },
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36'
  )
end

Set Capybara's default driver to the registered Playwright driver. Then, create a new Capybara session, navigate to the target URL (G2 Reviews), and print the HTML content of the page.

Example
# ...

Capybara.default_driver = :playwright

# navigate and print the HTML content
session = Capybara::Session.new(:playwright)
session.visit('https://www.g2.com/products/airtable/reviews')

# print the HTML content
puts session.html

This setup ensures your browser interactions use the specified proxy and User Agent. Here's the final code integrating the above snippets:

Example
require 'capybara'
require 'capybara/playwright'

# configure Capybara to use Playwright with custom options
Capybara.register_driver :playwright do |app|
  Capybara::Playwright::Driver.new(app,
    playwright_cli_executable_path: 'node_modules/.bin/playwright',
    browser_type: :chromium,
    headless: true,
    proxy: { server: 'http://8.219.97.248:80' },
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36'
  )
end

Capybara.default_driver = :playwright

# navigate and print the HTML content
session = Capybara::Session.new(:playwright)
session.visit('https://www.g2.com/products/airtable/reviews')

# print the HTML content
puts session.html

You'll get the following output:

Output
<!DOCTYPE html>
<html lang="en-US" class="lang-en-us">

<head>
    <title>Just a moment...</title>
    <!-- ... -->
    <meta http-equiv="refresh" content="390">
    <script src="/cdn-cgi/challenge-platform/h/g/orchestrate/chl_page/v1?ray=892d27a56cd2855b"></script>
    <script src="https://challenges.cloudflare.com/turnstile/v0/g/6aac8896f227/api.js?onload=OZxW4&amp;render=explicit"
        async="" defer="" crossorigin="anonymous"></script>
</head>

<body class="no-js">
    <div class="main-wrapper" role="main">
        <div class="main-content">
            <h1 class="zone-name-title h1"><img src="/favicon.ico" class="heading-favicon"
                    alt="Icon for www.g2.com">www.g2.com</h1>
            <h2 id="challenge-running" class="h2">Verifying you are human. This may take a few seconds.</h2>
            <div id="challenge-stage"></div>
            <!-- ... -->
            <div id="challenge-body-text" class="core-msg spacer">www.g2.com needs to review the security of your
                connection before proceeding.</div>
            <!-- ... -->
            <div id="challenge-error-title">
                <div class="h2"><span id="challenge-error-text">Enable JavaScript and cookies to continue</span>
                </div>
            </div>
            </noscript>
        </div>
    </div>

    <!-- ... -->

    <div class="footer" role="contentinfo">
        <!-- ... -->
        <div class="footer-inner">
            <!-- ... -->
            <div class="text-center" id="footer-text">Performance &amp; security by Cloudflare</div>
        </div>
    </div>
</body>

</html>

This means that Cloudflare detected your script as a bot.

A good solution to this problem is using a web scraping API, such as ZenRows. It's an alternative to Playwright since it offers similar headless browser functionalities without the complex setup. Additionally, ZenRows includes features like rotating premium proxies, auto-rotating User Agents, anti-CAPTCHA tools, and more, making it a complete web-scraping toolkit that helps avoid all blocks and bans.

Let's see how to use ZenRows with Ruby.

First, sign up for free, and you'll get redirected to the Request Builder page. Paste the same G2 Reviews URL in the URL to Scrape box. Click on the Premium Proxies check box and enable JS Rendering. Select Ruby as your language and click on the API tab.

building a scraper with zenrows
Click to open the image in full screen

The generated code uses Ruby's Faraday as the HTTP client. Install the library using the following command:

Terminal
gem install faraday

Here's the final code using the ZenRows API to access a protected website:

Example
# gem install faraday
require 'faraday'

# make your request
url = URI.parse('https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fairtable%2Freviews&js_render=true&premium_proxy=true')
conn = Faraday.new()
conn.options.timeout = 180
res = conn.get(url, nil, nil)

# print the HTML content
print(res.body)

This code will access the protected website and print its HTML:

Output
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8" />
    <link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
    <title>Airtable Reviews 2024: Details, Pricing, & Features | G2</title>
</head>
<body>
    <!-- other content omitted for brevity -->
</body>

Awesome! You've just bypassed Cloudflare with ZenRows. 

Conclusion

In this tutorial, you learned how to use Playwright in Ruby. Then, you explored more advanced browser interactions, including scrolling, waiting for an element, waiting for a page to load, and capturing screenshots. You also explored how to integrate custom User Agents and proxies with Playwright and Ruby.

However, even the most advanced web scraper still needs to avoid blocks and bans to be effective. The best solution is to use a web scraping API, such as ZenRows. ZenRows is a viable alternative to Playwright in Ruby, providing a more reliable, automated solution for web scraping and browser automation. Try ZenRows for free!

Ready to get started?

Up to 1,000 URLs for free are waiting for you