How to Use a Proxy With Capybara

August 15, 2024 ยท 7 min read

Capybara is a powerful Ruby package for automating web interactions. It easily integrates with different drivers to execute automation scripts. While effective for web scraping with Ruby, it's still at risk of being blocked by heavily protected websites.

In this article, you'll learn how to avoid blocks and bans by implementing proxies with Capybara.

Set up a Proxy With Capybara

To begin, install the Capybara and Selenium Webdriver gems.

Terminal
gem install capybara selenium-webdriver

Let's create a basic script that makes an HTTP request to https://httpbin.io/ip, a website that returns the client's IP address.

Include the following libraries at the beginning of your script:

scraper.rb
require 'capybara'
require 'capybara/dsl'
require 'selenium-webdriver'

Next, define a Scraper class and include the Capybara DSL module, which allows you to use Capybara methods within the class.

scraper.rb
# ...

class Scraper
  # include Capybara DSL to use Capybara methods
  include Capybara::DSL

end

Define an initialize method that configures a Selenium driver with headless Chrome inside the Scraper class:

scraper.rb
# ...

class Scraper
  # ...

  def initialize
    # set Capybara to use the Selenium driver with headless Chrome
    Capybara.register_driver :selenium do |app|
      options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
      Capybara::Selenium::Driver.new(app, browser: :chrome, options: options)
    end
    Capybara.default_driver = :selenium
  end

end

Then, define a scrape_data method within the same class. This method navigates to HTTPBin, prints the HTML response, handles errors, and ensures that the Capybara session is reset and the browser is closed.

scraper.rb
# ...

class Scraper
  # ...

  def scrape_data
    begin
      # navigate to the target URL
      visit('https://httpbin.io/ip')

      # print the response HTML
      puts page.html
    rescue => e
      # print error
      puts "An error occurred: #{e.message}"
    ensure
      # ensure the session is reset and close the browser
      Capybara.reset!
    end
  end

end

Finally, create an instance of the Scraper class and call the scrape_data method to execute the scraping operation:

scraper.rb
# ...

# create an instance of the Scraper class and fetch the data
request = Scraper.new
request.scrape_data

Merge the code snippets. Your complete code should look like this:

scraper.rb
require 'capybara'
require 'capybara/dsl'
require 'selenium-webdriver'

class Scraper
  # include Capybara DSL to use Capybara methods
  include Capybara::DSL

  def initialize
    # set Capybara to use the Selenium driver with headless Chrome
    Capybara.register_driver :selenium do |app|
      options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
      Capybara::Selenium::Driver.new(app, browser: :chrome, options: options)
    end
    Capybara.default_driver = :selenium
  end

  def scrape_data
    begin
      # navigate to the target URL
      visit('https://httpbin.io/ip')

      # print the response HTML
      puts page.html
    rescue => e
      # print error
      puts "An error occurred: #{e.message}"
    ensure
      # ensure the session is reset and close the browser
      Capybara.reset!
    end
  end

end

# create an instance of the Scraper class and fetch the data
request = Scraper.new
request.scrape_data

This code will print the HTML response containing your IP address.

Output
<html><head><meta name="color-scheme" content="light dark"><meta charset="utf-8"></head><body><pre>{
  "origin": "94.198.220.17:8443"
}
</pre><div class="json-formatter-container"></div></body></html>

Good job! You've built a basic script to make HTTP requests using Capybara in Ruby.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Now, let's set up a proxy to mask our requests!

To follow this tutorial, grab a free HTTPS proxy from the Free Proxy List website. We'll use an HTTPS proxy that works with both HTTP and HTTPS websites.

Define this proxy in your code. Configure the Selenium driver using Chrome Options to route your requests through the proxy. Additionally, instead of printing the whole HTML response, just print the content of the pre tag containing the client's IP address.

Here's the modified code integrating a proxy with Capybara:

scraper.rb
require 'capybara'
require 'capybara/dsl'
require 'selenium-webdriver'

class Scraper
  # include Capybara DSL to use Capybara methods
  include Capybara::DSL

  def initialize()
    # define proxy address and port
    proxy_address = '8.219.97.248'
    proxy_port = '80'

    # set Capybara to use the Selenium driver with headless Chrome and proxy settings
    Capybara.register_driver :selenium do |app|
      options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
      options.add_argument("--proxy-server=#{proxy_address}:#{proxy_port}")
      Capybara::Selenium::Driver.new(app, browser: :chrome, options: options)
    end

    Capybara.default_driver = :selenium
  end

  def scrape_data
    begin
      # navigate to the target URL
      visit('https://httpbin.io/ip')

      # print the IP address
      response = find('pre').text
      puts response
    rescue => e
      # print error
      puts "An error occurred: #{e.message}"
    ensure
      # ensure the session is reset and close the browser
      Capybara.reset!
    end
  end

end

# create an instance of the Scraper class and fetch the data
request = Scraper.new
request.scrape_data

Running this code will print the IP address from which the request was made.

Output
{
  "origin": "8.219.97.248:80"
}

The output indicates that the request was routed through the proxy's IP address rather than your own. Congrats!

You've learned how to set up a proxy with Capybara. However, single proxies are still easy to detect and block or even blacklist, especially if you're targeting heavily-protected websites.

To avoid this, you can use rotating proxies or premium proxies. In the next section, we'll see how this works.

Add Rotating and Premium Proxies to Capybara

Making several requests from the same IP address in short intervals may cause websites to recognize your activity as suspicious and block your scraper.

If you rotate a pool of proxies to distribute your requests across multiple IP addresses, your scraper will be more reliable and less prone to blocks. Let's learn how to do it!

First, define a rotate_proxy method that returns a randomly selected proxy from the list of proxies. For this exercise, you can grab some more free proxies from the same Free Proxy List website.

scraper.rb
# ...

class Scraper
  # ...

  # method to rotate proxies
  def rotate_proxy
    proxies = [
      '8.219.97.248:80',
      '20.235.159.154:80',
      '18.188.32.159:3128'
      # add more proxies as needed
    ]
    proxies.sample
  end

  # ...
end

# ...

In the initialize method, call rotate_proxy to get a random proxy and split it into proxy_address and proxy_port. This selected proxy is then integrated into the Selenium driver configuration.

scraper.rb
# ...

class Scraper
  # ...

  def initialize
    # rotate and assign a proxy
    proxy = rotate_proxy
    proxy_address, proxy_port = proxy.split(':')
    # ...

  # ... 
end

# ...

Here's what your final Capybara script that integrates rotating proxies should look like:

scraper.rb
require 'capybara'
require 'capybara/dsl'
require 'selenium-webdriver'

class Scraper
  # include Capybara DSL to use Capybara methods
  include Capybara::DSL

  def initialize
    # rotate and assign a proxy
    proxy = rotate_proxy
    proxy_address, proxy_port = proxy.split(':')

    # set Capybara to use the Selenium driver with headless Chrome and proxy settings
    Capybara.register_driver :selenium do |app|
      options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
      options.add_argument("--proxy-server=#{proxy_address}:#{proxy_port}")
      Capybara::Selenium::Driver.new(app, browser: :chrome, options: options)
    end
    Capybara.default_driver = :selenium
  end

  # method to rotate proxies
  def rotate_proxy
    proxies = [
      '8.219.97.248:80',
      '20.235.159.154:80',
      '18.188.32.159:3128'
      # add more proxies as needed
    ]
    proxies.sample
  end

  def scrape_data
    begin
      # navigate to the target URL
      visit('https://httpbin.io/ip')

      # prints the IP address
      response = find('pre').text
      puts response
    rescue => e
      # print error
      puts "An error occurred: #{e.message}"
    ensure
      # ensure the session is reset and close the browser
      Capybara.reset!
    end
  end
end

# create an instance of the Scraper class and fetch the data
scraper = Scraper.new
scraper.scrape_data

You should get a different IP address each time you run it. Here are the results for three runs:

Output
# request 1
{
  "origin": "18.188.32.159:3128"
}

# request 2
{
  "origin": "20.235.159.154:80"
}

# request 3
{
  "origin": "8.219.97.248:80"
}

Great! You successfully implemented the rotating proxies functionality.

However, as mentioned before, free proxies arenโ€™t useful in production. They're short-lived and don't stand a chance against advanced anti-bot systems. Additionally, they require manual setup, so they're ineffective for large-scale web scraping.

Let's try to access a well-protected website using the methods presented above.

Replace the target URL with G2 Reviews, a Cloudflare-protected website, and get back the HTML response.

scraper.rb
require 'capybara'
require 'capybara/dsl'
require 'selenium-webdriver'

class Scraper
  # include Capybara DSL to use Capybara methods
  include Capybara::DSL

  def initialize
    # rotate and assign a proxy
    proxy = rotate_proxy
    proxy_address, proxy_port = proxy.split(':')

    # set Capybara to use the Selenium driver with headless Chrome and proxy settings
    Capybara.register_driver :selenium do |app|
      options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
      options.add_argument("--proxy-server=#{proxy_address}:#{proxy_port}")
      Capybara::Selenium::Driver.new(app, browser: :chrome, options: options)
    end
    Capybara.default_driver = :selenium
  end
  
  # method to rotate proxies
  def rotate_proxy
    proxies = [
      '8.219.97.248:80',
      '20.235.159.154:80',
      '18.188.32.159:3128'
      # add more proxies as needed
    ]
    proxies.sample
  end

  def scrape_data
    begin
      # navigate to the target URL
      visit('https://www.g2.com/products/asana/reviews')

      # print the response HTML
      puts page.html
    rescue => e
      # print error
      puts "An error occurred: #{e.message}"
    ensure
      # ensure the session is reset and close the browser
      Capybara.reset!
    end
  end
end

# create an instance of the Scraper class and fetch the data
scraper = Scraper.new
scraper.scrape_data

The target website will block your request, and you'll get the following response:

Output
<html class="no-js" lang="en-US">
<head>
    <title>Attention Required! | Cloudflare</title>
    
    <!-- ... -->

          <div class="cf-wrapper cf-header cf-error-overview">
            <h1 data-translate="block_headline">Sorry, you have been blocked</h1>
            <h2 class="cf-subheadline"><span data-translate="unable_to_access">You are unable to access</span> g2.com</h2>
          </div>

    <!-- ... -->
              
</html>

G2 detected your script as a bot and declined your request. It's common when relying on free proxies for web scraping tasks.

Fortunately, there's an easy and efficient solution to this problem: premium proxies.

With premium proxy services, you'll automate the process of rotating IP addresses and managing connections, bypass all anti-bot systems, and build a scraper with higher reliability and speed.

Let's learn how to implement premium proxies in Capybara using the example of ZenRows, one of the most reliable premium proxy providers on the market. We'll use the same G2 Reviews page that got us blocked in the previous step.

Sign up for free, and you'll get redirected to the Request Builder page. Paste the G2 Reviews URL in the URL to Scrape box. Activate Premium Proxies and toggle the JS Rendering Boost mode. Select Ruby as your preferred language, and click on the API tab. Finally, copy the generated API endpoint.

building a scraper with zenrows
Click to open the image in full screen

Take the initial script you wrote for basic HTTP requests and replace the HTTPBin target URL with the API endpoint copied from the ZenRows Request Builder.

Your final script should look like this:

scraper.rb
require 'capybara'
require 'capybara/dsl'
require 'selenium-webdriver'

class Scraper
  # include Capybara DSL to use Capybara methods
  include Capybara::DSL

  def initialize
    # set Capybara to use the Selenium driver with headless Chrome
    Capybara.register_driver :selenium do |app|
      options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
      Capybara::Selenium::Driver.new(app, browser: :chrome, options: options)
    end
    Capybara.default_driver = :selenium
  end

  def scrape_data
    begin
      # navigate to the target URL using ZenRows API
      visit('https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fasana%2Freviews&js_render=true&premium_proxy=true')

      # print the response HTML
      puts page.html
    rescue => e
      # print error
      puts "An error occurred: #{e.message}"
    ensure
      # ensure the session is reset and close the browser
      Capybara.reset!
    end
  end

end

# create an instance of the Scraper class and fetch the data
request = Scraper.new
request.scrape_data

This code will access the protected website and print its HTML:

Output
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8" />
    <link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
    <title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
</head>
<body>
    <!-- other content omitted for brevity -->
</body>

Awesome! Youโ€™ve bypassed Cloudflare with ZenRows.

Conclusion

In this step-by-step tutorial, you created a basic script to make HTTP requests using Capybara in Ruby. Then, you integrated basic, rotating, and premium proxies into your code.

To avoid the hassle of finding and configuring proxies on your own, try ZenRows. ZenRows offers a fully managed proxy solution that automatically rotates IP addresses, bypasses anti-bot protections, and ensures all your requests are successful.

Ready to get started?

Up to 1,000 URLs for free are waiting for you