Capybara is a powerful Ruby package for automating web interactions. It easily integrates with different drivers to execute automation scripts. While effective for web scraping with Ruby, it's still at risk of being blocked by heavily protected websites.
In this article, you'll learn how to avoid blocks and bans by implementing proxies with Capybara.
Set up a Proxy With Capybara
To begin, install the Capybara and Selenium Webdriver gems.
gem install capybara selenium-webdriver
Let's create a basic script that makes an HTTP request to https://httpbin.io/ip
, a website that returns the client's IP address.
Include the following libraries at the beginning of your script:
require 'capybara'
require 'capybara/dsl'
require 'selenium-webdriver'
Next, define a Scraper
class and include the Capybara DSL module, which allows you to use Capybara methods within the class.
# ...
class Scraper
# include Capybara DSL to use Capybara methods
include Capybara::DSL
end
Define an initialize
method that configures a Selenium driver with headless Chrome inside the Scraper
class:
# ...
class Scraper
# ...
def initialize
# set Capybara to use the Selenium driver with headless Chrome
Capybara.register_driver :selenium do |app|
options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
Capybara::Selenium::Driver.new(app, browser: :chrome, options: options)
end
Capybara.default_driver = :selenium
end
end
Then, define a scrape_data
method within the same class. This method navigates to HTTPBin, prints the HTML response, handles errors, and ensures that the Capybara session is reset and the browser is closed.
# ...
class Scraper
# ...
def scrape_data
begin
# navigate to the target URL
visit('https://httpbin.io/ip')
# print the response HTML
puts page.html
rescue => e
# print error
puts "An error occurred: #{e.message}"
ensure
# ensure the session is reset and close the browser
Capybara.reset!
end
end
end
Finally, create an instance of the Scraper
class and call the scrape_data
method to execute the scraping operation:
# ...
# create an instance of the Scraper class and fetch the data
request = Scraper.new
request.scrape_data
Merge the code snippets. Your complete code should look like this:
require 'capybara'
require 'capybara/dsl'
require 'selenium-webdriver'
class Scraper
# include Capybara DSL to use Capybara methods
include Capybara::DSL
def initialize
# set Capybara to use the Selenium driver with headless Chrome
Capybara.register_driver :selenium do |app|
options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
Capybara::Selenium::Driver.new(app, browser: :chrome, options: options)
end
Capybara.default_driver = :selenium
end
def scrape_data
begin
# navigate to the target URL
visit('https://httpbin.io/ip')
# print the response HTML
puts page.html
rescue => e
# print error
puts "An error occurred: #{e.message}"
ensure
# ensure the session is reset and close the browser
Capybara.reset!
end
end
end
# create an instance of the Scraper class and fetch the data
request = Scraper.new
request.scrape_data
This code will print the HTML response containing your IP address.
<html><head><meta name="color-scheme" content="light dark"><meta charset="utf-8"></head><body><pre>{
"origin": "94.198.220.17:8443"
}
</pre><div class="json-formatter-container"></div></body></html>
Good job! You've built a basic script to make HTTP requests using Capybara in Ruby.
Now, let's set up a proxy to mask our requests!
To follow this tutorial, grab a free HTTPS proxy from the Free Proxy List website. We'll use an HTTPS proxy that works with both HTTP and HTTPS websites.
Define this proxy in your code. Configure the Selenium driver using Chrome Options to route your requests through the proxy. Additionally, instead of printing the whole HTML response, just print the content of the pre
tag containing the client's IP address.
Here's the modified code integrating a proxy with Capybara:
require 'capybara'
require 'capybara/dsl'
require 'selenium-webdriver'
class Scraper
# include Capybara DSL to use Capybara methods
include Capybara::DSL
def initialize()
# define proxy address and port
proxy_address = '8.219.97.248'
proxy_port = '80'
# set Capybara to use the Selenium driver with headless Chrome and proxy settings
Capybara.register_driver :selenium do |app|
options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
options.add_argument("--proxy-server=#{proxy_address}:#{proxy_port}")
Capybara::Selenium::Driver.new(app, browser: :chrome, options: options)
end
Capybara.default_driver = :selenium
end
def scrape_data
begin
# navigate to the target URL
visit('https://httpbin.io/ip')
# print the IP address
response = find('pre').text
puts response
rescue => e
# print error
puts "An error occurred: #{e.message}"
ensure
# ensure the session is reset and close the browser
Capybara.reset!
end
end
end
# create an instance of the Scraper class and fetch the data
request = Scraper.new
request.scrape_data
Running this code will print the IP address from which the request was made.
{
"origin": "8.219.97.248:80"
}
The output indicates that the request was routed through the proxy's IP address rather than your own. Congrats!
The free proxy used in this example may not work by the time you're reading this. Free proxies are often unreliable and may be down or blocked by certain websites. This is why free proxies are best suited for learning and testing purposes. To follow along with the tutorial, feel free to grab a new proxy from FreeProxyList.com
You've learned how to set up a proxy with Capybara. However, single proxies are still easy to detect and block or even blacklist, especially if you're targeting heavily-protected websites.
To avoid this, you can use rotating proxies or premium proxies. In the next section, we'll see how this works.
Add Rotating and Premium Proxies to Capybara
Making several requests from the same IP address in short intervals may cause websites to recognize your activity as suspicious and block your scraper.
If you rotate a pool of proxies to distribute your requests across multiple IP addresses, your scraper will be more reliable and less prone to blocks. Let's learn how to do it!
First, define a rotate_proxy
method that returns a randomly selected proxy from the list of proxies. For this exercise, you can grab some more free proxies from the same Free Proxy List website.
# ...
class Scraper
# ...
# method to rotate proxies
def rotate_proxy
proxies = [
'8.219.97.248:80',
'20.235.159.154:80',
'18.188.32.159:3128'
# add more proxies as needed
]
proxies.sample
end
# ...
end
# ...
In the initialize
method, call rotate_proxy
to get a random proxy and split it into proxy_address
and proxy_port
. This selected proxy is then integrated into the Selenium driver configuration.
# ...
class Scraper
# ...
def initialize
# rotate and assign a proxy
proxy = rotate_proxy
proxy_address, proxy_port = proxy.split(':')
# ...
# ...
end
# ...
Here's what your final Capybara script that integrates rotating proxies should look like:
require 'capybara'
require 'capybara/dsl'
require 'selenium-webdriver'
class Scraper
# include Capybara DSL to use Capybara methods
include Capybara::DSL
def initialize
# rotate and assign a proxy
proxy = rotate_proxy
proxy_address, proxy_port = proxy.split(':')
# set Capybara to use the Selenium driver with headless Chrome and proxy settings
Capybara.register_driver :selenium do |app|
options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
options.add_argument("--proxy-server=#{proxy_address}:#{proxy_port}")
Capybara::Selenium::Driver.new(app, browser: :chrome, options: options)
end
Capybara.default_driver = :selenium
end
# method to rotate proxies
def rotate_proxy
proxies = [
'8.219.97.248:80',
'20.235.159.154:80',
'18.188.32.159:3128'
# add more proxies as needed
]
proxies.sample
end
def scrape_data
begin
# navigate to the target URL
visit('https://httpbin.io/ip')
# prints the IP address
response = find('pre').text
puts response
rescue => e
# print error
puts "An error occurred: #{e.message}"
ensure
# ensure the session is reset and close the browser
Capybara.reset!
end
end
end
# create an instance of the Scraper class and fetch the data
scraper = Scraper.new
scraper.scrape_data
You should get a different IP address each time you run it. Here are the results for three runs:
# request 1
{
"origin": "18.188.32.159:3128"
}
# request 2
{
"origin": "20.235.159.154:80"
}
# request 3
{
"origin": "8.219.97.248:80"
}
Great! You successfully implemented the rotating proxies functionality.
However, as mentioned before, free proxies arenโt useful in production. They're short-lived and don't stand a chance against advanced anti-bot systems. Additionally, they require manual setup, so they're ineffective for large-scale web scraping.
Let's try to access a well-protected website using the methods presented above.
Replace the target URL with G2 Reviews, a Cloudflare-protected website, and get back the HTML response.
require 'capybara'
require 'capybara/dsl'
require 'selenium-webdriver'
class Scraper
# include Capybara DSL to use Capybara methods
include Capybara::DSL
def initialize
# rotate and assign a proxy
proxy = rotate_proxy
proxy_address, proxy_port = proxy.split(':')
# set Capybara to use the Selenium driver with headless Chrome and proxy settings
Capybara.register_driver :selenium do |app|
options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
options.add_argument("--proxy-server=#{proxy_address}:#{proxy_port}")
Capybara::Selenium::Driver.new(app, browser: :chrome, options: options)
end
Capybara.default_driver = :selenium
end
# method to rotate proxies
def rotate_proxy
proxies = [
'8.219.97.248:80',
'20.235.159.154:80',
'18.188.32.159:3128'
# add more proxies as needed
]
proxies.sample
end
def scrape_data
begin
# navigate to the target URL
visit('https://www.g2.com/products/asana/reviews')
# print the response HTML
puts page.html
rescue => e
# print error
puts "An error occurred: #{e.message}"
ensure
# ensure the session is reset and close the browser
Capybara.reset!
end
end
end
# create an instance of the Scraper class and fetch the data
scraper = Scraper.new
scraper.scrape_data
The target website will block your request, and you'll get the following response:
<html class="no-js" lang="en-US">
<head>
<title>Attention Required! | Cloudflare</title>
<!-- ... -->
<div class="cf-wrapper cf-header cf-error-overview">
<h1 data-translate="block_headline">Sorry, you have been blocked</h1>
<h2 class="cf-subheadline"><span data-translate="unable_to_access">You are unable to access</span> g2.com</h2>
</div>
<!-- ... -->
</html>
G2 detected your script as a bot and declined your request. It's common when relying on free proxies for web scraping tasks.
Fortunately, there's an easy and efficient solution to this problem: premium proxies.
With premium proxy services, you'll automate the process of rotating IP addresses and managing connections, bypass all anti-bot systems, and build a scraper with higher reliability and speed.
Let's learn how to implement premium proxies in Capybara using the example of ZenRows, one of the most reliable premium proxy providers on the market. We'll use the same G2 Reviews page that got us blocked in the previous step.
Sign up for free, and you'll get redirected to the Request Builder page. Paste the G2 Reviews URL in the URL to Scrape box. Activate Premium Proxies and toggle the JS Rendering Boost mode. Select Ruby as your preferred language, and click on the API tab. Finally, copy the generated API endpoint.
Take the initial script you wrote for basic HTTP requests and replace the HTTPBin target URL with the API endpoint copied from the ZenRows Request Builder.
Your final script should look like this:
require 'capybara'
require 'capybara/dsl'
require 'selenium-webdriver'
class Scraper
# include Capybara DSL to use Capybara methods
include Capybara::DSL
def initialize
# set Capybara to use the Selenium driver with headless Chrome
Capybara.register_driver :selenium do |app|
options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
Capybara::Selenium::Driver.new(app, browser: :chrome, options: options)
end
Capybara.default_driver = :selenium
end
def scrape_data
begin
# navigate to the target URL using ZenRows API
visit('https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fasana%2Freviews&js_render=true&premium_proxy=true')
# print the response HTML
puts page.html
rescue => e
# print error
puts "An error occurred: #{e.message}"
ensure
# ensure the session is reset and close the browser
Capybara.reset!
end
end
end
# create an instance of the Scraper class and fetch the data
request = Scraper.new
request.scrape_data
This code will access the protected website and print its HTML:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
</head>
<body>
<!-- other content omitted for brevity -->
</body>
Awesome! Youโve bypassed Cloudflare with ZenRows.
Conclusion
In this step-by-step tutorial, you created a basic script to make HTTP requests using Capybara in Ruby. Then, you integrated basic, rotating, and premium proxies into your code.
To avoid the hassle of finding and configuring proxies on your own, try ZenRows. ZenRows offers a fully managed proxy solution that automatically rotates IP addresses, bypasses anti-bot protections, and ensures all your requests are successful.