Watir (Web Application Testing in Ruby) is a Selenium-powered, open-source family of Ruby libraries for automating web browsers.
While effective for web scraping in Ruby, it can still get blocked by websites with anti-bot measures.
In this tutorial, you'll learn how to set up Watir proxies to avoid detection and bans, and scrape the web interrupted. Letโs go!
Set up a Proxy With Watir
To get started, install the Watir gem:
gem install watir
Next, import the gem in your script. Initialize a new Chrome browser instance in headless mode and navigate to HTTPBin, a website that returns the client's IP address. Finally, retrieve the page content and close the browser:
require 'watir'
# initialize the browser
browser = Watir::Browser.new :chrome, headless: true
# navigate to the URL
url = 'https://httpbin.io/ip'
browser.goto(url)
# get the page content
page_content = browser.text
puts page_content
# close the browser
browser.close
The above code will print your machine's IP address:
{
"origin": "210.212.39.138:80"
}
Using this script for requests reveals your IP address, which is bad practice for web scraping. Since most websites monitor traffic, their anti-bot systems may detect and block your IP.
To mask your request, let's integrate proxies into the code.
You can grab a free proxy from the Free Proxy List website. Make sure to pick the HTTPS proxy, which works with both HTTPS and HTTP websites.
Define the proxy settings by replacing 8.219.97.248:80
with your actual proxy server address and port. Here, the same proxy is used for both HTTP and SSL connections.
proxy = {
http: '8.219.97.248:80',
ssl: '8.219.97.248:80'
}
Now, initialize the Chrome browser instance in headless mode, but this time with the specified proxy settings.
# ...
browser = Watir::Browser.new :chrome, headless: true, proxy: proxy
# ...
After merging the snippets, your complete code should look like this:
require 'watir'
# define proxy
proxy = {
http: '8.219.97.248:80',
ssl: '8.219.97.248:80'
}
# initialize the browser
browser = Watir::Browser.new :chrome, headless: true, proxy: proxy
# navigate to the URL
url = 'http://httpbin.io/ip'
browser.goto(url)
# get the page content
page_content = browser.text
puts page_content
# close the browser
browser.close
This script will print the IP address of the proxy server to the console:
{
"origin": "8.219.97.248:80"
}
Congrats! The response matches the proxy server IP.
You now know the basics of using a proxy with Watir. Let's dive into more advanced concepts!
The provided IP address is a free proxy, which may not work when you read this article. Free proxies often have limited uptime, which means they should never be used in production. For the sake of following the tutorial, grab a new proxy from the Free Proxy List.
Add Rotating and Premium Proxies to Watir
If you make several requests from a specific IP address, your activity becomes easy to detect. Rotating proxies can help you distribute your requests across multiple IP addresses, making it harder for websites to block your scraper.
Let's integrate rotating proxies into your Watir script. You'll build a simple rotator that randomly selects a proxy from a predefined list for each browsing session.
First, grab some free proxies from the Free Proxy List website. Then, configure the Selenium WebDriver logger to log important messages and ignore unnecessary ones to reduce log noise.
require 'watir'
require 'logger'
# list of proxies
proxies = [
{ http: '8.219.97.248:80', ssl: '8.219.97.248:80' },
{ http: '20.235.159.154:80', ssl: '20.235.159.154:80' },
{ http: '18.188.32.159:3128', ssl: '18.188.32.159:3128' },
# ...
]
# configure Selenium WebDriver logger
logger = Selenium::WebDriver.logger
logger.ignore(:jwp_caps, :logger_info)
Define a function that randomly selects a proxy from the proxies
list and returns it.
# ...
# function to rotate proxies
def get_rotating_proxy(proxies)
proxies.sample
end
Use the get_rotating_proxy()
function to randomly select a proxy. Then, log the selected proxy for reference. As before, initialize a headless Chrome browser with the chosen proxy. Finally, navigate to the target website to retrieve the page content.
# ...
begin
# initialize the browser with a proxy
proxy = get_rotating_proxy(proxies)
logger.info("Using proxy: #{proxy}")
browser = Watir::Browser.new :chrome, headless: true, proxy: proxy
# navigate to the URL
url = 'https://httpbin.io/ip'
browser.goto(url)
# get the page content
page_content = browser.text
puts page_content
rescue => e
# handle error
logger.error("An error occurred: #{e.message}")
ensure
# close the browser
browser.close
end
If any error occurs during execution, the rescue
block will catch and log them. The ensure
block guarantees that the browser is closed properly, regardless of whether an error occurred.
This structure (begin
, rescue
, ensure
, and end
) is crucial to ensure your script is robust, handles errors, and performs necessary cleanup operations.
Here's the complete code after merging all the above snippets:
require 'watir'
require 'logger'
# list of proxies
proxies = [
{ http: '8.219.97.248:80', ssl: '8.219.97.248:80' },
{ http: '20.235.159.154:80', ssl: '20.235.159.154:80' },
{ http: '18.188.32.159:3128', ssl: '18.188.32.159:3128' },
# ...
]
# configure Selenium WebDriver logger
logger = Selenium::WebDriver.logger
logger.ignore(:jwp_caps, :logger_info)
# function to rotate proxies
def get_rotating_proxy(proxies)
proxies.sample
end
begin
# initialize the browser with a proxy
proxy = get_rotating_proxy(proxies)
logger.info("Using proxy: #{proxy}")
browser = Watir::Browser.new :chrome, headless: true, proxy: proxy
# navigate to the URL
url = 'https://httpbin.io/ip'
browser.goto(url)
# get the page content
page_content = browser.text
puts page_content
rescue => e
# handle error
logger.error("An error occurred: #{e.message}")
ensure
# close the browser
browser.close
end
Youโll get a randomly selected proxy as output every time you run this code.
Here's the output after running the code three times:
# request 1
2024-05-21 20:23:56 INFO Selenium Using proxy: {:http=>"8.219.97.248:80", :ssl=>"8.219.97.248:80"}
{
"origin": "8.219.97.248:80"
}
# request 2
2024-05-21 20:24:08 INFO Selenium Using proxy: {:http=>"18.188.32.159:3128", :ssl=>"18.188.32.159:3128"}
{
"origin": "18.188.32.159:3128"
}
# request 3
2024-05-21 20:25:45 INFO Selenium Using proxy: {:http=>"20.235.159.154:80", :ssl=>"20.235.159.154:80"}
{
"origin": "20.235.159.154:80"
}
Fantastic! You successfully implemented the rotating proxies approach.
As mentioned before, the free proxies may not be consistently reliable. They have a short lifespan and tend to be slow.
Another issue is that free proxies fail when facing advanced anti-bot measures. Try to test the proxy rotator logic against the G2 Reviews page that uses anti-bot technologies:
require 'watir'
require 'logger'
# list of proxies
proxies = [
{ http: '8.219.97.248:80', ssl: '8.219.97.248:80' },
{ http: '20.235.159.154:80', ssl: '20.235.159.154:80' },
{ http: '18.188.32.159:3128', ssl: '18.188.32.159:3128' },
# ...
]
# configure Selenium WebDriver logger
logger = Selenium::WebDriver.logger
logger.ignore(:jwp_caps, :logger_info)
# function to rotate proxies
def get_rotating_proxy(proxies)
proxies.sample
end
begin
# initialize the browser with a proxy
proxy = get_rotating_proxy(proxies)
logger.info("Using proxy: #{proxy}")
browser = Watir::Browser.new :chrome, headless: true, proxy: proxy
# navigate to the URL
url = 'https://www.g2.com/products/asana/reviews'
browser.goto(url)
# get the page content
page_content = browser.text
puts page_content
rescue => e
# handle error
logger.error("An error occurred: #{e.message}")
ensure
# close the browser
browser.close
end
Hereโs the output:
2024-05-21 20:45:12 INFO Selenium Using proxy: {:http=>"8.219.97.248:80", :ssl=>"8.219.97.248:80"}
Sorry, you have been blocked
You are unable to access g2.com
Why have I been blocked?
This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.
What can I do to resolve this?
You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.
Cloudflare Ray ID: jk34g5k3g5523 โข Your IP: Click to reveal โข Performance & security by Cloudflare
Your request got blocked by Cloudflare!
To stay undetected by more advanced protection systems, you need premium proxies. They're consistently reliable, provide a more automated process (no need for manually composing the list), and can bypass all anti-bots and IP bans. If youโre unsure where to get started, check out our list of the best premium proxy providers.
Let's learn how to use premium proxies using the example of ZenRows, the most reliable premium proxy provider.
Sign up for free, and you'll get redirected to the Request Builder page.
Paste the same G2 Reviews URL in the URL to Scrape box. Enable JS Rendering and click on the Premium Proxies check box. Select Ruby as your language and click on the API tab to copy the API endpoint.
Let's jump into the code!
Initialize the browser in headless mode. Then, navigate to the G2 Reviews page specified by the ZenRows API endpoint. Open the page, extract and print its HTML, and terminate the browser session to conclude the scraping process.
Here's the final Watir script integrating the ZenRows premium proxies:
require 'watir'
# initialize the browser in headless mode
browser = Watir::Browser.new :chrome, headless: true
# connect to the target page
url = 'https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fasana%2Freviews&js_render=true&premium_proxy=true'
browser.goto(url)
# get the page content
page_content = browser.html
puts page_content
# close the browser
browser.quit
The code accesses the protected website and extracts its HTML:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
</head>
<body>
<!-- other content omitted for brevity -->
</body>
</html>
Great! You've just bypassed a protected website with ZenRows.
Conclusion
This tutorial walked you through the whole process of configuring proxies in Watir. Now, you know how to:
- Use proxies with Watir.
- Set up a rotating proxy.
- Use premium proxies.
Premium proxies boost your scraperโs reliability, save you the hassle of finding and configuring proxies manually, and provide a foolproof anti-bot bypass that works even against the most powerful protection systems. Try them out with ZenRows, a complete web scraping toolkit. On top of premium proxies, ZenRows offers a headless browser, User Agent rotator, and anything else you need to extract data from the web.