RSelenium provides R bindings for Selenium WebDriver, allowing you to automate browsers directly from R. You can reap the benefits of Selenium without switching to another programming language.
However, any Selenium-based scraper is still vulnerable to anti-bot systems, especially when sending multiple requests. The solution to this problem is an RSelenium proxy.
Keep reading to learn how to set up a proxy in Rselenium. You'll also learn a few additional tweaks to fortify your scraper further.
1. Set up a Proxy With RSelenium
Setting up a proxy in RSelenium involves configuring the WebDriver to route requests through a proxy server before creating a browser instance. This can be achieved by initializing an RSelenium remote driver and passing your proxy details as an argument using Selenium's extra capabilities function.
Before we put this into practice, below is a quick RSelenium intro:
# import the required libraries
library(RSelenium)
library(wdman)
library(netstat)
# launch a Selenium server using wdman
selenium()
eCaps <- list(
chromeOptions = list(
args = list("--headless") # optional for headless mode
)
)
# start a Chrome browser using the rsDriver function
remote_driver <- rsDriver(browser = "chrome",
chromever = "latest",
verbose = F,
extraCapabilities = eCaps,
port = free_port())
# create a client object
remDr <- remote_driver$client
# navigate to target website
remDr$navigate("https://httpbin.io/ip")
# find the body element and get its text content
html <- remDr$findElement(using = "tag name", value = "body")$getElementText()
print(html)
# close the Selenium client and server
remDr$close
remote_driver$server$stop()
This code launches a Selenium server, starts a Chrome browser in headless mode, navigates to a target website (https://httpbin.io/ip
), and prints its text content (your IP address).
The --headless
flag tells rsDriver
to run Chrome in headless mode. Comment out the chromeOptions
line or the eCaps
object to load the browser's GUI.
[1] "{\n \"origin\": \"98.97.79.238:40268\"\n}"
While the rsDriver()
function lets you start a Selenium server and browser instance simultaneously, the code above uses the wdman
package for more control and direct server management.
Also, the netstat
package offers the free_port()
function, which helps you find a free port on your machine to run the Selenium server.
Now, let's add an RSelenium proxy.
In the eCaps
object above, define your proxy settings as an argument using the following format:
--proxy-server=SCHEMA://HOST:PORT
eCaps <- list(
chromeOptions = list(
args = list("--headless",
# define your proxy settings
"--proxy-server=http://189.240.60.166:9090"
)
)
)
Setting this argument ensures that your browser instance, managed by the Selenium Webdriver, routes every request through your specified proxy server.
To follow along in this tutorial, grab a free proxy from Free Proxy List. Choose HTTPS proxies to avoid errors since they work with HTTP and HTTPS websites.
Here's what the current R file looks like:
# import the required libraries
library(RSelenium)
library(wdman)
library(netstat)
# launch a Selenium server using wdman
selenium()
eCaps <- list(
chromeOptions = list(
args = list("--headless",
# define your proxy settings
"--proxy-server=http://189.240.60.166:9090"
)
)
)
# start a Chrome browser using the rsDriver function
remote_driver <- rsDriver(browser = "chrome",
chromever = "latest",
verbose = F,
extraCapabilities = eCaps,
port = free_port())
# create a client object
remDr <- remote_driver$client
# navigate to target website
remDr$navigate("https://httpbin.io/ip")
# find the body element and get its text content
html <- remDr$findElement(using = "tag name", value = "body")$getElementText()
print(html)
# close the Selenium client and server
remDr$close
remote_driver$server$stop()
Run this code, and you'll get the proxy server address as a response.
[1] "{\n \"origin\": \"189.240.60.168:17770\"\n}"
Hooray! You've configured your first RSelenium proxy.
However, the proxy used in the example has a short lifespan and will most likely not work at the time of reading. That's because free proxies are generally unreliable and only suitable for learning purposes.
Read on to discover a reliable alternative.
2. Add Rotating and Premium Proxies to RSelenium
Websites often flag multiple requests from the same IP address as suspicious activity and can block your web scraper. Thus, rotating between multiple proxies is essential to avoid detection.
To rotate proxies in RSelenium, start by defining your proxy list. You can choose a few more from Free Proxy List.
# define your proxy list
proxies <- list(
"http://189.240.60.166:9090",
"http://72.10.160.171:10095",
"http://38.54.71.67:80"
)
Next, randomly select a proxy from the list.
# randomly select a proxy from the list
selected_proxy <- sample(proxies, 1)
The sample()
method is a base R function that lets you select an item(s) randomly from a specified vector, in this case, proxies
.
This function takes two arguments (the vector and the number of items to select) and generates a random sample from the vector elements.
After that, concatenate the string, "--proxy-server="
with the selected_proxy
object to form the format, (--proxy-server=SCHEMA://HOST:PORT
). Then, pass the concatenated string as an argument using ChromeOptions
, like in the previous example.
eCaps <- list(
chromeOptions = list(
args = list("--headless",
# pass concatenated string as an argument
paste("--proxy-server=", selected_proxy)
)
)
)
That's it!
Here's the complete code:
# import the required libraries
library(RSelenium)
library(wdman)
library(netstat)
# launch a Selenium server using wdman
selenium()
# define your proxy list
proxies <- list(
"http://189.240.60.166:9090",
"http://72.10.160.171:10095",
"http://38.54.71.67:80"
)
# randomly select a proxy from the list
selected_proxy <- sample(proxies, 1)
eCaps <- list(
chromeOptions = list(
args = list("--headless",
# pass concatenated string as an argument
paste("--proxy-server=", selected_proxy)
)
)
)
# start a Chrome browser using the rsDriver function
remote_driver <- rsDriver(browser = "chrome",
chromever = "latest",
verbose = F,
extraCapabilities = eCaps,
port = free_port())
# create a client object
remDr <- remote_driver$client
# navigate to target website
remDr$navigate("https://httpbin.io/ip")
# find the body element and get its text content
html <- remDr$findElement(using = "tag name", value = "body")$getElementText()
print(html)
# close the Selenium client and server
remDr$close
remote_driver$server$stop()
To verify it works, run it multiple times. RSelenium will route your request through a different proxy server each time.
Here's the result from three runs:
[1] "{\n \"origin\": \"189.240.60.168:17770\"\n}"
[1] "{\n \"origin\": \"72.10.160.171:55517\"\n}"
[1] "{\n \"origin\": \"38.54.71.67:75610\"\n}"
Well done!
However, you're not quite there yet.
We used free proxies to show you the basic configurations. But, they're generally unreliable and unsuitable for real-world use cases. They're usually slow, have a short life span, and websites can easily detect and block your requests.
For the best results, use premium proxies. They increase reliability, anonymity, security, and automate the proxy rotation process. Check out this list of the best premium proxy providers to get started!
In the meantime, below is a step-by-step guide on how to use ZenRows' premium residential proxies.
Sign up to access your dashboard. Select Residential Proxies in the left menu section and create a new proxy user. You'll be directed to the Proxy Generator page.
Configure your proxy settings. ZenRows also allows you to choose between the auto-rotate option and sticky sessions. For this case, we'll go with the auto-rotate option.
Finally, copy the generated proxy server URL.
Let's go back to the basic proxy setup using RSelenium and update the --proxy-server
with the copied proxy server URL.
Your new code now looks like this:
# import the required libraries
library(RSelenium)
library(wdman)
library(netstat)
# launch a Selenium server using wdman
selenium()
eCaps <- list(
chromeOptions = list(
args = list("--headless",
# define your proxy settings
"--proxy-server=http://<PROXY_USERNAME>:<PROXY_PASSWORD>@<PROXY_DOMAIN>:<PROXY_PORT>"
)
)
)
# start a Chrome browser using the rsDriver function
remote_driver <- rsDriver(browser = "chrome",
chromever = "latest",
verbose = F,
extraCapabilities = eCaps,
port = free_port())
# create a client object
remDr <- remote_driver$client
# navigate to target website
remDr$navigate("https://httpbin.io/ip")
# find the body element and get its text content
html <- remDr$findElement(using = "tag name", value = "body")$getElementText()
print(html)
# close the Selenium client and server
remDr$close
remote_driver$server$stop()
Run it, and you'll get the page's HTML content.
[1] "{\n \"origin\": \"20.235.159.154:80\"\n}"
Congratulations! Since ZenRows automatically rotates proxies under the hood, you'll get a different IP address for each request.
Conclusion
Configuring an RSelenium proxy can help you avoid detection, including IP-based restrictions, such as IP bans and rate limiting. However, for the best results, you must use premium proxies. It's also important to rotate them to avoid being flagged by anti-bot systems.
For auto premium proxy rotation and guaranteed anti-bot bypass, Try ZenRows now for free.