Web Scraping in R: The Complete Guide 2025

Yuvraj Chandra
Yuvraj Chandra
Updated: February 18, 2025 · 8 min read

Do you want to build a robust web scraper in R? You've come to the right place!

In this step-by-step tutorial, you'll learn how to scrape websites in R using libraries like rvest and RSelenium. Let's go!

Is R Good for Web Scraping?

Yes, it is! R is an advanced programming language for data science. It has many data-oriented libraries to support your web scraping goals.

Before we dive into the tutorial, let's first review the tools you need for web scraping with R.

Prerequisites

Let's first go through the requirements for this R scraping tutorial.

Set up the Environment

You'll need the following tools:

  • R 4+: Any version of R greater than or equal to 4 will work. This web scraping tutorial uses R version 4.3.3 on a Windows operating system. Ensure you add R to your system's variable PATH to make it executable from the command line.
  • An R IDE: This tutorial uses the Visual Studio Code IDE with the REditorSupport extension. PyCharm, with the R Language for IntelliJ plugin installed and enabled, is also an excellent option for building an R web scraper.

Click the given links to download and install the above tools if you've not done so already. 

Set Up an R Project in VS Code

After setting up the environment, create a new project folder in your chosen location. Add a new scraper.R file to that folder. Open VS Code and go to File > Open Folder. Select your project directory:

VS Code folder creation menu
Click to open the image in full screen

You've got the basics out of the way. You can now start building a data scraper in R.

How to Scrape a Website in R

In this R data scraping tutorial, you'll extract product data from the E-commerce Challenge page.

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

The site uses pagination to break products into 12 pages. You'll use your R scraper to collect product names, prices, and image URLs from the website.

Step 1: Install rvest

rvest is an R library that helps you scrape data from web pages through its advanced R web scraping API. It allows you to download an HTML document, parse it, select HTML elements and extract data from them.

To install rvest, open the R Terminal or simply type R into your command line and press Enter to launch an interactive terminal. Then, launch the command as given below:

Terminal
install.packages("rvest")

Once installed, load it into your scraper.R file:

scraper.R
# install.packages("rvest")
library(rvest)

It's time to grab some data!

Step 2: Retrieve the HTML Page

Download the HTML document using its remote URL with rvest with a single line of code:

The read_html() function retrieves the HTML downloaded using the URL passed as a parameter and assigns the resulting data structure to the document variable.

scraper.R
# retrieve the target web page 
document <- read_html("https://scrapingcourse.com/ecommerce/")

Go ahead and select the target elements from the page.

Step 3: Identify and Select the Most Important HTML elements

This tutorial on web scraping with R aims to extract all product data from the current page. So, the product HTML nodes are the most essential elements. To select them, right-click on a product HTML element and choose the "Inspect" option. This action will launch the following DevTools pop-up:

scrapingcourse ecommerce homepage inspect first product li
Click to open the image in full screen

The target elements are inside individual list tags (li.product):

  • An img.product-image that contains the product image.
  • A h2.product-name that contains the product name.
  • A span.product-price that stores the product price.

Select all HTML elements contained in the target web page in rvest with:

scraper.R
# select the list of product HTML elements 
html_products <- document %>% html_elements("li.product")

This executes the html_elements() function using the R %>% pipe operator. Specifically, html_elements() returns the list of HTML elements found by applying a CSS or an XPath selector.

Given a single HTML product, select the target HTML nodes with:

example.R
# select the "a" HTML element storing the product URL
a_element <- html_product %>% html_element("product-url")
# select the "img" HTML element storing the product image
img_element <- html_product %>% html_element("product-image")
# select the "h2" HTML element storing the product name
h2_element <- html_product %>% html_element("product-name")
# select the "span" HTML element storing the product price
span_element <- html_product %>% html_element("product-price")

Step 4: Extract the Data from the HTML Elements

R is inefficient when it comes to appending elements to a list. You should avoid iterating over each HTML product. Instead, use rvest to extract each product data from a list of the parent HTML elements.

Extract each product node from the parent elements. The html_attr() function returns the string stored in a single attribute. Similarly, html_text2() returns the text in an HTML element as it looks in a browser:

scraper.R
# ...

# select the list of product HTML elements
html_products <- document %>% html_elements("li.product")

# extract the required data
product_urls <- html_products %>%
    html_element("a") %>%
    html_attr("href")
product_images <- html_products %>%
    html_element(".product-image") %>%
    html_attr("src")
product_names <- html_products %>%
    html_element(".product-name") %>%
    html_text2()
product_prices <- html_products %>%
    html_element(".product-price") %>%
    html_text2()

R uses the data.frame function to aggregate all scraped data into the products variable:

scraper.R
# ...

# convert the lists containing the scraped data into a dataframe
products <- data.frame(
    url = unlist(product_urls),
    image = unlist(product_images),
    name = unlist(product_names),
    price = unlist(product_prices)
)

Here's what the full code looks like at this point:

scraper.R
# install.packages("rvest")
library(rvest)

# retrieve the target web page
document <- read_html("https://scrapingcourse.com/ecommerce/")

# select the list of product HTML elements
html_products <- document %>% html_elements("li.product")

# select the list of product HTML elements
html_products <- document %>% html_elements("li.product")

# extract the required data
product_urls <- html_products %>%
    html_element("a") %>%
    html_attr("href")
product_images <- html_products %>%
    html_element(".product-image") %>%
    html_attr("src")
product_names <- html_products %>%
    html_element(".product-name") %>%
    html_text2()
product_prices <- html_products %>%
    html_element(".product-price") %>%
    html_text2()

# convert the lists containing the scraped data into a dataframe
products <- data.frame(
    url = unlist(product_urls),
    image = unlist(product_images),
    name = unlist(product_names),
    price = unlist(product_prices)
)

Bravo! You've built an R web scraper using rvest. Follow the steps below to export the DataFrame as a CSV file.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step 5: Analyze and Store the Scraped Data

Data storage and analysis are crucial for gaining specific insights, referencing collected information, and more. In this section, you'll perform a simple sorting analysis and store the data as a CSV.

First, sort the product data in the descending order of the price:

scraper.R
# ...

# sort by descending order of price
products <- products[order(-as.numeric(gsub("[^0-9.]", "", products$price))), ]

Now, export the result into a CSV file using the write.csv() method:

scraper.R
# ...

# write the data to a CSV file
write.csv(products, "products.csv", row.names = FALSE)

Combining the snippets gives the following final code:

scraper.R
# install.packages("rvest")
library(rvest)

# retrieve the target page
document <- read_html("https://www.scrapingcourse.com/ecommerce/page/1/")

# select the list of product HTML elements
html_products <- document %>% html_elements("li.product")

# extract the required data
product_urls <- html_products %>%
    html_element("a") %>%
    html_attr("href")
product_images <- html_products %>%
    html_element("img") %>%
    html_attr("src")
product_names <- html_products %>%
    html_element(".product-name") %>%
    html_text2()
product_prices <- html_products %>%
    html_element(".product-price") %>%
    html_text2()

# convert the lists containing the scraped data into data.frame
products <- data.frame(
    url = unlist(product_urls),
    image = unlist(product_images),
    name = unlist(product_names),
    price = unlist(product_prices)
)

# sort by descending order of price
products <- products[order(-as.numeric(gsub("[^0-9.]", "", products$price))), ]

# write the data to a CSV file
write.csv(products, "products.csv", row.names = FALSE)

Open the terminal to your project root folder and run your R script with the following command:

Terminal
Rscript scraper.R

The code generates a products.csv file in your project's root folder with the following data:

Product data CSV sorted by price
Click to open the image in full screen

Well done!

Advanced Techniques in Web Scraping in R

You just learned the basics of web scraping in R. It's time to dig into more advanced techniques.

Web Crawling in R

The target website has several pages, but the current scraper only retrieves content from the first page. To scrape the entire website and retrieve all data, you need to extract the list of all pagination links and visit each page to extract its data. That's basically what web crawling is about.

To start, right-click the next-page HTML element and select "Inspect":

scrapingcourse ecommerce homepage inspect
Click to open the image in full screen

Your browser will open the following DevTools window:

scrapingcourse ecommerce homepage devtools
Click to open the image in full screen

You'll see that all the pagination elements share the same page-numbers class. So, retrieve all pagination links using the html_elements() method:

scraper.R
ks <- document %>% 
	html_elements("a.page-numbers") %>% 
	html_attr("href")

Although the goal is to visit all web pages, you may want your R crawler to stop programmatically. The best way to do this is to introduce a crawl limit. 

The R web scraping script below crawls the target web page and fills the crawling queue with the next page link element. The target website has 12 pages. So, assign a limit of 12 to scrape the entire website. At the end of the while cycle, pages_discovered will store all the 12 pagination URLs.

scraper.R
# install.packages("rvest")
library(rvest)

# initialize the lists that will store the scraped data
product_urls <- list()
product_images <- list()
product_names <- list()
product_prices <- list()

# initialize the list of pages to scrape with the first pagination links
pages_to_scrape <- list("https://www.scrapingcourse.com/ecommerce/")

# initialize the list of pages discovered
pages_discovered <- pages_to_scrape

# current iteration
i <- 1
# max pages to scrape
limit <- 12

# until there is still a page to scrape
while (length(pages_to_scrape) != 0 && i <= limit) {
    # get the current page to scrape
    page_to_scrape <- pages_to_scrape[[1]]

    # remove the page to scrape from the list
    pages_to_scrape <- pages_to_scrape[-1]

    # retrieve the current page to scrape
    document <- read_html(page_to_scrape)

    # extract the list of pagination links
    new_pagination_links <- document %>%
        html_elements("a.page-numbers") %>%
        html_attr("href")

    # iterate over the list of pagination links
    for (new_pagination_link in new_pagination_links) {
        # if the web page discovered is new and should be scraped
        if (!(new_pagination_link %in% pages_discovered) && !(new_pagination_link %in% pages_to_scrape)) {
            pages_to_scrape <- append(pages_to_scrape, new_pagination_link)
        }

        # discover new pages
        pages_discovered <- append(pages_discovered, new_pagination_link)
    }

    # remove duplicates from pages_discovered
    pages_discovered <- unique(pages_discovered)

    # increment the iteration counter
    i <- i + 1

    # select the list of product html elements
    html_products <- document %>% html_elements("li.product")

    # extract the required data
    product_urls <- append(product_urls, list(html_products %>% html_element("a") %>% html_attr("href")))
    product_images <- append(product_images, list(html_products %>% html_element("img") %>% html_attr("src")))
    product_names <- append(product_names, list(html_products %>% html_element(".product-name") %>% html_text2()))
    product_prices <- append(product_prices, list(html_products %>% html_element(".product-price") %>% html_text2()))
}

# convert the lists containing the scraped data into data.frame
products <- data.frame(
    url = unlist(product_urls),
    image = unlist(product_images),
    name = unlist(product_names),
    price = unlist(product_prices)
)

# sort by descending order of price
products <- products[order(-as.numeric(gsub("[^0-9.]", "", products$price))), ]

# write the data to a CSV file
write.csv(products, "products.csv", row.names = FALSE)

The code will produce a products.csv file containing products from all 12 pages.

Awesome! You just crawled all the product data on the target site using R!

Parallel Web Scraping in R

Parallel scraping involves scraping many pages simultaneously. While this technique requires some extra steps, it speeds up data extraction. 

Parallel scraping is particularly handy if your target site has many pages or its server responds slowly. 

R supports parallelization out of the box! You'll see how it works in the next steps.

First, import the parallel package into your R script. This package uses the utility functions exposed by this R library to perform parallel computation. Define a scrape_page() function to handle the scraping logic. Then, initialize an R cluster with makeCluster() to create parallel copies of R instances. 

Use parLapply() to apply the scraper function over the list of web pages. Note that parLapply() is the parallel version of apply()`:

scraper.R
library(parallel)

pages_to_scrape <- list(
    "https://scrapingcourse.com/ecommerce/page/1",
    "https://scrapingcourse.com/ecommerce/page/2",
    "https://scrapingcourse.com/ecommerce/page/3",
    # ...,
    "https://scrapingcourse.com/ecommerce/page/12"
)


# define a scraper function
scrape_page <- function(page_url) {
    # load the rvest library
    library(rvest)

    # retrieve the current page to scrape
    document <- read_html(page_url)

    html_products <- document %>% html_elements("li.product")

    product_urls <- html_products %>%
        html_element("a") %>%
        html_attr("href")

    product_images <- html_products %>%
        html_element("img") %>%
        html_attr("src")
    product_names <-
        html_products %>%
        html_element("h2") %>%
        html_text2()

    product_prices <-
        html_products %>%
        html_element("span") %>%
        html_text2()

    products <- data.frame(
        unlist(product_urls),
        unlist(product_images),
        unlist(product_names),
        unlist(product_prices)
    )

    names(products) <- c("url", "image", "name", "price")

    return(products)
}

# automatically detect the number of cores
num_cores <- detectCores()

# create a parallel cluster
cluster <- makeCluster(num_cores)

# execute scrape_page() on each element of pages_to_scrape
# in parallel
scraped_data_list <- parLapply(cluster, pages_to_scrape, scrape_page)

# merge the list of dataframes into
# a single dataframe
products <- do.call("rbind", scraped_data_list)

Perfect! You increased your R scraper's performance significantly with a parallelization logic. But the tutorial is unfinished without learning how to handle dynamic pages!

Let's now see how to use a headless browser for web scraping in R.

Web Scraping with a Headless Browser in R

Most websites rely on API calls to perform front-end operations or asynchronously retrieve data. The easiest way to scrape such data is to render the target web page in a headless browser. 

A headless browser lets you load a web page in a browser with no GUI and automate user interactions like clicking, hovering, scrolling, and more. The most popular headless browser library in R is RSelenium, and we'll use it in this tutorial.

Install RSelenium by launching the command below in your R Console:

Terminal
install.packages("RSelenium")

Once installed, run the code below to extract the data from the target page.

Note that the findElements() method allows you to select HTML elements with RSelenium. Then, call findChildElements() to select a child element from an RSelenium HTML element. Finally, use getElementAttribute() and getElementText() to extract data from the selected HTML elements:

scraper.R
# install.packages("RSelenium")
library(RSelenium)

# load the Chrome driver  
driver <- rsDriver( 
	browser = c("chrome"), 
	chromever = "108.0.5359.22", 
	verbose = F, 
	# enabling the Chrome --headless mode 
	extraCapabilities = list("chromeOptions" = list(args = list("--headless"))) 
) 
web_driver <- driver[["client"]] 
 
# navigate to the target web page in the headless Chrome instance 
web_driver$navigate("https://scrapingcourse.com/ecommerce/") 
 
# initializing the lists that will contain the scraped data 
product_urls <- list() 
product_images <- list() 
product_names <- list() 
product_prices <- list() 
 
# retrieve all product HTML elements 
html_products <- web_driver$findElements(using = "css selector", value = "li.product") 
 
# iterating over the product list 
for (html_product in html_products) { 
	# scrape the data
	product_urls <- append( 
		product_urls, 
		html_product$findChildElement(using = "css selector", value = "a")$getElementAttribute("href")[[1]] 
	) 
 
	product_images <- append( 
		product_images, 
		html_product$findChildElement(using = "css selector", value = "img")$getElementAttribute("src")[[1]] 
	) 
 
	product_names <- append( 
		product_names, 
		html_product$findChildElement(using = "css selector", value = "h2")$getElementText()[[1]] 
	) 
 
	product_prices <- append( 
		product_prices, 
		html_product$findChildElement(using = "css selector", value = "span")$getElementText()[[1]] 
	) 
} 
 
# convert the lists containing the scraped data into data.frame 
products <- data.frame( 
	unlist(product_urls), 
	unlist(product_images), 
	unlist(product_names), 
	unlist(product_prices) 
) 
 
# changing the column names of the data frame before exporting it into CSV 
names(products) <- c("url", "image", "name", "price") 
 
# export the data frame containing the scraped data to a CSV file 
write.csv(products, file = "./products.csv", fileEncoding = "UTF-8", row.names = FALSE)

The approach to web scraping in R using rvest and RSelenium differ in how they handle content. RSelenium controls a real browser, allowing interaction with JavaScript-driven pages. In contrast, rvest is an HTTP client that fetches and parses static HTML but doesn't execute JavaScript.

For example, if you want to scrape data from a new page with rvest, you'll have to request that page explicitly. Contrarily, RSelenium allows you to directly navigate to a new web page by clicking the next page button:

scraper.R
# ...

pagination_element <- web_driver$findElement(using = "css selector", value = "a.page-numbers") 
 
# navigating to a new web page 
paginationElement.click(); 
 
print(web_driver$getTitle());  
# prints "Ecommerce Test Site to Learn Web Scraping - Page 2 - ScrapingCourse.com"

RSelenium can simulate human behavior to increase your chances of scraping a web page without getting blocked. However, it has limitations, such as bot-like fingerprints and incomplete request headers, among others. Due to these shortcomings, RSelenium can't overcome advanced anti-bots.

Avoid Getting Blocked While Scraping in R

Many websites you'll scrape in real life implement anti-scraping techniques to block your scraper. You need to find a way to bypass these blocks to scrape without limitations.

You can reduce the chances of anti-bot detection with custom scraping headers like the User Agent.

As explained in the official documentation, rvest uses httr to set global User Agent behind the scenes as follows:

scraper.R
httr::set_config(httr::user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36"))

You can also implement other solutions, such as proxies. However, these can be insufficient for advanced anti-bot measures. 

The easiest way to scrape any website in R without getting blocked is to use a web scraping solution, such as the ZenRows' Universal Scraper API. It provides the complete toolkit required to bypass any anti-bot measure at scale. 

With a single API call, ZenRows fortifies your scraper with premium proxy rotation, anti-bot auto-bypass, JavaScript rendering support, advanced fingerprinting evasion techniques, and more. 

Let's see ZenRows in action by scraping this Anti-bot Challenge page.

Sign up on ZenRows to open the Request Builder. Paste your target URL in the link box and activate Premium Proxies and JS Rendering.

building a scraper with zenrows
Click to open the image in full screen

Select cURL as your programming language and choose the API connection mode. Copy the generated URL and request it with R's built-in httr HTTP client:

scraper.R
# load the required package
library(httr)


# make the GET request with query parameters
response <- GET("https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.scrapingcourse.com%2Fantibot-challenge&js_render=true&premium_proxy=true")


# check the status of the request
if (status_code(response) == 200) {
    # print the response content
    print(content(response))
} else {
    cat("Failed to fetch the page. Status code:", status_code(response), "\n")
}

The above code outputs the protected website's full-page HTML, proving you bypassed the anti-bot challenge. You can load this HTML into rvest for parsing and further data extraction:

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

Congratulations! 🎉You just bypassed an anti-bot measure in R using ZenRows.

Other Web Scraping Libraries for R

Other useful libraries for web scraping in R are:

  • ZenRows: A web scraping API that bypasses all anti-bot or anti-scraping systems for you, offering rotating proxies, headless browsers, CAPTCHAs bypass and more.
  • RCrawler: An R package for web crawling and scraping. It offers many features to extract structured data from a web page.
  • xmlTreeParse: An R library for parsing XML/HTML files or strings. It generates an R structure representing the XML/HTML tree and allows you to select elements from it.

Conclusion

In this step-by-step tutorial, you've learned the basic to advanced concepts of R web scraping.

Here's a recap of what you've learned:

  • Perform basic data scraping using R with rvest.
  • How to implement crawling logic to scrape data from an entire website.
  • Advanced concepts to optimize your R scraper.

Web scraping with R can get stressful due to anti-scraping systems integrated into several websites, and most libraries find it hard to bypass them. One way to bypass any anti-bot mechanism at scale is to use a web scraping API like ZenRows. It's an all-in-one scraping solution that gives you all you need with a single API call.

Try ZenRows for free today!

Frequent Questions

How Do You Scrape Data from a Website in R?

To scrape data in R, you need a web scraping library like rvest. You'll then use it to connect to your target website, extract HTML elements, and retrieve the data.

What is the Rvest package in R?

rvest is one of the most popular web scraping R libraries, offering several functions to make R web crawling easier. rvest wraps the xml2 and httr packages, allowing you to download HTML documents and extract data from them.

Ready to get started?

Up to 1,000 URLs for free are waiting for you