Are you looking for a comprehensive tutorial on using Selenium with NodeJS? You're in the right place!
Selenium is one of the most popular browser automation tools for web scraping and testing. Its official Node.js library, Selenium WebDriver, allows you to control web browsers programmatically.
In this tutorial, you'll learn everything you need to get started with using Selenium in NodeJS:
- How to use Selenium in NodeJS.
- How to interact with web pages in a browser with selenium-webdriver.
- How to avoid getting blocked when scraping with Selenium.
How to Use Selenium in NodeJS
You'll use the Selenium WebDriver's headless capabilities to scrape data from an infinite-scrolling webpage. You'll also parse the DOM and export the scraped data to a CSV file.
Let's get started!
Step 1: Install Selenium in NodeJS
Before installing Selenium, ensure you've installed NodeJS on your system. Run the following command in the terminal to display the installed NodeJS version (e.g., v20.10.0):
node -v
If the terminal displays an error message, install NodeJS from the official website first.
Once you're ready, create a selenium-nodejs-project
directory and a JavaScript file (scraper.js
) for your example project.
mkdir selenium-nodejs-project
cd selenium-nodejs-project
touch scraper.js
Next, initialize your NodeJS project using the npm init command:
npm init
The selenium-webdriver library is the Selenium WebDriver implementation for NodeJS. You can install it using the following command:
npm install selenium-webdriver
Now, you can start writing your NodeJS script to scrape data using the Selenium WebDriver.
Open the project directory in your favorite IDE. In the scraper.js
file, add the following line of code to import the library:
const { Builder } = require('selenium-webdriver');
Awesome! It's time to start working on the code.
Step 2: Run Browser With Selenium in NodeJS
Selenium is widely known for its powerful browser automation capabilities. It supports most major browsers, including Chrome, Firefox, Edge, Opera, Safari, and Internet Explorer.
Since Chrome is the most popular and robust among these, you'll be using the same in this tutorial. Import the Chrome WebDriver in the scraper.js
file:
const chrome = require('selenium-webdriver/chrome');
Create an async function to enclose the scraper logic since you'll be dealing with asynchronous operations.
// import statements
async function scraper() {
// write your scraping logic here
}
scraper();
Next, initialize the headless Chrome browser. Headless browsers are web browsers without a graphical user interface (GUI). They allow you to interact with web pages and perform tasks programmatically.
async function scraper() {
// set the browser options
const options = new chrome.Options().addArguments('--headless');
// initialize the webdriver
const driver = new Builder().forBrowser('chrome').setChromeOptions(options).build();
}
Now, you're ready to navigate to any webpage. In our case, we'll be scraping the product data from the ScrapingCourse infinite scrolling challenge page.
It's good practice to use the try…catch…finally
statement for efficient error handling. We'll use it to enclose our main scraping logic and handle errors/exceptions.
Navigate to the target webpage and get its complete HTML code.
try {
// navigate to the target webpage
await driver.get('https://www.scrapingcourse.com/infinite-scrolling');
// extract HTML of target webpage
const html = await driver.getPageSource();
console.log(html);
} catch (error) {
// handle error
console.error('An error occurred:', error);
} finally {
// quit browser session
await driver.quit();
}
Here's how our scraper.js file looks right now:
// import statements
const { Builder } = require('selenium-webdriver');
const chrome = require('selenium-webdriver/chrome');
async function scraper() {
// set the browser options
const options = new chrome.Options().addArguments('--headless');
// initialize the webdriver
const driver = new Builder().forBrowser('chrome').setChromeOptions(options).build();
try {
// navigate to the target webpage
await driver.get('https://www.scrapingcourse.com/infinite-scrolling');
// extract HTML of the target webpage
const html = await driver.getPageSource();
console.log(html);
} catch (error) {
// handle error
console.error('An error occurred:', error);
} finally {
// quit browser session
await driver.quit();
}
}
scraper();
The above script will print the following HTML in the terminal:
<html lang="en"><head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Infinite Scroll Challenge to Learn Web Scraping - ScrapingCourse.com</title>
<!-- Omitted for brevity... -->
</body></html>
Great! You've just extracted the complete HTML of the target webpage.
Step 3: Extract Data From a Page
Once you have the complete HTML of the webpage, you can proceed with extracting the required data. In this case, let's parse the name and price of all the products on the page.
To accomplish the task, you must follow these steps:
- Analyze the DOM of the webpage using DevTools.
- Implement an effective node selection strategy to locate the products.
- Extract the required data and store them in JavaScript arrays/objects.
DevTools is an invaluable tool in web scraping. It helps you inspect the currently loaded HTML, CSS, and JavaScript. You can also get information about the assets the page has requested and their corresponding loading time.
CSS selectors and XPath expressions are the most reliable node selection strategies. You can use either of them to locate the elements, but in this tutorial, we'll use CSS selectors for simplicity.
DevTools allows you to copy the CSS selector or XPath expression from the selected HTML node. Although it can give you an idea of the structure and positioning of the element, it may not be always robust.
Let's use DevTools to define the CSS selectors. Open the target webpage in your browser, and right-click > Inspect on the product element to open DevTools.
You can observe that the individual product details are inside a div
tag with the class name product-info
. The product name is enclosed within the first span
tag with the class name product-name
, and the product price is within the second span
tag with the class name product-price
.
Use the above information to define the CSS selectors and locate the products using the findElements()
and findElement()
methods. Further, use the getText()
method to extract the inner text of the HTML nodes and finally store the extracted names and prices in arrays.
const { Builder, By } = require('selenium-webdriver');
// ...
// ...
// locate the parent elements
let parentElements = await driver.findElements(By.css('.product-info'));
const namesArray = [];
const pricesArray = [];
for (let parentElement of parentElements) {
// find child elements within the parent element
let names = await parentElement.findElement(By.css('.product-name'));
let prices = await parentElement.findElement(By.css('.product-price'));
namesArray.push(await names.getText());
pricesArray.push(await prices.getText());
}
console.log(namesArray);
console.log(pricesArray);
// ...
Here's how your scraper.js
file should look right now:
const { Builder, By } = require('selenium-webdriver');
const chrome = require('selenium-webdriver/chrome');
async function scraper() {
// set the browser options
const options = new chrome.Options().addArguments('--headless');
// initialize the webdriver
const driver = new Builder().forBrowser('chrome').setChromeOptions(options).build();
try {
// navigate to the target webpage
await driver.get('https://www.scrapingcourse.com/infinite-scrolling');
// locate the parent elements
let parentElements = await driver.findElements(By.css('.product-info'));
const namesArray = [];
const pricesArray = [];
for (let parentElement of parentElements) {
// find child elements within the parent element
let names = await parentElement.findElement(By.css('.product-name'));
let prices = await parentElement.findElement(By.css('.product-price'));
namesArray.push(await names.getText());
pricesArray.push(await prices.getText());
}
console.log(namesArray);
console.log(pricesArray);
} catch (error) {
// handle error
console.error('An error occurred:', error);
} finally {
// quit browser session
await driver.quit();
}
}
scraper();
Run the above code, and you'll get the following output in the terminal:
[
'Chaz Kangeroo Hoodie',
'Teton Pullover Hoodie',
'Bruno Compete Hoodie',
'Frankie Sweatshirt',
'Hollister Backyard Sweatshirt',
'Stark Fundamental Hoodie',
'Hero Hoodie',
'Oslo Trek Hoodie',
'Abominable Hoodie',
'Mach Street Sweatshirt',
'Grayson Crewneck Sweatshirt',
'Ajax Full-Zip Sweatshirt'
]
[
'$52', '$70', '$63',
'$60', '$52', '$42',
'$54', '$42', '$69',
'$62', '$64', '$69'
]
Voila! You just successfully scraped the product details using Selenium and NodeJS.
Step 4: Export Data to CSV
You're now ready to export the scraped data to a CSV file.
Import the built-in Node.js fs
module, which provides functions for working with the file system.
const fs = require('fs');
Then, initialize a string variable called productsData
with the header line containing column names ("name,price\n").
let productsData = "name,price\n";
Next, loop through the two arrays (namesArray
and pricesArray
) containing product names and prices. For each element in the arrays, append a line to productsData
with the name and price separated by a comma.
for (let i = 0; i < namesArray.length; i++) {
productsData += `${namesArray[i]},${pricesArray[i]}\n`;
}
Using the fs.writeFile()
function, write the productsData
string to a file named ProductDetails.csv. This function takes three arguments: the file name, the data to write, and a callback function that handles any errors encountered during the writing process.
fs.writeFile("ProductDetails.csv", productsData, err => {
if (err) {
console.error("Error:", err);
} else {
console.log("Success!");
}
});
Your final web scraping project code should look like the following:
const { Builder, By } = require('selenium-webdriver');
const chrome = require('selenium-webdriver/chrome');
const fs = require('fs');
async function scraper() {
// set the browser options
const options = new chrome.Options().addArguments('--headless');
// initialize the webdriver
const driver = new Builder().forBrowser('chrome').setChromeOptions(options).build();
try {
// navigate to the target webpage
await driver.get('https://www.scrapingcourse.com/infinite-scrolling');
// locate the parent elements
let parentElements = await driver.findElements(By.css('.product-info'));
let namesArray = [];
let pricesArray = [];
for (let parentElement of parentElements) {
// find child elements within the parent element
let names = await parentElement.findElement(By.css('.product-name'));
let prices = await parentElement.findElement(By.css('.product-price'));
namesArray.push(await names.getText());
pricesArray.push(await prices.getText());
}
console.log(namesArray);
console.log(pricesArray);
// export to csv file
let productsData = "name,price\n";
for (let i = 0; i < namesArray.length; i++) {
productsData += `${namesArray[i]},${pricesArray[i]}\n`;
}
fs.writeFile("ProductDetails.csv", productsData, err => {
if (err) {
console.error("Error:", err);
} else {
console.log("Success!");
}
});
} catch (error) {
// handle error
console.error('An error occurred:', error);
} finally {
// quit browser session
await driver.quit();
}
}
scraper();
Running the command node scraper.js
in the terminal will create the ProductDetails.csv file. The CSV file will contain the following data:
Amazing! You now have the fundamental knowledge required to use Selenium in NodeJS.
The target page has several products, but the current output displays only a few. This is because the page initially loads only the first 12 products and implements infinite scrolling to load the rest.
You'll learn how to scrape all the products in the next section.
Interacting With Web Pages in a Browser With Selenium-webdriver
When dealing with dynamic websites, you must interact with them like an average user would in a regular browser. Interactions on dynamic websites may include scrolling, clicking a button, filling out a form, moving the mouse, etc.
The selenium-webdriver
Node.js library provides various browser interactions for automated testing and web scraping. Here are some of the key interactions supported by the library:
- Click elements
- Input text
- Navigate to URLs
- Navigate back and forward
- Scrolling
- Mouse actions
- Keyboard actions
- Wait for elements
- Alert handling
- Window handling
- Frame and iFrame handling
- Cookies handling
- Configuring browser behavior
In addition to the built-in methods provided by the library to perform interactions, you can also use the executeScript() method. This method lets you execute a JavaScript code snippet directly on the page.
Let's finish our Node.js Selenium scraping project by extracting all the product data from the webpage. Then, we'll see some other interactions.
Scrolling
Since our target webpage implements infinite scrolling, you need to scroll to the bottom of the page until no new elements are loaded.
The following code repeatedly scrolls to the bottom of the page, waits for 3 seconds for content to load, and then checks if the page height has changed. If the height remains the same for two consecutive iterations, it assumes no more content is being loaded and breaks out of the loop.
// loop to keep scrolling until no more content is loaded
let lastHeight = 0;
while (true) {
// scroll to the end of the page
await driver.executeScript('window.scrollTo(0, document.body.scrollHeight)');
// wait for 3 seconds
await driver.sleep(3000);
// get the current height of the page
const currentHeight = await driver.executeScript('return document.body.scrollHeight');
// break the loop if no more content is loaded
if (currentHeight === lastHeight) {
break;
}
lastHeight = currentHeight;
}
Integrate the above code snippet with the previous scraping script. Here's your new complete code:
const { Builder, By } = require('selenium-webdriver');
const chrome = require('selenium-webdriver/chrome');
const fs = require('fs');
async function scraper() {
// set the browser options
const options = new chrome.Options().addArguments('--headless');
// initialize the webdriver
const driver = new Builder().forBrowser('chrome').setChromeOptions(options).build();
try {
// navigate to the target webpage
await driver.get('https://www.scrapingcourse.com/infinite-scrolling');
// loop to keep scrolling until no more content is loaded
let lastHeight = 0;
while (true) {
// scroll to the end of the page
await driver.executeScript('window.scrollTo(0, document.body.scrollHeight)');
// wait for 3 seconds
await driver.sleep(3000);
// get the current height of the page
const currentHeight = await driver.executeScript('return document.body.scrollHeight');
// break the loop if no more content is loaded
if (currentHeight === lastHeight) {
break;
}
lastHeight = currentHeight;
}
// locate the parent elements
let parentElements = await driver.findElements(By.css('.product-info'));
let namesArray = [];
let pricesArray = [];
for (let parentElement of parentElements) {
// find child elements within the parent element
let names = await parentElement.findElement(By.css('.product-name'));
let prices = await parentElement.findElement(By.css('.product-price'));
namesArray.push(await names.getText());
pricesArray.push(await prices.getText());
}
console.log(namesArray);
console.log(pricesArray);
// export to csv file
let productsData = "name,price\n";
for (let i = 0; i < namesArray.length; i++) {
productsData += `${namesArray[i]},${pricesArray[i]}\n`;
}
fs.writeFile("ProductDetails.csv", productsData, err => {
if (err) {
console.error("Error:", err);
} else {
console.log("Success!");
}
});
} catch (error) {
// handle error
console.error('An error occurred:', error);
} finally {
// quit browser session
await driver.quit();
}
}
scraper();
After executing this script, you'll have a ProductDetails.csv
file containing details of all 187 items.
Congratulations! You successfully scraped all the required data from the target webpage.
Wait for Element
In some cases like network or browser slowdown, your script might fail or show inconsistent results.
Rather than waiting for a fixed time interval, prefer smart waits, like waiting for a specific node to be present or visible on the page. This ensures that the web elements are loaded properly before interacting with them, reducing the chances of element not found
or element not interactable
errors.
The following code snippet implements an explicit wait strategy. The until.elementsLocated
method defines the condition for waiting, which ensures that the WebDriver waits until the specified elements are located or until the maximum timeout of 5000 milliseconds (5 seconds) is reached.
const { Builder, By, until } = require('selenium-webdriver');
// ...
// ...
let parentElements = await driver.wait(until.elementsLocated(By.css('.your-css-selector')), 5000);
// ..
You can learn more about Selenium's Explicit Waits from the official documentation.
Wait for the Page to Load
Dynamic websites often have elements that load asynchronously or are added to the DOM after the initial page load. Your page load strategy should account for these dynamic elements to ensure you capture all the data you need while remaining efficient.
Selenium WebDriver allows you to set the page load strategy to control how WebDriver waits for page loads to complete. There are three possible strategies:
-
normal
: WebDriver waits for the full page to load (including all its resources such as images, scripts, etc.) before considering the page load to be complete. -
eager
: WebDriver waits for the DOM access to be ready while other resources like images may still be loading. -
none
: WebDriver does not wait for the page to load at all. It's up to you to handle waiting for elements or other conditions manually.
You can check out the official Selenium documentation for more information about the page load strategy.
Avoid Getting Blocked When Scraping With Selenium
One of the biggest challenges to web scraping is getting blocked by websites implementing anti-bot measures. To avoid this, you need to imitate a real browser and normal user behavior.
Let's see what happens when we try to scrape data from G2 Reviews (a website protected by Cloudflare) using our script that extracts the complete HTML of the page.
// import statements
const { Builder } = require('selenium-webdriver');
const chrome = require('selenium-webdriver/chrome');
async function scraper() {
// set the browser options
const options = new chrome.Options().addArguments('--headless');
// initialize the webdriver
const driver = new Builder().forBrowser('chrome').setChromeOptions(options).build();
try {
// navigate to the target webpage
await driver.get('https://www.g2.com/products/airtable/reviews');
// extract HTML of target webpage
const html = await driver.getPageSource();
console.log(html);
} catch (error) {
// handle error
console.error('An error occurred:', error);
} finally {
// quit browser session
await driver.quit();
}
}
scraper();
Running this script will produce the following output:
<html class="no-js" lang="en-US">
<title>Attention Required! | Cloudflare</title>
<!-- omitted for brevity -->
<h1 data-translate="block_headline">Sorry, you have been blocked</h1>
<h2 class="cf-subheadline"><span data-translate="unable_to_access">You are unable to access</span> g2.com</h2>
<!-- omitted for brevity -->
<div class="cf-column">
<h2 data-translate="blocked_why_headline">Why have I been blocked?</h2>
<p data-translate="blocked_why_detail">This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.</p>
</div>
<!-- omitted for brevity -->
</html>
You got blocked! The website detected our scraping bot.
To avoid bot detection in Selenium, you can take measures like rotating IPs, premium proxies, rotating User-Agents, etc. Note that these approaches are just baby steps to bypass anti-bot solutions. Advanced anti-bot protection systems like Cloudflare would still be able to detect your bot.
So what can you do? Use ZenRows!
ZenRows is a popular alternative to Selenium in NodeJS. This advanced web scraping API offers all the functionalities of Selenium and provides additional features such as rotating premium proxies, auto-rotating UAs, anti-CAPTCHA, and other tools to help you avoid getting blocked.
To get started with ZenRows, sign up on the platform and get your 1,000 free API credits. After signing up, you'll get redirected to the Request Builder page.
Let's scrape data from the protected G2 Reviews page that you saw earlier.
Paste the target URL (https://www.g2.com/products/airtable/reviews
) in the 'URL to Scrape' input field. Make sure the Premium Proxies checkbox is checked and JS rendering is enabled.
Click on the Try it button, and you'll get the following output:
<!DOCTYPE html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/assets/favicon-fdacc4208a68e8ae57a80bf869d155829f2400fa7dd128b9c9e60f07795c4915.ico" rel="shortcut icon" type="image/x-icon" />
<title>Airtable Reviews 2024: Details, Pricing, & Features | G2</title>
<meta content="78D210F3223F3CF585EB2436D17C6943" name="msvalidate.01" />
<!-- Omitted for Brevity -->
Congrats! You successfully scraped the HTML source of the protected G2 Reviews page.
As you saw, ZenRows web scraping API efficiently handles anti-bot measures.
It's even capable of replacing Selenium's full functionality. Selenium might seem like a free tool, but there are several hidden expenses to consider when using it for professional purposes. Learning time, troubleshooting complexity, scaling expenses, etc., make Selenium's net cost significantly higher than ZenRows.
Conclusion
In this Selenium NodeJS tutorial, you learned how to control headless Chrome for scraping and automation. You started with the basics and then moved on to the advanced concepts of web scraping using Selenium.
Now you know:
- How to set up a NodeJS Selenium WebDriver project.
- How to use it to scrape data from a dynamic website.
- How to interact with dynamic content using Selenium.
- The challenges of web scraping and how to deal with it.
No matter how good your scraping script is, anti-bot measures will still be able to block it. Avoid them all using the advanced ZenRows API. Try ZenRows for free!