Do you want to explore all there is to know about web scraping with Selenium in Python, from the basics to advanced concepts, including avoiding blocks? We've got you covered!
Selenium is a popular open-source library for automation and scraping. It uses the WebDriver protocol to control Chrome, Firefox, and Safari browsers. Unlike many traditional scraping tools, Selenium quickly collects data from websites that rely on JavaScript.
The library also provides several methods to interact with a page like a human user would, meaning you gain extra functionality and are more prepared to avoid being blocked.
Here's what you're about to learn:
- Setting up a basic Selenium scraper.
- Interacting with web pages.
- Avoiding anti-bot blocks.
- Saving resources while web scraping.
Let's go!
Getting Started
Before we begin, we'll go through the steps to have everything ready to follow this Selenium web scraping tutorial and run a headless browser.
This tutorial uses Python 3.12.1. If you still need to do so, install the latest Python version. Although you can use any code editor, this tutorial uses VS Code.
Once Python is ready, install Selenium using pip
:
pip install selenium
The above command installs the latest Selenium version on your machine.
Selenium supports several browsers, but we'll use Google Chrome because it's the most popular. You'll need to install the following:
- Google Chrome: The latest version of the browser will do.
-
ChromeDriver: Download the one that matches your Chrome version, extract the zipped folder, and paste the executable file (
chromedriver.exe
) into your project root folder.
You are now ready to start controlling Chrome via Selenium. Let's get started!
Step #1: Build a Basic Scraper
As a web scraper, you might often find yourself scraping e-commerce sites like Amazon with Selenium.
So for this article, we'll use an e-commerce demo website, ScrapingCourse, as our target website, let's start with a basic web scraper that extracts a full-page HTML. Here's what the target page looks like:
Initialize a scraper.py
file in your project directory and write the following Python code. The code instantiates a ChromeDriver object and uses it to visit the target site before quitting the driver instance:
# import the required library
from selenium import webdriver
# initialize an instance of the chrome driver (browser)
driver = webdriver.Chrome()
# visit your target site
driver.get("https://www.scrapingcourse.com/ecommerce/")
# output the full-page HTML
print(driver.page_source)
# release the resources allocated by Selenium and shut down the browser
driver.quit()
Verify that your scraper.py
Selenium script works by running the following command via your terminal:
python scraper.py
The code spins up a browser interface with a "Chrome is being controlled by automated test software" message, an extra alert section to inform you that Selenium is controlling the Chrome instance:
Once that browser instance closes, Selenium extracts the website's full-page HTML, as shown:
<!DOCTYPE html>
<html lang="en-US">
<head>
<!--- ... --->
<title>Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com</title>
<!--- ... --->
</head>
<body class="home archive ...">
<p class="woocommerce-result-count" id="result-count">Showing 1-16 of 188 results</p>
<ul class="products columns-4" id="product-list">
<!--- ... --->
</ul>
</body>
</html>
Great! Your Python script works as expected. But do you really have to open a Chrome window?
Let's dig into this in the next section.
Step #2: Set up Headless Mode
Selenium is well known for its headless browser capabilities. A headless browser lacks a graphical user interface (GUI) but has all the other functionalities of a real browser.
Enable the headless mode for Chrome in Selenium by defining a ChromeOptions
object and passing it to the WebDriver Chrome constructor. Additionally, you must set headless=new to activate the headless mode starting from Chrome 109:
# import the required library
from selenium import webdriver
# instantiate a Chrome options object
options = webdriver.ChromeOptions()
# set the options to use Chrome in headless mode
options.add_argument("--headless=new")
# initialize an instance of the Chrome driver (browser) in headless mode
driver = webdriver.Chrome(options=options)
# ...
Selenium will now launch a headless Chrome instance, and you'll no longer see a Chrome window if you rerun the script. That's the ideal setting for production when running the scraping script on a server, as you don't want to waste resources on the GUI.
At the same time, seeing what happens in Chrome windows is useful while testing your scraper because it allows you to observe the effects of your script directly in the browser.
The setup part is complete. Now it's time to get your hands dirty with some web scraping logic and data extraction!
Step #3: Extract Specific Data From the Page
To extract specific data from the target website, you'll scrape the product's names, prices, image sources, and URLs on the target page.
There are two methods of locating elements on a web page:
-
find_element
: It finds only one element. If multiple elements share similar selectors, it returns the first HTML element that matches the search condition.
-
find_elements
: It returns all elements that match the search condition in an array.
These two methods support 8 locator strategies. We can categorize these into three:
-
CSS selectors:
By.ID
,By.CLASS_NAME
, andBy.CSS_SELECTOR
. -
XPath:
By.XPATH
. -
Direct selectors:
By.NAME
,By.LINK_TEXT
,By.PARTIAL_LINK_TEXT
, andBy.TAG_NAME
.
Let's describe each with examples in the table below:
Strategy |
Description |
HTML Sample Code |
Selenium Examples |
---|---|---|---|
By.ID |
Selects HTML elements based on their id attribute |
<div id="s-437">...</div> |
find_element(By.ID, "s-437") |
By.CLASS_NAME |
Selects HTML elements based on their class attribute |
<div class="welcome-text">Welcome!</div> |
find_element(By.CLASSNAME, "welcome-text") find_elements(By.CLASSNAME, "text-center") |
By.CSS_SELECTOR |
Selects HTML elements that match a CSS selector |
<div class="product-card"><span class="price"\>$140</span></div> |
find_element(By.CSS_SELECTOR, ".product-card .price") find_elements(By.CSS_SELECTOR, ".product-card .price") |
By.XPATH |
Selects HTML elements that match an XPath expression |
<h1>My <strong>Fantastic</strong> Blog</h1> |
find_element(By.XPATH, "//h1/strong") find_elements(By.XPATH, "//h1/strong") |
By.NAME |
Selects HTML elements based on their name attribute |
<input name="email" /> |
find_element(By.NAME, "email") find_elements(By.NAME, "email") |
By.LINK_TEXT |
Selects anchor (<a> ) elements matching a specific link text |
<a href="/">Home</a> |
find_element(By.LINK_TEXT, "Home") find_elements(By.LINK_TEXT, "Home") |
By.PARTIAL_LINK_TEXT |
Selects anchor (<a> ) elements matching the substring of a link text` |
<a href="/">Click here now</a> |
find_element(By.PARTIAL_LINK_TEXT, "now") find_elements(By.PARTIAL_LINK_TEXT, "now") |
By.TAG_NAME |
Selects HTML elements based on their tag name |
<span>...</span> |
find_element(By.TAG_NAME, "span") find_elements(By.TAG_NAME, "span") |
CSS selectors are recommended over XPath because they're more beginner-friendly and maintainable at scale. However, the XPath can be more suitable when dealing with a complex HTML structure, and you need more specificity when selecting element nodes.ย
You can also get XPath expressions and CSS selectors directly from the HTML. Right-click on an element, open the ''Copy'' menu and choose ''Copy selector'' or ''Copy XPath'' to get the related selector for the selected element. Let's scrape some data to see how they work.
Scrape a Single Element With Selenium
First, let's inspect the target website to see its element structure. Open the target website via a web browser like Chrome. Then, right-click the first element and select Inspect to open the developer tool.ย
You'll see that each product information is inside a list tag (li
).
The products share the same CSS selectors. To extract the first product, let's use the find_element
method, which returns the information of the first matching element.
First, import By
, the Selenium method containing all the built-in locator strategies:
# ...
from selenium.webdriver.common.by import By
Extract each target element using its CSS selector (class name). Collect the product details directly into a dictionary and output it. Replace the print
line in the previous code with the following scraping logic:
# ...
# extract all the product containers
products = driver.find_elements(By.CSS_SELECTOR, ".product")
# extract the elements into a dictionary using the CSS selector
product_data = {
"Url": driver.find_element(
By.CSS_SELECTOR, ".woocommerce-LoopProduct-link"
).get_attribute("href"),
"Image": driver.find_element(By.CSS_SELECTOR, ".product-image").get_attribute(
"src"
),
"Name": driver.find_element(By.CSS_SELECTOR, ".product-name").text,
"Price": driver.find_element(By.CSS_SELECTOR, ".price").text,
}
# print the extracted data
print(product_data)
Combine the above snippet with the previous basic scraper, and you'll get this complete code:
# import the required libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
# instantiate a Chrome options object
options = webdriver.ChromeOptions()
# set the options to use Chrome in headless mode
options.add_argument("--headless=new")
# initialize an instance of the chrome driver (browser) in headless mode
driver = webdriver.Chrome(
options=options,
)
# visit your target site
driver.get("https://www.scrapingcourse.com/ecommerce")
product_data = {
"Url": driver.find_element(
By.CSS_SELECTOR, ".woocommerce-LoopProduct-link"
).get_attribute("href"),
"Image": driver.find_element(By.CSS_SELECTOR, ".product-image").get_attribute(
"src"
),
"Name": driver.find_element(By.CSS_SELECTOR, ".product-name").text,
"Price": driver.find_element(By.CSS_SELECTOR, ".price").text,
}
# print the extracted data
print(product_data)
# release the resources allocated by Selenium and shut down the browser
driver.quit()
Now, rerun your scraper.py
file, and it outputs the first element's product details:
{
'Url': 'https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/',
'Image': 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main.jpg'
'Name': 'Abominable Hoodie',
'Price': '$69.00',
}
You've extracted your first product with Selenium! But there are more elements on that page, so let's extract them.
Scrape Multiple Elements With Selenium
Remember that each product is inside a separate container (li
tag). It means there are many containers in the DOM. Scroll through the HTML in the Elements console, and you'll see that all the containers share the same class name (.product
):
This time, extract all the element containers (the li
tags) with the find_elements
method, which returns an array of all elements with a given selector:
# ...
# extract all the product containers
products = driver.find_elements(By.CSS_SELECTOR, ".product")
Create an empty array to collect the final data. Then, loop through the array to extract the desired data from each container and append the data to the extracted_products
list. Modify the previous scraping logic like so:
# ...
# declare an empty list to collect the extracted data
extracted_products = []
# loop through the product containers
for product in products:
# extract the elements into a dictionary using the CSS selector
product_data = {
"Url": product.find_element(
By.CSS_SELECTOR, ".woocommerce-LoopProduct-link"
).get_attribute("href"),
"Image": product.find_element(By.CSS_SELECTOR, ".product-image").get_attribute(
"src"
),
"Name": product.find_element(By.CSS_SELECTOR, ".product-name").text,
"Price": product.find_element(By.CSS_SELECTOR, ".price").text,
}
# append the extracted data to the extracted_product list
extracted_products.append(product_data)
Finally, print extracted_products
to see the extracted data:
# ...
# print the extracted data
print(extracted_products)
Modify the previous scraper with the new changes. Here's what your new scraper file looks like:
# import the required libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
# instantiate a Chrome options object
options = webdriver.ChromeOptions()
# set the options to use Chrome in headless mode
options.add_argument("--headless=new")
# initialize an instance of the chrome driver (browser) in headless mode
driver = webdriver.Chrome(
options=options,
)
# visit your target site
driver.get("https://www.scrapingcourse.com/ecommerce")
# extract all the product containers
products = driver.find_elements(By.CSS_SELECTOR, ".product")
# declare an empty list to collect the extracted data
extracted_products = []
# loop through the product containers
for product in products:
# extract the elements into a dictionary using the CSS selector
product_data = {
"name": product.find_element(By.CSS_SELECTOR, ".product-name").text,
"price": product.find_element(By.CSS_SELECTOR, ".price").text,
"URL": product.find_element(
By.CSS_SELECTOR, ".woocommerce-LoopProduct-link"
).get_attribute("href"),
"image": product.find_element(By.CSS_SELECTOR, ".product-image").get_attribute(
"src"
),
}
# append the extracted data to the extracted_product list
extracted_products.append(product_data)
# print the extracted data
print(extracted_products)
# release the resources allocated by Selenium and shut down the browser
driver.quit()
Execute the scraper.py
file, and you'll get the following output containing all the products on the first page:
[
{
'Url': 'https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/',
'Image': 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main.jpg',
'Name': 'Abominable Hoodie',
'Price': '$69.00'
},
# ... other products omitted for brevity
{
'Url': 'https://www.scrapingcourse.com/ecommerce/product/artemis-running-short/',
'Image': 'https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wsh04-black_main.jpg',
'Name': 'Artemis Running Short',
'Price': '$45.00'
}
]
Good job! You now know how to extract multiple elements using Selenium.
Step #4: Export Data to CSV
Your scraping task is only complete when you store your data. Saving the extracted data in a CSV prepares it for processing or sharing with others.
To start, import Python's built-in csv
package, a library that allows you to write dictionaries into individual CSV rows. Specify the CSV file name and open a CSV file in "write" mode. Write the dictionary header fields using the keys (writer.writeheader
). Then, write each dictionary to a new row (writer.writerows
):
# ...
import csv
# ...
# specify the CSV file name
csv_file = "products.csv"
# write the extracted data to the CSV file
with open(csv_file, mode="w", newline="", encoding="utf-8") as file:
# write the headers
writer = csv.DictWriter(file, fieldnames=["Url", "Image", "Name", "Price"])
writer.writeheader()
# write the rows
writer.writerows(extracted_products)
Finally, print a success message to confirm that the data has been written to a CSV:
# ...
# confirm that the data has been written to the CSV file
print(f"Data has been written to {csv_file}")
Combine the above snippets with the previous code. Here's the final code:
# import the required libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
import csv
# instantiate a Chrome options object
options = webdriver.ChromeOptions()
# set the options to use Chrome in headless mode
options.add_argument("--headless=new")
# initialize an instance of the chrome driver (browser) in headless mode
driver = webdriver.Chrome(
options=options,
)
# visit your target site
driver.get("https://www.scrapingcourse.com/ecommerce")
# extract all the product containers
products = driver.find_elements(By.CSS_SELECTOR, ".product")
# declare an empty list to collect the extracted data
extracted_products = []
# loop through the product containers
for product in products:
# extract the elements into a dictionary using the CSS selector
product_data = {
"Url": product.find_element(
By.CSS_SELECTOR, ".woocommerce-LoopProduct-link"
).get_attribute("href"),
"Image": product.find_element(By.CSS_SELECTOR, ".product-image").get_attribute(
"src"
),
"Name": product.find_element(By.CSS_SELECTOR, ".product-name").text,
"Price": product.find_element(By.CSS_SELECTOR, ".price").text,
}
# append the extracted data to the extracted_product list
extracted_products.append(product_data)
print(extracted_products)
# specify the CSV file name
csv_file = "products.csv"
# write the extracted data to the CSV file
with open(csv_file, mode="w", newline="", encoding="utf-8") as file:
# write the headers
writer = csv.DictWriter(file, fieldnames=["Url", "Image", "Name", "Price"])
writer.writeheader()
# write the rows
writer.writerows(extracted_products)
# confirm that the data has been written to the CSV file
print(f"Data has been written to {csv_file}")
# release the resources allocated by Selenium and shut down the browser
driver.quit()
The above code creates a products.csv
file containing the extracted data. You'll see that file inside your project root directory:
That works! You've just learned the basic steps of scraping data with Selenium in Python. Now, there are many other things you can do with Selenium. Keep reading!
How to Interact With a Web Page as in a Browser
Selenium lets you execute JavaScript and simulate many web interactions, including scrolling, clicking, hovering, filling out a form, dragging and dropping, logging in, and more. This feature allows your scraper to interact with dynamic pages like a human user and can help you scrape without triggering anti-bot measures.
This section will teach you the most useful browser interactions you'll often encounter while scraping with Selenium.
Scrolling
Scrolling is handy when scraping a website that renders content dynamically with infinite scrolling.ย
Websites using infinite scrolling use AJAX to load more content as you scroll down the page. You can simulate continuous scrolling with Selenium and extract all the content on that page.
We'll see how to use the infinite scrolling demo page. See what the page looks like:
To extract content from a web page that uses infinite scrolling, you'll instruct the browser to scroll continuously down the page. A nifty way to achieve that is to run the scrolling instruction using JavaScript inside Selenium's execute_script
method.
The logic behind implementing infinite scrolling in Selenium is to get the initial page height and initiate the scrolling action inside a while
loop. Since content doesn't load immediately after scrolling, you want to use time.sleep
to pause for them to load before proceeding with your scrolling.
You'll keep updating the initial height. Then, when the height stops increasing, you'll break the loop and execute your scraping logic.ย
Here's a sample code that scrolls infinitely to extract all the content from the target page:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
# instantiate a Chrome options object
options = webdriver.ChromeOptions()
# set the options to use Chrome in headless mode
options.add_argument("--headless=new")
# initialize an instance of the Chrome driver (browser) in headless mode
driver = webdriver.Chrome(
options=options,
)
# visit your target site
driver.get("https://www.scrapingcourse.com/infinite-scrolling")
# get the initial scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# scroll to the bottom of the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# wait for more elements to load after scrolling
time.sleep(5)
# get the new scroll height after scrolling
new_height = driver.execute_script("return document.body.scrollHeight")
# check if new content has loaded
if new_height == last_height:
# if no new content is loaded, break the loop
break
# update the last height
last_height = new_height
# extract all product containers
products = driver.find_elements(By.CSS_SELECTOR, ".product-item")
# declare an empty list to collect the extracted data
extracted_products = []
# loop through each product container to extract details
for product in products:
product_data = {
"Name": product.find_element(By.CSS_SELECTOR, ".product-name").text,
"Price": product.find_element(By.CSS_SELECTOR, ".product-price").text,
}
extracted_products.append(product_data)
# output the data
print(extracted_products)
# release the resources allocated by Selenium and shut down the browser
driver.quit()
You can run the above scraper in non-headless mode to view the scrolling action in the GUI.
Wait for an Element
Most sites rely on API calls to get the data they need. After the first load, they perform many asynchronous XHR requests to load content via AJAX in JavaScript. Such content isn't immediately available in the DOM. So, trying to scrape it may result in an "element not found" error.
For example, look at the ''Network'' tab of the infinite scroll demo DevTools window below. In the ''Fetch/XHR'' section, you can see the AJAX requests performed by the page as you scroll down:
Always use the developer tools to understand what a target page does and how JavaScript manipulates its DOM. Remember that a website can rely on JavaScript to render part or all of its pages.
With JS-rendered pages, you can't immediately start scraping data. That's because the DOM will only be ready after some time. In other words, you have to wait until JavaScript loads the element you want to scrape.
You have three ways to scrape data from such pages using Selenium:
-
time.sleep()
: Built-in Python function to pause the Python Selenium web scraping script for a few seconds before selecting elements. It offers a wildcard pause and can be suitable when uncertain about what elements to wait for. For instance, it can be handy for waiting for elements during infinite scrolling. However, its shortcoming is that it introduces inefficient fixed delays, and waiting too long can slow down your scraper. Too little pausing may not be enough to allow the required element to load, resulting in potential errors. -
implicitly_wait
: It waits for all selected elements to be present in the DOM before scraping or interacting with them. Theimplicity_wait
method is helpful if you want to apply the same wait time to all elements. However, it offers less control, as it doesn't select specific elements and only waits for the presence of the elements without considering their visibility or availability for interaction. -
WebDriverWait
: Also called explicit waits, this approach pauses for specific conditions before proceeding with further actions. You should opt for the Selenium explicit wait (WebDriverWait) approach because it's more flexible and allows you to wait only as long as required. It's helpful when you want to wait for specific elements under a given condition.
Let's see the explicit wait technique using the JS Rendering demo page, a good example of a page that relies on AJAX calls.
Assume you want to wait for the entire products to load before extracting product names and prices:
To extract the target data, add WebDriverWait
and expected_conditions
to your imports. Then, explicitly wait for 5 seconds to allow all the products to be visible before scraping their content:
# import the required libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# instantiate a Chrome options object
options = webdriver.ChromeOptions()
# set the options to use Chrome in headless mode
options.add_argument("--headless=new")
# initialize an instance of the Chrome driver (browser) in headless mode
driver = webdriver.Chrome(options=options)
# visit your target site
driver.get("https://www.scrapingcourse.com/javascript-rendering")
# wait up to 5 seconds until the image card appears
element = WebDriverWait(driver, 5).until(
EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".product-item"))
)
# you are now sure that the product grid has loaded
# and can scrape it
products = driver.find_elements(By.CSS_SELECTOR, ".product-item")
extracted_products = []
for product in products:
product_data = {
"name": product.find_element(By.CSS_SELECTOR, ".product-name").text,
"price": product.find_element(By.CSS_SELECTOR, ".product-price").text,
}
extracted_products.append(product_data)
print(extracted_products)
The code extracts the specified data, as shown:
[
{'name': 'Chaz Kangeroo Hoodie', 'price': '$52'},
#... other products omitted for brevity
{'name': 'Ajax Full-Zip Sweatshirt', 'price': '$69'}
]
You can wait for several expected_conditions
. These are some popular ones:
-
title_contains
: Until the page title contains a specific string. -
presence_of_element_located
: Until an HTML element is present in the DOM. -
visibility_of_element_located
: Until an element already in the DOM becomes visible. -
text_to_be_present_in_element
: Until the element contains a particular text. -
element_to_be_clickable
: Until an HTML element is clickable. -
alert_is_present
: Until a JavaScript native alert shows up. -
visibility_of_all_elements_located
: Until several elements (with the same selector name) load.
You can also wait for an entire page to load in Selenium. In the next section, we'll see how that works.
Wait for a Page to Load
Waiting for a page to load in Selenium is related to waiting for elements to load. But this time, it involves waiting for the page to load all its resources, including CSS, JavaScript, and images. This technique is useful when scraping a website that dynamically loads all or part of its content.
This approach can also help solve anti-bot JavaScript challenges, which may require waiting for specific conditions before accessing the page content. You can also leverage it to solve CAPTCHAs. For instance, it can ensure that a CAPTCHA box loads before taking its screenshot and sending it to a CAPTCHA-solving service.
You can wait for a page to load in two ways:
-
time.sleep()
: After opening the URL, pause code execution to wait for all resources to load. The setback of this method is that it's hard to determine how long the page takes to load.ย -
document.readyState: Execute a script that waits for the
document.readyState
to complete before interacting further with the DOM. This method involves using the explicit wait method (WebDriverWait
) with theexpected_conditions
to check if the page document and all its resources have finished loading. Again, this method is better since it offers more flexibility.
Let's use the document.readyState
condition with the WebDriverWait
method. The code below sets an anonymous function that executes JavaScript that waits for the document.readyState
to return complete
in 10 seconds:
# ...
# visit your target site
driver.get("https://www.scrapingcourse.com/javascript-rendering")
# wait up to 10 seconds until the document is fully ready
WebDriverWait(driver, 10).until(
lambda driver: driver.execute_script("return document.readyState") == "complete"
)
# ... your scraping logic
The above code times out if that condition isn't met within the specified time limit (10 seconds).
Take Screenshots
Besides scraping text data, Selenium allows you to take screenshots of the visible part of the page, a specific element, or the entire page. This feature is helpful in debugging, supporting your thesis with visual evidence, or crawling UI choices. For example, you can take screenshots to check how competitors present products on their sites.
The code below shows how to screenshot the visible part of a page in Selenium. You can add the previous explicit wait method to wait for content to load before taking your screenshot:
# ...
# open the target website
driver.get("https://www.scrapingcourse.com/javascript-rendering")
# ... wait for dynamic content to load
# screenshot the visible part of the page
driver.save_screenshot("screenshot.png")
Here's the result:
You can also screenshot a specific element. The code below grabs the description section of this demo product page. Note that we've used the ID selector this time:
# ...
from selenium.webdriver.common.by import By
# ...
# open the target website
driver.get("https://www.scrapingcourse.com/ecommerce/product/chaz-kangeroo-hoodie/")
# get the element using the ID selector
summary_element = driver.find_element(By.CSS_SELECTOR, "#tab-description")
# screenshot the selected element
summary_element.screenshot("specific_screenshot.png")
See the output below:
To learn more, read our detailed tutorial on how to take screenshots in Selenium.
Click on a Link
Clicking is another helpful feature in Selenium. You can click a button to perform specific actions, like submitting a form or navigating between pages.
Assume you want to click the brand name below to navigate to the site's homepage:
The code below clicks the highlighted brand name and takes a screenshot of the result page:
# ...
from selenium.webdriver.common.by import By
# ...
# open the target website
driver.get("https://www.scrapingcourse.com/javascript-rendering")
# get the brand name element
brand_name = driver.find_element(By.CSS_SELECTOR, ".brand-name")
# perform a click action
brand_name.click()
# screenshot the result page
driver.save_screenshot("homepage-screenshot.png")
The code navigates to the homepage and returns its screenshot:
Fill Out a Form
Selenium's form-filling feature helps automate actions, such as signing up, logging in, filling out a contact form, or launching a search.ย
Below is how you can use Selenium to fill out the form on the login demo page:
# ...
from selenium.webdriver.common.by import By
# ...
# open the target website
driver.get("https://www.scrapingcourse.com/login")
# retrieve the form elements
email_input = driver.find_element(By.ID, "email")
password_input = driver.find_element(By.ID, "password")
submit_button = driver.find_element(By.ID, "submit-button")
# filling out the form elements
email_input.send_keys("[email protected]")
password_input.send_keys("password")
# submit the form and log in
submit_button.click()
This feature is also handy for scraping data hidden behind a login.
Execute JavaScript Directly Within the Browser
Selenium provides access to all browser functionalities, including launching JavaScript instructions.
The execute_script() method enables you to execute JavaScript instructions synchronously. That's particularly helpful when the features provided by Selenium aren't enough to achieve your goal.
Assume you want to screenshot the description box on this demo product page. If it isn't in the viewport when the screenshot is taken, the result will be a blank image.
To avoid that, use the window.scrollBy()
JavaScript function to scroll to the element position before taking the screenshot.
Here's how to do that:
# ...
# open the target website
driver.get("https://www.scrapingcourse.com/ecommerce/product/chaz-kangeroo-hoodie/")
# select the desired element
card = driver.find_element(By.ID, "tab-description")
# retrieve the y position of the selected element on the page
card_y_location = card.location["y"]
# "-100" to give some extra space and make
# ensure the screenshot is taken correctly
javaScript = f"window.scrollBy(0, {card_y_location}-100);"
# execute JavaScript
driver.execute_script(javaScript)
driver.save_screenshot("scrolled-element-screenshot.png")
And here's the output:
Thanks to JavaScript's return keyword, you can also use execute_script()
to pass values to your script. See how it works:
title = driver.execute_script("return document.title")
print(title) # Chaz Kangeroo Hoodie - Ecommerce Test Site to Learn Web Scraping
The code above passes the value of the page's title from JavaScript to Python's title variable.
Customize Windows Size
Modern sites are responsive and adapt their layout to the user's screen or browser window size. Depending on the available space, they may show or hide elements using JavaScript on smaller screens. Selenium allows you to change the browser window's initial size, enabling you to reveal content that might be hidden in the initial viewport.ย
You can achieve this in two ways:
-
options.add_argument("--window-size=<width>,<height>")
. -
set_window_size(<width>, <height>)
.
See them in action in the example snippets below.
Using options.add_argument("--window-size=<width>,<height>")
:
# method 1: using options.add_argument("--window-size=<width>,<height>")
options = webdriver.ChromeOptions()
# set the initial window size
options.add_argument("--window-size=800,600")
driver = webdriver.Chrome(options=options)
# print the window size
print(driver.get_window_size()) # {"width": 800, "height": 600}
Using set_window_size(<width>, <height>)
:
# method 2: using set_window_size(<width>, <height>)
driver = webdriver.Chrome(options=options)
# set the window size
driver.set_window_size(1920, 1200)
# print the window size
print(driver.get_window_size()) # {'width': 1920, 'height': 1200}
We used get_window_size()
to check the width and height of the current window. That comes in handy in several scenarios, like ensuring the browser window has the correct size before taking a screenshot.
Get Around Anti-Scraping Protections With Selenium in Python
You now know how to do web scraping using Selenium in Python. Yet, retrieving data from the web is a challenge, as some sites adopt anti-bot technologies that might detect your scraper as a bot and block it.
Try to scrape data from a Cloudflare-protected website like the G2 Reviews using Selenium:
# import the required libraries
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# run Chrome in headless mode
options = Options()
options.add_argument("--headless=new")
# start a driver instance
driver = webdriver.Chrome(options=options)
# open the target website
driver.get("https://www.g2.com/products/asana/reviews")
# save a screenshot to see what happens
driver.save_screenshot("g2-reviews-screenshot.png")
# release the resources allocated by Selenium and shut down the browser
driver.quit()
The above code gets blocked, as shown:
Getting blocked by anti-bots is a huge setback to your data retrieval process. You may build the best Selenium Python web scraping script possible, but it'll be a pointless effort if it keeps getting detected and blocked!
For effective web scraping without getting blocked, consider adopting ZenRows, an all-in-one web scraping API, which will save you stress and allow you to easily bypass all anti-bot protections.
Still, let's go through a few manual techniques to make your scraper less prone to blocks.
Change IP Using a Proxy
A proxy service sends requests on your behalf and increases your chances of bypassing IP bans due to rate limiting and geo-restrictions.ย
To see how proxy implementation works in Selenium, grab a free proxy from the Free Proxy List and add it to your scraper, as shown in the code below. In this example, you'll scrape <https://httpbin.io/ip>
, a test website that returns your current IP address:
# import the required libraries
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# run Chrome in headless mode
options = Options()
# set the proxy address
proxy_server_ip = "http://34.143.221.240:8103"
# add the address to Chrome options
options.add_argument(f"--proxy-server={proxy_server_ip}")
# set the options to use Chrome in headless mode
options.add_argument("--headless")
# start a driver instance
driver = webdriver.Chrome(options=options)
# open the target website
driver.get("https://httpbin.io/ip")
# print your current IP address
print(driver.find_element(By.TAG_NAME, "body").text)
# release the resources allocated by Selenium and shut down the browser
driver.quit()
As expected, the above code outputs the proxy's IP address:
{
"origin": "34.143.221.240:43094"
}
The proxy used in the above example is a free one from the Free Proxy List website, and it may not work at the time of reading. Feel free to grab a new one.
Your best bet is to use premium web scraping proxies for commercial web scraping projects. These proxies are more reliable since they belong to daily internet users with network provider IPs.
Your ideal proxy choice should provide advanced features like IP auto-rotation and geo-targeting and only charge you for successful requests. An example of such a service is ZenRows. In addition to residential IP auto-rotation and flexible geo-targeting, it offers extra valuable features, including anti-bot and CAPTCHA auto-bypass under the same plan.
However, Selenium's proxy support is limited because it doesn't allow proxy authentication, which you'll need while using premium proxies. However, you can add premium proxy support with an alternative like selenium-wire.
Check our full tutorial on setting up a proxy with Selenium to learn more.
Add Real Headers
The request headers tell the server about your scraper, and it's one of the parameters that anti-bots use to detect bots.
Automation software, including Selenium, sends bot-like default headers, such as the 'HeadlessChrome` flag in the User Agent string. Such headers signal the anti-bot measure that your request is suspicious and may result in potential blocking.
You can customize Selenium's request headers with that of an actual browser to boost the chances of success.
Since the User Agent is among the most essential request headers for web scraping, let's see how to set that with Selenium using the following code. The code requests https://httpbin.io/user-agent
to show your current User Agent header:
# import the required libraries
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# run Chrome in headless mode
options = Options()
# define a custom User Agent
custom_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
# add the User Agent to Chrome options
options.add_argument(f"user-agent={custom_user_agent}")
# set the options to use Chrome in headless mode
options.add_argument("--headless=new")
# start a driver instance
driver = webdriver.Chrome(options=options)
# open the target website
driver.get("https://httpbin.io/user-agent")
# save a screenshot to see what happens
print(driver.find_element(By.TAG_NAME, "body").text)
# release the resources allocated by Selenium and shut down the browser
driver.quit()
Here's the output:
{
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
}
Changing the User Agent might be counterproductive if you forget to adjust other headers. For instance, the sec-ch-ua header also sends your browser's version, so it must match the User Agent version.
See what sec-ch-ua
looks like on Google Chrome version 126:
"\"Not/A)Brand\";v=\"8\", \"Chromium\";v=\"126\", \"Google Chrome\";v=\"126\""
At the same time, older versions don't send that header, so adding it in older browser versions might be suspicious.
Although you've set a custom User Agent in the snippet above, you can't do the same for a complete request header set because Selenium doesn't support a complete request header customization.ย
That's where a third-party Python library like selenium-wire comes into play. It extends Selenium to give you access to the underlying processes made by the browser and let you intercept requests, update the headers, or add new ones.
Install it using pip
:
pip install blinker==1.7.0 selenium-wire
Selenium-wire depends on blinker==1.7.0
. To ensure you can run selenium-wire smoothly, install it with the fixed blinker dependency as done above.
Using selenium-wire, let's set the user-agent
, sec-ch-ua
, and referer
headers and visit https://httpbin.io/headers
to view your current request headers.ย
Below is the code to achieve that. It defines a request interceptor that deletes the existing request headers to avoid duplicates and sets new ones:
# import the required libraries
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from seleniumwire import webdriver
# specify your headers
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
sec_ch_ua = '"Not/A)Brand";v="8", "Chromium";v="126", "Google Chrome";v="126"'
referer = "https://www.google.com"
def interceptor(request):
# delete the target headers first
del request.headers["user-agent"]
del request.headers["sec-ch-ua"]
request.headers["referer"]
# set new headers
request.headers["user-agent"] = user_agent
request.headers["sec-ch-ua"] = sec_ch_ua
request.headers["referer"] = referer
# run Chrome in headless mode
options = Options()
# set the options to use Chrome in headless mode
options.add_argument("--headless=new")
# start a driver instance
driver = webdriver.Chrome(options=options)
# set the selenium-wire interceptor
driver.request_interceptor = interceptor
# open the target website
driver.get("https://httpbin.io/headers")
print(driver.find_element(By.TAG_NAME, "body").text)
# release the resources allocated by Selenium and shut down the browser
driver.quit()
See the output below. We've highlighted the customized headers for clarity:
{
"headers": {
"Accept": [
"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"
],
"Accept-Encoding": [
"gzip, deflate, br, zstd"
],
"Connection": [
"keep-alive"
],
"Host": [
"httpbin.io"
],
"Referer": [
"https://www.google.com"
],
"Sec-Ch-Ua": [
"\"Not/A)Brand\";v=\"8\", \"Chromium\";v=\"126\", \"Google Chrome\";v=\"126\""
],
"Sec-Ch-Ua-Mobile": [
"?0"
],
"Sec-Ch-Ua-Platform": [
"\"Windows\""
],
"Sec-Fetch-Dest": [
"document"
],
"Sec-Fetch-Mode": [
"navigate"
],
"Sec-Fetch-Site": [
"none"
],
"Sec-Fetch-User": [
"?1"
],
"Upgrade-Insecure-Requests": [
"1"
],
"User-Agent": [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
]
}
}
Read our guide on setting custom request headers with Selenium for a detailed tutorial.
Hidden Elements to Detect Scrapers
Some websites rely on honeypot traps, elements that are only visible to bots but not real users, to detect and block scrapers.
Assume your target site contains an invisible honeypot link such as the one below:
<a href="https://your-target-site/honeypot-page" style="display: none">Click here</a>
A common scraping task is retrieving and following links (<a>
tag) to extract their URLs for crawling purposes. However, you might scrape and follow the honeypot link together with the other real links, causing the anti-bot system to detect and block your Selenium web scraper.
The web element exposes an is_displayed()
method that allows you to verify whether the HTML element is visible to the user. This method returns true if an element is visible but returns false if not.ย
In the scenario above, you could use it to filter out non-visible links this way:
a_elements = driver.find_elements(By.TAG_NAME, "a")
# filter out non-visible links
visible_a_elements = list(filter(lambda e: (e.is_displayed()), a_elements))
The above code ensures that you only scrape the visible links on the target website.
Save Resources While Web Scraping With Selenium
Selenium gives you access to standard browser capabilities that help you take your scraping process to the next level. However, you may not need all those features all the time.
For instance, if you don't need to take screenshots, loading images will only cost you extra network resources. Besides, images make up a significant percentage of the page's total weight!
Thankfully, Selenium offers a solution to this and similar issues. Blocking specific resources improves performance and bandwidth and avoids being tracked. This technique is especially beneficial when scaling your Selenium Python scraping operations.
Find out some real-world examples below.
Block Images
You'll need to turn off Chrome's Blink rendering engine to prevent the browser from loading images in Selenium.
Let's see how to achieve that by screenshotting the e-commerce demo site using the code below:
# import the required libraries
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# run Chrome in headless mode
options = Options()
# disable Blink to disallow images
options.add_argument("--blink-settings=imagesEnabled=false")
# set the options to use Chrome in headless mode
options.add_argument("--headless=new")
# start a driver instance
driver = webdriver.Chrome(options=options)
# open the target website
driver.get("https://www.scrapingcourse.com/ecommerce")
# take a screenshot of the page
driver.save_screenshot("screenshot_with-no-image.png")
# release the resources allocated by Selenium and shut down the browser
driver.quit()
The above code outputs the following screenshot, showing unloaded image placeholders:
Block JavaScript
Disallowing JavaScript execution can help prevent anti-bot tactics, like client-side JavaScript challenges, from running in the background. However, use this technique cautiously, as it can prevent dynamic websites from rendering correctly.
You can stop the browser controlled by Selenium from running JavaScript by adding the following to your Chrome Options:
# ...
# disallow JavaScript execution
options.experimental_options["prefs"] = {
"profile.managed_default_content_settings.javascript": 2
}
Your browser instance will now run without the JavaScript engine.
Intercept Requests
Thanks to selenium-wire, you can programmatically intercept and alter the course of requests. This functionality can be helpful when you want to block specific image formats (MIME Types).ย
Let's see how with the following code that blocks png
and gif
image formats:
# import the required libraries
from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver
def interceptor(request):
# block only PNG and GIF images
if request.path.endswith((".png", ".gif")):
request.abort()
# run Chrome in headless mode
options = Options()
# set the options to use Chrome in headless mode
options.add_argument("--headless")
# start a driver instance
driver = webdriver.Chrome(options=options)
# set the selenium-wire interceptor
driver.request_interceptor = interceptor
# open the target website
driver.get("https://www.scrapingcourse.com/ecommerce/")
# ... your scraping logic
# release the resources allocated by Selenium and shut down the browser
driver.quit()
You can also block specific domains using the exclude_hosts
option or allow only specific requests based on URLs matching a regex with driver.scopes
.
These were just a few examples, but there's much more to learn. Follow our guide to learn how to intercept XHR requests in web scraping.
Conclusion
This step-by-step tutorial covered the most essential knowledge on web scraping in Python using Selenium. You now know how to:
- Set up Selenium in Python.
- Locate elements on a web page.
- Use Selenium to interact with web elements on a page in a browser.
- Run JavaScript code in the browser.
- Avoid anti-bot measures in Selenium.
As you've seen, data extraction involves many challenges, primarily due to websites' anti-bot technologies. Bypassing them is complex and resource-intensive, but you can forget about them with a full-fledged web scraping API like ZenRows. Use it to handle everyday data extraction with a single API request and avoid anti-bot protections.
Frequent Questions
What Is Selenium in Web Scraping?
Selenium is a popular solution for web scraping that allows you to create scripts that interact with web pages like a browser. Its headless browser capabilities help render JavaScript and avoid getting blocked.
Can Selenium Be Used for Web Scraping?
Even though Selenium is an automation testing tool, you can use it for web scraping. Its ability to interact with web pages and simulate human behavior makes it a popular tool for data extraction.
How to Use Selenium in Python for Web Scraping?
Using Selenium in Python for web scraping involves the following steps:
- Install the Selenium binding for Python with
pip
install Selenium, and download the web driver compatible with your browser. - Import the Selenium library in your Python code and create a new WebDriver instance.
- Use the driver instance to navigate to the target page.
- Implement the scraping logic and extract data from it.
Is Selenium Good for Web Scraping?
Selenium is an excellent option for web scraping, especially for websites that rely on JavaScript to render the whole page or have dynamic content. However, when not configured correctly, it can be slower and more resource-intensive than other scraping solutions.
How to Scrape a Web Page Using Selenium?
To scrape web pages using Selenium, you'll need to create a WebDriver instance and use it to navigate to your target. Next, employ the library's methods to interact with the page and extract the desired information.