Do you want to extract product data from Amazon using your Selenium-based web scraper? We've got you covered!
In this tutorial, you'll learn step-by-step how to scrape Amazon with Selenium in Python, including best practices to avoid getting blocked.
Let's get right to it!
Build an Amazon Product Scraper With Selenium
In this Python Selenium scraping tutorial, we'll scrape the following Amazon product page.
You'll start with a basic scraper to access the page before scraping the following product information:
- Product name.
- Price.
- Description.
- Images.
- Rating.
Let's start with the prerequisites.
Step #1: Prerequisites
This tutorial assumes you've installed Python on your machine. Otherwise, install the latest version from the Python download page.Â
In this tutorial, we'll automate the Chrome browser with Selenium. So, in addition to Selenium, you'll need the WebDriverManager to manage the ChromeDriver installation automatically.
Open your command line to your project directory and install Selenium and the WebDriverManager using pip
:
pip3 install selenium webdriver-manager
You can follow this tutorial with any suitable IDE. We'll use VS Code for this tutorial.
Did you get everything ready? You're now prepared to scrape some data from Amazon!
Step #2: Access the Amazon Page
We'll first build a basic scraper to access the target product page. This step is essential to check if your Selenium setup works correctly.
Setting up a basic Selenium scraper is simple. Initiate an initial ChromeDriver installation using the ChromeDriverManager and add it to the Selenium WebDriver Services. This installation only occurs the first time you run the code and doesn't apply to subsequent executions:
# import the required libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
# install ChromeDriver and set up the driver instance
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
Open the target website and close the driver instance:
# ...
# specify the target URL
target_url = (
"https://www.amazon.com/Logitech-G502-Performance-Gaming-Mouse/dp/B07GBZ4Q68/"
)
# visit the target URL
driver.get(target_url)
# quit the driver instance
driver.quit()
Here's a combination of both snippets:
# import the required libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
# install ChromeDriver and set up the driver instance
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
# specify the target URL
target_url = (
"https://www.amazon.com/Logitech-G502-Performance-Gaming-Mouse/dp/B07GBZ4Q68/"
)
# visit the target URL
driver.get(target_url)
# quit the driver instance
driver.quit()
The above code will open the target web page in a browser interface (non-headless mode), showing that your Selenium setup works.Â
However, opening the interface increases memory overhead and isn't recommended for real-life scraping. For this tutorial, we'll run Selenium in headless mode.
To change the above to headless mode, introduce the ChromeOptions
and add the headless option. Then, include that option as an argument in the driver instance:
# ...
# set up Chrome options
options = webdriver.ChromeOptions()
# run Chrome in headless mode
options.add_argument("--headless=new")
# install ChromeDriver and set up the driver instance
driver = webdriver.Chrome(
options=options, service=Service(ChromeDriverManager().install())
)
Modify the previous scraper with these changes, and here's your new basic Selenium scraper:
# import the required libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
# set up Chrome options
options = webdriver.ChromeOptions()
# run Chrome in headless mode
options.add_argument("--headless=new")
# install ChromeDriver and set up the driver instance
driver = webdriver.Chrome(
options=options, service=Service(ChromeDriverManager().install())
)
# specify the target URL
target_url = (
"https://www.amazon.com/Logitech-G502-Performance-Gaming-Mouse/dp/B07GBZ4Q68/"
)
# visit the target URL
driver.get(target_url)
# quit the driver instance
driver.quit()
The above scraper now runs the Chrome browser without a user interface. Let's build on it to extract specific content.
Step #3: Scrape Amazon Product Details
To scrape specific product details, you'll need to select HTML elements from the target web page. Selenium has a built-in HTML parser that allows you to parse elements using CSS selectors based on the By
class.
Before going ahead, add the By
class to your imports:
# import the required libraries
# ...
from selenium.webdriver.common.by import By
Amazon's selectors often change due to regular DOM structure updates. When following this tutorial, ensure you double-check and update them.
Locate and Scrape Product Name
Before scraping the product name, inspect its web element to reveal its CSS selectors. Open the target web page, right-click the product's name, and select Inspect.
The product name is a span
tag inside an h2
element with an ID of title
:
Now, create a dictionary to collect the extracted data. Extract the product's name into this dictionary using the By.ID
and find_element
methods:
# ...
# extract the product name
product_name = driver.find_element(By.ID, "title").text
# create a dictionary to store scraped product data
data = {
"Name": product_name,
}
# print the extracted data
print(data)
The code outputs the product's name as shown:
{
'Name': 'Logitech G502 HERO High Performance Wired Gaming Mouse, HERO 25K Sensor, 25,600 DPI, RGB, Adjustable Weights, 11 Programmable Buttons, On-Board Memory, PC / Mac'
}
You've just scraped your first Amazon product information! Let's move to the product's price.
Locate and Scrape Product Price
Similarly, let's inspect the price element to view its CSS selector.Â
Since the target is the actual listing price, not the discounted one, right-click on the product listing price and select Inspect to open its element in the browser console.
The price element is inside a span
tag with the class name a-offscreen
as shown:
Using the find_element
method to search the listing price element (by the class name a-offscreen
) returns an empty string. That's because this element is buried inside multiple nodes, and searching the element inspection console returns several similar elements.Â
We'll use JavaScript's querySelector
via Selenium's execute_script
method to select the price element more precisely. First, query the element using its immediate parent node. Then extract the listing price text from the query:
# ...
# find the price element with JavaScript's querySelector
price_element = driver.execute_script(
'return document.querySelector(".a-price.a-text-price span.a-offscreen")'
)
# get the text of the listing price
price = driver.execute_script("return arguments[0].textContent", price_element)
Insert the extracted price text into the data dictionary:
# ...
# create a dictionary to store scraped product data
data = {
# ...,
"Price": price,
}
The code updates the result with the extracted price:
{
# ...
'Price': '$79.99'
}
You now know how to use Selenium's execute_script
method to interact with HTML directly. Keep going!
Locate and Scrape Product Description
Right-click the product's description (the "About this item" section) and click Inspect. You'll see that each description is a list (li
) inside an unordered list (ul
):
Extract the unordered list using its CSS selector and collect all its nodes (list tags):
# ...
# extract the description list
description_list = driver.find_element(
By.CSS_SELECTOR, "ul.a-unordered-list.a-vertical.a-spacing-mini"
)
# find all list items within the description list
description_items = description_list.find_elements(By.TAG_NAME, "li")
Create an empty description_data
list to collect each description as a separate item. Loop through each list tag to extract its text content and append it to the empty list:
# ...
# create an empty list to collect the descriptions
description_data = []
# collect and store all product description texts
for item in description_items:
# get the text content of the span within the li
description_text = item.find_element(By.TAG_NAME, "span").text.strip()
description_data.append(description_text)
Add the description_data
to the data dictionary:
# ...
# create a dictionary to store scraped product data
data = {
# ...,
"Description": description_data,
}
The code adds the description data to the output as shown:
{
# ...,
'Description': [
'Hero 25K sensor through a software update from G HUB, this...,'
# ... omitted for brevity,
'Microprocessor: 32-bit ARM. Use Logitech G HUB to save your...'
],
}
Locate and Scrape Product Rating
Let's inspect the product rating to expose its elements and CSS selectors. Right-click the rating score below the product name and select Inspect.
The rating score is a span
under a parent node with the ID acrPopover
:
You can easily extract the review score from this ID:
# ...
# extract the rating score
rating = driver.find_element(By.ID, "acrPopover").text
Update the data dictionary with this extracted rating score:
# ...
# create a dictionary to store scraped product data
data = {
# ...,
"Rating": ratings,
}
The code now adds the rating score to the extracted product data:
{
# ...,
'Rating': '4.7',
}
There's one more piece of information left to scrape. Keep going!
Locate and Scrape Product Image
Let's collect the product's featured image. Right-click the product's main image and select Inspect.Â
The featured image tag (img
) is inside a div
with an ID imgTagWrapperId
:
Select the parent element containing the featured image, scrape the image from it, and extract the image src
attribute to get its URL:
# ...
# select the div element containing the featured image
image_element = driver.find_element(By.ID, "imgTagWrapperId")
# scrape the image tag from its parent div
product_image = image_element.find_element(By.TAG_NAME, "img")
# get the image src attribute
product_image_url = product_image.get_attribute("src")
Finally, insert the extracted image URL into the data dictionary:
# ...
# create a dictionary to store scraped product data
data = {
# ...,
"Featured Image": product_image_url,
}
The code outputs the featured image URL, as shown:
{
# ...,
'Featured Image': 'https://m.media-amazon.com/images/I/61mpMH5TzkL._AC_SY355_.jpg',
}
You've now extracted the target product details. Let's combine all the code snippets to get the following complete code:
# import the required libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
# set up Chrome options
options = webdriver.ChromeOptions()
# run Chrome in headless mode
options.add_argument("--headless=new")
# install ChromeDriver and set up the driver instance
driver = webdriver.Chrome(
options=options, service=Service(ChromeDriverManager().install())
)
# specify the target URL
target_url = (
"https://www.amazon.com/Logitech-G502-Performance-Gaming-Mouse/dp/B07GBZ4Q68/"
)
# visit the target URL
driver.get(target_url)
# extract the product name
product_name = driver.find_element(By.ID, "title").text
# find the price element with JavaScript's querySelector
price_element = driver.execute_script(
'return document.querySelector(".a-price.a-text-price span.a-offscreen")'
)
# get the text of the listing price
price = driver.execute_script("return arguments[0].textContent", price_element)
# extract the description list
description_list = driver.find_element(
By.CSS_SELECTOR, "ul.a-unordered-list.a-vertical.a-spacing-mini"
)
# find all list items within the description list
description_items = description_list.find_elements(By.TAG_NAME, "li")
# create an empty list to collect the descriptions
description_data = []
# collect and store all product description texts
for item in description_items:
# get the text content of the span within the li
description_text = item.find_element(By.TAG_NAME, "span").text.strip()
description_data.append(description_text)
# extract the rating score
ratings = driver.find_element(By.ID, "acrPopover").text
# select the div element containing the featured image
image_element = driver.find_element(By.ID, "imgTagWrapperId")
# scrape the image tag from its parent div
product_image = image_element.find_element(By.TAG_NAME, "img")
# get the image src attribute
product_image_url = product_image.get_attribute("src")
# create a dictionary to store scraped product data
data = {
"Name": product_name,
"Price": price,
"Description": description_data,
"Rating": ratings,
"Featured Image": product_image_url,
}
See the complete output below:
{
'Name': 'Logitech G502 HERO High Performance Wired Gaming Mouse, HERO 25K Sensor, 25,600 DPI, RGB, Adjustable Weights, 11 Programmable Buttons, On-Board Memory, PC / Mac',
'Price': '$79.99'
'Description': [
'Hero 25K sensor through a software update from G HUB, this...,
#... omitted for brevity
'Microprocessor: 32-bit ARM. Use Logitech G HUB to save your...'
],
'Rating': '4.7',
'Featured Image': 'https://m.media-amazon.com/images/I/61mpMH5TzkL._AC_SY355_.jpg',
}
Great job! We'll collect this product data into a CSV file in the next section.
Step #4: Export Data to CSV
The last step is to write the extracted data into a CSV file, allowing you to store the product information for further analysis.
Let's update the previous code to reflect these changes.
First, import Python's CSV package. Specify a CSV file name, open a new CSV file in write mode, and insert the extracted data:
# import the required libraries
# ...
import csv
# ...
# define the CSV file name for storing scraped data
csv_file = "product.csv"
# ...
# open the CSV file in write mode with proper encoding
with open(csv_file, mode="w", newline="", encoding="utf-8") as file:
# create a CSV writer object
writer = csv.writer(file)
# write the header row to the CSV file
writer.writerow(data.keys())
# write the data row to the CSV file
writer.writerow(data.values())
# print a confirmation message after successful data extraction and storage
print("Scraping completed and data written to CSV")
Merge the above snippet with the previous scraper. Here's the final code:
# import the required libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import csv
# set up Chrome options
options = webdriver.ChromeOptions()
# run Chrome in headless mode
options.add_argument("--headless=new")
# install ChromeDriver and set up the driver instance
driver = webdriver.Chrome(
options=options, service=Service(ChromeDriverManager().install())
)
# specify the target URL
target_url = (
"https://www.amazon.com/Logitech-G502-Performance-Gaming-Mouse/dp/B07GBZ4Q68/"
)
# visit the target URL
driver.get(target_url)
# extract the product name
product_name = driver.find_element(By.ID, "title").text
# find the price element with JavaScript's querySelector
price_element = driver.execute_script(
'return document.querySelector(".a-price.a-text-price span.a-offscreen")'
)
# get the text of the listing price
price = driver.execute_script("return arguments[0].textContent", price_element)
# extract the description list
description_list = driver.find_element(
By.CSS_SELECTOR, "ul.a-unordered-list.a-vertical.a-spacing-mini"
)
# find all list items within the description list
description_items = description_list.find_elements(By.TAG_NAME, "li")
# create an empty list to collect the descriptions
description_data = []
# collect and store all product description texts
for item in description_items:
# get the text content of the span within the li
description_text = item.find_element(By.TAG_NAME, "span").text.strip()
description_data.append(description_text)
# extract the rating score
ratings = driver.find_element(By.ID, "acrPopover").text
# select the div element containing the featured image
image_element = driver.find_element(By.ID, "imgTagWrapperId")
# scrape the image tag from its parent div
product_image = image_element.find_element(By.TAG_NAME, "img")
# get the image src attribute
product_image_url = product_image.get_attribute("src")
# create a dictionary to store scraped product data
data = {
"Name": product_name,
"Price": price,
"Description": description_data,
"Rating": ratings,
"Featured Image": product_image_url,
}
# define the CSV file name for storing scraped data
csv_file = "product.csv"
# open the CSV file in write mode with proper encoding
with open(csv_file, mode="w", newline="", encoding="utf-8") as file:
# create a CSV writer object
writer = csv.writer(file)
# write the header row to the CSV file
writer.writerow(data.keys())
# write the data row to the CSV file
writer.writerow(data.values())
# print a confirmation message after successful data extraction and storage
print("Scraping completed and data written to CSV")
# quit the driver instance
driver.quit()
The final code creates a product.csv
file with the extracted product data in your project root directory. See the CSV file below:
Look closely at the CSV file. The "Description" field presents the data as a list. Feel free to insert each list item into a separate row programmatically.
Awesome! You now know how to extract data from Amazon using Selenium with Python. However, scraping Amazon with Selenium comes with a few challenges you should know.
Challenges of Web Scraping Amazon With Selenium
Although Selenium is a great scraping tool, it may be insufficient to scrape Amazon, especially if you're extracting data from the e-commerce store at scale. Let's look at a few weaknesses of the Selenium-based scraper, along with the solutions for them.
Blocks and Bans
Amazon is well protected and often challenging to scrape at scale. It employs security measures, such as invisible JavaScript challenges, rate-limited IP bans, and CAPTCHAs, to restrict automated programs from accessing its pages.Â
Unfortunately, bypassing these restrictions with base headless browsers like Selenium can be difficult. That's because headless browsers show bot-like properties like the WebDriver and lack the required evasion strategies to avoid advanced anti-bot detection.
Even if you follow best practices, such as changing the User Agent and fixing proxies, Amazon's defense mechanism is powerful enough to detect and block your request. The best way to scrape Amazon without getting blocked is using a web scraping API like ZenRows. We'll explain more later. Â
Inefficient Performance
Headless browsers like Selenium take up the system's memory, resulting in poor performance. Increasing wait time, managing browser instances, handling JavaScript, and loading resources reduce Selenium's performance.
Here are a few ways to speed up Selenium:
- Run the browser in headless mode.
- Block extra resources, such as images and CSS.Â
- Use optimized selectors such as CSS selectors.
- Run multiple Selenium instances in parallel using cloud grids.
There are more ways to optimize Selenium's speed. Read our detailed guide on speeding up Selenium to learn more.
Changes in Page Layout
Big websites like Amazon often change the DOM layout, including element structure and attributes, causing your previous selectors to fail.Â
One way to mitigate this challenge is to monitor the site's HTML layout and update your selectors regularly to reflect layout changes. Another good practice is to isolate element selectors from your scraping logic using the page object model (POM). This approach allows you to locate and fix outdated selectors quickly.
Selenium Alternative: Scrape Amazon With a Web Scraping API
A web scraping API is the best solution to scrape any website at scale without getting blocked. One of its advantages is that it's compatible with any programming language and easy to implement. Additionally, a web scraping API works every time despite the frequent security updates of anti-bot systems.Â
ZenRows is the most popular web scraping API. It features a dedicated Amazon scraper, a ready-made solution for extracting the correct data from Amazon at scale without stress.
ZenRows's dedicated Amazon scraper helps you to:
- Optimize your requests and auto-bypasses CAPTCHAs and other anti-bot mechanisms.
- Automatically extract accurate data in JSON format.
- Auto-parse data from various Amazon pages, including products, listings, search results, best sellers, questions and answers, and more.
- Auto-rotate premium proxies to avoid Amazon's rate-limited IP bans.
- Grants you access to localized products in 185+ countries around the world.
You only need a single API call, and ZenRows handles the scraping task under the hood. Let's try ZenRows with the previous product page to see how it works.
Sign up to load the Request Builder.
Paste the product URL in the link box and activate Premium Proxies and JS Rendering.
Select Python as your programming language and choose the API connection mode. Copy and paste the generated code into your Python file:
The generated code should look like this:
# pip install requests
import requests
url = "https://www.amazon.com/Logitech-G502-Performance-Gaming-Mouse/dp/B07GBZ4Q68/"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
"premium_proxy": "true",
"autoparse": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)
The above code parses the product page and returns a JSON format of the extracted product details:
{
"answers": "Search this page",
"availability": "In Stock",
"avg_rating": "4.7 out of 5 stars",
"category": "Video Games > PC > Accessories > Gaming Mice",
"description": "Logitech updated its iconic G502 gaming mouse...",
"out_of_stock": false,
"price": "$44.90",
"price_without_discount": "$79.99",
"title": "Logitech G502 HERO High Performance Wired Gaming Mouse, HERO 25K Sensor, 25,600 DPI, RGB, Adjustable Weights, 11 Programmable Buttons, On-Board Memory, PC / Mac",
"features": [
{"Brand": "Logitech G"},
{"Series": "Logitech G502 HERO High Performance Gaming Mouse"},
{"Item model number": "910-005469"},
# ... omitted for brevity,
{"Manufacturer": "Logitech"},
{"ASIN": "B07GBZ4Q68"},
{"Is Discontinued By Manufacturer": "No"},
{"Date First Available": "August 24, 2018"},
{
"Best Sellers Rank": "#15 in Video Games (See Top 100 in Video Games) #1 in PC Gaming Mice"
},
],
"images": ["https://m.media-amazon.com/images/I/61mpMH5TzkL._AC_SL1500_.jpg"],
}
Congratulations! You just parsed an Amazon product page automatically with ZenRows.
Conclusion
You've seen how to scrape Amazon with Selenium in Python, including tips for optimizing your scraper. Here's a recap of what you've learned:
- Build a basic Amazon scraper to access the product page.
- Extract specific product data from an Amazon product page.
- Write the scraped data to a CSV file.
- Beat the challenges of scraping data from Amazon.
As mentioned, scraping Amazon at scale is challenging, as your scraper faces potential IP bans and anti-bot measures. The easiest way to get your desired Amazon product data is to use ZenRows' Amazon scraper.
Try ZenRows for free now without a credit card!