Scrapy is a robust Python framework for extracting data from web pages at scale. Its ability to run multiple requests simultaneously and its built-in mechanisms for handling pagination make Scrapy a great choice for scraping Amazon.
In this tutorial, we'll provide a step-by-step guide to everything you need to scrape Amazon. From setting up Scrapy for Amazon, you'll learn how to retrieve data from multiple pages.
- Scrape Amazon product data with Scrapy.
- Export scraped Amazon data to CSV.
- Scraping multiple pages using Scrapy.
- The easiest solution to scrape Amazon.
Step 1: Prerequisites
Before we dive into the nitty gritty of scraping Amazon, ensure you meet the following prerequisites:
- Python.
- Scrapy.
Follow the steps below to set up Scrapy.
Using a virtual environment is often recommended to allow you to manage dependencies separately per project. Set up your virtual environment and install Scrapy using the following command:
pip3 install scrapy
Once the installation is complete, create a new Scrapy project using this command.
scrapy startproject amazon_scraper
This creates an amazon_scraper
directory structure containing the essential Scrapy files.
That's it. You're all set up.
Next, create your spider, where you'll define the instructions for scraping Amazon. To do that, navigate to the new directory and enter Scrapy's genspider
command.
cd amazon_scraper
scrapy genspider scraper amazon.com
Open this newly generated scraper spider in your code editor and get ready to write some code.
Step 2: Scrape Amazon Product Data With Scrapy
For demonstration purposes, we'll use the following sample Amazon product page as our target website.
After retrieving the full HTML from this product page, we'll extract the following data points.
- Product name.
- Price.
- Images.
- Description.
- Reviews.
By the end of this tutorial, you'll have a Scrapy Amazon spider capable of retrieving and storing data in a usable format.
So, without further ado, let's dive in!
Below is a basic Scrapy spider for fetching the full HTML of the target web page.
import scrapy
class ScraperSpider(scrapy.Spider):
name = "scraper"
start_urls = ["https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/dp/B098LG3N6R/"]
def parse(self, response):
self.log(response.text)
Run this code using Scrapy's crawl command.
scrapy crawl scraper
This will scrape the target Amazon page and log its HTML, as shown below. We've truncated the result for simplicity.
<!doctype html>
<html lang="en-us" class="a-no-js" data-19ax5a9jf="dingo">
<head>
<!-- ... -->
<!-- DNS Prefetch to improve loading speed of images -->
<link rel="dns-prefetch" href="https://images-na.ssl-images-amazon.com">
<!-- ... -->
<title>Amazon.com: MageGee Portable 60% Mechanical Gaming Keyboard, MK-Box LED Backlit Compact 68 Keys Mini Wired Office Keyboard with Red Switch for Windows Laptop PC Mac - Black/Grey : Video Games</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
Omitted for brevity
</body>
</html>
However, it's worth noting that Amazon pages are mostly protected by anti-bot systems. Thus, you may encounter restrictions or blocks when trying to extract data.
If you're getting blocked, some proven techniques that could get you over the hump include using proxies with Scrapy and specifying a custom Scrapy User Agent.
That said, in a later section, we'll discuss a more reliable, foolproof option for avoiding detection while web scraping.Â
Find and Scrape the Product Name
Let's proceed with the product name, one of the most straightforward data points to scrape.
Start by inspecting the page to identify the correct selector for the product name. It's often located within an <h1>
tag or a span
with an easy-to-identify ID.
To verify, open the Amazon product page in a browser, right-click on the product name, and select Inspect to open the DevTools window.
You'll find that the product name is within a span
with a productTitle
ID.
Using this information, select the target element and extract its text content.
#...
class ScraperSpider(scrapy.Spider):
#...
def parse(self, response):
# select the product name element and extract its text content
product_name = response.css('#productTitle::text').get().strip()
yield {'Product Name': product_name}
This code logs the following product name to your console.
{
'Product Name': 'MageGee Portable 60% Mechanical Gaming Keyboard, MK-Box LED Backlit Compact 68 Keys Mini Wired Office Keyboard with Red Switch for Windows Laptop PC Mac - Black/Grey'
}
Amazon's selectors often change due to regular DOM structure updates. When following this tutorial, ensure you double-check and update them accordingly.
Locate and Get the Price
Extracting the product price follows a similar process: locate the specific element, identify its selector, and retrieve its text content.
Using the same inspection techniques as in the previous section, you'll find that the price is divided into different span
nodes with classes, a-price-symbol
, a-price-whole
, a-price-decimal
, and a-price-fraction
.
However, you'll also find the actual price within a span
with class aok-offscreen
. Amazon often uses this class to visually hide the actual price text from view for accessibility purposes. But it remains readily accessible in its HTML.
Using this information, select the price element as in the previous code and extract its text content.
#...
class ScraperSpider(scrapy.Spider):
#...
def parse(self, response):
# ...
# select the price element and extract its text content
price = response.css('.aok-offscreen::text').get().strip()
yield {'Price': price}
This code outputs the product price, as seen below.
{'Price': '$29.99'}
 Locate and Scrape Product Images
Amazon displays the product images in a carousel format; any image you click or hover over appears as the primary image. Therefore, you can scrape all product images from the carousel container.
As before, use your browser's DevTools to locate the image elements and identify their selectors.
These images are in img
tags within list items nested in a div
with an altImages
ID.
To extract these data, locate the div
element, select all the img
tags, and extract their src
attributes.
Scrapy allows you to do all this with a single line of code using its getall()
method.
#...
class ScraperSpider(scrapy.Spider):
#...
def parse(self, response):
# ...
# extract the product images
images = response.css('#altImages img::attr(src)').getall()
yield {'Images': images}
This method retrieves all the matches specified by the selector and returns a list, as in the result below.
{
'Images':
[
'https://m.media-amazon.com/images/I/41S5pwGovuL._AC_US40_.jpg',
'https://m.media-amazon.com/images/I/41HOY7Hp4zL._AC_US40_.jpg',
'https://m.media-amazon.com/images/I/41TgP6epL9L._AC_US40_.jpg',
'https://m.media-amazon.com/images/I/41ddUUIQQQL._AC_US40_.jpg',
# ... omitted for brevity
]
}
Scrape Product Descriptions
As in previous cases, the first step is inspecting the page to locate the element containing the product description. You'll find it within the li
tags with class a-spacing-mini
. The text content is contained in span elements with class a-list-item
.
Using this class attribute, select all list items and extract their text content.
#...
class ScraperSpider(scrapy.Spider):
#...
def parse(self, response):
# ...
# select the product description container and extract it
description_items = response.css('.a-spacing-mini .a-list-item::text').getall()
# clean and remove empty spaces
description = [item.strip() for item in description_items if item.strip()]
yield {'Description': description}
This code retrieves the product description, cleans it to remove whitespaces, and outputs the following result.
{
'Description':
[
'Mini portable 60% compact layout: MK-Box is a 68 keys mechanical keyboard have cute small size, separate arrow keys and all your F-keys you need, can use it for gaming or work while saving space.',
# ... omitted for brevity
]
}
Locate and Scrape Product Reviews
Amazon product reviews are typically structured as tiles, including individual ratings, review headings, and review bodies.
To extract each review on the page, select all review tiles and extract the ratings, review headings, and body content.
Start by inspecting the page to locate the target elements. The reviews are in multiple div
tags with class review
, nested in a parent div
with ID cm-cr-dp-review-list
.
Using this CSS selector, select all review tiles, loop through, and extract each tile's rating, heading, and body.
- Ratings are span elements with class
a-icon-alt
. - Review headings are the second span elements in anchor tags with class
review-title
. - The body is a
div
tag with the classreview-text-content
.
#...
class ScraperSpider(scrapy.Spider):
#...
def parse(self, response):
# ...
# select each review tile
reviews = response.css('#cm-cr-dp-review-list .review')
# create an empty list to store review data
review_data = []
# for each review tile extract rating, heading, and body
for review in reviews:
# extract rating
rating = review.css('.a-icon-alt::text').get().strip()
# extract review heading
heading = review.css('.review-title span:nth-of-type(2)::text').get().strip()
# extract body
body = review.css('.review-text-content span::text').get().strip()
# append rating, heading, and body to review_data
review_data.append({
'Rating': rating,
'Heading': heading,
'Body': body
})
print(review_data)
This code creates an empty list to store the review data and then appends each review's rating, heading, and body to the list.
We've truncated the result for simplicity.
[
{
'Rating': '5.0 out of 5 stars',
'Heading': 'Maybe just not for me?',
'Body': 'Keyboard seems high quality. ...'
},
{
'Rating': '5.0 out of 5 stars',
'Heading': 'Solid Performance in a Compact Package',
'Body': 'I recently purchased the MageGee Portable 60% Mechanical Gaming Keyboard...'
},
{
'Rating': '4.0 out of 5 stars',
'Heading': 'good for price',
'Body': 'you get what you pay for with this keyboard...'
},
# ... omitted for brevity
]
Step 3: Export Scraped Amazon Data to CSV
Scrapy offers two straightforward options for exporting data to CSV:
- Directly from the command line.
- Using the FEEDS setting.
To save to CSV directly from your command line, add the -o
flag to the scrapy crawl command to specify the file path, and the -t
flag to define the file format.
Below is an example:
scrapy crawl scraper -o output.csv -t csv
This will create an output.csv
file in your project directory.
To use the FEEDS settings, open your settings.py
file and enter the following settings.
FEEDS = {
'output.csv': {'format': 'csv'}
}
This will automatically export your scraped data to CSV every time you run the spider.
Now, put all the steps together to get the following complete code.
import scrapy
class ScraperSpider(scrapy.Spider):
name = 'scraper'
start_urls = ['https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/dp/B098LG3N6R/']
def parse(self, response):
# ... find and scrape product name ... #
# select the product name element and extract its text content
product_name = response.css('#productTitle::text').get().strip()
# ... locate and get the price ... #
# select the price element and extract its text content
price = response.css('.aok-offscreen::text').get().strip()
# ... locate and scrape product images ... #
# select images container, search for all img tags, and extract their src
images = response.css('#altImages img::attr(src)').getall()
# ... scrape product description ... #
# select the product description container and extract it
description_items = response.css('.a-spacing-mini .a-list-item::text').getall()
# clean and remove empty spaces
description = [item.strip() for item in description_items if item.strip()]
# ... locate and scrape product reviews ... #
# select each review tile
reviews = response.css('#cm-cr-dp-review-list .review')
# create an empty list to store review data
review_data = []
# for each review tile extract rating, heading, and body
for review in reviews:
# extract rating
rating = review.css('.a-icon-alt::text').get().strip()
# extract review heading
heading = review.css('.review-title span:nth-of-type(2)::text').get().strip()
# extract body
body = review.css('.review-text-content span::text').get().strip()
# append rating, heading, and body to review_data
review_data.append({
'Rating': rating,
'Heading': heading,
'Body': body
})
# log all the extracted data.
yield {
'Product Name': product_name,
'Price': price,
'Images': images,
'Description': description,
'Reviews': review_data
}
Run it using the CSV command:
scrapy crawl scraper -o output.csv -t csv
You'll have an output.csv
file with content similar to the one below:
Awesome! You now have a Scrapy spider for scraping Amazon and storing data in CSV format.
Step 4: Scraping Multiple Pages Using Scrapy
Now that you scraped all the information from a single product page, it's time to scale up! Many products have reviews spanning multiple pages. Similarly, you may be interested in an Amazon search result. Either way, if you're targeting several products, scraping multiple pages is critical.
Scrapy makes it easy to handle Amazon pagination, particularly search pages. With a few lines of code, you can configure your spider to follow the next page link and retrieve the necessary information.
For this tutorial, we'll use an Amazon search page as the target URL.
Let's start by extracting product data from the first result page. Similar to the previous examples, select all search result items on the page, loop through, and extract the necessary data.
For simplicity, we'll focus only on the product name.
import scrapy
class ScraperSpider(scrapy.Spider):
name = 'scraper'
start_urls = ['https://www.amazon.com/s?k=computer+keyboard']
def parse(self, response):
# select search result list items
search_result = response.css('.s-result-item')
# loop through and extract the product name of each product
for product in search_result:
product_name = product.css('h2.a-spacing-none span::text').get()
# check if product_name is none
if product_name:
product_name = product_name.strip()
yield {'Product Name': product_name}
This code snippet selects the search result list, loops through each search result, and extracts the product name.
Also, some of the search results may not contain a product name. Therefore, we ensured that product_name
is not None
before calling the strip()
method. This is important to avoid errors.
To scrape subsequent pages, identify the next page link. In this case, the pagination is at the bottom of the search result page.
Inspect the "next button" in a browser to identify its selector. You'll find it's an anchor tag with class s-pagination-next
.
Select its href
attribute and queue it using the response.follow()
method. This method instructs Scrapy to load the next page URL and automatically handle the queue.
Remember to include the parse callback function to use the same scraping logic for each page.
# ...
class ScraperSpider(scrapy.Spider):
# ...
def parse(self, response):
# ...
# find and follow the next page link
next_page = response.css('a.s-pagination-next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Put everything together to get the following complete code:
import scrapy
class ScraperSpider(scrapy.Spider):
name = 'scraper'
start_urls = ['https://www.amazon.com/s?k=computer+keyboard']
def parse(self, response):
# select search result list items
search_result = response.css('.s-result-item')
# loop through and extract the product name of each product
for product in search_result:
product_name = product.css('h2.a-spacing-none span::text').get()
# check if product_name is none
if product_name:
product_name = product_name.strip()
yield {'Product Name': product_name}
# find and follow the next page link
next_page = response.css('a.s-pagination-next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
This code scrapes all the search result pages and returns the name of each product.
To export to CSV, run this code using the following command, like in the previous example.
scrapy crawl scraper -o output.csv -t csv
Your output.csv
file should look like the example below.
Congratulations! You've taken your Scrapy Amazon spider a step further.
Easiest Solution to Scrape Amazon
A surefire way to avoid detection when scraping Amazon with Scrapy is by integrating with ZenRows. This solution offers everything you need to scrape without getting blocked, including features like premium proxies, JavaScript rendering, advanced anti-bot bypass, and more.
These features allow you to focus on extracting your desired data rather than the intricacies of circumventing anti-bot solutions.
ZenRows offers the scrapy-zenrows middleware for seamless and straightforward integration with Scrapy.
To use this tool, install the middleware using the following command:
pip3 install scrapy-zenrows
You'll need an API key. Sign up to get yours.
After that, add the ZenRows Scraper API middleware to your DOWNLOADER_MIDDLEWARE
settings and specify your ZenRows API Key:
# ...
DOWNLOADER_MIDDLEWARES = {
"scrapy_zenrows.ZenRowsMiddleware": 543,
}
# ZenRows API Key
ZENROWS_API_KEY = "<YOUR_ZENROWS_API_KEY>"
Lastly, set the premium proxy and JS rendering parameters to true in your settings.py
file to use them globally.
# ...
USE_ZENROWS_PREMIUM_PROXY = True
USE_ZENROWS_JS_RENDER = True
With these settings, you can run any spider as usual, and ZenRows will handle all anti-bot solutions you may encounter.
Alternatively, if you want to apply the middleware to a specific spider, you can override the global settings using ZenRowsRequest
in start_requests
.
Below is an example.
Assuming Amazon blocked your request when trying to follow along in this tutorial, here's how you can bypass those restrictions using ZenRows.
# import the required modules
import scrapy
from scrapy_zenrows import ZenRowsRequest
class ScraperSpider(scrapy.Spider):
name = 'scraper'
def start_requests(self):
url = 'https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/dp/B098LG3N6R/'
yield ZenRowsRequest(
url=url,
callback=self.parse,
params={
'js_render': 'true', # enable JavaScript rendering
'premium_proxy': 'true', # use premium proxy
'custom_headers': 'true', # activate custom headers
'js_instructions': '[{"wait": 500}]', # wait 500ms after page load
},
# add custom referer header
headers={
'Referer': 'https://www.google.com/',
},
)
def parse(self, response):
# log the response body
self.logger.info(response.text)
Here's the result:
<!doctype html>
<html lang="en-us" class="a-no-js" data-19ax5a9jf="dingo">
<head>
<!-- ... -->
<!-- DNS Prefetch to improve loading speed of images -->
<link rel="dns-prefetch" href="https://images-na.ssl-images-amazon.com">
<!-- ... -->
<title>Amazon.com: MageGee Portable 60% Mechanical Gaming Keyboard, MK-Box LED Backlit Compact 68 Keys Mini Wired Office Keyboard with Red Switch for Windows Laptop PC Mac - Black/Grey : Video Games</title>
<!-- ... -->
</head>
<body>
<!-- ... -->
Omitted for brevity
</body>
</html>
Congratulations! You now have a Scrapy Amazon spider capable of avoiding detection.
ZenRows also offers numerous other features you can leverage by simply defining its parameters. Check the scrapy-zenrows documentation for more details.
Conclusion
Scrapy is a great choice for scraping Amazon due to its architecture, flexibility, and efficiency in handling multiple requests and pagination. However, vanilla Scrapy will get blocked by anti-bot solutions and website restrictions.
Luckily, you can integrate with ZenRows to avoid detection. For hassle-free Amazon scraping, try ZenRows now.