Scraping Amazon With Scrapy: Step-by-Step Tutorial [2024]

Sergio Nonide
Sergio Nonide
November 15, 2024 · 7 min read

Scrapy is a robust Python framework for extracting data from web pages at scale. Its ability to run multiple requests simultaneously and its built-in mechanisms for handling pagination make Scrapy a great choice for scraping Amazon.

In this tutorial, we'll provide a step-by-step guide to everything you need to scrape Amazon. From setting up Scrapy for Amazon, you'll learn how to retrieve data from multiple pages.

Step 1: Prerequisites

Before we dive into the nitty gritty of scraping Amazon, ensure you meet the following prerequisites:

Follow the steps below to set up Scrapy.

Using a virtual environment is often recommended to allow you to manage dependencies separately per project. Set up your virtual environment and install Scrapy using the following command:

Terminal
pip3 install scrapy

Once the installation is complete, create a new Scrapy project using this command.

Terminal
scrapy startproject amazon_scraper

This creates an amazon_scraper directory structure containing the essential Scrapy files.

That's it. You're all set up.

Next, create your spider, where you'll define the instructions for scraping Amazon. To do that, navigate to the new directory and enter Scrapy's genspider command.

Terminal
cd amazon_scraper
scrapy genspider scraper amazon.com

Open this newly generated scraper spider in your code editor and get ready to write some code.

Step 2: Scrape Amazon Product Data With Scrapy

For demonstration purposes, we'll use the following sample Amazon product page as our target website.

Amazon Product Page
Click to open the image in full screen

After retrieving the full HTML from this product page, we'll extract the following data points.

  • Product name.
  • Price.
  • Images.
  • Description.
  • Reviews.

By the end of this tutorial, you'll have a Scrapy Amazon spider capable of retrieving and storing data in a usable format.

So, without further ado, let's dive in!

Below is a basic Scrapy spider for fetching the full HTML of the target web page.

scraper.py
import scrapy

class ScraperSpider(scrapy.Spider):
    name = "scraper"
    start_urls = ["https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/dp/B098LG3N6R/"]

    def parse(self, response):
        self.log(response.text)

Run this code using Scrapy's crawl command.

Terminal
scrapy crawl scraper

This will scrape the target Amazon page and log its HTML, as shown below. We've truncated the result for simplicity.

Output
<!doctype html>
<html lang="en-us" class="a-no-js" data-19ax5a9jf="dingo">
<head>
    <!-- ... -->
    <!-- DNS Prefetch to improve loading speed of images -->
    <link rel="dns-prefetch" href="https://images-na.ssl-images-amazon.com">
   
    <!-- ... -->

    <title>Amazon.com: MageGee Portable 60% Mechanical Gaming Keyboard, MK-Box LED Backlit Compact 68 Keys Mini Wired Office Keyboard with Red Switch for Windows Laptop PC Mac - Black/Grey : Video Games</title>
 
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    Omitted for brevity 
</body>
</html>
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

However, it's worth noting that Amazon pages are mostly protected by anti-bot systems. Thus, you may encounter restrictions or blocks when trying to extract data.

If you're getting blocked, some proven techniques that could get you over the hump include using proxies with Scrapy and specifying a custom Scrapy User Agent.

That said, in a later section, we'll discuss a more reliable, foolproof option for avoiding detection while web scraping. 

Find and Scrape the Product Name

Let's proceed with the product name, one of the most straightforward data points to scrape.

Start by inspecting the page to identify the correct selector for the product name. It's often located within an <h1> tag or a span with an easy-to-identify ID.

To verify, open the Amazon product page in a browser, right-click on the product name, and select Inspect to open the DevTools window.

Amazon Product Page Title Inspection
Click to open the image in full screen

You'll find that the product name is within a span with a productTitle ID.

Using this information, select the target element and extract its text content.

scraper.py
#...

class ScraperSpider(scrapy.Spider):
    #...
   

    def parse(self, response):
        # select the product name element and extract its text content
        product_name = response.css('#productTitle::text').get().strip()
        yield {'Product Name': product_name}

This code logs the following product name to your console.

Output
{

'Product Name': 'MageGee Portable 60% Mechanical Gaming Keyboard, MK-Box LED Backlit Compact 68 Keys Mini Wired Office Keyboard with Red Switch for Windows Laptop PC Mac - Black/Grey'

}

Locate and Get the Price

Extracting the product price follows a similar process: locate the specific element, identify its selector, and retrieve its text content.

Using the same inspection techniques as in the previous section, you'll find that the price is divided into different span nodes with classes, a-price-symbol, a-price-whole, a-price-decimal, and a-price-fraction.

However, you'll also find the actual price within a span with class aok-offscreen. Amazon often uses this class to visually hide the actual price text from view for accessibility purposes. But it remains readily accessible in its HTML.

Using this information, select the price element as in the previous code and extract its text content.

scraper.py
#...

class ScraperSpider(scrapy.Spider):
    #...
   

    def parse(self, response):
        # ...
        # select the price element and extract its text content
        price = response.css('.aok-offscreen::text').get().strip()
        yield {'Price': price}

This code outputs the product price, as seen below.

Output
{'Price': '$29.99'}

 Locate and Scrape Product Images

Amazon displays the product images in a carousel format; any image you click or hover over appears as the primary image. Therefore, you can scrape all product images from the carousel container.

As before, use your browser's DevTools to locate the image elements and identify their selectors.

These images are in img tags within list items nested in a div with an altImages ID.

To extract these data, locate the div element, select all the img tags, and extract their src attributes.

Scrapy allows you to do all this with a single line of code using its getall() method.

scraper.py
#...

class ScraperSpider(scrapy.Spider):
    #...
   

    def parse(self, response):
        # ...
        # extract the product images
        images = response.css('#altImages img::attr(src)').getall()
        yield {'Images': images}

This method retrieves all the matches specified by the selector and returns a list, as in the result below.

Output
{
  'Images': 
    [
      'https://m.media-amazon.com/images/I/41S5pwGovuL._AC_US40_.jpg',
      'https://m.media-amazon.com/images/I/41HOY7Hp4zL._AC_US40_.jpg',
      'https://m.media-amazon.com/images/I/41TgP6epL9L._AC_US40_.jpg',
      'https://m.media-amazon.com/images/I/41ddUUIQQQL._AC_US40_.jpg', 

    # ... omitted for brevity
    ]
}

Scrape Product Descriptions

As in previous cases, the first step is inspecting the page to locate the element containing the product description. You'll find it within the li tags with class a-spacing-mini. The text content is contained in span elements with class a-list-item.

Using this class attribute, select all list items and extract their text content.

scraper.py
#...

class ScraperSpider(scrapy.Spider):
    #...
   

    def parse(self, response):
        # ...
        # select the product description container and extract it
        description_items = response.css('.a-spacing-mini .a-list-item::text').getall()
        # clean and remove empty spaces
        description = [item.strip() for item in description_items if item.strip()]
        yield {'Description': description}

This code retrieves the product description, cleans it to remove whitespaces, and outputs the following result.

Output
{
  'Description': 
  [
      'Mini portable 60% compact layout: MK-Box is a 68 keys mechanical keyboard have cute small size, separate arrow keys and all your F-keys you need, can use it for gaming or work while saving space.', 

    # ... omitted for brevity

  ]
}

Locate and Scrape Product Reviews

Amazon product reviews are typically structured as tiles, including individual ratings, review headings, and review bodies.

To extract each review on the page, select all review tiles and extract the ratings, review headings, and body content.

Start by inspecting the page to locate the target elements. The reviews are in multiple div tags with class review, nested in a parent div with ID cm-cr-dp-review-list.

Using this CSS selector, select all review tiles, loop through, and extract each tile's rating, heading, and body.

  • Ratings are span elements with class a-icon-alt.
  • Review headings are the second span elements in anchor tags with class review-title.
  • The body is a div tag with the class review-text-content.
scraper.py
#...

class ScraperSpider(scrapy.Spider):
    #...
   

    def parse(self, response):
        # ...
        # select each review tile
        reviews = response.css('#cm-cr-dp-review-list .review')
        # create an empty list to store review data
        review_data = []
        # for each review tile extract rating, heading, and body
        for review in reviews:
            # extract rating
            rating = review.css('.a-icon-alt::text').get().strip()
            # extract review heading
            heading = review.css('.review-title span:nth-of-type(2)::text').get().strip()
            # extract body
            body = review.css('.review-text-content span::text').get().strip()

            # append rating, heading, and body to review_data
            review_data.append({
                'Rating': rating,
                'Heading': heading,
                'Body': body
            })

        print(review_data)

This code creates an empty list to store the review data and then appends each review's rating, heading, and body to the list.

We've truncated the result for simplicity.

Output
[
  {
    'Rating': '5.0 out of 5 stars',
    'Heading': 'Maybe just not for me?',
    'Body': 'Keyboard seems high quality. ...'
  }, 
  {
    'Rating': '5.0 out of 5 stars',
    'Heading': 'Solid Performance in a Compact Package',
    'Body': 'I recently purchased the MageGee Portable 60% Mechanical Gaming Keyboard...'
  }, 
  {
    'Rating': '4.0 out of 5 stars',
    'Heading': 'good for price',
    'Body': 'you get what you pay for with this keyboard...'
  }, 
  # ... omitted for brevity
]

Step 3: Export Scraped Amazon Data to CSV

Scrapy offers two straightforward options for exporting data to CSV:

  • Directly from the command line.
  • Using the FEEDS setting.

To save to CSV directly from your command line, add the -o flag to the scrapy crawl command to specify the file path, and the -t flag to define the file format.

Below is an example:

Terminal
scrapy crawl scraper -o output.csv -t csv

This will create an output.csv file in your project directory.

To use the FEEDS settings, open your settings.py file and enter the following settings.

settings.py
FEEDS = {
   'output.csv': {'format': 'csv'}
}

This will automatically export your scraped data to CSV every time you run the spider.

Now, put all the steps together to get the following complete code.

scraper.py
import scrapy


class ScraperSpider(scrapy.Spider):
    name = 'scraper'
    start_urls = ['https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/dp/B098LG3N6R/']

    def parse(self, response):
        # ... find and scrape product name ... #
        # select the product name element and extract its text content
        product_name = response.css('#productTitle::text').get().strip()

        # ... locate and get the price ... #
        # select the price element and extract its text content
        price = response.css('.aok-offscreen::text').get().strip()

        # ... locate and scrape product images ... #
        # select images container, search for all img tags, and extract their src 
        images = response.css('#altImages img::attr(src)').getall()

        # ... scrape product description ... #
        # select the product description container and extract it
        description_items = response.css('.a-spacing-mini .a-list-item::text').getall()
        # clean and remove empty spaces
        description = [item.strip() for item in description_items if item.strip()]

        # ... locate and scrape product reviews ... #
        # select each review tile
        reviews = response.css('#cm-cr-dp-review-list .review')
        # create an empty list to store review data
        review_data = []
        # for each review tile extract rating, heading, and body
        for review in reviews:
            # extract rating
            rating = review.css('.a-icon-alt::text').get().strip()
            # extract review heading
            heading = review.css('.review-title span:nth-of-type(2)::text').get().strip()
            # extract body
            body = review.css('.review-text-content span::text').get().strip()

            # append rating, heading, and body to review_data
            review_data.append({
                'Rating': rating,
                'Heading': heading,
                'Body': body
            })


        # log all the extracted data.
        yield {
            'Product Name': product_name,
            'Price': price,
            'Images': images,
            'Description': description,
            'Reviews': review_data
        }

Run it using the CSV command:

Terminal
scrapy crawl scraper -o output.csv -t csv

You'll have an output.csv file with content similar to the one below:

Amazon Data CSV Output
Click to open the image in full screen

Awesome! You now have a Scrapy spider for scraping Amazon and storing data in CSV format.

Step 4: Scraping Multiple Pages Using Scrapy

Now that you scraped all the information from a single product page, it's time to scale up! Many products have reviews spanning multiple pages. Similarly, you may be interested in an Amazon search result. Either way, if you're targeting several products, scraping multiple pages is critical.

Scrapy makes it easy to handle Amazon pagination, particularly search pages. With a few lines of code, you can configure your spider to follow the next page link and retrieve the necessary information.

For this tutorial, we'll use an Amazon search page as the target URL.

Amazon Multiple Pages Scraping Search Result
Click to open the image in full screen

Let's start by extracting product data from the first result page. Similar to the previous examples, select all search result items on the page, loop through, and extract the necessary data.

For simplicity, we'll focus only on the product name.

scraper.py
import scrapy

class ScraperSpider(scrapy.Spider):
    name = 'scraper'
    start_urls = ['https://www.amazon.com/s?k=computer+keyboard']

    def parse(self, response):
        # select search result list items
        search_result = response.css('.s-result-item')
        # loop through and extract the product name of each product
        for product in search_result:
            product_name = product.css('h2.a-spacing-none span::text').get()
            # check if product_name is none
            if product_name:  
                product_name = product_name.strip()
                yield {'Product Name': product_name}

This code snippet selects the search result list, loops through each search result, and extracts the product name.

Also, some of the search results may not contain a product name. Therefore, we ensured that product_name is not None before calling the strip() method. This is important to avoid errors.

To scrape subsequent pages, identify the next page link. In this case, the pagination is at the bottom of the search result page.

Pagination
Click to open the image in full screen

Inspect the "next button" in a browser to identify its selector. You'll find it's an anchor tag with class s-pagination-next.

Select its href attribute and queue it using the response.follow() method. This method instructs Scrapy to load the next page URL and automatically handle the queue.

Remember to include the parse callback function to use the same scraping logic for each page.

scraper.py
# ...

class ScraperSpider(scrapy.Spider):
   # ...

    def parse(self, response):
        # ...
            
        # find and follow the next page link
        next_page = response.css('a.s-pagination-next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Put everything together to get the following complete code:

scraper.py
import scrapy

class ScraperSpider(scrapy.Spider):
    name = 'scraper'
    start_urls = ['https://www.amazon.com/s?k=computer+keyboard']

    def parse(self, response):
        # select search result list items
        search_result = response.css('.s-result-item')
        # loop through and extract the product name of each product
        for product in search_result:
            product_name = product.css('h2.a-spacing-none span::text').get()
            # check if product_name is none
            if product_name:  
                product_name = product_name.strip()
                yield {'Product Name': product_name}

        # find and follow the next page link
        next_page = response.css('a.s-pagination-next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

This code scrapes all the search result pages and returns the name of each product.

To export to CSV, run this code using the following command, like in the previous example.

Terminal
scrapy crawl scraper -o output.csv -t csv

Your output.csv file should look like the example below.

CSV Output for Multiple Pages
Click to open the image in full screen

Congratulations! You've taken your Scrapy Amazon spider a step further.

Easiest Solution to Scrape Amazon

A surefire way to avoid detection when scraping Amazon with Scrapy is by integrating with ZenRows. This solution offers everything you need to scrape without getting blocked, including features like premium proxies, JavaScript rendering, advanced anti-bot bypass, and more.

These features allow you to focus on extracting your desired data rather than the intricacies of circumventing anti-bot solutions.

ZenRows offers the scrapy-zenrows middleware for seamless and straightforward integration with Scrapy.

To use this tool, install the middleware using the following command:

Terminal
pip3 install scrapy-zenrows

You'll need an API key. Sign up to get yours.

building a scraper with zenrows
Click to open the image in full screen

After that, add the ZenRows Scraper API middleware to your DOWNLOADER_MIDDLEWARE settings and specify your ZenRows API Key:

settings.py
# ...

DOWNLOADER_MIDDLEWARES = {
    "scrapy_zenrows.ZenRowsMiddleware": 543,
}

# ZenRows API Key
ZENROWS_API_KEY = "<YOUR_ZENROWS_API_KEY>"

Lastly, set the premium proxy and JS rendering parameters to true in your settings.py file to use them globally.

settings.py
# ...

USE_ZENROWS_PREMIUM_PROXY = True 
USE_ZENROWS_JS_RENDER = True

With these settings, you can run any spider as usual, and ZenRows will handle all anti-bot solutions you may encounter.

Alternatively, if you want to apply the middleware to a specific spider, you can override the global settings using ZenRowsRequest in start_requests.

Below is an example.

Assuming Amazon blocked your request when trying to follow along in this tutorial, here's how you can bypass those restrictions using ZenRows.

scraper.py
# import the required modules
import scrapy
from scrapy_zenrows import ZenRowsRequest


class ScraperSpider(scrapy.Spider):
    name = 'scraper'

    def start_requests(self):
        url = 'https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/dp/B098LG3N6R/'
        yield ZenRowsRequest(
            url=url,
            callback=self.parse,
            params={
                'js_render': 'true',  # enable JavaScript rendering
                'premium_proxy': 'true',  # use premium proxy
                'custom_headers': 'true',  # activate custom headers
                'js_instructions': '[{"wait": 500}]',  # wait 500ms after page load
            },
            # add custom referer header
            headers={
                'Referer': 'https://www.google.com/',
            },
        )

    def parse(self, response):
        # log the response body
        self.logger.info(response.text)

Here's the result:

Output
<!doctype html>
<html lang="en-us" class="a-no-js" data-19ax5a9jf="dingo">
<head>
    <!-- ... -->
    <!-- DNS Prefetch to improve loading speed of images -->
    <link rel="dns-prefetch" href="https://images-na.ssl-images-amazon.com">
   
    <!-- ... -->

    <title>Amazon.com: MageGee Portable 60% Mechanical Gaming Keyboard, MK-Box LED Backlit Compact 68 Keys Mini Wired Office Keyboard with Red Switch for Windows Laptop PC Mac - Black/Grey : Video Games</title>
 
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    Omitted for brevity 
</body>
</html>

Congratulations! You now have a Scrapy Amazon spider capable of avoiding detection.

ZenRows also offers numerous other features you can leverage by simply defining its parameters. Check the scrapy-zenrows documentation for more details.

Conclusion

Scrapy is a great choice for scraping Amazon due to its architecture, flexibility, and efficiency in handling multiple requests and pagination. However, vanilla Scrapy will get blocked by anti-bot solutions and website restrictions.

Luckily, you can integrate with ZenRows to avoid detection. For hassle-free Amazon scraping, try ZenRows now.

Ready to get started?

Up to 1,000 URLs for free are waiting for you