How to Scrape Amazon With Python: Step-by-Step Tutorial

Updated: July 29, 2024 · 10 min read

Do you want to scrape product pages, listings, or any other information from Amazon? We've got you covered!

This tutorial will show you how to scrape Amazon with Python, from full-page HTML to specific product details and crawling multiple product listing pages.

Let's go!

Building an Amazon Product Scraper With Python

In this Amazon Python scraping tutorial, we'll scrape an Amazon product page using Python's Requests as the HTTP client and BeautifulSoup as the HTML parser. You'll start by extracting the full-page HTML and then proceed to scrape the following product details:

  • Product name.
  • Price.
  • Rating count.
  • Image.
  • Product description.

See a sample demo of the target product page below:

Amazon Product Page
Click to open the image in full screen

Before you begin, let's see the prerequisites.

Step #1: Prerequisites

You'll need to run a few installations to follow this tutorial smoothly. Let's go through them quickly.

Python

This tutorial uses Python version 3.12.1. If you still need to, download and install the latest version from Python's official website. 

Install Requests and BeautifulSoup: 

You'll use Python's Requests library as the HTTP client. Then, you'll parse the returned HTML with BeautifulSoup. Install them using pip:

Terminal
pip3 install beautifulsoup4 requests

A suitable IDE

Although we'll use VS Code on a Windows OS, you can follow this tutorial with any IDE you choose.

Once everything is up and running, you're ready to scrape Amazon!

You can also use a headless browser like Selenium to scrape Amazon, so feel free to check our guide.

Step #2: Retrieve the Page HTML

Let's start with a basic query to extract the full HTML of the target product page using the Requests library. This step ensures that your HTTP client retrieves the website's content as expected. 

Create a new scraper.py file in your project root folder and insert the following code:

scraper.py
# import the requests library
import requests

# specify the target URL
target_url = ( "https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/dp/B098LG3N6R/"
)

# send a get request to the target url
response = requests.get(target_url)

# check if the response status code is not 200 (ok)
if response.status_code != 200:
    # print an error message with the status code
    print(f"An error occurred with status {response.status_code}")
else:
    # get the page html content
    html_content = response.text
    # print the html content
    print(html_content)

The above code outputs the target web page HTML, as shown below. We've omitted some content for brevity:

Output
<!doctype html>
<html lang="en-us" class="a-no-js" data-19ax5a9jf="dingo">
<head>
    <!-- ... -->
    <!-- DNS Prefetch to improve loading speed of images -->
    <link rel="dns-prefetch" href="https://images-na.ssl-images-amazon.com">
   
    <!-- ... -->

    <title>Amazon.com: MageGee Portable 60% Mechanical Gaming Keyboard, MK-Box LED Backlit Compact 68 Keys Mini Wired Office Keyboard with Red Switch for Windows Laptop PC Mac - Black/Grey : Video Games</title>
 
    <!-- ... -->
</head>
<body>
    <!-- ... -->
</body>
</html>

However, due to potential IP bans and CAPTCHA protection, the Requests library may not work with complex websites like Amazon. If you got blocked, you can add a User Agent to Python's Requests library to mimic an actual browser and reduce the chances of getting blocked.

To add a custom User Agent, make the following changes to the previous script:

scraper.py
# ...

# specify your custom User Agent
custom_headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
}

# specify the target URL
target_url = (
    "https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/dp/B098LG3N6R/"
)

# send a get request to the target url with a custom User Agent
response = requests.get(target_url, custom_headers)

While the above measure may secure your scraper against Amazon's CAPTCHA, it's still prone to IP bans, especially if sending multiple requests. One way to prevent potential IP restrictions is to add a proxy to your request to mimic a user from another location.

Below is an example of adding a free proxy to the previous scraper script. Keep in mind that these are free proxies from the Free Proxy List. They may not work at the time of reading due to their short lifespan and unreliability:

scraper.py
# ...

# set an https proxy for http and https connection types
proxies = {
    "http": "http://47.90.205.231:33333",
    "https": "http://47.90.205.231:33333",
}

# send a get request to the target url with a custom User Agent and a proxy
response = requests.get(target_url, custom_headers, proxies=proxies)

Read our article on setting up a proxy with Python's Requests library to learn more.

Your complete code should look like this after adding a proxy and a User Agent:

scraper.py
# import the requests library
import requests

# specify your custom User Agent
custom_headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
}

# specify the target URL
target_url = (
    "https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/dp/B098LG3N6R/"
)

# set an https proxy for http and https connection types
proxies = {
    "http": "http://47.90.205.231:33333",
    "https": "http://47.90.205.231:33333",
}

# send a get request to the target url with a custom User Agent and a proxy
response = requests.get(target_url, headers=custom_headers, proxies=proxies)

# check if the response status code is not 200 (ok)
if response.status_code != 200:
    # print an error message with the status code
    print(f"An error occurred with status {response.status_code}")
else:
    # get the page html content
    html_content = response.text
    # print the html content
    print(html_content)

Ready? Now, let's scrape some real Amazon product data!

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step #3: Scrape Amazon Product Details

The next step is to parse the website's HTML using BeautifulSoup and extract specific product data using CSS selectors, which are more specific in locating elements using their IDs or class names.

Let's modify the previous code to parse the page's HTML. Add BeautifulSoup to your imports and use it to parse the HTML like so:

scraper.py
# import the required libraries

# ...
from bs4 import BeautifulSoup

if response.status_code != 200:

    # ...
else:

    # ...

    # parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, "html.parser")

    # ... scraping logic

Now, we'll go through the extraction of each target data step-by-step, starting with the product name.

Locate and Scrape Product Name

First, inspect the product name element. Open the product page on a browser like Chrome. Right-click on the product name and select Inspect to open the DevTools window. You'll see a highlight on an <h1> tag with a title ID and a span node. You can expand that span child to view the title text:

Amazon Product Page Title Inspection
Click to open the image in full screen

Scrape that product title into a data dictionary inside the else statement. Use BeautifulSoup's find method since you expect only one matching element. The strip method removes any space from the scraped data. Print the dictionary to view the extracted data:

scraper.py
# ...
else:

    # ...

    # create a dictionary to store scraped product data
    data = {
        "Name": soup.find(id="title").text.strip(),
    }

    # print the extracted data
    print(data)

The above outputs the product page's title, as shown:

Output
{
    'Name': 'MageGee Portable 60% Mechanical Gaming Keyboard, MK-Box LED Backlit Compact 68 Keys Mini Wired Office Keyboard with Red Switch for Windows Laptop PC Mac - Black/Grey'
}

Locate and Scrape Product Price

Let's also inspect the price element to get an idea of its selector layout before scraping it. Since the bolder price is a prime deal, not the real one, you want to scrape the actual listing price instead. 

Right-click on the listing price element and select Inspect. The listing price element is in a span tag with the class name a-offscreen:

Amazon Product Page Price Inspection
Click to open the image in full screen

Extract the listing price by pointing BeautifulSoup to its class. Modify the data dictionary, as shown below:

scraper.py
# ...
else:

    # ...

    # create a dictionary to store scraped product data
    data = {
        # ...,
        "Price": soup.find(class_="a-offscreen").text.strip(),
    }

The above code adds the listing price to the output as expected. You might wonder why the extracted value differs from the one on the website. That's because the scraped price data includes the percentage discount:

Output
{
    # ...,
    'Price': '$24.94'
}

You've now scraped the product's listing price. The next on our list is the rating count.

Locate and Scrape the Rating Count

To extract the number of people who have rated the target product, let's first inspect the target element as we did earlier (right-click the rating count and select Inspect).

You'll see that it's in a span tag with an ID of acrCustomerReviewText:

Amazon Product Page Rating Count Inspection
Click to open the image in full screen

That's easy to scrape since it uses an ID. Add the following line to your data dictionary to extract the rating count:

scraper.py
# ...
else:

    # ...

    # create a dictionary to store scraped product data
    data = {
        # ...,
        "Rating count": soup.find(id="acrCustomerReviewText").text.strip(),
    }

The above code updates the extracted data with the rating count:

Output
{
    # ...,
    'Rating count': '7,380 ratings'
}

Now, you'll now extract the product's image URLs and description.

Scrape the Product Images

This task has two parts. First, you'll extract the featured image URL. Then, you'll scrape the supporting images in the vertical grid.

Let's start with the featured image. Again, inspect the image element (right-click the featured image and select Inspect). The image is inside a div tag with an ID of imgTagWrapperId:

Amazon Product Page Featured Image Inspection
Click to open the image in full screen

Extend the data dictionary with a logic that extracts the src attribute from the image inside that div:

scraper.py
else:

    # ...

    # create a dictionary to store scraped product data
    data = {
        # ...,
        "Featured image": soup.find(id="imgTagWrapperId").find("img").get("src"),
    }

The code adds the featured image URL to the extracted data. See the result below:

Output
{
    # ...,
    'Featured image': 'https://m.media-amazon.com/images/I/618zZ7u3sUL.__AC_SX300_SY300_QL70_ML2_.jpg'
}

The second task is to extract the URLs of the alternative images in the vertical grid. 

Right-click the vertical image grid and select Inspect. So, the image grid is inside a div tag with the class name altImages. This element contains an unordered list of image elements:

Amazon Product Page Alternative Images Inspection
Click to open the image in full screen

To scrape the images, extract all the image tags (img) from the parent element (div) and loop through them to get their src attributes into a separate list. Add this logic before the data dictionary:

scraper.py
# ...

else:

    # ...

    # extract the parent image element
    images = soup.find(id="altImages").find_all("img")

    # create an empty list to collect the smaller images
    image_data = []

    # loop through the image container to extract its image URLs
    for image in images:
        image_data.append(image.get("src"))

Add the list of the scraped image URLs to the data dictionary like so:

scraper.py
    # create a dictionary to store scraped product data
    data = {
        # ...,
        "Alternative images": image_data,
    }

The above code updates the extracted data with a list of the alternative images:

Output
{
    # ...,

    'Alternative images': [
        'https://m.media-amazon.com/images/I/41S5pwGovuL._AC_US40_.jpg',
        'https://m.media-amazon.com/images/I/41HOY7Hp4zL._AC_US40_.jpg',

        # ... omitted for brevity, 
       
        'https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V192234675_.gif'
    ]
}

You've just scraped product images from Amazon! Let's extract one last piece of information before writing the data to a CSV file. 

Locate and Scrape Product Description

Place your cursor on the product description (the "About this item"  section), then right-click and select Inspect. 

Each description is a list (li) under an unordered list tag (ul). Expand the span element tags inside one of the list tags, and you'll see the wrapped description text.

Amazon Product Page Description Inspection
Click to open the image in full screen

To extract all the description texts, get the unordered list element using its class names. In this case, we've used all the class names to avoid conflict with similar elements. Now, define an empty list to collect each text. Loop through the parent element's (ul) to extract the target description texts into the empty list:

scraper.py
# ...

else:

    # ...

    # find the element containing product descriptions
    descriptions = soup.find(class_="a-unordered-list a-vertical a-spacing-mini")

    # create an empty list to collect the descriptions
    description_data = []

    # collect and store all product description texts
    for description in descriptions.contents:
        description_data.append(description.text.strip())

Add the extracted description text list to the data dictionary like so:

scraper.py
# ...

else:

    # ...

    # create a dictionary to store scraped product data
    data = {
        # ...,
        "Description": description_data,
    }

The code extracts the description texts as shown below:

Output
{
    # ...,
    'Description': [
        'Mini portable 60% compact layout: MK-Box is ...',

        # ... omitted for brevity,

        'Extensive compatibility: MageGee MK-Box mechanical...'
    ]
}

That's it! You've completed the initial tasks. 

Let's combine all the snippets to see what the complete code looks like:

scraper.py
# import the required libraries
import requests
from bs4 import BeautifulSoup

# specify your custom User Agent
custom_headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
}

# specify the target URL
target_url = (
    "https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/dp/B098LG3N6R/"
)

# send a get request to the target url with a custom User Agent
response = requests.get(target_url, headers=custom_headers)

# check if the response status code is not 200 (ok)
if response.status_code != 200:
    # print an error message with the status code
    print(f"An error occured with status {response.status_code}")
else:
    # get the page html content
    html_content = response.text

    # parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, "html.parser")

    # extract the parent image element
    images = soup.find(id="altImages").find_all("img")

    # create an empty list to collect the smaller images
    image_data = []

    # loop through the image container to extract its image URLs
    for image in images:
        image_data.append(image.get("src"))

    # find the element containing product descriptions
    descriptions = soup.find(class_="a-unordered-list a-vertical a-spacing-mini")

    # create an empty list to collect the descriptions
    description_data = []

    # collect and store all product description texts
    for description in descriptions.contents:
        description_data.append(description.text.strip())

    # create a dictionary to store scraped product data
    data = {
        "Name": soup.find(id="productTitle").text.strip(),
        "Price": soup.find(class_="a-offscreen").text.strip(),
        "Rating count": soup.find(id="acrCustomerReviewText").text.strip(),
        "Featured image": soup.find(id="imgTagWrapperId").find("img").get("src"),
        "Alternative images": image_data,
        "Description": description_data,
    }

    # print the extracted data
    print(data)

Run the above code, and you'll get the following combined output:

Output
{
    'Name': 'MageGee Portable 60% Mechanical Gaming ... for Windows Laptop PC Mac - Black/Grey',
    'Price': '$24.94',
    'Rating count': '7,380 ratings',
    'Featured image': 'https://m.media-amazon.com/images/I/618zZ7u3sUL.__AC_SX300_SY300_QL70_ML2_.jpg',
    'Alternative images': [
        'https://m.media-amazon.com/images/I/41S5pwGovuL._AC_US40_.jpg',
        'https://m.media-amazon.com/images/I/41HOY7Hp4zL._AC_US40_.jpg',

        # ... omitted for brevity,  
        'https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V192234675_.gif'
    ],

    'Description': [
            'Mini portable 60% compact layout: MK-Box is ...',
   
            # ... omitted for brevity,
   
            'Extensive compatibility: MageGee MK-Box mechanical...'
    ]
}

You've now scraped product information from the target Amazon page. Let's store the data in a CSV file.

Step 4: Export to CSV

In this step, you'll export the extracted data to a CSV file, ensuring you can access it later for further analysis.

Add Python's built-in csv package to your imports. Then, update the previous code to write the data to a products.csv file:

scraper.py
# ...
import csv

# ...
 
else:
    # ...
   
    # define the CSV file name for storing scraped data
    csv_file = "products.csv"

    # open the CSV file in write mode with proper encoding
    with open(csv_file, mode="w", newline="", encoding="utf-8") as file:
        # create a CSV writer object
        writer = csv.writer(file)

        # write the header row to the CSV file
        writer.writerow(data.keys())

        # write the data row to the CSV file
        writer.writerow(data.values())

    # print a confirmation message after successful data extraction and storage
    print("Scraping completed and data written to CSV")

After updating the previous scraper code with the one above, you'll get the following final code:

scraper.py
# import the required libraries
import requests
from bs4 import BeautifulSoup
import csv

# specify your custom User Agent
custom_headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
}

# specify the target URL
target_url = (
    "https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/dp/B098LG3N6R/"
)

# send a get request to the target url with a custom User Agent
response = requests.get(target_url, headers=custom_headers)

# check if the response status code is not 200 (ok)
if response.status_code != 200:
    # print an error message with the status code
    print(f"An error occured with status {response.status_code}")
else:
    # get the page html content
    html_content = response.text

    # parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, "html.parser")

    # extract the parent image element
    images = soup.find(id="altImages").find_all("img")

    # create an empty list to collect the smaller images
    image_data = []

    # loop through the image container to extract its image URLs
    for image in images:
        image_data.append(image.get("src"))

    # find the element containing product descriptions
    descriptions = soup.find(class_="a-unordered-list a-vertical a-spacing-mini")

    # create an empty list to collect the descriptions
    description_data = []

    # collect and store all product description texts
    for description in descriptions.contents:
        description_data.append(description.text.strip())

    # create a dictionary to store scraped product data
    data = {
        "Name": soup.find(id="productTitle").text.strip(),
        "Price": soup.find(class_="a-offscreen").text.strip(),
        "Rating count": soup.find(id="acrCustomerReviewText").text.strip(),
        "Featured image": soup.find(id="imgTagWrapperId").find("img").get("src"),
        "Alternative images": image_data,
        "Description": description_data,
    }

    # define the CSV file name for storing scraped data
    csv_file = "products.csv"

    # open the CSV file in write mode with proper encoding
    with open(csv_file, mode="w", newline="", encoding="utf-8") as file:
        # create a CSV writer object
        writer = csv.writer(file)

        # write the header row to the CSV file
        writer.writerow(data.keys())

        # write the data row to the CSV file
        writer.writerow(data.values())

    # print a confirmation message after successful data extraction and storage
    print("Scraping completed and data written to CSV")

The above code exports the extracted data to a products.csv file. You'll find this file in your project root folder:

Amazon Product CSV Output
Click to open the image in full screen

Congrats! You've just built an Amazon scraper that exports data to a CSV. 

However, you can scale this by scraping product listings from a search result and even crawling paginated pages. Let's do that in the next section.

Scrape Product Listings

So far, you've only scraped a single keyboard product page. But in most cases, you'll want to scrape many similar product types. You have to search for that product via Amazon's search bar to see all available listings for that specific product.

For instance, searching for "keyboards" returns many available keyboard brands, as shown below:

Amazon Product Listing Page
Click to open the image in full screen

Look at your browser's link box; the "keyboards" query keyword is in the URL. Here's a formatted version of the URL:

Example
https://www.amazon.com/s?k=keyboards

In this section, you'll scrape the above Amazon keyboard product listing (search result page). Since the products break into multiple pages, you'll also see how to handle pagination to crawl several pages.

Scrape Search Pages

Scraping Amazon's search result page is similar to how you scraped a product page. In this case, the only difference is that you have different products sharing identical selectors.

When you open the target listing page in your browser, you'll see that each product's title links to its product page. It means we can extract each product's URL from its title. 

Let's inspect the first product to view each product's element structure. Right-click the first product's title and click Inspect. Each product's URL (a tag) is a node of its title text (h2 tag).

Amazon First Product Listing Inspection
Click to open the image in full screen

You'll loop through the h2 tags to scrape the product links. Look at the link attached to the a tag closely, and you'll see that it doesn't include Amazon's base URL and isn't a complete link. You should concatenate Amazon's base URL with the extracted links to get full URLs. 

Let's see how to achieve that.

Amazon's listing pages tend to block requests that originate outside of a browser or without a trusted source. To increase the chances of success, include a Google referrer header to the previous request headers:

Example
# ...

# specify your custom request headers
custom_headers = {
    # ...,
    "Referer": "https://www.google.com/",
}

Remember that you'll need to concatenate each extracted URL with the website's base URL. Specify Amazon's base URL followed by the listing page URL. Then, define an empty array to collect the scraped links:

Example
# ...

# specify the base URL
base_url = "https://www.amazon.com"

# specify the target URL
target_url = "https://www.amazon.com/s?k=keyboards"

# define an empty list to collect extracted links
listing_data = []

All the other setups remain the same. However, you'll modify the else statement with a new scraping logic. Extract all the h2 tags:

Example
# ...

else:

    # ...

    # extract all h2s with the product link
    listings = soup.find_all(
        "h2", class_="a-size-mini a-spacing-none a-color-base s-line-clamp-2"
    )

Loop through the extracted listings and collect the attached product links. Use a condition to check if the extracted link contains a protocol (https). Concatenate it with the base URL to form a complete product link. Finally, append the extracted data to the empty array and print it:

Example
# ...

else:

    # ...

    # loop through the listings (h2s) to extract product URLs
    for link in listings:
        # find the href attribute in each h2
        data = link.find("a", href=True).get("href")

        # concatenate the extracted links with the base URL to form a complete URL
        if not data.startswith("https"):
            data = base_url + data
            listing_data.append(data)
    # output the links
    print(listing_data)

Merge all the snippets, and you'll get the following final code:

Example
# import the required libraries
import requests
from bs4 import BeautifulSoup

# specify your custom request headers
custom_headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
    "Referer": "https://www.google.com/",
}

# specify the base URL
base_url = "https://www.amazon.com"

# specify the target URL
target_url = "https://www.amazon.com/s?k=keyboards"

# define an empty list to collect extracted links
listing_data = []

# send a get request to the target url with a custom User Agent
response = requests.get(target_url, headers=custom_headers)

# check if the response status code is not 200 (ok)
if response.status_code != 200:
    # print an error message with the status code
    print(f"An error occured with status {response.status_code}")
else:
    # get the page html content
    html_content = response.text

    # parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, "html.parser")

    # extract all h2s with the product link
    listings = soup.find_all(
        "h2", class_="a-size-mini a-spacing-none a-color-base s-line-clamp-2"
    )

    # loop through the listings (h2s) to extract product URLs
    for link in listings:
        # find the href attribute in each h2
        data = link.find("a").get("href")

        # concatenate the extracted links with the base URL to form a complete URL
        if not data.startswith("https"):
            data = base_url + data
            listing_data.append(data)
    # output the links
    print(listing_data)

The above code scrapes all the product links on the first Amazon search result page, as shown:

Output
[
    'https://www.amazon.com/XVX-Mechanical-Swappable-Pre-lubed-Stabilizer/dp/B0C9ZJHQHM/ref=sr_1_1?dib=eyJ2IjoiMSJ9.dHakaxTjnRzg31pEFYNyuCQOrOuf0dF5Z3wzHd8FIltiED2FP5xDcJ0rZ9-gGbs3tFwy-pOukbdWpOulziCJBihwSpbhcVckZsRYV4b_3-cF9EFNBLrt7oqUEUq7cbxGN_CsFUiKqkwZQ5gEutgWo29iKYdIU_oGVPVPGbVlrBvEbIOHNxx-hgrqwJHNp-ByLYeCdpX0hU3G9UQ9mymx68pJtpskSALL1ZGD5jzoAA8.f-tBF7mVmELtg06oXl52cfAHVRGwU47xV4xQv79hyn0&dib_tag=se&keywords=keyboards&qid=1721252373&sr=8-1',

    # ... other links omitted for brevity,

    'https://www.amazon.com/sspa/click?ie=UTF8&spc=MTo4MjAxNTA2OTAzMDgwMjU2OjE3MjEyNTIzNzM6c3BfYnRmOjMwMDEwMzcyNjI2NDQwMjo6MDo6&url=%2FASUS-II-Switch-Dampening-Hot-Swappable-PBT%2Fdp%2FB0C7KFZ5TL%2Fref%3Dsr_1_22_sspa%3Fdib%3DeyJ2IjoiMSJ9.dHakaxTjnRzg31pEFYNyuCQOrOuf0dF5Z3wzHd8FIltiED2FP5xDcJ0rZ9-gGbs3tFwy-pOukbdWpOulziCJBihwSpbhcVckZsRYV4b_3-cF9EFNBLrt7oqUEUq7cbxGN_CsFUiKqkwZQ5gEutgWo29iKYdIU_oGVPVPGbVlrBvEbIOHNxx-hgrqwJHNp-ByLYeCdpX0hU3G9UQ9mymx68pJtpskSALL1ZGD5jzoAA8.f-tBF7mVmELtg06oXl52cfAHVRGwU47xV4xQv79hyn0%26dib_tag%3Dse%26keywords%3Dkeyboards%26qid%3D1721252373%26sr%3D8-22-spons%26sp_csd%3Dd2lkZ2V0TmFtZT1zcF9idGY%26psc%3D1'
]

You've just extracted product links from the first Amazon search result page. Great job! Let's modify this scraper to follow more pages through pagination.

Handle Pagination

The previous scraper only extracts the product links from the first listing page. However, Amazon breaks the listings into several pages. In this part of the tutorial, you'll follow each page to scrape more product links.

As usual, let's inspect the next button element in the navigation bar. Scroll down the listing page, right-click the next button on the navigation bar, and then click Inspect. Although the next page link has many class names, we'll use s-pagination-next since it's more descriptive:

Amazon Listing Page Navigation Bar Inspection
Click to open the image in full screen

Try to navigate to the last page (20). Observe the next page element in the inspection tab, and you'll see that it no longer shows any link. It means there are no more pages to crawl. You'll simulate this logic to terminate crawling once your scraper reaches the last page.

To extract the product links from all pages, implement logic to iteratively check for the presence of the next page link in the DOM and terminate crawling once it's gone. 

First, insert all the previous logic into a while loop. Then, instead of the previous else statement, use a break to stop the loop if the request fails. Here's the modification:

Example
# ...

while True:
    # ...

    # check if the response status code is not 200 (ok)
    if response.status_code != 200:
        # ...
        break

Add Python's time module to your imports. Find the next page link, check if it exists, and extract its links. Like the previously extracted links, the next page link doesn't include the base URL. Concatenate it with Amazon's base URL. Then, implement a 3-second pause to reduce the request frequency and the chances of getting blocked:

Example
# import the required libraries
# ... 
import time

# ...

while True:
    # ...

    # find the next page link
    next_page = soup.find("a", class_="s-pagination-next")

    # check if next page exists and follow its URL if so
    if next_page:
        next_link = next_page.get("href")

        # concatenate the next link with the base URL
        if not next_link.startswith("https"):
            target_url = base_url + next_link

            # pause for 3 seconds before making the next request
            time.sleep(3)

Break the while loop once the next page link disappears from the DOM:

Example
# ...

while True:
    # ...
   
    else:
        print("No more next page")
        # break the loop after following the pages
        break

Finally, import Python's csv library. Then, export the extracted links to a product-link.csv file:

Example
# import the required libraries
# ...
import csv

# ...

# define the CSV file name for storing scraped data
csv_file = "product_links.csv"

# write the collected links to a CSV file
with open(csv_file, "w", newline="") as csvfile:
    csvwriter = csv.writer(csvfile)
    # write the header
    csvwriter.writerow(["Product URL"])
    # write the data
    for link in listing_data:
        csvwriter.writerow([link])
       
# print a confirmation message after successful data extraction and storage
print("Data written to product_links.csv")

Now, combine the snippets. Here's your final code:

Example
# import the required libraries
import requests
from bs4 import BeautifulSoup
import time
import csv

# specify your custom request headers
custom_headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
    "Referer": "https://www.google.com/",
}

# specify the base URL
base_url = "https://www.amazon.com"

# specify the target URL
target_url = "https://www.amazon.com/s?k=keyboards"


# define an empty list to collect extracted links
listing_data = []

while True:
    # send a get request to the target url with a custom User Agent
    response = requests.get(target_url, headers=custom_headers)

    # check if the response status code is not 200 (ok)
    if response.status_code != 200:
        # print an error message with the status code
        print(f"An error occurred with status {response.status_code}")
        break

    # get the page html content
    html_content = response.text

    # parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, "html.parser")

    # extract all h2s with the product link
    listings = soup.find_all(
        "h2", class_="a-size-mini a-spacing-none a-color-base s-line-clamp-2"
    )

    # loop through the listings (h2s) to extract product URLs
    for link in listings:
        # find the href attribute in each h2
        data = link.find("a").get("href")

        # concatenate the extracted links with the base URL to form a complete URL
        if not data.startswith("https"):
            data = base_url + data
            listing_data.append(data)

    # find the next page link
    next_page = soup.find("a", class_="s-pagination-next")

    # check if next page exists and follow its URL if so
    if next_page:
        next_link = next_page.get("href")

        # concatenate the next link with the base URL
        if not next_link.startswith("https"):
            target_url = base_url + next_link

            # pause for 5 seconds before making the next request
            time.sleep(3)
    else:
        print("No more next page")
        # break the loop after following the pages
        break

# define the CSV file name for storing scraped data
csv_file = "product_links.csv"

# write the collected links to a CSV file
with open(csv_file, "w", newline="") as csvfile:
    csvwriter = csv.writer(csvfile)
    # write the header
    csvwriter.writerow(["Product URL"])
    # write the data
    for link in listing_data:
        csvwriter.writerow([link])

# print a confirmation message after successful data extraction and storage
print("Data written to product_links.csv")

After crawling all pages, the code writes the extracted links to a CSV file. You'll find this file in your project root folder:

Amazon Extracted Link CSV Output
Click to open the image in full screen

Great job! You've just crawled and extracted product link data from 19 Amazon listing pages using Python's Requests and BeautifulSoup. 

Still, you must be aware of a few challenges while scraping Amazon. We'll discuss them in the next section.

Challenges and Solutions for Amazon Web Scraping

Extracting data from Amazon is not an easy task. Let's take a look at the challenges you're most likely to encounter.

Blocks and Bans

Amazon is well-protected, considering it's an e-commerce website with many people looking to get product data. Some of its anti-bot mechanisms include CAPTCHAs, invisible bot detection challenges, behavioral analysis, and more. 

It can be difficult to bypass these security measures, especially if you're running multiple Amazon scraping instances. The most reliable approach to avoid getting blocked is to use web scraping tool like ZenRows. 

ZenRows bypasses all anti-bot mechanisms under the hood and allows you to focus on your scraping logic. We'll explain more below.

Changes in Page Layout

Amazon frequently modifies its DOM structure, including CSS selectors. Such changes often break your previous parsers, causing your scraper to fail. 

As a remedy, ensure you monitor the web page frequently for changes in DOM layout or HTML attributes and update your code regularly. To make your job easier, consider separating your CSS selectors from your scraping logic to make them easily editable. 

You can also scrape Amazon with Scrapy, a powerful Python framework, so feel free to check our guide.

A Surefire Way to Scrape Amazon With Python

A web scraping API is the best solution for scraping any website without getting blocked. It's compatible with any programming language, and you can implement it within a few minutes. Another advantage of using a web scraping API is that it always works despite changes in anti-bot measures.

ZenRows is one of the most popular web scraping APIs that reliably bypass any blocks. It features an Amazon web scraper explicitly designed to extract the correct data from Amazon without hassle.

To try it out with the previous product page, sign up to load the ZenRows Request Builder. Paste the product URL in the link box and activate Premium Proxies and JS Rendering. Choose Python as your programming language and select the API connection mode. Then, copy and paste the generated code into your Python file:

building a scraper with zenrows
Click to open the image in full screen

The generated code should look like this:

Example
# pip install requests
import requests

url = (
    "https://www.amazon.com/Portable-Mechanical-Keyboard-MageGee-Backlit/dp/B098LG3N6R/"
)
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
    "url": url,
    "apikey": apikey,
    "js_render": "true",
    "premium_proxy": "true",
    "autoparse": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)

The above code outputs the following JSON data:

Output
{
    "answers": "Search this page",
    "availability": "In Stock",
    "avg_rating": "4.4 out of 5 stars",
    "category": "Video Games > PC > Accessories > Gaming Keyboards",
    "discount": "-35%-19%",
    "out_of_stock": false,
    "price": "$24.94",
    "title": "MageGee Portable 60% Mechanical Gaming Keyboard, MK-Box LED Backlit Compact 68 Keys Mini Wired Office Keyboard with Red Switch for Windows Laptop PC Mac - Black/Grey",
    "features": [
        {"Product Dimensions": "12.13 x 3.98 x 1.54 inches"},
        {"Item Weight": "1.5 pounds"},
        {"Manufacturer": "MageGee"},

        # ... omitted for brevity,
    ],
    "ratings": [
        {"5 star": "67%"},
        {"4 star": "17%"},

        # ... omitted for brevity,
    ],
    "images": ["https://m.media-amazon.com/images/I/618zZ7u3sUL._AC_SL1500_.jpg"],
}

Congratulations! You've just scraped data from Amazon with ease using ZenRows. Your scraper will now bypass potential and active anti-bot measures.

Conclusion

In this tutorial, you've seen how to scrape Amazon product pages and listings using Requests and BeautifulSoup in Python. Here's a summary of what you've learned:

  • Get the full-page HTML of an Amazon product page.
  • Scrape specific product details from an Amazon product page.
  • Extract data from Amazon's search result pages.
  • Crawl several Amazon listings by implementing pagination with the Requests library.
  • Export the extracted data to a CSV file.
  • The challenges of scraping data from Amazon.

With all said and done, web scraping Amazon can be challenging. We recommend using the ZenRows Amazon scraper, which gets you all the data you want without stress.

Try ZenRows for free now without a credit card!

Frequent Questions

How Does Amazon Detect Scraping?

Yes, Amazon can detect scraping activities by checking your IP address, browser parameters, User Agents, and referrer header, among other details. Once it flags you as a bot, the website will throw a CAPTCHA. If your scraper can't solve the CAPTCHA puzzle, Amazon may block your IP address.

Does Amazon Allow Web Scraping?

Definitely! But there's a caveat: Amazon uses rate-limiting and can block your IP address if you overburden the website. It also checks HTTP headers and blocks you if your activity seems suspicious.

If you try to crawl through multiple pages simultaneously without using rotating proxies, you can get blocked. Amazon's web pages also have different structures, and even different product pages have different HTML structures. Building a robust web crawling application can take a lot of work.

However, scraping product prices, reviews and listings is legal.

Ready to get started?

Up to 1,000 URLs for free are waiting for you