Web Scraping Amazon: step-by-step tutorial with Python

October 25, 2022 · 9 min read

Web scraping Amazon is critical for e-commerce businesses since it's vital to have information about your competitors and the latest trends.

What is Amazon Scraping?

Web scraping is the activity of doing data collection from websites, and Amazon is the biggest online shopping platform. So how can you utilize web scraping to have an advantage over your competitors?

This practical tutorial will show how to scrape product information from Amazon using Python!

Does Amazon allow Web Scraping?

Definitely! But there's a caveat: Amazon uses rate-limiting and can block your IP address if you overburden the website. They also check HTTP headers and will block you if your activity seems suspicious.

If you try to crawl through multiple pages at the same time, you can get blocked without proxy rotation. Also, Amazon's web pages have different structures. Even different product pages have different HTML structures. It's difficult to build a robust web crawling application.

Yet scraping product prices, reviews and listings is legal.

Why Web Scraping from Amazon?

Especially in e-commerce, it's essential to have as much information as possible. If you can automatize the extraction process, you can concentrate on your competitors and your business more.

These are some of the benefits:

Monitoring Products

You can look at the top-selling products in your category. This will give you the opportunity to identify current trends in the market. So you can decide on your products on better ground. You'll also be able to spot the products that are losing top sales positions.

Monitoring Competitors

The only way to win the competition is by monitoring your competitors' moves. For example, scraping Amazon product prices will help spot price trends. This will let you identify a better pricing strategy. So you can track the changes and conduct competitor product analysis.

How to Scrape Data From Amazon?

The easiest way to scrape Amazon is:
  1. Create a free ZenRows account.
  2. Open the Request Builder.
  3. Enter the target URL.
  4. Get the data in HTML or JSON.

ZenRows API handles automatic premium proxy rotation. It also has autoparse support, which automatically extracts everything available on the page! It supports Amazon, Instagram, Youtube, Indeed, and more. Learn more about available platforms!

We can use its Amazon scraper to automatically extract the data from the products page. Let's start by installing the ZenRows SDK:

pip install zenrows

It's possible to add extra configuration to make the customized API calls. You can learn more from the API documentation. Now, let's automatically scrape data from products!

from zenrows import ZenRowsClient 
 
client = ZenRowsClient(API_KEY) 
url = "https://www.amazon.com/Crockpot-Electric-Portable-20-Ounce-Licorice/dp/B09BDGFSWS" 
# It's possible to specify javascript rendering, premium proxies and geolocation. 
# You can check the API documentation for further customization. 
params = {"premium_proxy":"true","proxy_country":"us","autoparse":"true"} 
 
response = client.get(url, params=params) 
 
print(response.json())

Results:

{ 
	"avg_rating": "4.7 out of 5 stars", 
	"category": "Home & Kitche › Kitchen & Dining › Storage & Organizatio › Travel & To-Go Food Containers › Lunch Boxes", 
	"description": "Take your favorite meals with you wherever you go! The Crockpot Lunch Crock Food Warmer is a convenient, easy-to-carry, electric lunch box. Plus, with its modern-vintage aesthetic and elegant Black Licorice color, it's stylish, too. It is perfectly sized for one person, and is ideal for carrying and warming meals while you’re on the go. With its 20-ounce capacity, this heated lunch box is perfect whether you're in the office, working from home, or on a road trip. Take your leftovers, soup, oatmeal, and more with you, then enjoy it at the perfect temperature! This portable food warmer features a tight-closing outer lid to help reduce spills, as well as an easy-carry handle, soft-touch coating, and detachable cord. The container is removable for effortless filling, carrying, and storage. Cleanup is easy, too: the inner container and lid are dishwasher-safe.", 
	"out_of_stock": false, 
	"price": "$29.99", 
	"price_without_discount": "$44.99", 
	"review_count": "14,963 global ratings", 
	"title": "Crockpot Electric Lunch Box, Portable Food Warmer for On-the-Go, 20-Ounce, Black Licorice", 
	"features": [ 
		{ 
			"Package Dimensions": "13.11 x 10.71 x 8.58 inches" 
		}, 
		{ 
			"Item Weight": "2.03 pounds" 
		}, 
		{ 
			"Manufacturer": "Crockpot" 
		}, 
		{ 
			"ASIN": "B09BDGFSWS" 
		}, 
		{ 
			"Country of Origin": "China" 
		}, 
		{ 
			"Item model number": "2143869" 
		}, 
		{ 
			"Best Sellers Rank": "#17 in Kitchen & Dining (See Top 100 in Kitchen & Dining)	 #1 in Lunch Boxes" 
		}, 
		{ 
			"Date First Available": "November 24, 2021" 
		} 
	], 
	"ratings": [ 
		{ 
			"5 star": "82%" 
		}, 
		{ 
			"4 star": "10%" 
		}, 
		{ 
			"3 star": "4%" 
		}, 
		{ 
			"2 star": "1%" 
		}, 
		{ 
			"1 star": "2%" 
		} 
	] 
}

As you can see, it's really easy to use ZenRows API!

But let's see also how to scrape the traditional way:

Web Scraping the with BeautifulSoup

This article doesn't require advanced Python skills, but if you have just started learning web scraping, check out our Web Scraping with Python guide!

We have several considerations to remark on for web scraping in Python:

It's possible to build only request-based scripts. We can also use browser automation tools such as Selenium for web scraping Amazon. But Amazon detects headless browsers rather easily. Especially when you don't customize them.

Using frameworks like Selenium is also inefficient, as they're prone to overload the system if you scale up your application.

Thus, we'll use the requests and BeautifulSoup modules in Python to scrape the Amazon products' data.

Let's start by installing the necessary Python packages! We'll be using the requests and BeautifulSoup modules.

You can use pip to install these:

pip install requests beautifulsoup4

Avoid Getting Blocked While Web Scraping

As we mentioned before, Amazon does allow web scraping. But it blocks the web scraper if many requests are performed. They are checking HTTP headers and applying rate-limiting to block malicious bots.

It's rather easy to detect if the client is a bot or not from the HTTP headers, so using browser automation tools such as Selenium won't work if you don't customize your fixed browser settings.

We'll be using a real browser's User-Agent and since we won't send too frequent requests, we won't get rate limited. You should use a proxy server if you want to scrape data at scale.

Scraping Product Data From Amazon

In this article, we'll scrape data from this lunch box product:

Amazon Product
Click to open the image in fullscreen
We'll extract the following:
  • Product title.
  • Rating.
  • Price.
  • Discount (if there's any).
  • Full price (if there's a discount, price will be the discounted price).
  • Stock status (out of stock or not).
  • Product description.
  • Related Items.

As mentioned, Amazon checks the fixed settings, so sending HTTP requests with standard settings won't work. You can check this out by sending a simple request with Python:

import requests 
 
response = requests.get("https://www.amazon.com/Sabrent-4-Port-Individual-Switches-HB-UM43/dp/B00JX1ZS5O") 
print(response.status_code) # prints 503

The response has the status code 503 Service Unavailable. There's also this message in the response's HTML:

To discuss automated access to Amazon data please contact [email protected] 
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.

So, what can we do to bypass Amazon's protection system?

In any web scraping project, it's critical to imitate a real user. Thus, we can use our standard HTTP headers and remove unnecessary headers:

Browser request headers
Click to open the image in fullscreen

We just need the accept, accept-encoding, accept-language, and user-agent headers. Amazon checks headers such as User-Agent to block suspicious clients.

They also use rate limiting, so your IP can get blocked if you send frequent requests. You can learn more about rate limiting from our rate limit bypassing guide!

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

First, we'll define a class for our scraper:

from bs4 import BeautifulSoup 
from requests import Session 
 
class Amazon: 
	def __init__(self): 
		self.sess = Session() 
		self.headers = { 
			"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
			"accept-encoding": "gzip, deflate, br", 
			"accept-language": "en", 
			"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36", 
		} 
		self.sess.headers = self.headers

Amazon class defines a Session object, which keeps track of the record of our connection. We can also see that the custom user-agent and other headers are used as Session's headers. Since we'll need to get the webpage, let's define a get method:

class Amazon: 
	#.... 
	def get(self, url): 
		response = self.sess.get(url) 
 
		assert response.status_code == 200, f"Response status code: f{response.status_code}" 
 
		splitted_url = url.split("/") 
		self.id = splitted_url[-1].split("?")[0] 
		self.product_page = BeautifulSoup(response.text, "html.parser") 
 
		return response

This method connects to the website using the Session above. If the returned status code is not 200, that means there's an error, so we raise an AssertionError.

If the request is successful, then we split the request URL and extract the product id. The BeautifulSoup object is defined for the product page. It'll let us search HTML elements and extract data from them.

Now, we'll define the data_from_product_page method. This will extract the relevant information from the product's page. We can use the find method and specify the element attributes to search for an HTML element.

How do we know what to search, though?

By using the developer console! We can use the developer console to find out which elements and attributes we need.

We can right-click and inspect the element:

Product title selector
Click to open the image in fullscreen

As we can see, the product title is written in a span element with a productTitle value in the id attribute. So let's get it!

# As the search result returns another BeautifulSoup object, we use "text" to 
# extract data inside of the element. 
# we will also use strip to remove whitespace at the start and end of the words 
title = self.product_page.find("span", attrs={"id": "productTitle"}).text.strip()

To extract the ratings, we can use the same method! When you inspect the average rating part, you'll see an i element at first. Instead, we'll use the span element a little above (indicated by the green rectangle). Because its title attribute already has the average rating text and a unique class value.

We'll search the span elements that have the reviewCountTextLinkedHistogram class. Instead of extracting the text inside of the element, let's use the title attribute:

Product rating selector
Click to open the image in fullscreen

Let's add the code to our script:

# The elements attributes are read as a dictionary, so we can get the title by passing its key 
# of course, we will also use strip here 
rating = self.product_page.find("span", attrs={"class": "reviewCountTextLinkedHistogram"})["title"].strip()

For the discount, there are lots of classes!

Product discount selector
Click to open the image in fullscreen

The reinventPriceSavingsPercentageMargin and savingsPercentage classes are unique to the discount element, so we'll use those to search for the discount. There's also the total price written below the listing price when there's a discount:

Product discount list price
Click to open the image in fullscreen

I couldn't find any other element with the a-price and a-text-price classes. So let's use them to find the actual listing price. Of course, since its uncertain if there's a discount, we extract that price only if there's a discount:

# get the element, "find" returns None if the element could not be found 
discount = self.product_page.find("span", attrs={"class": "reinventPriceSavingsPercentageMargin savingsPercentage"}) 
 
# if the discount is found, extract the total price 
# else, just set the total price as the price found above and set discount = False 
if discount: 
	discount = discount.text.strip() 
	price_without_discount = self.product_page.find("span", attrs={"class": "a-price a-text-price"}).text.strip() 
else: 
	price_without_discount = price 
	discount = False

In order to check if the product is out of stock, we can check if the part in the rectangle is on the page or not.

Out of stock product
Click to open the image in fullscreen

Let's inspect the box:

Out of stock selector
Click to open the image in fullscreen

We need to check if a div element with id=outOfStock exists. Let's add this snippet to our script:

# simply check if the item is out of stock or not 
out_of_stock = self.product_page.find("div", {"id": "outOfStock"}) 
if out_of_stock: 
	out_of_stock=True 
else: 
	out_of_stock=False

All right! We made it this far, the only thing left is the product description and then we're done!

The product description is stored in the div element, which has id="productDescription":

Product description selector
Click to open the image in fullscreen

We can easily extract the description:

description = self.product_page.find("div", {"id": "productDescription"}).text

Nice!

Most of the time, you'll have to crawl through relevant items. Your crawler should also collect data about other products as you need to conduct a proper analysis. You can store the related item's links and extract information from them later on.

As you can see below, the related items are stored in a carousel slider div. It seems we can use the a-carousel-viewport class of the div. We'll get li elements and then collect ASIN values. ASIN values are Amazon's standard identification number, we can use them to construct product URLs.

Product related items selector
Click to open the image in fullscreen

Now, add the code to our script:

# extract the related items carousel's first page 
carousel = self.product_page.find("div", {"class": "a-carousel-viewport"}) 
related_items = carousel.find_all("li") 
 
related_item_asins = [item.find("div")["data-asin"] for item in related_items] 
# of course, we need to get item links 
related_item_links = [] 
for asin in related_item_asins: 
	link = "www.amazon.com/dp/" + asin 
	related_item_links.append(link)

What a ride! Now let's finish the script we've prepared till now:

from bs4 import BeautifulSoup 
from requests import Session 
 
class Amazon: 
	def __init__(self): 
		self.sess = Session() 
		self.headers = { 
			"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
			"accept-encoding": "gzip, deflate, br", 
			"accept-language": "en", 
			"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36", 
		} 
		self.sess.headers = self.headers 
 
	def get(self, url): 
		response = self.sess.get(url) 
		 
		assert response.status_code == 200, f"Response status code: f{response.status_code}" 
 
		splitted_url = url.split("/") 
		self.product_page = BeautifulSoup(response.text, "html.parser") 
		self.id = splitted_url[-1].split("?")[0] 
		 
		return response 
 
	def data_from_product_page(self): 
		 
		# As the search result returns another BeautifulSoup object, we use "text" to 
		# extract data inside of the element. 
		# we will also use strip to remove whitespace at the start and end of the words 
		title = self.product_page.find("span", attrs={"id": "productTitle"}).text.strip() 
 
		# The elements attributes are read as a dictionary, so we can get the title by passing its key 
		# of course, we will also use strip here 
		rating = self.product_page.find("span", attrs={"class": "reviewCountTextLinkedHistogram"})["title"].strip() 
 
		# First, find the element by specifiyng a selector with two options in this case 
		price_span = self.product_page.select_one("span.a-price.reinventPricePriceToPayMargin.priceToPay, span.a-price.apexPriceToPay") 
		# Then, extract the pricing from the span inside of it 
		price = price_span.find("span", {"class": "a-offscreen"}).text.strip() 
 
		# Get the element, "find" returns None if the element could not be found 
		discount = self.product_page.find("span", attrs={"class": "reinventPriceSavingsPercentageMargin savingsPercentage"}) 
		# If the discount is found, extract the total price 
		# Else, just set the total price as the price found above and set discount = False 
		if discount: 
			discount = discount.text.strip() 
			price_without_discount = self.product_page.find("span", attrs={"class": "a-price a-text-price"}).text.strip() 
		else: 
			price_without_discount = price 
			discount = False 
 
		# Simply check if the item is out of stock or not 
		out_of_stock = self.product_page.find("div", {"id": "outOfStock"}) 
		if out_of_stock: 
			out_of_stock=True 
		else: 
			out_of_stock=False 
		 
		# Get the description 
		description = self.product_page.find("div", {"id": "productDescription"}).text 
 
		# Extract the related items carousel's first page 
		carousel = self.product_page.find("div", {"class": "a-carousel-viewport"}) 
		related_items = carousel.find_all("li") 
 
		related_item_asins = [item.find("div")["data-asin"] for item in related_items] 
		# Of course, we need to return the product URLs 
		# So let's construct them! 
		related_item_links = [] 
		for asin in related_item_asins: 
			link = "www.amazon.com/dp/" + asin 
			related_item_links.append(link) 
 
		extracted_data = { 
			"title": title, 
			"rating": rating, 
			"price": price, 
			"discount": discount, 
			"price without discount": price_without_discount, 
			"out of stock": out_of_stock, 
			"description": description, 
			"related items": related_item_links 
		} 
		 
		return extracted_data

Let's try it using the product's link:

scraper = Amazon() 
scraper.get("https://www.amazon.com/Crockpot-Electric-Portable-20-Ounce-Licorice/dp/B09BDGFSWS") 
data = scraper.data_from_product_page() 
 
for k,v in data.items(): 
	print(f"{k}:{v}")

And the final output:

title:Crockpot Electric Lunch Box, Portable Food Warmer for On-the-Go, 20-Ounce, Black Licorice 
rating:4.7 out of 5 stars 
price:$29.99 
discount:False 
price without discount:$29.99 
out of stock:False 
description: 
 Take your favorite meals with you wherever you go! The Crockpot Lunch Crock Food Warmer is a convenient, easy-to-carry, electric lunch box. Plus, with its modern-vintage aesthetic and elegant Black Licorice color, it's stylish, too. It is perfectly sized for one person, and is ideal for carrying and warming meals while you’re on the go. With its 20-ounce capacity, this heated lunch box is perfect whether you're in the office, working from home, or on a road trip. Take your leftovers, soup, oatmeal, and more with you, then enjoy it at the perfect temperature! This portable food warmer features a tight-closing outer lid to help reduce spills, as well as an easy-carry handle, soft-touch coating, and detachable cord. The container is removable for effortless filling, carrying, and storage. Cleanup is easy, too: the inner container and lid are dishwasher-safe. 
related items:['www.amazon.com/dp/B09YXSNCTM', 'www.amazon.com/dp/B099PNWYRH', 'www.amazon.com/dp/B09B14821T', 'www.amazon.com/dp/B074TZKCCV', 'www.amazon.com/dp/B0937KMYT8', 'www.amazon.com/dp/B07T7F5GHX', 'www.amazon.com/dp/B07YJFB8GY']

Of course, Amazon's page structure is quite complex, so different product pages may have different structures. Thus, one should repeat this process over and over again to build a solid web scraper.

Congratulations! You built your own Amazon product scraper if you followed along!

This was quite a hassle, well, how about using a service completely designed for web scraping?

Conclusion

Amazon's webpage structures are complex and you may see a different structured page. You should find out the differences and update your application along with it.

You'll also need to randomize HTTP headers and use premium proxy servers while scaling up. Otherwise, you'd get detected easily.

In this article, you've learned
  1. The importance of data scraping from Amazon.
  2. The methods Amazon uses to block web scrapers.
  3. Scraping product prices, descriptions, and more data from Amazon.
  4. How ZenRows can help you with its autoparse feature.

In the end, scaling and building efficient web scrapers is difficult. Even the script we've prepared will fail when the page structure changes. You can use ZenRows' web scraping API with auto-parser support, which will automatically collect the most valuable part of the data.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Want to keep learning?

We will be sharing all the insights we have learned through the years in the following blog posts. If you don't want to miss a piece and keep learning, we'd be thrilled to have us in our newsletter.

No spam guaranteed. You can unsubscribe at any time.