How to Scrape JavaScript Rendered Web Pages with Python

November 3, 2022 · 8 min read

Ever tried scraping JavaScript rendered web pages with Python and you hit a wall?

Well, that's understandable.

Scraping JavaScript rendered web pages can be difficult because the data on the web page loads dynamically.

There are also loads of web applications out there using frameworks like React.js, Angular and Vue.js, so there is a high chance of your request-based scraper may break while scraping JS rendered pages.

If you are looking to scrape JavaScript-generated content from these web pages, then the regular libraries and methods aren't enough.

In this tutorial, we'll be discussing how to scrape JavaScript rendered web pages with Python.

Let's dive right in!

Why is scraping JavaScript rendered web pages Difficult?

When you send a request to a webpage, the client downloads the website content, which is different when it comes to JavaScript rendered websites. If the client supports JS, it'll run the JavaScript code to populate the rendered HTML content.

That being said…

JavaScript rendered web pages don't really produce valuable static HTML content and, thanks to that, plain HTTP requests won't be enough as the requested content must be populated first.

This means that you have to write code specifically for each website that you want to scrape which makes scraping JavaScript generated content difficult.

Of course, this isn't always the case. There are different ways of rendering the webpage:
  • Server rendering: it generates the full HTML for a page on the server in response to navigation. This avoids extra data fetching on the client since it's handled before the browser gets a response.

    Just like a static website, we can extract information by just sending simple HTTP requests as the full content returned from the server.
  • Static rendering: this happens at build-time and offers a fast experience for the users. The main downside is that individual HTML files must be generated for every possible URL.

    As we know, it's pretty easy to scrape data from static websites.
  • Client-side rendering: pages are rendered directly in the browser using JavaScript. The logic, data fetching, and routing are handled on the client rather than the server.

Nowadays, many modern web applications combine these two approaches. This is often referred to as Universal Rendering.

Universal Rendering tries to combine Client-Side and Server rendering to smooth over their disadvantages. It's also supported by popular frameworks such as React JS and Angular. For example, React parses HTML and updates the rendered page dynamically. This is called hydration.

How to Scrape JavaScript Generated Content

There are different methods available to scrape JavaScript generated content from web pages, some of which include:
  • Using backend queries.
  • Using hidden data in the HTML script tag.

Using Backend Queries to scrape JavaScript rendered web pages

Sometimes frameworks such as React populates the webpage by using backend queries. It's possible to make use of these API calls in your application to get the data directly from the server.

As it's not a guaranteed method, you'll need to check the requests made by your browser to find out if there's an available API backend. If there's one, then you can use the same settings with your custom queries to grab the data from the server.

Using Script Tags

It's possible to scrape JS rendered pages using hidden data in a script tag in the form of a JSON file. Although, this method might require a deep search since you'll be checking the HTML tags in the loaded web page.

JS codes for a dynamic web page can be found in the script tags and extracted using the BeautifulSoup Python package.

Web applications usually protect API endpoints using different authentication methods, so it may be difficult to make use of an API for scraping JavaScript rendered web pages.

If there's encoded hidden data present in the static content, you may not be able to decode it. In this case, you need a tool that can render JavaScript for scraping.

You can use browser-based automation tools like Selenium, Playwright, and Puppeteer.

In this guide, we'll be making use of Selenium in Python, which is also available for JavaScript and Node JS.

How to Build a Web Scraper with Selenium

Selenium is a browser automation tool primarily used for web testing. Its ability to work like an actual browser makes it one of the best options for web scraping purposes. And since it supports JavaScript, scraping JavaScript rendered web pages with Selenium shouldn't be a problem.

We won't dive deep in and use complex methods, but you can check our complete Selenium guide to learn more!

In this article, we'll scrape Sprouts' breads from Instacart.

At first, instacart renders a template page on the server, then it gets populated by JavaScript on the client's side.

Here's what the loading screen template looks like:

Loading template
Click to open the image in fullscreen

And after populating the HTML content, we get something like this:

Instacart
Click to open the image in fullscreen

Now that we have the basics…

Let's get down to scraping JavaScript rendered web pages with Selenium on Python!

Installing the Requirements

Selenium is used to control a web driver instance, therefore we'll be needing a browser's web driver.

We are going to use WebDriver Manager for this task, which will automatically download the required WebDriver. The data will be stored in a CSV format by using the Pandas module.

First of all, let's install the packages by using pip:

pip install webdriver-manager selenium pandas

Alright!

Now we can start scraping some JavaScript generated content from the website.

Start by importing the necessary modules:

import time 
 
import pandas as pd 
from selenium import webdriver 
from selenium.webdriver import Chrome 
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.common.by import By 
from webdriver_manager.chrome import ChromeDriverManager

Now, let's initialize the headless chrome web driver:

# start by defining the options 
options = webdriver.ChromeOptions() 
options.headless = True # it's more scalable to work in headless mode 
# normally, selenium waits for all resources to download 
# we don't need it as the page also populated with the running javascript code. 
options.page_load_strategy = 'none' 
# this returns the path web driver downloaded 
chrome_path = ChromeDriverManager().install() 
chrome_service = Service(chrome_path) 
# pass the defined options and service objects to initialize the web driver 
driver = Chrome(options=options, service=chrome_service) 
driver.implicitly_wait(5)

After the initialization is done, let's connect to the website:

url = "https://www.instacart.com/store/sprouts/collections/bread?guest=True" 
 
driver.get(url) 
time.sleep(10)

You'll notice we added a 10 seconds delay after connecting to the website, this is done to let the web driver load the website completely.

Before extracting data from individual listings, we need to find out where the products are stored.

The products are stored as a li element inside of the ul, which is also inside of a div element:

Instacart content in DevTools
Click to open the image in fullscreen
Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

We can filter out the div elements by filtering their classes by substrings.

Check if the element's class attribute has the ItemsGridWithPostAtcRecommendations text.

It's possible to use the CSS selectors for this, like how we did over here:

content = driver.find_element(By.CSS_SELECTOR, "div[class*='ItemsGridWithPostAtcRecommendations'")

We can use *= to check if a specific substring is in the attribute.

As there aren't any li elements outside of the ul parent, let's extract the li elements from content:

breads = content.find_elements(By.TAG_NAME, "li")

Moving on, we'll scrape the JavaScript generated data from every single li element individually:

Instacart bread in DevTools
Click to open the image in fullscreen

Let's start by extracting the product image.

There's only one img element in the li. We can also see the image URLs in the srcset attribute:

Instacart bread image
Click to open the image in fullscreen

We need to process the extracted data.

After a bit of digging, you can see the image is stored in Cloudfront's CDN. So we can extract the URL from there.

Split the whole element by , [take note of the space after the comma] and process the first element.

We split the URL with / and concatenated the parts starting from the Cloudfront URL:

def parse_img_url(url): 
	# get the first url 
	url = url.split(', ')[0] 
	# split it by '/' 
	splitted_url = url.split('/') 
	# loop over the elements to find where 'cloudfront' url begins 
	for idx, part in enumerate(splitted_url): 
		if 'cloudfront' in part: 
			# add the HTTP scheme and concatenate the rest of the URL 
			# then return the processed url 
			return 'https://' + '/'.join(splitted_url[idx:]) 
 
	# as we don't know if that's the only measurement to take, 
	# return None if the cloudfront couldn't be found 
	return None

Now we can extract the URL by using the parse_img_url function:

img = element.find_element(By.TAG_NAME, "img").get_attribute("srcset") 
img = parse_img_url(img)

There are also dietary attributes of the products. But as you can see from the green rectangle, not all of the products have them:

Instacart bread dietary attributes
Click to open the image in fullscreen

We can also make use of the CSS selectors to get the div element first, then we could extract the spans inside of it.

As we'll use the find_elements method in Selenium, it'll return None if there aren't any span elements:

# A>B means the B elements where A is the parent element. 
dietary_attrs = element.find_elements(By.CSS_SELECTOR, "div[class*='DietaryAttributes']>span") 
# if there aren't any, then 'dietary_attrs' will be None and 'if' block won't work 
# but if there are any dietary attributes, extract the text from them 
if dietary_attrs: 
	dietary_attrs = [attr.text for attr in dietary_attrs] 
else: 
	# set the variable to None if there aren't any dietary attributes found. 
	dietary_attrs = None

Let's move to the prices…

They're stored in a div element with the ItemBCardDefault substring in the class attribute.

But it's not the only one, so we'll directly get the span element inside of it by using CSS selectors:

Instacart bread price
Click to open the image in fullscreen

It's always a good idea to check if the element is loaded while scraping the prices on the web page.

A simple approach would be the find_elements method.

It returns an empty list which can be helpful while building an API for data extraction:

# get the span elements where the parent is a 'div' element that 
# has 'ItemBCardDefault' substring in the 'class' attribute 
price = element.find_elements(By.CSS_SELECTOR, "div[class*='ItemBCardDefault']>span") 
# extract the price text if we could find the price span 
if price: 
	price = price[0].text 
else: 
	price = None

To wrap things up, let's extract the name and the size of the product.

The name is stored in the only h2 element. And we can extract the size by using a CSS selector since it's in a div which has the Size substring:

Instacart bread name and size
Click to open the image in fullscreen

Now when that's done, let's add the code as shown:

name = element.find_element(By.TAG_NAME, "h2").text 
size = element.find_element(By.CSS_SELECTOR, "div[class*='Size']").text

Finally, we can wrap all of these within an extract_data function:

def extract_data(element): 
	img = element.find_element(By.TAG_NAME, "img").get_attribute("srcset") 
	img = parse_img_url(img) 
 
	# A>B means the B elements where A is the parent element. 
	dietary_attrs = element.find_elements(By.CSS_SELECTOR, "div[class*='DietaryAttributes']>span") 
	# if there aren't any, then 'dietary_attrs' will be None and 'if' block won't work 
	# but if there are any dietary attributes, extract the text from them 
	if dietary_attrs: 
		dietary_attrs = [attr.text for attr in dietary_attrs] 
	else: 
		# set the variable to None if there aren't any dietary attributes found. 
		dietary_attrs = None 
 
	# get the span elements where the parent is a 'div' element that 
	# has 'ItemBCardDefault' substring in the 'class' attribute 
	price = element.find_elements(By.CSS_SELECTOR, "div[class*='ItemBCardDefault']>span") 
	# extract the price text if we could find the price span 
	if price: 
		price = price[0].text 
	else: 
		price = None 
 
	name = element.find_element(By.TAG_NAME, "h2").text 
	size = element.find_element(By.CSS_SELECTOR, "div[class*='Size']").text 
 
	return { 
		"price": price, 
		"name": name, 
		"size": size, 
		"attrs": dietary_attrs, 
		"img": img 
	}

Let's use the function to process all li elements found in the main content div.

It's possible to store the results in a list and convert them to a DataFrame by using Pandas!

data = [] 
 
for bread in breads: 
	extracted_data = extract_data(bread) 
	data.append(extracted_data) 
 
df = pd.DataFrame(data) 
df.to_csv("result.csv", index=False)

And there you have it!

A Selenium scraper that is capable of scraping data from JavaScript rendered websites!

Now, if you followed this tutorial step by step, here's what your final result should look like:

Instacart scraping result
Click to open the image in fullscreen

A scraped data from a JavaScript-rendered web page using Python.

In this GitHub gist is the full version of the code used in this guide.

The Disadvantage of using Selenium

Since we're running web driver instances, it's difficult to scale up the application.

More instances will need more resources, which will generally overload the production environment.

Also, using a web driver is more time-consuming compared to request-based solutions. Therefore, it's generally advised to use browser-automation tools such as Selenium as a last resort.

Conclusion

The purpose of this guide is to show you how to scrape JavaScript generated content from dynamically loaded pages.

We covered how JavaScript rendered websites work. We used Selenium to build a tool to extract data from dynamically loaded elements.

Quick recap:
  1. Install Selenium and WebDriver Manager.
  2. Connect to the target URL.
  3. Scrape the relevant data by using CSS selectors or another method that Selenium supports.
  4. Save and export the data as a CSV file for later use.

Of course, you can always write your own code and build your own web scraper. But there are many precautions that websites take to block bots.

On a bigger scale, scraping dozens of products is difficult and time-consuming. The best option is to make use of ZenRows, which will let you scrape data with simple API calls. It also handles the anti-bot measures automatically.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Want to keep learning?

We will be sharing all the insights we have learned through the years in the following blog posts. If you don't want to miss a piece and keep learning, we'd be thrilled to have us in our newsletter.

No spam guaranteed. You can unsubscribe at any time.