How to Scrape JavaScript Rendered Web Pages with Python
Ever tried scraping JavaScript rendered web pages with Python and you hit a wall?
Well, that's understandable.
Scraping JavaScript rendered web pages can be difficult because the data on the web page loads dynamically.
There are also loads of web applications out there using frameworks like React.js, Angular and Vue.js, so there is a high chance of your request-based scraper may break while scraping JS rendered pages.
If you are looking to scrape JavaScript-generated content from these web pages, then the regular libraries and methods aren't enough.
In this tutorial, we'll be discussing how to scrape JavaScript rendered web pages with Python.
Let's dive right in!
Why is scraping JavaScript rendered web pages Difficult?
When you send a request to a webpage, the client downloads the website content, which is different when it comes to JavaScript rendered websites. If the client supports JS, it'll run the JavaScript code to populate the rendered HTML content.
That being said…
JavaScript rendered web pages don't really produce valuable static HTML content and, thanks to that, plain HTTP requests won't be enough as the requested content must be populated first.
This means that you have to write code specifically for each website that you want to scrape which makes scraping JavaScript generated content difficult.
- Server rendering: it generates the full HTML for a page on the server in response to navigation. This avoids extra data fetching on the client since it's handled before the browser gets a response.
Just like a static website, we can extract information by just sending simple HTTP requests as the full content returned from the server. - Static rendering: this happens at build-time and offers a fast experience for the users. The main downside is that individual HTML files must be generated for every possible URL.
As we know, it's pretty easy to scrape data from static websites. - Client-side rendering: pages are rendered directly in the browser using JavaScript. The logic, data fetching, and routing are handled on the client rather than the server.
Nowadays, many modern web applications combine these two approaches. This is often referred to as Universal Rendering.
Universal Rendering tries to combine Client-Side and Server rendering to smooth over their disadvantages. It's also supported by popular frameworks such as React JS and Angular. For example, React parses HTML and updates the rendered page dynamically. This is called hydration.
How to Scrape JavaScript Generated Content
- Using backend queries.
- Using hidden data in the HTML script tag.
Using Backend Queries to scrape JavaScript rendered web pages
Sometimes frameworks such as React populates the webpage by using backend queries. It's possible to make use of these API calls in your application to get the data directly from the server.
As it's not a guaranteed method, you'll need to check the requests made by your browser to find out if there's an available API backend. If there's one, then you can use the same settings with your custom queries to grab the data from the server.
Using Script Tags
It's possible to scrape JS rendered pages using hidden data in a script tag in the form of a JSON file. Although, this method might require a deep search since you'll be checking the HTML tags in the loaded web page.
JS codes for a dynamic web page can be found in the script tags and extracted using the BeautifulSoup Python package.
Web applications usually protect API endpoints using different authentication methods, so it may be difficult to make use of an API for scraping JavaScript rendered web pages.
If there's encoded hidden data present in the static content, you may not be able to decode it. In this case, you need a tool that can render JavaScript for scraping.
You can use browser-based automation tools like Selenium, Playwright, and Puppeteer.
In this guide, we'll be making use of Selenium in Python, which is also available for JavaScript and Node JS.
How to Build a Web Scraper with Selenium
Selenium is a browser automation tool primarily used for web testing. Its ability to work like an actual browser makes it one of the best options for web scraping purposes. And since it supports JavaScript, scraping JavaScript rendered web pages with Selenium shouldn't be a problem.
We won't dive deep in and use complex methods, but you can check our complete Selenium guide to learn more!
In this article, we'll scrape Sprouts' breads from Instacart.
At first, instacart renders a template page on the server, then it gets populated by JavaScript on the client's side.
Here's what the loading screen template looks like:

And after populating the HTML content, we get something like this:

Now that we have the basics…
Let's get down to scraping JavaScript rendered web pages with Selenium on Python!
Installing the Requirements
Selenium is used to control a web driver instance, therefore we'll be needing a browser's web driver.
We are going to use WebDriver Manager for this task, which will automatically download the required WebDriver. The data will be stored in a CSV
format by using the Pandas module.
First of all, let's install the packages by using pip
:
pip install webdriver-manager selenium pandas
Alright!
Now we can start scraping some JavaScript generated content from the website.
Start by importing the necessary modules:
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
Now, let's initialize the headless chrome web driver:
# start by defining the options
options = webdriver.ChromeOptions()
options.headless = True # it's more scalable to work in headless mode
# normally, selenium waits for all resources to download
# we don't need it as the page also populated with the running javascript code.
options.page_load_strategy = 'none'
# this returns the path web driver downloaded
chrome_path = ChromeDriverManager().install()
chrome_service = Service(chrome_path)
# pass the defined options and service objects to initialize the web driver
driver = Chrome(options=options, service=chrome_service)
driver.implicitly_wait(5)
After the initialization is done, let's connect to the website:
url = "https://www.instacart.com/store/sprouts/collections/bread?guest=True"
driver.get(url)
time.sleep(10)
You'll notice we added a 10 seconds delay after connecting to the website, this is done to let the web driver load the website completely.
Before extracting data from individual listings, we need to find out where the products are stored.
The products are stored as a li
element inside of the ul
, which is also inside of a div
element:

We can filter out the div
elements by filtering their classes by substrings.
Check if the element's class
attribute has the ItemsGridWithPostAtcRecommendations
text.
It's possible to use the CSS selectors for this, like how we did over here:
content = driver.find_element(By.CSS_SELECTOR, "div[class*='ItemsGridWithPostAtcRecommendations'")
We can use *=
to check if a specific substring is in the attribute.
As there aren't any li
elements outside of the ul
parent, let's extract the li
elements from content
:
breads = content.find_elements(By.TAG_NAME, "li")
Moving on, we'll scrape the JavaScript generated data from every single li
element individually:

Let's start by extracting the product image.
There's only one img
element in the li
. We can also see the image URLs in the srcset
attribute:

We need to process the extracted data.
After a bit of digging, you can see the image is stored in Cloudfront's CDN. So we can extract the URL from there.
Split the whole element by ,
[take note of the space after the comma] and process the first element.
We split the URL with /
and concatenated the parts starting from the Cloudfront
URL:
def parse_img_url(url):
# get the first url
url = url.split(', ')[0]
# split it by '/'
splitted_url = url.split('/')
# loop over the elements to find where 'cloudfront' url begins
for idx, part in enumerate(splitted_url):
if 'cloudfront' in part:
# add the HTTP scheme and concatenate the rest of the URL
# then return the processed url
return 'https://' + '/'.join(splitted_url[idx:])
# as we don't know if that's the only measurement to take,
# return None if the cloudfront couldn't be found
return None
Now we can extract the URL by using the parse_img_url
function:
img = element.find_element(By.TAG_NAME, "img").get_attribute("srcset")
img = parse_img_url(img)
There are also dietary attributes of the products. But as you can see from the green rectangle, not all of the products have them:

We can also make use of the CSS selectors to get the div
element first, then we could extract the span
s inside of it.
As we'll use the find_elements
method in Selenium, it'll return None
if there aren't any span
elements:
# A>B means the B elements where A is the parent element.
dietary_attrs = element.find_elements(By.CSS_SELECTOR, "div[class*='DietaryAttributes']>span")
# if there aren't any, then 'dietary_attrs' will be None and 'if' block won't work
# but if there are any dietary attributes, extract the text from them
if dietary_attrs:
dietary_attrs = [attr.text for attr in dietary_attrs]
else:
# set the variable to None if there aren't any dietary attributes found.
dietary_attrs = None
Let's move to the prices…
They're stored in a div
element with the ItemBCardDefault
substring in the class
attribute.
But it's not the only one, so we'll directly get the span
element inside of it by using CSS selectors:

It's always a good idea to check if the element is loaded while scraping the prices on the web page.
A simple approach would be the find_elements
method.
It returns an empty list which can be helpful while building an API for data extraction:
# get the span elements where the parent is a 'div' element that
# has 'ItemBCardDefault' substring in the 'class' attribute
price = element.find_elements(By.CSS_SELECTOR, "div[class*='ItemBCardDefault']>span")
# extract the price text if we could find the price span
if price:
price = price[0].text
else:
price = None
To wrap things up, let's extract the name and the size of the product.
The name is stored in the only h2
element. And we can extract the size by using a CSS selector since it's in a div
which has the Size
substring:

Now when that's done, let's add the code as shown:
name = element.find_element(By.TAG_NAME, "h2").text
size = element.find_element(By.CSS_SELECTOR, "div[class*='Size']").text
Finally, we can wrap all of these within an extract_data
function:
def extract_data(element):
img = element.find_element(By.TAG_NAME, "img").get_attribute("srcset")
img = parse_img_url(img)
# A>B means the B elements where A is the parent element.
dietary_attrs = element.find_elements(By.CSS_SELECTOR, "div[class*='DietaryAttributes']>span")
# if there aren't any, then 'dietary_attrs' will be None and 'if' block won't work
# but if there are any dietary attributes, extract the text from them
if dietary_attrs:
dietary_attrs = [attr.text for attr in dietary_attrs]
else:
# set the variable to None if there aren't any dietary attributes found.
dietary_attrs = None
# get the span elements where the parent is a 'div' element that
# has 'ItemBCardDefault' substring in the 'class' attribute
price = element.find_elements(By.CSS_SELECTOR, "div[class*='ItemBCardDefault']>span")
# extract the price text if we could find the price span
if price:
price = price[0].text
else:
price = None
name = element.find_element(By.TAG_NAME, "h2").text
size = element.find_element(By.CSS_SELECTOR, "div[class*='Size']").text
return {
"price": price,
"name": name,
"size": size,
"attrs": dietary_attrs,
"img": img
}
Let's use the function to process all li
elements found in the main content div
.
It's possible to store the results in a list and convert them to a DataFrame by using Pandas!
data = []
for bread in breads:
extracted_data = extract_data(bread)
data.append(extracted_data)
df = pd.DataFrame(data)
df.to_csv("result.csv", index=False)
And there you have it!
A Selenium scraper that is capable of scraping data from JavaScript rendered websites!
Now, if you followed this tutorial step by step, here's what your final result should look like:

A scraped data from a JavaScript-rendered web page using Python.
In this GitHub gist is the full version of the code used in this guide.
The Disadvantage of using Selenium
Since we're running web driver instances, it's difficult to scale up the application.
More instances will need more resources, which will generally overload the production environment.
Also, using a web driver is more time-consuming compared to request-based solutions. Therefore, it's generally advised to use browser-automation tools such as Selenium as a last resort.
Conclusion
The purpose of this guide is to show you how to scrape JavaScript generated content from dynamically loaded pages.
We covered how JavaScript rendered websites work. We used Selenium to build a tool to extract data from dynamically loaded elements.
- Install Selenium and WebDriver Manager.
- Connect to the target URL.
- Scrape the relevant data by using CSS selectors or another method that Selenium supports.
- Save and export the data as a CSV file for later use.
Of course, you can always write your own code and build your own web scraper. But there are many precautions that websites take to block bots.
On a bigger scale, scraping dozens of products is difficult and time-consuming. The best option is to make use of ZenRows, which will let you scrape data with simple API calls. It also handles the anti-bot measures automatically.
Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.