The Anti-bot Solution to Scrape Everything? Get Your Free API Key! ๐Ÿ˜Ž

How to Scrape JavaScript-Rendered Web Pages with Python

November 3, 2022 ยท 8 min read

Ever hit a wall while scraping JavaScript-rendered web pages with Python?

Well, that's understandable.

It can certainly prove difficult because of their dynamically loaded data. Not to mention there are loads of web apps using frameworks like React.js or Angular, so there's a high chance your request-based scraper may break while trying to perform.

By now, you've probably realized that the standard libraries and methods aren't enough to scrape JS-generated content. Don't worry! In this tutorial, you'll get just the right tips to get the job done.

Are you ready to learn how to scrape JavaScript-rendered web pages with Python? Let's go:

Why is Scraping JavaScript-Rendered Web Pages Difficult?

When you send a request to a web page, the client downloads the content (which is different when it comes to JS websites). If the client supports JavaScript, it'll run the code to populate the rendered HTML content.

That being said...

These pages don't really produce valuable static HTML content. As a result, plain HTTP requests won't be enough as the requested content must be populated first.

This means you have to write code specifically for each target website. And that's what makes scraping JavaScript content so difficult.

Of course, there are other options. Let's look at the different ways of rendering the web page:

  • Static rendering: this happens at build-time and offers a fast user experience. The main downside is that individual HTML files must be generated for every possible URL.

As we know, it's pretty easy to scrape data from static websites.

  • Server rendering: it generates the complete HTML for a page on the server in response to navigation. This avoids extra data fetching on the client's side since it's handled before the browser gets a response.

Just like a static website, we can extract information by sending simple HTTP requests as the entire content returns from the server.

  • Client-side rendering: pages are rendered directly in the browser using JavaScript. The logic, data fetching, and routing are handled on the client's side rather than the server.

Nowadays, many modern apps combine the last two approaches in an attempt to smooth over their disadvantages. This is often referred to as Universal Rendering.

It's also supported by popular frameworks such as React.js and Angular. For example, React parses the HTML and updates the rendered page dynamically---a process called hydration.

How to Scrape JavaScript-Generated Content

There are a variety of different methods available. Let's explore two of them:

Using Backend Queries

Sometimes frameworks such as React populate the page by using backend queries. It's possible to use these API calls in your application, getting data directly from the server.

However, this isn't guaranteed. This means you'll need to check your browser requests to find out if there's an available API backend in the first place. If there is one, then you can use the same settings with your custom queries to grab the data.

Using Script Tags

So, what's another method you can use to scrape JavaScript-generated content from web pages?

You can try using hidden data in a script tag as a JSON file. However, you should know that this might require a deep search since you'll be checking the HTML tags on the loaded page. The JS codes can be extracted using the BeautifulSoup Python package.

Web applications usually protect API endpoints using different authentication methods, so employing APIs for scraping JS-rendered pages may be challenging.

If there's encoded hidden data present in the static content, you may not be able to decode it. In this case, you need software that can render JavaScript for scraping.

You can try browser-based automation tools, e.g., Selenium, Playwright, or Puppeteer. In this guide, we'll test how Selenium in Python works (note that it's also available for JavaScript and Node.js).

How to Build a Web Scraper with Selenium

Selenium is primarily used for web testing. Its ability to work like an actual browser places it among the best options for web scraping as well.

Since it also supports JS, scraping JavaScript-rendered web pages with Selenium shouldn't be an issue.

In this tutorial, we won't be exploring all the complex methods you can use. Check out our thorough Selenium guide to learn all about that and more. However, here we'll focus on else:

Let's try scraping Sprouts' breads from Instacart.

At first, the website renders a template page on the server; then, it gets populated by JavaScript on the client's side.

Here's what the loading screen looks like:

scrape-instacart-zenrows
Click to open the image in full screen

After populating the HTML content, we get something like this:

instacart-scrape-javascript
Click to open the image in full screen

Now that we've covered the basics, let's get down to scraping JavaScript-rendered web pages with Selenium on Python!

Installing the Requirements

Selenium is used to control a web driver instance. Therefore, you'll need a browser's driver. For this, we go to WebDriver Manager, which will automatically download all that's required. The data will be stored in a CSV format by using the Pandas module.

We must then install the packages by using pip:

pip install webdriver-manager selenium pandas

Alright! Finally, we can get to scraping.

We'll start by importing the necessary modules:

import time 
 
 
import pandas as pd 
from selenium import webdriver 
from selenium.webdriver import Chrome 
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.common.by import By 
from webdriver_manager.chrome import ChromeDriverManager

Now, let's initialize the headless chrome web driver:

# start by defining the options 
options = webdriver.ChromeOptions() 
options.headless = True # it's more scalable to work in headless mode 
# normally, selenium waits for all resources to download 
# we don't need it as the page also populated with the running javascript code. 
options.page_load_strategy = 'none' 
# this returns the path web driver downloaded 
chrome_path = ChromeDriverManager().install() 
chrome_service = Service(chrome_path) 
# pass the defined options and service objects to initialize the web driver 
driver = Chrome(options=options, service=chrome_service) 
driver.implicitly_wait(5)

After that, we'll connect to the website:

url = "https://www.instacart.com/store/sprouts/collections/bread?guest=True" 
 
 
driver.get(url) 
time.sleep(10)

You'll notice that we added a 10 seconds delay. That's done to let the driver load the website completely.

Before we move on to extracting data from individual listings, we must find out where the products are stored.

They are saved as a li element inside of the ul, which, in turn, is in a div one:

scrape-javascript-rendered-pages
Click to open the image in full screen
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

We can sort out the div elements by filtering their classes by substrings.

Next, we have to check if their class attribute has the ItemsGridWithPostAtcRecommendations text. You can use the CSS selectors for this:

content = driver.find_element(By.CSS_SELECTOR, "div[class*='ItemsGridWithPostAtcRecommendations'")

The *= will come in handy when making sure if a specific substring is in the attribute. As there aren't any li elements outside of the ul parent, we'll extract the ones from content:

breads = content.find_elements(By.TAG_NAME, "li")

Next up, we'll scrape the JS-generated data from every single li element individually:

scrape-javascript
Click to open the image in full screen

Let's start by extracting the product image. You'll notice two things: there's only one img element in the li, and the image URLs are visible in the srcset attribute:

instacart-scrape
Click to open the image in full screen

We now need to process the extracted data.

After a bit of digging, you can see the image is stored in CloudFront CDN. From there, we can extract the URL.

Split the whole element by , [take note of the space after the comma] and process the first part. We break the URL with / and link together the parts starting from the Cloudfront one:

def parse_img_url(url): 
	# get the first url 
	url = url.split(', ')[0] 
	# split it by '/' 
	splitted_url = url.split('/') 
	# loop over the elements to find where 'cloudfront' url begins 
	for idx, part in enumerate(splitted_url): 
		if 'cloudfront' in part: 
			# add the HTTP scheme and concatenate the rest of the URL 
			# then return the processed url 
			return 'https://' + '/'.join(splitted_url[idx:]) 
 
	# as we don't know if that's the only measurement to take, 
	# return None if the cloudfront couldn't be found 
	return None

Time to extract the URL by using the parse_img_url function:

img = element.find_element(By.TAG_NAME, "img").get_attribute("srcset") 
img = parse_img_url(img)
img = parse_img_url(img)

As you can see, there are dietary attributes only on some of the products.

instacart-attributes-scraping
Click to open the image in full screen

Time to employ the CSS selectors to extract the spans inside of the div element. After that, we'll use the find_elements method in Selenium. If it returns None, that means there aren't any span elements:

# A>B means the B elements where A is the parent element. 
dietary_attrs = element.find_elements(By.CSS_SELECTOR, "div[class*='DietaryAttributes']>span") 
# if there aren't any, then 'dietary_attrs' will be None and 'if' block won't work 
# but if there are any dietary attributes, extract the text from them 
if dietary_attrs: 
	dietary_attrs = [attr.text for attr in dietary_attrs] 
else: 
	# set the variable to None if there aren't any dietary attributes found. 
	dietary_attrs = None

Moving on to prices...

Prices are stored in a div element with the ItemBCardDefault substring in the class attribute. Since it's not the only one, we'll directly get the span element by using CSS selectors:

build-a-scraper-instacart
Click to open the image in full screen

It's always a good idea to check if the element is loaded while scraping the prices on the web page.

A simple approach would be the find_elements method.

It returns an empty list, which can be helpful for building a data extraction API:

# get the span elements where the parent is a 'div' element that 
# has 'ItemBCardDefault' substring in the 'class' attribute 
price = element.find_elements(By.CSS_SELECTOR, "div[class*='ItemBCardDefault']>span") 
# extract the price text if we could find the price span 
if price: 
	price = price[0].text 
else: 
	price = None

Finally, we'll retrieve the name and size of the product.

The name is stored in the only h2 element. As for the size, we'll once again rely on a CSS selector for the job:

instacart-products-attributes-scraping
Click to open the image in full screen

Now that that's all taken care of, we have to add the following code:

name = element.find_element(By.TAG_NAME, "h2").text 
size = element.find_element(By.CSS_SELECTOR, "div[class*='Size']").text

Finally, we can wrap all this within an extract_data function:

def extract_data(element): 
	img = element.find_element(By.TAG_NAME, "img").get_attribute("srcset") 
	img = parse_img_url(img) 
 
	# A>B means the B elements where A is the parent element. 
	dietary_attrs = element.find_elements(By.CSS_SELECTOR, "div[class*='DietaryAttributes']>span") 
	# if there aren't any, then 'dietary_attrs' will be None and 'if' block won't work 
	# but if there are any dietary attributes, extract the text from them 
	if dietary_attrs: 
		dietary_attrs = [attr.text for attr in dietary_attrs] 
	else: 
		# set the variable to None if there aren't any dietary attributes found. 
		dietary_attrs = None 
 
	# get the span elements where the parent is a 'div' element that 
	# has 'ItemBCardDefault' substring in the 'class' attribute 
	price = element.find_elements(By.CSS_SELECTOR, "div[class*='ItemBCardDefault']>span") 
	# extract the price text if we could find the price span 
	if price: 
		price = price[0].text 
	else: 
		price = None 
 
	name = element.find_element(By.TAG_NAME, "h2").text 
	size = element.find_element(By.CSS_SELECTOR, "div[class*='Size']").text 
 
	return { 
		"price": price, 
		"name": name, 
		"size": size, 
		"attrs": dietary_attrs, 
		"img": img 
	}

Let's use that to process all li elements found in the main content div. It's possible to store the results in a list and convert them to a DataFrame by using Pandas!

data = [] 
 
for bread in breads: 
	extracted_data = extract_data(bread) 
	data.append(extracted_data) 
 
df = pd.DataFrame(data) 
df.to_csv("result.csv", index=False)

And there you have it! A Selenium scraper that extracts data from JavaScript-rendered websites! Easy, right?

Here's what your final result should look like if you followed this step-by-step tutorial meticulously:

scrape-javascript-pages
Click to open the image in full screen

A scraped data from a JavaScript-rendered web page using Python.

You can find the complete code used in the guide in this GitHub gist.

The Disadvantage of using Selenium

Since we're running web driver instances, it's difficult to scale up the application. More of them will need more resources, which will overload the production environment.

You should also keep in mind that using a web driver is more time-consuming compared to request-based solutions. Therefore, it's generally advised to use such browser-automation tools, that is Selenium, as a last resort.

Conclusion

Today you learned how to scrape JavaScript-generated content from dynamically loaded web pages.

We covered how JS-rendered websites work. We used Selenium to build a dynamic data extraction tool.

Let's do a quick recap:

  1. Install Selenium and WebDriver Manager.

  2. Connect to the target URL.

  3. Scrape the relevant data using CSS selectors or another method Selenium supports.

  4. Save and export the data as a CSV file for later use.

Of course, you can always build your own web scraper. Though, there are many difficulties that come with that decision. Just think of the numerous antibot protections websites employ. Not to mention that scraping dozens of products and/or websites is extremely difficult and time-consuming.

So, why don't you let professionals take this load off your hands? ZenRows lets you scrape data with simple API calls. It also handles the anti-bot measures automatically. Try it out for free today!

Did you find the content helpful? Spread the word and share it on Twitter, or LinkedIn.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The easiest way to do Web Scraping

From Rotating Proxies and Headless Browsers to CAPTCHAs, a single API call to ZenRows handles all anti-bot bypass for you.