Dynamic Web Pages Scraping with Python: Guide to Scrape All Content

November 4, 2022 · 6 min read

Have you ever tried web scraping dynamic content from a web page and the results weren't impressive? Well, that's because crawling dynamic content with regular scrapers is difficult since JavaScript runs in the background when an HTTP request is made.

When it comes to scraping dynamic web pages, it's necessary to render the entire webpage in a browser and extract the required information.

In this tutorial, we'll be taking you through the steps on how to go about dynamic web scraping with python.

Let's dive right in!

What is a dynamic website?

A dynamic website is a website that doesn't have all its content directly in its static HTML. It uses server-side or client-side scripting to display content, sometimes based on user actions like clicking, scrolling and so on.

To put that in simpler terms, a dynamic website displays different content or layout with every request to the web server. This helps the page load faster as there's no need to reload the same layout each time you want to view "new" content.

One way to identify a dynamic website is by disabling JavaScript in the command palette on your browser. If it's dynamic, the web content will disappear.

Let's use Saleor React Storefront as an example. Here's what the front page looks like:

React Storefront
Click to open the image in fullscreen

You'd notice the titles, images and artist's name.

Now, let's disable JavaScript using the steps below:
  1. Inspect the page.
  2. Navigate to the command palette (CTRL or CMD + SHIFT + P).
  3. Search for "JavaScript".
  4. Click on Disable JavaScript.
  5. Hit refresh.

And then you get something like this:

React Storefront with JavaScript disabled
Click to open the image in fullscreen

As previously discussed, disabling JavaScript removes all dynamic web content from a dynamic website.

Alternatives for dynamic web scraping with Python

Since Python libraries like BeautifulSoup or Requests don't automatically fetch dynamic content from a web page, we are left with two options for scraping dynamic websites with Python.

We can either feed the content to a regular library or execute the page's internal JavaScript while scraping.

However, not all dynamic websites are the same. Some render content through JavaScript APIs which we can access by inspecting the network tab while others have the JavaScript-rendered content as JSON somewhere in the DOM.

We can parse the JSON string to extract the necessary data in both cases.

For websites where the above cases are inapplicable, we have to use a headless browser to render the page to extract the data we need. The alternatives for crawling dynamic web pages with Python are:
  • Manually locating the data and parsing JSON string.
  • Using headless browsers to execute the page's internal JavaScript. For example, Selenium and Pyppeteer (an unofficial Python port of Puppeteer).

What is the easiest way to scrape a dynamic website in Python?

Headless browsers can be slow and performance-intensive. But they have no restrictions on web scraping, except for anti-bot detection. Well, that's not a problem since we have an article on how to bypass bot detections.

Manually locating data and parsing JSON string presumes that accessing the JSON version of the dynamically rendered data is possible.

This is not the case for many websites, particularly high-level single-page applications (SPAs). Also, mimicking an API request is not scalable. They often require cookies and authentications alongside other restrictions that can block you out.

The easiest way to scrape dynamic web pages in Python depends on your goals and the resources available to you. If you have access to a website's JSON and your goal is extracting data from a single web page, you may not need a headless browser.

Anyways, the best and easiest way to scrape a dynamic website in Python is by using BeautifulSoup and Selenium

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Prerequisites

We'll be using the following tools for this tutorial:

The last item will ensure to match the browser and driver versions. No need for you to download manually the webdriver.

pip install selenium webdriver-manager

Option 1: Dynamic Web Scraping with Python using BeautifulSoup

BeautifulSoup is arguably one of the most used Python libraries for crawling data from HTML. It works by parsing an HTML string into a BeautifulSoup Python object.

To extract data using this library, we need the HTML string of the page we want to scrape.

However, dynamic content is not directly present in a website's static HTML, therefore BeautifulSoup can't access JavaScript-generated data.

But it is possible to extract data from XHR requests using BeautifulSoup if the website loads content using an AJAX request.

Option 2: Scraping Dynamic Web Pages in Python using Selenium

To better understand how Selenium works for scraping dynamic websites, let's take a look at how regular libraries (Requests) interact with these websites.

For this tutorial, we'll use Angular as our target website:

Angular.io homepage
Click to open the image in fullscreen

Let's scrape it using Requests to see what we can get. For this, we need to install the Request library, which can be done using the pip command.

pip install requests

Here's what our code looks like:

import requests 
 
url = 'https://angular.io/' 
 
response = requests.get(url) 
 
html = response.text 
 
print(html)

From the results, you'd see that only the following HTML was extracted:

<noscript> 
	<div class="background-sky hero"></div> 
	<section id="intro" style="text-shadow: 1px 1px #1976d2;"> 
		<div class="hero-logo"></div> 
		<div class="homepage-container"> 
			<div class="hero-headline">The modern web<br>developer's platform</div> 
		</div> 
	</section> 
	<h2 style="color: red; margin-top: 40px; position: relative; text-align: center; text-shadow: 1px 1px #fafafa; border-top: none;"> 
		<b><i>This website requires JavaScript.</i></b> 
	</h2> 
</noscript>

And inspecting the website shows more content than what was extracted.

This is what happened when we disabled JavaScript on the page:

Angular.io with JavaScript disabled
Click to open the image in fullscreen

Exactly what requests was able to return.

Everything is correct from the Requests perspective as it parsed the data from the website's static HTML, which is what it should do. But we want the same result as what is being displayed on the website which is quite impossible since it is a dynamic web page.

We must render the JavaScript to access the full content and extract the required data.

Now that we've failed our first attempt at dynamic web scraping with Python using regular libraries, let's make things right with selenium dynamic web scraping.

Let's use the following script to quickly crawl the entire content of our target website:

from selenium import webdriver 
from selenium.webdriver.chrome.service import Service as ChromeService 
from webdriver_manager.chrome import ChromeDriverManager 
 
url = 'https://angular.io/' 
 
driver = webdriver.Chrome(service=ChromeService( 
	ChromeDriverManager().install())) 
 
driver.get(url) 
 
print(driver.page_source)

Your result would be the complete HTML of our target page, including the dynamic web content.

Congratulations champion… you've just scraped your first dynamic website!

Selecting Elements in Selenium

There are different ways to access elements in Selenium which was discussed in the web scraping with Selenium in Python article.

Moving on, let's select only the H2s on our target website as an example:

Angular.io H2
Click to open the image in fullscreen

First of all… Let's inspect our target website to identify the location of the elements we want to extract and how we can get them.

So, for the H2s on this website, the class="text-container" is common. We can copy that and map the H2s to get elements using Chrome Driver.

Angular.io H2 inspect
Click to open the image in fullscreen

Here's what our code to get the H2s looks like:

from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.chrome.service import Service as ChromeService 
from webdriver_manager.chrome import ChromeDriverManager 
 
# instantiate options 
options = webdriver.ChromeOptions() 
 
# run browser in headless mode 
options.headless = True 
 
# instantiate driver 
driver = webdriver.Chrome(service=ChromeService( 
	ChromeDriverManager().install()), options=options) 
 
# load website 
url = 'https://angular.io/' 
 
# get the entire website content 
driver.get(url) 
 
# select elements by class name 
elements = driver.find_elements(By.CLASS_NAME, 'text-container') 
for title in elements: 
	# select H2s, within element, by tag name 
	heading = title.find_element(By.TAG_NAME, 'h2').text 
	# print H2s 
	print(heading)

And here's the result:

"DEVELOP ACROSS ALL PLATFORMS" 
"SPEED & PERFORMANCE" 
"INCREDIBLE TOOLING" 
"LOVED BY MILLIONS"

As you can see, scraping dynamic sites with Selenium is pretty nice and neat.

How to scrape Infinite Scroll Web Pages with Selenium

Some dynamic web pages load more content as users scroll down the bottom of the page, these websites are known as infinite scroll websites.

To scrape them, we need to instruct our crawler to scroll to the bottom of the page, wait for new content to load and scrape the dynamic web content we need.

Let's use Scraping Club infinite scroll sample page.

Scraping club homepage
Click to open the image in fullscreen

This script will scroll through the first 20 results and extract their title:

from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.chrome.service import Service as ChromeService 
from webdriver_manager.chrome import ChromeDriverManager 
import time 
 
options = webdriver.ChromeOptions() 
options.headless = True 
driver = webdriver.Chrome(service=ChromeService( 
	ChromeDriverManager().install()), options=options) 
 
# load target website 
url = 'https://scrapingclub.com/exercise/list_infinite_scroll/' 
 
# get website content 
driver.get(url) 
 
# instantiate items 
items = [] 
 
# instantiate height of webpage 
last_height = driver.execute_script('return document.body.scrollHeight') 
 
# set target count 
itemTargetCount = 20 
 
# scroll to bottom of webpage 
while itemTargetCount > len(items): 
	driver.execute_script('window.scrollTo(0, document.body.scrollHeight);') 
 
	# wait for content to load 
	time.sleep(1) 
 
	new_height = driver.execute_script('return document.body.scrollHeight') 
 
	if new_height == last_height: 
		break 
 
	last_height == new_height 
 
	# select elements by XPath 
	elements = driver.find_elements(By.XPATH, "//div[@class='card-body']/h4/a") 
	h4_texts = [element.text for element in elements] 
 
	items.extend(h4_texts) 
 
	# print title 
	print(h4_texts)

Remark: it's important to set a target count for infinite scroll pages so you can end your script at some point. Otherwise, your code will keep running if the website's scroll is endless.

For the previous example, we used yet another selector: By.XPath. This new selector will locate elements based on an XPath instead of classes and IDs as seen before. Inspect the page, right-click on a <div> containing the elements you want to scrape and select Copy Path.

Your result should look like this:

['Short Dress', 'Patterned Slacks', 'Short Chiffon Dress', 'Off-the-shoulder Dress', ...]

And there you have it, the H4s of the first 20 products!

Remark: using Selenium for dynamic web scraping can get tricky with continuous Selenium updates. Do well to go through the latest changes when scraping dynamic web pages with Python and Selenium.

Conclusion

Dynamic web pages are rampant today and there's a high chance you'll encounter a few in any data extraction project. You should explore these websites to identify the best approach for extracting the needed information.

Here's a quick recap of the steps we took to scrape dynamic content in Python:
  1. Import web driver.
  2. Define driver path.
  3. Access website.
  4. Select HTML elements.
  5. Print or store data in a file.

That said, Selenium is performance-intensive and can be slow when it comes to scraping large datasets. This is one of the reasons we created ZenRows.

With ZenRows, you can scrape dynamic sites using a simple API call. Try it for free and watch your data crawling task become easy.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn, or Facebook.

Frustrated that your web scrapers are blocked once and again? ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Want to keep learning?

We will be sharing all the insights we have learned through the years in the following blog posts. If you don't want to miss a piece and keep learning, we'd be thrilled to have us in our newsletter.

No spam guaranteed. You can unsubscribe at any time.