Are you getting incomplete results while scraping dynamic web page content? It's not just you. Crawling dynamic data is a challenging undertaking for standard scrapers. That's because JavaScript runs in the background when an HTTP request is made.
Scraping dynamic websites requires rendering the entire page in a browser and extracting the target information.
Join us in this step-by-step tutorial to learn all you need about dynamic web scraping with Python—the dos and don'ts, the challenges and solutions, and everything in between.
Let's dive right in!
What Is a Dynamic Website?
A dynamic website is one that doesn't have all its content directly in its static HTML. It displays data using server-side or client-side rendering or based on the user's actions (e.g., clicking, scrolling, etc.).
Put simply, these websites generate content on the fly based on user interactions or server requests. This approach improves user experience by providing relevant content without needing to reload the entire page.
How can you identify dynamic websites? One way is to disable JavaScript in your browser's command palette. If the website is dynamic, the content will disappear.
Let's use the Scraping Course JS Rendering demo page as an example. Here's what it looks like:
Take notice of the product names, prices, and images.
Now, let's disable JavaScript using the steps below:
- Inspect the page: Right-click and select "Inspect" to open the DevTools window.
- Navigate to the command palette: CTRL/CMD + SHIFT + P.
- Search for "JavaScript."
- Click on Disable JavaScript.
- Hit refresh.
What's the result? See below:
You see it for yourself! Disabling JavaScript removes all dynamic web content.
As you can see, disabling JavaScript removes all dynamic web content.
Alternatives to Dynamic Web Scraping With Python
Since libraries such as Beautiful Soup or Requests don't automatically fetch dynamic content, you're left with two options to complete the task:
- Feed the content to a standard library.
- Execute the page's internal JavaScript while scraping.
However, not all dynamic pages are the same. Some render content through JS APIs that can be accessed by inspecting the "Network" tab. Others store the JS-rendered content as JSON somewhere in the DOM (Document Object Model).
The good news is we can parse the JSON string to extract the necessary data in both cases.
Keep in mind that there are situations in which these solutions are inapplicable. For such websites, you can use headless browsers to render the page and extract the data you need.
The alternatives to crawling dynamic web pages with Python are:
- Manually locating the data and parsing JSON string.
- Using headless browsers to execute the page's internal JavaScript (e.g., Selenium or Pyppeteer, an unofficial Python port of Puppeteer).
What Is the Easiest Way to Scrape a Dynamic Website in Python?
Headless browsers can be slow and performance-intensive. However, they lift most of the restrictions of web scraping, apart from getting blocked by websites’ protection systems. If you’re searching for solutions to this particular problem, check out our guide on bypassing bot detection.
Manually locating data and parsing JSON strings presumes that accessing the JSON version of the dynamic data is possible. Unfortunately, that's not always the case, especially when it comes to high-level Single-page applications (SPAs).
Not to mention that mimicking an API request is not scalable. They often require cookies and authentications alongside other restrictions that can easily block you out.
The best way to scrape dynamic web pages in Python depends on your goals and resources. If you have access to the website's JSON and are looking to extract data from a single page, you may not need a headless browser.
However, apart from these few use cases, Beautiful Soup and Selenium are usually the best and easiest options.
It's time to get our hands dirty! Get ready to write some code and see precisely how to scrape a dynamic website in Python!
Prerequisites
To follow this tutorial, you'll need to meet some requirements. We'll use the following tools:
- Python 3: The latest version of Python will work best.
-
Selenium and webdriver-manager: Run the command
pip install selenium webdriver-manager
to install the Selenium and webdriver-manager libraries.
Method #1: Dynamic Web Scraping With Python Using Beautiful Soup
Beautiful Soup is arguably the most popular Python library for crawling HTML data.
To extract information with it, we need our target page's HTML string. However, dynamic content is not directly present in a website's static HTML. This means that Beautiful Soup can't access JavaScript-generated data.
The solution is to extract data from XHR requests if the website loads content using an AJAX request.
Method #2: Scraping Dynamic Web Pages in Python Using Selenium
To understand how Selenium helps you scrape dynamic websites, we need to inspect how regular libraries, such as Requests
, interact with them.
We'll use the same JS Rendering demo page as our target website:
Let's try scraping it with Requests
and see the result. Before that, we have to install the Requests
library, which can be done using the pip
command.
pip install requests
Here's what our code looks like:
import requests
url = 'https://www.scrapingcourse.com/javascript-rendering'
response = requests.get(url)
html = response.text
print(html)
As you can see, only the following HTML was extracted:
<!DOCTYPE html>
<html lang="en">
<!-- omitted for brevity -->
<body>
<!-- omitted for brevity -->
<div class="product-item flex flex-col items-center rounded-lg" data-testid="product-item"
data-content="product-item" itemscope itemtype="http://schema.org/Product">
<a href="" class="product-link" data-testid="product-link" data-content="product-link">
<img class="rounded-lg border shadow-md object-cover" width="200" height="240" decoding="async"
fetchpriority="high" src="" alt="" class="mb-2 rounded-md w-full object-cover"
data-testid="product-image" data-content="product-image" itemprop="image">
<div class="product-info self-start text-left w-full" data-testid="product-info"
data-content="product-info">
<span data-testid="product-name" data-content="product-name" itemprop="name"></span>
<br>
<span class="product-price text-slate-600" data-testid="product-price" data-content="product-price"
itemprop="priceCurrency" content="USD" itemprop="price"></span>
</div>
</a>
<!-- omitted for brevity -->
</body>
</html>
However, inspecting the website shows more content than what was retrieved. The product details, including product names, prices, and image URLs, are missing.
This is what happened when we disabled JavaScript on the page:
This is what Requests
was able to return. The library perceives no errors as it parses data from the website's static HTML, which is what it was created to do.
To access the entire content and extract our target data, we must render the JavaScript.
It's time to make it right with Selenium dynamic web scraping.
We'll use the following script to quickly scrape our target website:
# pip install selenium webdriver-manager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
# URL of the web page to scrape
url = 'https://www.scrapingcourse.com/javascript-rendering'
# set up Chrome WebDriver using ChromeDriverManager
driver = webdriver.Chrome(service=ChromeService(
ChromeDriverManager().install()))
# open the specified URL in the browser
driver.get(url)
# print the HTML content of the page
print(driver.page_source)
# close the browser
driver.quit()
You'll get the following output on running this code:
<html lang="en">
<!-- omitted for brevity -->
<body>
<!-- omitted for brevity -->
<div class="product-item flex flex-col items-center rounded-lg" data-testid="product-item"
data-content="product-item" itemscope="" itemtype="http://schema.org/Product">
<a href="https://scrapingcourse.com/ecommerce/product/chaz-kangeroo-hoodie" class="product-link"
data-testid="product-link" data-content="product-link">
<img class="product-image rounded-lg border shadow-md object-cover" width="200" height="240"
decoding="async" fetchpriority="high"
src="https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh01-gray_main.jpg"
alt="Chaz Kangeroo Hoodie" data-testid="product-image" data-content="product-image" itemprop="image" />
<div class="product-info self-start text-left w-full" data-testid="product-info"
data-content="product-info">
<span class="product-name" data-testid="product-name" data-content="product-name" itemprop="name">Chaz
Kangeroo Hoodie</span>
<br />
<span class="product-price text-slate-600" data-testid="product-price" data-content="product-price"
itemprop="priceCurrency" content="USD">$52</span>
</div>
</a>
<!-- omitted for brevity -->
</body>
</html>
There you have it! The page's complete HTML, including the dynamic web content.
Congratulations! You've just scraped your first dynamic website.
Selecting Elements in Selenium
There are different ways to access elements in Selenium. We discuss this matter in depth in our web scraping with Selenium in Python guide.
Still, let’s see a simple example. Select only the product names on our target website:
Before we get to that, we need to inspect the website and identify the location of the elements we want to extract.
We can see that the class product-name
is common for those product names. We'll use this information to extract the product names using Chrome WebDriver.
The following code extracts all the product names from the target webpage:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
# instantiate options for Chrome
options = webdriver.ChromeOptions()
# run browser in headless mode
options.add_argument('--headless=new')
# instantiate Chrome WebDriver with options
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
# URL of the web page to scrape
url = 'https://www.scrapingcourse.com/javascript-rendering'
# open the specified URL in the browser
driver.get(url)
# find elements by class name 'product-name'
product_names = driver.find_elements(By.CLASS_NAME, 'product-name')
# iterate over found elements and print their text content
for product_name in product_names:
print(product_name.text)
# close the browser
driver.quit()
You'll get the following output on running this code:
Chaz Kangeroo Hoodie
Teton Pullover Hoodie
Bruno Compete Hoodie
Frankie Sweatshirt
Hollister Backyard Sweatshirt
Stark Fundamental Hoodie
Hero Hoodie
Oslo Trek Hoodie
Abominable Hoodie
Mach Street Sweatshirt
Grayson Crewneck Sweatshirt
Ajax Full-Zip Sweatshirt
Nice and easy! You can now scrape dynamic sites with Selenium effortlessly.
How to Scrape Infinite Scroll Web Pages With Selenium
Some dynamic pages load new content as users scroll down to the bottom of the page. These are known as "infinite scroll websites." Scraping them is a bit more challenging, since we need to instruct our scraper to scroll to the bottom and wait for all new content to load before it begins scraping.
Understand this with an example. Let's use an Infinite Scrolling demo page.
This script scrolls to the bottom of the page three times and prints the product names it finds.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
import time
# set up Chrome options for headless mode
options = webdriver.ChromeOptions()
options.add_argument('--headless=new')
# instantiate Chrome WebDriver with options
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
# load target website
url = 'https://www.scrapingcourse.com/infinite-scrolling'
driver.get(url)
# scroll to the bottom of the page three times
scroll_times = 3
for _ in range(scroll_times):
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(1) # wait for content to load
# select elements by XPath
product_names = driver.find_elements(By.XPATH, '//*[@class="product-name"]')
product_name_texts = [product_name.text for product_name in product_names]
# print the product names
print(product_name_texts)
# close the browser
driver.quit()
In the previous example, we used yet another selector: By.XPath
. It will locate elements based on an XPath instead of classes and IDs, as seen before. Inspect the page, right-click on the <div>
containing the elements you want to scrape, and select Copy XPath.
Your result should look like this:
['Chaz Kangeroo Hoodie', 'Teton Pullover Hoodie', 'Bruno Compete Hoodie', ...]
And there you have it, the names of the first 24 products!
Remark: Using Selenium for dynamic web scraping can get tricky with continuous Selenium updates. It's important to stay informed about the latest changes.
Conclusion
Dynamic web pages are everywhere. Thus, there's a high enough chance you'll encounter them in your data extraction efforts. Remember that familiarizing yourself with their structure will help you identify the best approach for retrieving your target information.
All the methods we explored in this article come from their own set of disadvantages, the biggest one being their inability to deal with websites’ anti-bot systems. To scrape uninterrupted, try out ZenRows, a web scraping API that allows you to scrape dynamic websites with a simple API call.