Are you getting incomplete results while scraping dynamic web page content with Python? It's not just you. Dynamic web scraping with Python can be particularly challenging because standard scrapers often struggle with JavaScript-generated content during an HTTP request.
Scraping dynamic websites requires steps like request interception or direct browser automation.
In this step-by-step tutorial, you'll learn all you need to know about dynamic web scraping with Python, including dealing with cases like infinite scrolling and client-side dynamic rendering.
Let's begin!
What Is a Dynamic Website?
A dynamic website doesn't have all its content directly in static HTML. It displays data on the fly using server-side or client-side rendering or based on the user's actions, such as showing more content as the user clicks a button or scrolls down the page (infinite scrolling).
This approach improves user experience by providing relevant content without reloading the entire page. One way to determine if a website is dynamic is to disable JavaScript in your browser's command palette. If the website is dynamic, most or all of its content will disappear.
Let's use the JS Rendering demo page as an example. Here's what it looks like with JavaScript enabled:
And here's the website with JavaScript disabled:
You see it for yourself! Disabling JavaScript removes all dynamic web content.
What Are the Options for Scraping Dynamic Websites?
Not all dynamic pages are the same. Some fetch content from application programming interfaces (APIs), which you can observe through the DevTools' Network tab. Others use client-side rendering to load content directly from JSON objects stored within the DOM (Document Object Model).
The best way to scrape dynamic web pages in Python depends on your goals and resources. However, these are the two available options:
Intercept XHR/Fetch RequestÂ
Intercepting XHR/Fetch requests during scraping involves inspecting the browser's Network tab to identify the API endpoint supplying the dynamic content. Once the endpoint is identified, you can directly request data from that API using an HTTP client, such as the Requests library. Once you get the API's HTML response, you can parse it using BeautifulSoup.
The shortcoming of this approach is that it doesn't work for client-side dynamic rendering, where the data is stored as a JSON within the DOM. Not to mention that mimicking an API request is unscalable due to issues like API changes, rate limiting, authentication requirements, etc.
Use a Headless Browser
Headless browsers like Selenium and Playwright lift most of the restrictions of dynamic web scraping, allowing for greater flexibility and access to content. However, they can be slow and performance-intensive, and there's still the risk of getting blocked by websites' protection systems.
If you're searching for solutions to getting blocked, check out our guide on bypassing bot detection.
For instance, you can automate infinite scrolling with Selenium to extract content as you scroll down a page. Although scraping with browser automation libraries can be inefficient due to memory overhead from the browser instance, it works whether the site loads content from an API or uses client-side rendering.
In the next sections, we'll scrape product names and prices from this infinite scroll challenge page to teach you how to use each method.
The first method involves intercepting network requests using Python's Requests and parsing the content with BeautifulSoup, while the second uses Selenium to automate the scrolling action.
Here's a demo showing how the page renders content via infinite scrolling:
Let's first go through the prerequisites before jumping into the tutorials.
Prerequisites
To follow this tutorial, you'll need to meet some requirements. We'll use the following tools:
- Python 3: The latest version of Python works best.
- Requests: A Python HTTP client library to directly request the source API.
- BeautifulSoup: A Python library to parse HTML and XML content.
- Selenium: An automation library with a headless browsing feature to render JavaScript.
Install the libraries using pip
:
pip3 install requests beautifulsoup4 selenium
You're good to go once all is up and ready!
Method #1: Dynamic Web Scraping With Python Using BeautifulSoup
Scraping JavaScript-rendered pages with Requests and BeautifulSoup involves intercepting the network Fetch/XHR requests. We'll show you how to apply this technique to the Infinite Scrolling Challenge page.Â
Let's start with the target web page inspection.
Step 1: Inspect the Target Elements
Right-click the first element and go to Inspect to view via the browser console. Each product is inside a div
element with the class name product-item
. The product name has the class name product-name
, while the price class is product-price
:
Keep the above class names safe because you'll use them during the main scraping actions.Â
Step 2: Intercept Network Fetch/XHR Requests
HTTP clients like the Requests don't support JavaScript rendering. When you use Requests and BeautifulSoup to scrape a dynamic site like the infinite scroll challenge page, it only returns the content appearing within the viewport and not those hidden behind the scrollbar.Â
The only way to scrape dynamic data with Requests is to intercept the Fetch/XHR request via the Network tab.
To do that, open the target website via a browser like Chrome. Right-click anywhere to open the DevTools. Then, go to the Network tab.
Once in the Network tab, reload the page. You'll see several requests in the tab, including image, CSS, JavaScript, content requests, etc.Â
Scroll through the page, and new requests will appear in the Network tab, showing that each scrolled height triggers a request for fresh content.
For some websites, the API call is usually in a named document icon (this can be different on other websites). In this case, it's in a request named products?offset=0
.
Keep scrolling, and you'll see that the value of offset
increases by 10 per scroll height. That value shows 150 when you reach the bottom of the page, indicating the API offsets 10 products 15 times.
Observe the Request URL in the Headers tab. As shown in the above image, it has the following pattern:
https://www.scrapingcourse.com/ajax/products?offset=<OFFSET_NUMBER>
This URL shows three essential parts:
-
https://www.scrapingcourse.com
: Represents the API's base URL. -
ajax/products
: The API's endpoint, which theoretically calls theinfinite-scroll
page. -
offset
: A request parameter that determines the number of content the API returns per scroll height.
We'll send requests directly to that API to extract the desired product data. Let's write our scraping logic based on this information.
Step 3: Scrape the First Few Data
First, import the required libraries and specify the target API's URL (without the offset parameter):
# pip3 install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
# specify the API URL without the offset
target_url = "https://www.scrapingcourse.com/ajax/products"
Define a scraper function that accepts a url
parameter.Â
This function requests the target URL before parsing the returned HTML with BeautifulSoup. It then extracts all the parent elements and loops through them to scrape individual product names and prices using their class selectors:
# ...
def scraper(url):
# request the target website
response = requests.get(url)
# verify the response status
if response.status_code != 200:
return f"status failed with {response.status_code}"
else:
# parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")
# empty list to collect data
scraped_data = []
# get the product containers
products = soup.find_all("div", class_="product-item")
# iterate through the product containers and extract the product content
for product in products:
name_tag = product.find(class_="product-name")
price_tag = product.find(class_="product-price")
data = {
"name": name_tag.text if name_tag else "",
"price": price_tag.text if price_tag else "",
}
# append the data to the empty list
scraped_data.append(data)
# return the scraped data
return scraped_data
If you run the scraper at this point, it will return only a few product data. You can try it by combining the current snippets and adding a line to execute the scraper function:
# pip3 install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
# specify the API URL without the offset
target_url = "https://www.scrapingcourse.com/ajax/products"
def scraper(url):
# request the target website
response = requests.get(url)
# verify the response status
if response.status_code != 200:
return f"status failed with {response.status_code}"
else:
# parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")
# empty list to collect data
scraped_data = []
# get the product containers
products = soup.find_all("div", class_="product-item")
# iterate through the product containers and extract the product content
for product in products:
name_tag = product.find(class_="product-name")
price_tag = product.find(class_="product-price")
data = {
"name": name_tag.text if name_tag else "",
"price": price_tag.text if price_tag else "",
}
# append the data to the empty list
scraped_data.append(data)
# return the scraped data
return scraped_data
# execute the scraper function and print the scraped data
print(scraper(target_url))
The code outputs only the first few products:
[
{"name": "Chaz Kangeroo Hoodie", "price": "$52"},
{"name": "Teton Pullover Hoodie", "price": "$70"},
# ... other products omitted for brevity
{"name": "Grayson Crewneck Sweatshirt", "price": "$64"},
{"name": "Ajax Full-Zip Sweatshirt", "price": "$69"},
]
That's only some of the data on that page. Many more are inaccessible due to forced infinite scrolling.Â
So, how can you get all the data out? You need to spoof the API data offset as it appears in the Network tab. We'll do that in the next step.
Step 4: Simulate the Network API Call
The next step is to simulate the offset logic. Specify an offset count (offset_count
) and set its initial value as zero. Then, define an empty list to extend the data returned by the scraper function:
# ...
# set an initial request count
offset_count = 0
# array to collect scraped data
product_data = []
Remember that the API offsets 10 items for 15 scroll heights (150 offsets).Â
So, create a 15-count iteration using a for
loop and increment offset_count
by 10 per iteration. Add the increasing offset_count
to the API URL using the offset
parameter.Â
Execute the scraper function with the new URL and extend the extracted data to the product_data
list:
# ...
# scrape infinite scroll by intercepting API request
# in the Network tab (150 offsets/10 = 15 offsets)
for page in range(0, 15):
# simulate the full URL format
requested_page_url = f"{target_url}?offset={offset_count}"
# execute the scraper function
collected_data = scraper(requested_page_url)
# extend the new list with the scraped data
product_data.extend(collected_data)
# increment the request count
offset_count += 10
print(product_data)
Combine all the snippets. Here's the final code:
# pip3 install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
# specify the target URL
target_url = "https://www.scrapingcourse.com/ajax/products"
def scraper(url):
# request the target website
response = requests.get(url)
# verify the response status
if response.status_code != 200:
return f"status failed with {response.status_code}"
else:
# parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")
# empty list to collect data
scraped_data = []
# get the product containers
products = soup.find_all("div", class_="product-item")
# iterate through the product containers and extract the product content
for product in products:
name_tag = product.find(class_="product-name")
price_tag = product.find(class_="product-price")
data = {
"name": name_tag.text if name_tag else "",
"price": price_tag.text if price_tag else "",
}
# append the data to the empty list
scraped_data.append(data)
# return the scraped data
return scraped_data
# set an initial request count
offset_count = 0
# array to collect scraped data
product_data = []
# scrape infinite scroll by intercepting API request
# in the Network tab (150 offsets/10 = 15 offsets)
for page in range(0, 15):
# simulate the full URL format
requested_page_url = f"{target_url}?offset={offset_count}"
# execute the scraper function
collected_data = scraper(requested_page_url)
# extend the new list with the scraped data
product_data.extend(collected_data)
# increment the request count
offset_count += 10
print(product_data)
The above code intercepts the dynamic page network requests and outputs all its product names and prices:
[
{"name": "Chaz Kangeroo Hoodie", "price": "$52"},
{"name": "Teton Pullover Hoodie", "price": "$70"},
# ... other products omitted for brevity
{"name": "Antonia Racer Tank", "price": "$34"},
{"name": "Breathe-Easy Tank", "price": "$34"},
]
That's it! You just used Requests and BeautifulSoup to scrape a dynamic page that renders content with infinite scrolling.
For more information, check out our article on handling pagination and infinite scrolling with Requests.
Let's see how to achieve the same result with Selenium.
Method #2: Scraping Dynamic Web Pages in Python Using Selenium
Since Selenium lets you automate the browser, it provides full-scale support for JavaScript rendering.Â
For instance, while the Requests library won't return any content when used to scrape a dynamic page like this JS Rendering challenge, Selenium will. That page uses client-side rendering and only loads content after some time, making a browser automation tool like Selenium more suitable for scraping.
Here's what the page looks like before loading content:
To demonstrate Selenium's browser capability, let's compare how it handles that website against how an HTTP client like the Requests library does.Â
Let's first access it with Requests:
# pip3 install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
# specify the target URL
target_url = "https://www.scrapingcourse.com/javascript-rendering"
def scraper(url):
# request the target website
response = requests.get(url)
# verify the response status
if response.status_code != 200:
return f"status failed with {response.status_code}"
else:
# parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")
print(soup.prettify())
# execute the function
scraper(target_url)
The HTML returned by the Requests library doesn't contain any product content, showing it can't handle client-side JavaScript rendering:
<!DOCTYPE html>
<html lang="en">
<head>
<!-- ... -->
<title>
JS Rendering Challenge to Learn Web Scraping - ScrapingCourse.com
</title>
<!-- ... -->
</head>
<body data-gr-ext-installed="" data-...">
<!-- ... -->
<div>
<p class="challenge-description flex" ...">
Enable JavaScript to see products
</p>
</div>
<div class="product-grid ...">
<div class="product-item ...">
<a class="product-link" ...>
<img alt="" class="mb-2 rounded-md ... src="" width="200"/>
<!-- ... -->
<span data-content="product-name" data-testid="product-name" itemprop="name">
</span>
<br/>
<span class="product-price text-slate-600" content="USD" ...>
</span>
</a>
</div>
<!-- ... omitted for brevity -->
</div>
<!-- ... -->
</body>
</html>
However, Selenium renders that page successfully and extracts its content. Try it with the following Selenium scraper:
# pip3 install selenium
from selenium import webdriver
# instantiate options for Chrome
options = webdriver.ChromeOptions()
# run the browser in headless mode
options.add_argument("--headless=new")
# instantiate Chrome WebDriver with options
driver = webdriver.Chrome(options=options)
# URL of the web page to scrape
url = "https://www.scrapingcourse.com/javascript-rendering"
# open the specified URL in the browser
driver.get(url)
# print the page source
print(driver.page_source)
# close the browser
driver.quit()
The Selenium scraper outputs the page HTML with its content, as shown:
<!DOCTYPE html>
<html lang="en">
<head>
<!-- ... -->
<title>
JS Rendering Challenge to Learn Web Scraping - ScrapingCourse.com
</title>
<!-- ... -->
</head>
<body data-gr-ext-installed="" data-...">
<!-- ... -->
<div>
<p class="challenge-description flex" ...">
Enable JavaScript to see products
</p>
</div>
<div class="product-grid ...">
<div class="product-item ...">
<a href="https://scrapingcourse.com/ecommerce/product/chaz-kangeroo-hoodie" class="product-link" ...>
<img alt="" class="product-image ... src="https://scrapingcourse.com/ecommerce/...main.jpg" width="200"/>
<!-- ... -->
<span data-content="product-name" data-testid="product-name" itemprop="name">
Chaz Kangeroo Hoodie
</span>
<br/>
<span class="product-price text-slate-600" content="USD" ...>
$52
</span>
</a>
</div>
<!-- ... omitted for brevity -->
</div>
<!-- ... -->
</body>
</html>
In addition to built-in JavaScript rendering support, Selenium lets you execute user interactions within a browser environment, such as clicking, scrolling, typing, hovering, and more.
Selecting Elements in Selenium
There are different ways to access elements in Selenium. We've discussed this topic in depth in our web scraping with Selenium in Python guide. You can select elements in Selenium using CSS selectors, tag names, XPath, class, or ID.
Let's build a Selenium scraper to collect product data from the previous infinite scroll challenge page. Again, let's quickly inspect the target page elements before extracting its content.
Right-click the first product and select Inspect. We'll still scrape the previous target elements (product names and prices).
The following code loops through the product containers (class: product-item
) using find_elements
and extracts the product names (class: product-name
) and prices (class: product-price
) from each using the find_element
method:
# pip install selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
def scraper(driver):
# find elements by class name 'product-name'
products = driver.find_elements(By.CLASS_NAME, "product-item")
scraped_data = []
# iterate over found elements and print their text content
for product in products:
product_name = product.find_element(By.CLASS_NAME, "product-name")
product_price = product.find_element(By.CLASS_NAME, "product-price")
data = {
"name": product_name.text,
"price": product_price.text,
}
# append the data to the empty list
scraped_data.append(data)
# return the scraped data
return scraped_data
# instantiate options for Chrome
options = webdriver.ChromeOptions()
# run the browser in headless mode
options.add_argument("--headless=new")
# instantiate Chrome WebDriver with options
driver = webdriver.Chrome(options=options)
# open the specified URL in the browser
driver.get("https://www.scrapingcourse.com/infinite-scrolling")
# execute the scraper function and print the scraped data
print(scraper(driver))
# close the browser
driver.quit()
Run the the above code, and you'll get the following output:
[
{"name": "Chaz Kangeroo Hoodie", "price": "$52"},
{"name": "Teton Pullover Hoodie", "price": "$70"},
# ... other products omitted for brevity
{"name": "Grayson Crewneck Sweatshirt", "price": "$64"},
{"name": "Ajax Full-Zip Sweatshirt", "price": "$69"},
]
The above data is fewer than the products on the infinite scroll page. Other products are behind the scroll bar, and you can only access them if you scroll.Â
How can you automate infinite scrolling with Selenium? Find out in the next section.
How to Scrape Infinite Scroll Web Pages With Selenium
Scraping all products from the previous infinite scroll challenge page with Selenium requires automating the scroll action through JavaScript execution.
Let's extend the previous scraper to see how it works.
To automate infinite scrolling with Selenium, obtain the initial page height using the execute_script
function. Start a while loop and scroll down to the next height. Then, update the previous height to the current one. Only execute the scraper function when the last and current height values are equal (once you reach the bottom of the page):
# ...
# get the previous height value
last_height = driver.execute_script("return document.body.scrollHeight")
# array to collect scraped data
product_data = []
while True:
# scroll down to the bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# wait for the page to load
time.sleep(2)
# get the new height and compare it with the last height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
# extract data once all content has loaded
product_data.extend(scraper(driver))
break
last_height = new_height
# print the scraped data after scrolling
print(product_data)
Combine the above snippet with the previous one, and you'll get the following complete code:
# pip3 install selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
def scraper(driver):
# find elements by class name 'product-name'
products = driver.find_elements(By.CLASS_NAME, "product-item")
scraped_data = []
# iterate over found elements and print their text content
for product in products:
product_name = product.find_element(By.CLASS_NAME, "product-name")
product_price = product.find_element(By.CLASS_NAME, "product-price")
data = {
"name": product_name.text,
"price": product_price.text,
}
# append the data to the empty list
scraped_data.append(data)
# return the scraped data
return scraped_data
# instantiate options for Chrome
options = webdriver.ChromeOptions()
# run the browser in headless mode
options.add_argument("--headless=new")
# instantiate Chrome WebDriver with options
driver = webdriver.Chrome(options=options)
# open the specified URL in the browser
driver.get("https://www.scrapingcourse.com/infinite-scrolling")
# get the previous height value
last_height = driver.execute_script("return document.body.scrollHeight")
# array to collect scraped data
product_data = []
while True:
# scroll down to the bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# wait for the page to load
time.sleep(2)
# get the new height and compare it with the last height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
# extract data once all content has loaded
product_data.extend(scraper(driver))
break
last_height = new_height
# print the scraped data after scrolling
print(product_data)
# close the browser
driver.quit()
The above Selenium scraper scrolls the entire page and extracts all the desired product data:
[
{"name": "Chaz Kangeroo Hoodie", "price": "$52"},
{"name": "Teton Pullover Hoodie", "price": "$70"},
# ... other products omitted for brevity
{"name": "Antonia Racer Tank", "price": "$34"},
{"name": "Breathe-Easy Tank", "price": "$34"},
]
There you go 🎉! You've used Selenium to automate infinite scrolling and scrape a dynamic web page without scrolling limitations.
Using Selenium for dynamic web scraping can get tricky with continuous Selenium updates. It's important to stay informed about the latest changes.
Conclusion
You've learned to scrape dynamic pages with Requests, BeautifulSoup, and a browser automation library like Selenium. Dynamic web pages are everywhere, and there's a high enough chance you'll encounter them in your data extraction efforts. Remember that familiarizing yourself with their rendering style will help you identify the best approach for retrieving your target information.Â
However, each method we explored in this article has its disadvantages, the biggest being the inability to deal with websites' anti-bot systems.Â
We recommend trying ZenRows, an all-in-one web scraping API, to extract data from any dynamic page without getting blocked. With a single API call, ZenRows helps you manage proxy rotation, handle JavaScript rendering, and auto-bypass the most advanced anti-bot measures.Â
Try ZenRows for free now without a credit card!