When it comes to scraping dynamic web pages, it's necessary to render the entire webpage in a browser and extract the required information.
In this tutorial, we'll be taking you through the steps on how to go about dynamic web scraping with python.
Let's dive right in!
What is a dynamic website?
A dynamic website is a website that doesn't have all its content directly in its static HTML. It uses server-side or client-side scripting to display content, sometimes based on user actions like clicking, scrolling and so on.
To put that in simpler terms, a dynamic website displays different content or layout with every request to the web server. This helps the page load faster as there's no need to reload the same layout each time you want to view "new" content.
Let's use Saleor React Storefront as an example. Here's what the front page looks like:
You'd notice the titles, images and artist's name.
- Inspect the page.
- Navigate to the command palette (CTRL or CMD + SHIFT + P).
- Hit refresh.
And then you get something like this:
Alternatives for dynamic web scraping with Python
We can parse the JSON string to extract the necessary data in both cases.
- Manually locating the data and parsing JSON string.
What is the easiest way to scrape a dynamic website in Python?
Headless browsers can be slow and performance-intensive. But they have no restrictions on web scraping, except for anti-bot detection. Well, that's not a problem since we have an article on how to bypass bot detections.
Manually locating data and parsing JSON string presumes that accessing the JSON version of the dynamically rendered data is possible.
This is not the case for many websites, particularly high-level single-page applications (SPAs). Also, mimicking an API request is not scalable. They often require cookies and authentications alongside other restrictions that can block you out.
The easiest way to scrape dynamic web pages in Python depends on your goals and the resources available to you. If you have access to a website's JSON and your goal is extracting data from a single web page, you may not need a headless browser.
Anyways, the best and easiest way to scrape a dynamic website in Python is by using BeautifulSoup and Selenium
The last item will ensure to match the browser and driver versions. No need for you to download manually the webdriver.
pip install selenium webdriver-manager
Option 1: Dynamic Web Scraping with Python using BeautifulSoup
BeautifulSoup is arguably one of the most used Python libraries for crawling data from HTML. It works by parsing an HTML string into a BeautifulSoup Python object.
To extract data using this library, we need the HTML string of the page we want to scrape.
But it is possible to extract data from XHR requests using BeautifulSoup if the website loads content using an AJAX request.
Option 2: Scraping Dynamic Web Pages in Python using Selenium
To better understand how Selenium works for scraping dynamic websites, let's take a look at how regular libraries (Requests) interact with these websites.
For this tutorial, we'll use Angular as our target website:
Let's scrape it using
Requests to see what we can get. For this, we need to install the Request library, which can be done using the
pip install requests
Here's what our code looks like:
import requests url = 'https://angular.io/' response = requests.get(url) html = response.text print(html)
From the results, you'd see that only the following HTML was extracted:
And inspecting the website shows more content than what was extracted.
requests was able to return.
Everything is correct from the Requests perspective as it parsed the data from the website's static HTML, which is what it should do. But we want the same result as what is being displayed on the website which is quite impossible since it is a dynamic web page.
Now that we've failed our first attempt at dynamic web scraping with Python using regular libraries, let's make things right with selenium dynamic web scraping.
Let's use the following script to quickly crawl the entire content of our target website:
from selenium import webdriver from selenium.webdriver.chrome.service import Service as ChromeService from webdriver_manager.chrome import ChromeDriverManager url = 'https://angular.io/' driver = webdriver.Chrome(service=ChromeService( ChromeDriverManager().install())) driver.get(url) print(driver.page_source)
Your result would be the complete HTML of our target page, including the dynamic web content.
Congratulations champion… you've just scraped your first dynamic website!
Selecting Elements in Selenium
There are different ways to access elements in Selenium which was discussed in the web scraping with Selenium in Python article.
Moving on, let's select only the H2s on our target website as an example:
First of all… Let's inspect our target website to identify the location of the elements we want to extract and how we can get them.
So, for the H2s on this website, the
class="text-container" is common. We can copy that and map the H2s to get elements using Chrome Driver.
Here's what our code to get the H2s looks like:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.service import Service as ChromeService from webdriver_manager.chrome import ChromeDriverManager # instantiate options options = webdriver.ChromeOptions() # run browser in headless mode options.headless = True # instantiate driver driver = webdriver.Chrome(service=ChromeService( ChromeDriverManager().install()), options=options) # load website url = 'https://angular.io/' # get the entire website content driver.get(url) # select elements by class name elements = driver.find_elements(By.CLASS_NAME, 'text-container') for title in elements: # select H2s, within element, by tag name heading = title.find_element(By.TAG_NAME, 'h2').text # print H2s print(heading)
And here's the result:
"DEVELOP ACROSS ALL PLATFORMS" "SPEED & PERFORMANCE" "INCREDIBLE TOOLING" "LOVED BY MILLIONS"
As you can see, scraping dynamic sites with Selenium is pretty nice and neat.
How to scrape Infinite Scroll Web Pages with Selenium
Some dynamic web pages load more content as users scroll down the bottom of the page, these websites are known as infinite scroll websites.
To scrape them, we need to instruct our crawler to scroll to the bottom of the page, wait for new content to load and scrape the dynamic web content we need.
Let's use Scraping Club infinite scroll sample page.
This script will scroll through the first 20 results and extract their title:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.service import Service as ChromeService from webdriver_manager.chrome import ChromeDriverManager import time options = webdriver.ChromeOptions() options.headless = True driver = webdriver.Chrome(service=ChromeService( ChromeDriverManager().install()), options=options) # load target website url = 'https://scrapingclub.com/exercise/list_infinite_scroll/' # get website content driver.get(url) # instantiate items items =  # instantiate height of webpage last_height = driver.execute_script('return document.body.scrollHeight') # set target count itemTargetCount = 20 # scroll to bottom of webpage while itemTargetCount > len(items): driver.execute_script('window.scrollTo(0, document.body.scrollHeight);') # wait for content to load time.sleep(1) new_height = driver.execute_script('return document.body.scrollHeight') if new_height == last_height: break last_height == new_height # select elements by XPath elements = driver.find_elements(By.XPATH, "//div[@class='card-body']/h4/a") h4_texts = [element.text for element in elements] items.extend(h4_texts) # print title print(h4_texts)
Remark: it's important to set a target count for infinite scroll pages so you can end your script at some point. Otherwise, your code will keep running if the website's scroll is endless.
For the previous example, we used yet another selector:
By.XPath. This new selector will locate elements based on an XPath instead of classes and IDs as seen before. Inspect the page, right-click on a
<div> containing the elements you want to scrape and select Copy Path.
Your result should look like this:
['Short Dress', 'Patterned Slacks', 'Short Chiffon Dress', 'Off-the-shoulder Dress', ...]
And there you have it, the H4s of the first 20 products!
Remark: using Selenium for dynamic web scraping can get tricky with continuous Selenium updates. Do well to go through the latest changes when scraping dynamic web pages with Python and Selenium.
Dynamic web pages are rampant today and there's a high chance you'll encounter a few in any data extraction project. You should explore these websites to identify the best approach for extracting the needed information.
- Import web driver.
- Define driver path.
- Access website.
- Select HTML elements.
- Print or store data in a file.
That said, Selenium is performance-intensive and can be slow when it comes to scraping large datasets. This is one of the reasons we created ZenRows.
With ZenRows, you can scrape dynamic sites using a simple API call. Try it for free and watch your data crawling task become easy.