Ever hit a wall while scraping JavaScript-rendered web pages with Python?
It can certainly prove difficult because of the dynamically loaded data. Not to mention, loads of web apps use frameworks like React.js or Angular, so there's a high chance your request-based scraper may break while trying to perform requests.
We'll explore different methods for scraping JavaScript-rendered content, show you how to build a web scraper using Selenium and provide solutions to common problems like avoiding blocks and scaling up your scraping efforts.
Why Is Scraping JavaScript-Rendered Web Pages Difficult?
Scraping JavaScript-rendered web pages is challenging because the content isn't readily available in the initial HTML response. Instead, it's dynamically generated or modified by JavaScript after the page loads. As a result, plain HTTP requests won't be enough as the requested content must be populated first.
Moreover, modern web applications often use complex frameworks and asynchronous loading, making it hard to determine when the page is fully rendered. To effectively scrape such pages, you need tools that can execute JavaScript, render the page as a browser would, and interact with dynamic elements. This requires more sophisticated approaches than simple HTTP requests.
In the following sections, we'll explore various methods and tools designed to overcome these challenges and allow you to extract data successfully from JavaScript-heavy websites.
How to Scrape JavaScript-Rendered Content
Let's explore three practical approaches to scraping JavaScript-rendered content: backend queries, script tags, and browser automation tools. We'll compare these approaches to help you choose the best solution for your project.
Use Backend Queries
Sometimes, frameworks such as React populate the page using backend queries. You can use these API calls in your application to get data directly from the server.
However, this isn't a foolproof method. You'll need to check your browser requests to find out if there's an available API backend. If there is one, you can use the same settings with your custom queries to grab the data.
Use Script Tags
This method involves searching for hidden data within script tags on the loaded page. Some web applications store data as JSON objects within these tags, which can be extracted using tools like BeautifulSoup in Python. By parsing the HTML and locating the relevant script tags, you can potentially access pre-rendered data without executing JavaScript.
However, this approach has limitations. Many websites don't store their data this way, and when they do, the data might be encoded or obfuscated. Also, as websites evolve, the location and format of this data can change, making your scraper prone to breaking. For these reasons, while occasionally useful, relying on script tags is often not a robust solution for scraping JavaScript-rendered content.
Use a Tool That Executes JavaScript
Browser-based automation tools are often the most effective solution for scraping JavaScript-rendered content. These tools can load web pages, execute JavaScript, and interact with dynamic content just like a real browser.
Let's explore some popular options that can handle even the most complex JavaScript-heavy websites:
Selenium
Selenium is a powerful open-source tool primarily used for web browser automation and testing. Its ability to interact with web pages as a real user makes it an excellent choice for scraping JavaScript-rendered content. Due to its widespread adoption and extensive community support, we'll use Selenium in this tutorial to demonstrate scraping JavaScript-rendered pages.
For a comprehensive guide on web scraping with Selenium, check out our detailed article.
Playwright
Playwright is a modern automation framework developed by Microsoft that supports multiple browser engines, including Chromium, Firefox, and WebKit. It offers powerful features like automatic wait and built-in async/await functionality, making it particularly effective for scraping complex, JavaScript-heavy websites. Playwright's ability to handle multiple browser contexts and its comprehensive API makes it a robust choice for advanced scraping projects.
Explore our guide on web scraping with Playwright to learn more about its capabilities.
Scrapy
Scrapy is a high-performance, open-source web scraping framework for Python. While it doesn't handle JavaScript rendering out of the box, it can be combined with tools like Splash or Selenium to scrape JavaScript-rendered content. Scrapy's extensibility, built-in features for handling common scraping tasks, and ability to handle large-scale scraping projects make it a popular choice among developers.
Read our guide on web scraping with Scrapy to learn how to leverage it for JavaScript-heavy websites.
Now that we've explored various tools for scraping JavaScript-rendered content, let's dive deeper into one of the most popular solutions: Selenium. In the next section, we'll walk through the process of building a web scraper using Selenium to extract data from a JavaScript-rendered web page.
How to Scrape JavaScript-Rendered Web Pages with Selenium
Let's create a web scraper using Selenium to extract data from JavaScript-rendered pages. We'll scrape the JS Rendering demo page from the ScrapingCourse website.
At first, the website renders a template page on the server; then, JavaScript populates it on the client's side.
Here's what the loading screen looks like:
After populating the HTML content, we get something like this:
1. Prerequisites
Before we start building our scraper, let's set up the necessary tools. We'll be using Selenium with Python, so make sure you have Python installed on your system.
To install Selenium, run the following command in your terminal:
pip3 install selenium
Selenium 4 and later versions have a built-in WebDriver that simplifies the setup process. If you're using an older version, you might need to update it:
pip3 install --upgrade selenium
With these prerequisites, you're ready to start building your scraper.
2. Extract HTML
Let's extract the complete HTML of the target webpage.
First, import the necessary tools to control the browser, wait for elements, and handle the extracted data.
# import necessary tools from the selenium library
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Next, initialize the headless Chrome driver, which runs in the background without opening a visible window. It's faster and more resource-efficient for web scraping tasks.
# ...
# set up chrome driver
service = Service()
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(service=service, options=options)
Now, navigate to the page and wait for the product grid to load. The WebDriverWait
ensures that the JavaScript has finished rendering the content before we attempt to extract it. Finally, print the complete HTML of the page that includes the dynamically loaded content.
# ...
# navigate to the target webpage
driver.get("https://www.scrapingcourse.com/javascript-rendering")
# wait for the product grid to load
WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#product-grid .product-item"))
)
# print the complete HTML after JavaScript execution
print(driver.page_source)
# close the browser
driver.quit()
Here's the complete code:
# import necessary tools from the selenium library
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# set up chrome driver
service = Service()
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(service=service, options=options)
# navigate to the target webpage
driver.get("https://www.scrapingcourse.com/javascript-rendering")
# wait for the product grid to load
WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located(
(By.CSS_SELECTOR, "#product-grid .product-item")
)
)
# print the complete HTML after JavaScript execution
print(driver.page_source)
# close the browser
driver.quit()
You'll get the following output on running this code:
<html lang="en">
<head>
<!-- ... -->
<title>JS Rendering Challenge to Learn Web Scraping - ScrapingCourse.com</title>
<!-- ... -->
</head>
<body>
<div id="product-grid">
<div class="product-item" ...>
<!-- ... -->
<div class="product-info" ...>
<span class="product-name" ...>
Chaz Kangeroo Hoodie
</span>
<span class="product-price" ...>
$52
</span>
</div>
<!-- ... -->
</div>
<!-- ... -->
</div>
</body>
</html>
Awesome! You were successfully able to extract the complete HTML from the JavaScript-rendered web page.
3. Parse Data
To parse the required elements from the extracted HTML, first, you need to define your CSS selector strategy.
Open the target URL in your browser and inspect to open DevTools. You'll notice that all the products are enclosed within a product-grid
div, and the product details (product-name
, product-price
, and product-image
) are inside individual product-item
divs.
We'll use this information in the next steps.
Using the CSS selectors, locate the div element containing products and store the resulting array of elements in the products
variable.
Here's how your modified script would look like:
# import necessary tools from the selenium library
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# set up chrome driver
service = Service()
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(service=service, options=options)
# navigate to the target webpage
driver.get("https://www.scrapingcourse.com/javascript-rendering")
# wait for the product grid to load
WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located(
(By.CSS_SELECTOR, "#product-grid .product-item")
)
)
# extract product details
products = []
items = driver.find_elements(By.CSS_SELECTOR, "#product-grid .product-item")
for item in items:
name = item.find_element(By.CSS_SELECTOR, ".product-name").text.strip()
price = item.find_element(By.CSS_SELECTOR, ".product-price").text.strip()
image_url = item.find_element(By.CSS_SELECTOR, ".product-image").get_attribute(
"src"
)
products.append({"name": name, "price": price, "imageUrl": image_url})
# print the parsed data
print(products)
# close the browser
driver.quit()
The above code will print a list containing the parsed product data:
[
{
"name": "Chaz Kangeroo Hoodie",
"price": "$52",
"imageUrl": "https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh01-gray_main.jpg"
},
{
"name": "Teton Pullover Hoodie",
"price": "$70",
"imageUrl": "https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh02-black_main.jpg"
},
// other content omitted for brevity
]
4. Save to CSV
As the final step, let's save the extracted data into a CSV file. Define the CSV headers, open a new file named "products.csv", and use Python's csv
module to write the data. The DictWriter
class allows you to write the product dictionaries directly to the CSV file.
Here's the complete code combining all the steps:
# import necessary libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv
# set up chrome driver
service = Service()
options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
driver = webdriver.Chrome(service=service, options=options)
# navigate to the target webpage
driver.get("https://www.scrapingcourse.com/javascript-rendering")
# wait for the product grid to load
WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located(
(By.CSS_SELECTOR, "#product-grid .product-item")
)
)
# extract product details
products = []
items = driver.find_elements(By.CSS_SELECTOR, "#product-grid .product-item")
for item in items:
name = item.find_element(By.CSS_SELECTOR, ".product-name").text.strip()
price = item.find_element(By.CSS_SELECTOR, ".product-price").text.strip()
image_url = item.find_element(By.CSS_SELECTOR, ".product-image").get_attribute(
"src"
)
products.append({"name": name, "price": price, "imageUrl": image_url})
# specify the CSV headers
headers = ["name", "price", "imageUrl"]
# write the extracted data to a CSV file
with open("products.csv", "w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=headers)
writer.writeheader()
writer.writerows(products)
print("CSV file written successfully.")
# close the driver
driver.quit()
By running this script, you'll generate a CSV file containing all the product data.
Congratulations! You've successfully built a Selenium-based web scraper capable of extracting data from JavaScript-rendered pages.
While Selenium effectively scrapes JavaScript-rendered content, it has drawbacks. It's difficult to scale, time-consuming, resource-intensive, and prone to blocks from anti-bot measures. These limitations can hinder large-scale scraping projects.
In the next section, we'll introduce an alternative solution that addresses these challenges and offers a more efficient approach.
How to Avoid Getting Blocked While Scraping With Python?
Trying to scrape G2 Reviews with the previous method would likely result in a block or CAPTCHA challenge.
To overcome these limitations, consider using a web scraping API like ZenRows. ZenRows offers a powerful toolkit that includes JavaScript rendering capabilities. It can effectively replace Selenium while providing additional benefits such as auto-rotating premium proxies, CAPTCHA and anti-bot auto-bypass, and more.
Let's walk through using ZenRows to scrape the G2 Reviews page.
First, sign up for ZenRows and open the Request Builder. Paste the G2 Reviews URL into the link box, activate Premium Proxies and JS Rendering boost mode. Then select Python as your language and choose API connection mode.
Copy the generated code, which would look similar to this:
# pip install requests
import requests
url = "https://www.g2.com/products/asana/reviews"
apikey = "<YOUR_ZENROWS_API_KEY>"
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
"premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)
Run this code, and you'll see the full HTML output of the page, successfully bypassing the site's protections.
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
</head>
<body>
<!-- other content omitted for brevity -->
</body>
</html>
This demonstrates how ZenRows simplifies the process of scraping JavaScript-rendered content from even heavily protected websites. It offers a more robust and efficient approach compared to traditional methods.
Conclusion
We've explored various methods for scraping JavaScript-rendered web pages, from basic techniques to more advanced solutions. While tools like Selenium offer powerful capabilities for interacting with dynamic content, they come with challenges in scalability, efficiency, and dealing with anti-bot measures.
The Selenium-based approach can be effective for simple scraping tasks or when working with less protected websites. However, as your scraping needs grow or when facing more sophisticated websites, you'll need a more robust solution.
This is where ZenRows comes into play.
As a specialized web scraping API, ZenRows offers a scalable, easy-to-use, and block-resistant solution for all your web scraping needs. It handles the complexities of JavaScript rendering, proxy rotation, and anti-bot bypassing, allowing you to focus on extracting the data you need rather than battling with website protections.