Web Scraping AJAX Pages: Tutorial 2024

September 22, 2023 · 8 min read

Many websites use AJAX to dynamically display content, meaning web pages add or update data without requiring a full page refresh. While this leads to better user experiences, it poses unique web scraping challenges. 

In this tutorial, you'll learn everything you need to know for web scraping AJAX pages.

What is AJAX in Web Scraping?

AJAX stands for Asynchronous JavaScript and XML. It's a set of web development techniques where the website makes asynchronous requests to the server (via API) using the XMLHttpRequest object rather than a standard HTTP request.

The data retrieved from the server can be used to update portions of the web page based on user input and state changes without page refresh. A classic example is a social media feed: when you scroll down (user input), new posts automatically load, courtesy of AJAX. 

The challenge for you is that the dynamic content is not present in the website's static HTML. So, if you try web scraping AJAX pages, you won't get the new content. That makes extracting data from an AJAX website different.

The solution? You must intercept the AJAX request. Let's learn how next!

How to Scrape AJAX Pages

We'll go through the following steps to ensure you retrieve your desired data from any AJAX-driven website:

  1. Get the tools you need.
  2. Create your initial scraper.
  3. Identify AJAX requests.
  4. Make AJAX requests.
  5. Parse the response.
  6. Export Data to CSV.

The process is the same for other programming languages, but we'll use Python for AJAX web scraping pages. If you prefer, feel free to check out our quick guides for scraping dynamic content in Java, Ruby, PHP, and cURL.

As a target URL, we'll use an infinite scroll demo page, which loads more ecommerce products as the user scrolls down.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step 1. Get the Tools You Need

To create your AJAX scraper, you need tools that'll enable you to reproduce an AJAX call and parse your response. Here are the modules that should make up your Python stack:

If you need a refresher on these tools, check out our step-by-step tutorial on Python web scraping.

Step 2. Create Your Initial Scraper

To set up your scraper, install the above-mentioned tools (Requests and BeautifulSoup) using this command:

Terminal
pip install requests beautifulSoup4

Then, import the installed libraries, define your target URL, send an HTTP GET request, fetch the HTML content, and parse it using BeautifulSoup. 

program.py
import requests
from bs4 import BeautifulSoup
 
# Define your target URL
url = 'https://scrapingclub.com/exercise/list_infinite_scroll/'
 
# Send an HTTP GET request and fetch the HTML content
response = requests.get(url)
 
# Fetch the HTML content
html_content = response.text
    
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(html_content, 'html.parser')
    
# Print formatted HTML content
print(soup.prettify())

When you run that script, you'll get the website's static HTML. It doesn't include the images and other JavaScript-rendered content but a bunch of script tags.

Output
<!DOCTYPE html>
<html class="h-full">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="Learn to scrape infinite scrolling pages" name="description"/>
  <title>
   Scraping Infinite Scrolling Pages (Ajax) | ScrapingClub
  </title>
 
#...
 
....#
 
<script src="/static/js/runtime.733406a7.437561bb6191.js" type="text/javascript">
  </script>
  <script src="/static/js/844.c955a86a.8ae7a88d9ece.js" type="text/javascript">
  </script>
  <script src="/static/js/app.9eae399d.fbbfa467cd03.js" type="text/javascript">
  </script>
  <script src="https://unpkg.com/infinite-scroll@4/dist/infinite-scroll.pkgd.js">
  </script>
  <script>
   let elem = document.querySelector('.container .grid');
    let infScroll = new InfiniteScroll(elem, {
      // options
      path: 'a[rel="next"]',
      append: '.post',
      history: false,
    });
  </script>

Clearly, you can't retrieve the data you're after from this static HTML because it is dynamically loaded through AJAX. To access such data, you need to identify the AJAX request responsible for the content, replicate it using Python Requests, and parse the AJAX response with BeautifulSoup.

Step 3. Identify AJAX requests for Python Web Scraping

To identify AJAX requests for web scraping, visit the website using your Chrome browser, right-click anywhere on the web page, and select Inspect to open the Developer Tools. You can also use this shortcut: Ctrl+Shift+I (or Cmd+Option+I on Mac).

dev tools inspection screenshot
Click to open the image in full screen

Next, navigate to the Network tab and click the `Fetch/XHR` button. That'll filter requests to show only XMLHttpRequests, which are AJAX requests.

filter to show XMLHttpRequests only
Click to open the image in full screen

Then, refresh the page and scroll to the bottom of the page. You'll see new requests appear for every new content uploaded to the page. Click on these requests to inspect their content (You'll see: request URL, headers, cookies, etc.).

dev tools inspection screenshot
Click to open the image in full screen

Now, identify the specific request that fetches the data you're interested in. You can preview each request by copying the Request URL and loading it in a new tab. For example, if you're interested in the content on the second window, it's the first request that appears. By inspecting that request, you'll find a GET request made to the following Request URL: https://scrapingclub.com/exercise/list_infinite_scroll/?page=2.

If you keep scrolling, you'll notice that there are 6 pages in total, and each AJAX request URL has the format [base_url] + [page_number], where [base_url] is https://scrapingclub.com/exercise/list_infinite_scroll/.

Step 4. Make AJAX requests

After identifying the AJAX requests that display the data you're after for web scraping, the next step is to reproduce it in your scraper. Since they're all GET requests, you can make the same GET requests from your scraper to retrieve the AJAX-rendered content. Here's how:

Define the base URL, and initialize the page_number variable and total_pages. Then, using a while loop, make GET requests to the six request URLs.

The code below retrieves the HTML content directly from each AJAX page and stops the loop at the last page. 

scraper.py
import requests
from bs4 import BeautifulSoup
 
base_url = 'https://scrapingclub.com/exercise/list_infinite_scroll/'
page_number = 1  # Start with the base URL
total_pages = 6 
 
while True:
    # Construct ajax request URL
    url = f'{base_url}?page={page_number}'
    
    # Make GET request
    response = requests.get(url)
    
    # Retrieve the response content
    html_content = response.text
        
    # Move to the next page
    page_number += 1
 
    if page_number > total_pages:
        break  # Stop the loop

Let's now parse the content we're after!

Step 5. Parse the response

To parse the response and grab the product names and prices, select the elements you want using CSS selectors.

You can find them by inspecting the page elements in the Chrome DevTools. Focus on the Elements tab this time. By selecting products on the page to inspect, you can see that product names are located in div.p-4>h4>a and prices are in div.p-4>h5

parsed elements
Click to open the image in full screen

Create a BeautifulSoup object that takes the html_content variable as an argument and uses those selectors.

scraper.py
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(html_content, 'html.parser')
 
# Extract product names and prices
products = soup.select('div.p-4 h4 > a')
prices = soup.select('div.p-4 h5')

Next, loop through products and prices to extract the text content of each element using the get_text() method. 

scraper.py
for product, price in zip(products, prices):
    product_name = product.get_text(strip=True)
    product_price = price.get_text(strip=True)

Putting everything together and adding a printing instruction, this is your full AJAX web scraping code: 

scraper.py
import requests
from bs4 import BeautifulSoup
 
base_url = 'https://scrapingclub.com/exercise/list_infinite_scroll/'
page_number = 1  # Start with the base URL
total_pages = 6 
 
while True:
    # Construct ajax request URL
    url = f'{base_url}?page={page_number}'
    
    # Make GET request
    response = requests.get(url)
    
    # Retrieve the response content
    html_content = response.text
 
    # Move to the next page
    page_number += 1
 
    if page_number > total_pages:
        break  # Stop the loop
 
    # Parse the HTML content using Beautiful Soup
    soup = BeautifulSoup(html_content, 'html.parser')
 
    # Extract product names and prices
    products = soup.select('div.p-4 h4 > a')
    prices = soup.select('div.p-4 h5')
 
    for product, price in zip(products, prices):
        product_name = product.get_text(strip=True)
        product_price = price.get_text(strip=True)
        print(f'Product: {product_name} | Price: {product_price}')
 
    print('-' * 50)
 
    # Move to the next page
    page_number += 1
 
    if page_number > total_pages:
        break  # Stop the loop

Run it, and you should have all the data you targeted.   

Output
Product: Short Dress | Price: $24.99
Product: Patterned Slacks | Price: $29.99
Product: Short Chiffon Dress | Price: $49.99
Product: Off-the-shoulder Dress | Price: $59.99
Product: V-neck Top | Price: $24.99
Product: Short Chiffon Dress | Price: $49.99
Product: V-neck Top | Price: $24.99
Product: V-neck Top | Price: $24.99
Product: Short Lace Dress | Price: $59.99
Product: Fitted Dress | Price: $34.99
--------------------------------------------------
Product: V-neck Jumpsuit | Price: $69.99
Product: Chiffon Dress | Price: $54.99
Product: Skinny High Waist Jeans | Price: $39.99
Product: Super Skinny High Jeans | Price: $19.99
Product: Oversized Denim Jacket | Price: $19.99
 
#..... 
    # truncated for brevity's sake
#.....

Remark: This result is partial for brevity's sake.

Bingo! You've scraped your first AJAX website.

Step 6. Export Data to CSV

If you need, you can also export the extracted data from AJAX web scraping to CSV. 

For that, import the csv module, initialize an empty product_data array, and append the product data to it.

scraper.py
import requests
from bs4 import BeautifulSoup
import csv
 
base_url = 'https://scrapingclub.com/exercise/list_infinite_scroll/'
page_number = 1  # Start with the base URL
total_pages = 6 
product_data = []
 
while page_number <= total_pages:
    # Construct ajax request URL
    url = f'{base_url}?page={page_number}'
    
    # Make GET request
    response = requests.get(url)
    
    # Retrieve the response content
    html_content = response.text
        
    # Parse the HTML content using Beautiful Soup
    soup = BeautifulSoup(html_content, 'html.parser')
        
    # Extract product names and prices
    products = soup.select('div.p-4 h4 > a')
    prices = soup.select('div.p-4 h5')
        
    # Collect product information
    for product, price in zip(products, prices):
        product_name = product.get_text(strip=True)
        product_price = price.get_text(strip=True)
        
        #add data to product_data []
        product_data.append({'name': product_name, 'price': product_price})   
        
    # Move to the next page
    page_number += 1

Then, create a function named write_to_csv that writes this data to a CSV file. It'll open the file, define the field names, and use the csv.DictWriter class to write the data to the CSV. 

scraper.py
def write_to_csv(product_data):
    # Open the CSV file for writing
    with open('products.csv', 'w', newline='', encoding='utf-8') as csvfile:
        # Define the field names for the CSV header
        fieldnames = ['Product Name', 'Price']
         # Create a DictWriter object
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

The csv.DictWriter class expects that the first row you write to the CSV file is the header row, containing the field names. So, still, within the function, instruct the csv.DictWriter class to automatically write the headers using the writer.writeheader() method. 

Next,  loop through the product_data list and write a new row for each dictionary list using the writer.writerow() method.

scraper.py
#....
    #...
      
        # Write CSV header
        writer.writeheader()
        
        # Loop through the product data list
        for product in product_data:
            # Write a new row with product name and price
            writer.writerow({'Product Name': product['name'], 'Price': product['price']})

Lastly, call the write_to_csv function to export the data to CSV.

Putting it all together, here's the final code:

scraper.py
import requests
from bs4 import BeautifulSoup
import csv
 
def write_to_csv(product_data):
    # Open the CSV file for writing
    with open('products.csv', 'w', newline='', encoding='utf-8') as csvfile:
        # Define the field names for the CSV header
        fieldnames = ['Product Name', 'Price']
         # Create a DictWriter object
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
      
        # Write CSV header
        writer.writeheader()
        
        # Loop through the product data list
        for product in product_data:
            # Write a new row with product name and price
            writer.writerow({'Product Name': product['name'], 'Price': product['price']})
 
base_url = 'https://scrapingclub.com/exercise/list_infinite_scroll/'
page_number = 1  # Start with the base URL
total_pages = 6 
product_data = []
 
while page_number <= total_pages:
    # Construct ajax request URL
    url = f'{base_url}?page={page_number}'
    
    # Make GET request
    response = requests.get(url)
    
    # Retrieve the response content
    html_content = response.text
        
    # Parse the HTML content using Beautiful Soup
    soup = BeautifulSoup(html_content, 'html.parser')
        
    # Extract product names and prices
    products = soup.select('div.p-4 h4 > a')
    prices = soup.select('div.p-4 h5')
        
    # Collect product information
    for product, price in zip(products, prices):
        product_name = product.get_text(strip=True)
        product_price = price.get_text(strip=True)
        
        #add data to product_data []
        product_data.append({'name': product_name, 'price': product_price})   
        
    # Move to the next page
    page_number += 1
 
# Call the function to write data to CSV
write_to_csv(product_data)

This creates a CSV file named products.csv in your project file. 

csv project file
Click to open the image in full screen

Congrats, you've accomplished AJAX web scraping!

The Challenge of Getting Blocked Easily

Scraping numerous pages can easily lead to getting blocked. Websites often employ anti-scraping measures like rate-limiting and IP banning to regulate traffic.

A recommended quick solution is using a web scraping proxy. Moreover, rotating between proxies allows you to route your requests through multiple IP addresses, making the requests look like they're coming from different users. This is recommended for better results, so you might be interested in our tutorial on implementing proxies with Python Requests

Conclusion

While AJAX enhances user experience by enabling websites to render content dynamically, it also introduces complexities for web scraping. Dynamic contents are not present in a website's static HTML. Therefore, you must identify the AJAX requests responsible for the data you're after and replicate them. 

It's also important to be aware of the increasing risk of getting blocked, as web servers can flag your activity as suspicious and deny you access.

Frequent Questions

What Is AJAX Used for?

AJAX is used for developing dynamic websites where web pages can interact with servers and update content based on user input and state changes. For example, a load more button or an infinite scroll page where more content loads as you scroll to the bottom of a page.  

What Is the Difference between HTTP and AJAX?

The difference between HTTP and AJAX is that HTTP is a protocol for communication between a browser and a web server. In contrast, AJAX is not a protocol but a set of web development techniques that allow websites to dynamically update portions of a web page by making asynchronous requests to a server.

Ready to get started?

Up to 1,000 URLs for free are waiting for you