Scrapy Pagination: How to Scrape Multiple Pages

May 24, 2024 · 6 min read

Scrapy allows you to scrape paginated websites easily. Wondering how to implement that?

This article shows you how to use Python's Scrapy to scrape websites in all cases.

When You Have a Navigation Page Bar

Page navigation bars are the most common and simplest forms of pagination. 

There are two standard methods for scraping websites with navigation bars in Scrapy: you can use the next page link method or change the page number in the URL.

Let's see how each works with examples where Scrapy retrieves the names and prices of products across all pages on ScrapingCourse.com, an e-commerce demo website with a page navigation bar.

But first, let's briefly examine the website you want to scrape.

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

You're about to scrape 12 product pages using Scrapy! Let's get started with each method.

The scraping logic behind the next page link is to have Scrapy follow and scrape the next page if it exists. We’ll do this using the response.follow standard method of Scrapy.

Let's set up our spider request with a callback:

scraper.py
# import the required modules
from scrapy.spiders import Spider
from scrapy import Request
 
class MySpider(Spider):
    # specify the spider name
    name = 'product_scraper'
    start_urls = ['https://www.scrapingcourse.com/ecommerce/']
  
    def start_requests(self):
        # start with the initial page
        for url in self.start_urls:
            yield Request(url=url, callback=self.parse)
#...

All the products are inside an unordered list (ul) element. The products variable obtains each product from the parent ul element.

The following parse function loops through the product container and retrieves corresponding product names and prices. It then appends the result into the empty data array.

scraper.py
class MySpider(Spider)
 
#...
 
    # parse HTML page as response
    def parse(self, response):
        # extract text content from the ul element
        products = response.css('ul.products li.product')
 
        # declare an empty array to collect data
        data = []
 
        for product in products:
            # get children element from the ul
            product_name = product.css('h2.woocommerce-loop-product__title::text').get()
            price = product.css('span.woocommerce-Price-amount::text').get()
 
            # append the scraped data into the empty data array
            data.append({
                'product_name': product_name,
                'price': price,
            })
 
        self.log(data)
#...

Running the spider logs the content on the first page only.  It means that our code isn't following the pages yet:

Output
[
    {'product_name': 'Abominable Hoodie', 'price': '69.00'}, 
    {'product_name': 'Adrienne Trek Jacket', 'price': '57.00'}, 
...

]

Next is to write the logic for retrieving content from all the pages. Let's quickly inspect the pagination bar element and expose its structure.

So, the navigation bar is an unordered list (ul) of page numbers. 

Here's what the element looks like in the inspection tab:

ScrapingCourse next page inspection
Click to open the image in full screen

The code below gets successive pages from the ul by calling the href attribute of a.next

If the next href exists, Scrapy visits its page using response.follow and scrapes the target content. Otherwise, it terminates the crawl.

Extend the previous code with the following:

scraper.py
def parse(self, response):
 
    #...
 
        # follow the next page link
        next_page = response.css('ul.page-numbers li a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Let's get the code together:

scraper.py
# import the required modules
from scrapy.spiders import Spider
from scrapy import Request
 
class MySpider(Spider):
    # specify the spider name
    name = 'product_scraper'
    start_urls = ['https://www.scrapingcourse.com/ecommerce/']
 
    def start_requests(self):
        # start with the initial page
        for url in self.start_urls:
            yield Request(url=url, callback=self.parse)
 
    # parse HTML page as response
    def parse(self, response):
        # extract text content from the ul element
        products = response.css('ul.products li.product')
 
        # declare an empty array to collect data
        data = []
 
        for product in products:
            # Get children element from the ul
            product_name = product.css('h2.woocommerce-loop-product__title::text').get()
            price = product.css('bdi::text').get()
 
            # append the scraped data into the empty data array
            data.append({
                'product_name': product_name,
                'price': price,
            })
 
        # follow the next page link
        next_page = response.css('ul.page-numbers li a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)
 
        self.log(data)

This crawls all available pages and outputs the product names and prices:

Output
[
    {'product_name': 'Abominable Hoodie', 'price': '69.00'}, 
    {'product_name': 'Adrienne Trek Jacket', 'price': '57.00'},
 
    #... other products omitted for brevity

    {'product_name': 'Zoe Tank', 'price': '29.00'}, 
    {'product_name': 'Zoltan Gym Tee', 'price': '29.00'}
]

Congratulations! You just scraped content from every page on a paginated website.

You can even try a more custom approach of changing the page number in the URL.

Change the Page Number in the URL

Paginated websites with navigation bars often show the current page directory in the URL. 

For example, the third page URL on ScrapingCourse is https://www.scrapingcourse.com/ecommerce/page/3/. This is the same for all pages. See a demo below:

ScrapingCourse page number demo
Click to open the image in full screen

The idea is to increase the page numbers in the URL, have Scrapy visit, and scrape each page. 

Let's see how that works, starting with the request callback below. page_count sets the initial page number to one. 

Since this is a custom method, handle_httpstatus_list ensures that your scraper ignores the error 404 once your scraper exceeds the available page numbers. 

scraper.py
# import the required modules
from scrapy.spiders import Spider
from scrapy import Request
 
class MySpider(Spider):
    # specify the spider name
    name = 'product_scraper'
    
    # specify the target URL
    start_urls = ['https://www.scrapingcourse.com/ecommerce/']
 
    # handle HTTP 404 response
    handle_httpstatus_list = [404]
 
    # set the initial page count to 1
    page_count = 1
 
    def start_requests(self):
        # start with the initial page
        for url in self.start_urls:
            yield Request(url=url, callback=self.parse)

The code below parses and scrapes each page by requesting the next page count in the URL.

Notably, the code adds one to the page number per request and obtains the target URL by its index. This gives https://www.scrapingcourse.com/ecommerce/.

Thus, the next page format becomes: https://www.scrapingcourse.com/ecommerce/page/<PAGE_NUMBER>. Next, Scrapy visits each incremented page number in that order until the last page.

The error 404 logic terminates the crawl once Scrapy exceeds the available page numbers and hits an error 404.

scraper.py
class MySpider(Spider):
 
#...
 
    # parse HTML page as response
    def parse(self, response):
 
        # get response status
        status = response.status
 
        # terminate the crawl when you exceed the available page numbers
        if status == 404:
            self.log(f'Ignoring 404 response for URL: {response.url}')
            return
        # extract text content from the ul element
        products = response.css('ul.products li.product')
 
        # declare an empty array to collect data
        data = []
 
        for product in products:
            # get children element from the ul
            product_name = product.css('h2.woocommerce-loop-product__title::text').get()
            price = product.css('bdi::text').get()
 
            # append the scraped data into the empty data array
            data.append({
                'product_name': product_name,
                'price': price,
            })
 
        self.page_count += 1 
 
        next_page = f'{self.start_urls[0]}page/{self.page_count}/'
 
        yield Request(url=next_page, callback=self.parse)
 
        self.log(data)

Let's put the code in one piece:

scraper.py
# import the required modules
from scrapy.spiders import Spider
from scrapy import Request
 
class MySpider(Spider):
    # specify the spider name
    name = 'product_scraper'
    
    # specify the target URL
    start_urls = ['https://www.scrapingcourse.com/ecommerce/']
 
    # handle HTTP 404 response
    handle_httpstatus_list = [404]
 
    # set the initial page count to 1
    page_count = 1
 
    def start_requests(self):
        # start with the initial page
        for url in self.start_urls:
            yield Request(url=url, callback=self.parse)
 
    # parse HTML page as response
    def parse(self, response):
 
        # get response status
        status = response.status
 
        # terminate the crawl when you exceed the available page numbers
        if status == 404:
            self.log(f'Ignoring 404 response for URL: {response.url}')
            return
        # extract text content from the ul element
        products = response.css('ul.products li.product')
 
        # declare an empty array to collect data
        data = []
 
        for product in products:
            # get children element from the ul
            product_name = product.css('h2.woocommerce-loop-product__title::text').get()
            price = product.css('bdi::text').get()
 
            # append the scraped data into the empty data array
            data.append({
                'product_name': product_name,
                'price': price,
            })
 
        self.page_count += 1 
 
        next_page = f'{self.start_urls[0]}page/{self.page_count}/'
 
        yield Request(url=next_page, callback=self.parse)
 
        self.log(data)

This outputs the product names and prices for all available pages, as shown:

Output
[
    {'product_name': 'Abominable Hoodie', 'price': '69.00'}, 
    {'product_name': 'Adrienne Trek Jacket', 'price': '57.00'},
 
    #... other products omitted for brevity

    {'product_name': 'Zoe Tank', 'price': '29.00'}, 
    {'product_name': 'Zoltan Gym Tee', 'price': '29.00'}
]

Nice! Your custom Scrapy code for scraping paginated content works.

But most modern websites now employ dynamic JavaScript to load content as you scroll. Let's handle that in the next section.

When JavaScript-Based Pagination is Required

Websites that use JavaScript for pagination may use infinite scroll to load content or require clicking a button to load more content. It means you'll need a headless browser to render JavaScript with Scrapy in their case.

Let's consider each scenario with examples that use Scrapy Splash to scrape product images, names, prices, and links off ScrapingClub, a demo website that uses infinite scrolling.

The demo website loads content dynamically as you scroll down the page like so:

Infinite Scroll Demo
Click to open the image in full screen

First, install scrapy-splash using pip:

Terminal
pip install scrapy-splash

Now, let’s scrape this website!

Infinite Scroll to Load More Content

Infinite scrolling is common with social media and e-commerce websites. Using Splash to render JavaScript in Scrapy is the best way to interact with infinite scrolling.

The following code demonstrates how to use Scrapy Splash to access and scrape data rendered by infinite scrolling.

The lua_script details how Splash should interact with the web page. The script specifies the number of times to scroll the page and implements a pause for more items to load when scrolling.

The Splash Request accepts a URL, a callback, an endpoint, and an optional parameter that points to the lua_script.  

scraper.py
# import the required libraries
import scrapy
from scrapy_splash import SplashRequest
 
lua_script = """
function main(splash, args)
    splash:go(args.url)
 
    local num_scrolls = 8
    local wait_after_scroll = 1.0
 
    local scroll_to = splash:jsfunc('window.scrollTo')
    local get_body_height = splash:jsfunc(
        'function() { return document.body.scrollHeight; }'
    )
 
    -- scroll to the end for 'num_scrolls' time
    for _ = 1, num_scrolls do
        scroll_to(0, get_body_height())
        splash:wait(wait_after_scroll)
    end       
        
    return splash:html()
end
"""
 
class ScrapingClubSpider(scrapy.Spider):
    name = 'scraping_club'
    star_urls = ['https://scrapingclub.com/exercise/list_infinite_scroll/']
 
    def start_requests(self):
        for url in self.star_urls:
            yield SplashRequest(url, callback=self.parse, endpoint='render.html', args={'lua_source': lua_script})
 
 
    def parse(self, response):
        # iterate over the product elements
        for product in response.css('.post'):
            url = product.css('a').attrib['href']
            image = product.css('.card-img-top').attrib['src']
            name = product.css('h4 a::text').get()
            price = product.css('h5::text').get()
        
            # add the scraped product data to the list
        
            yield {
                'url': url,
                'image': image,
                'name': name,
                'price': price
            }

This scrapes the desired content successfully, as shown:

Output
{'url': '/exercise/list_basic_detail/90008-E/', 'image': '/static/img/90008-E.jpg', 'name': 'Short Dress', 'price': '$24.99'}
{'url': '/exercise/list_basic_detail/96436-A/', 'image': '/static/img/96436-A.jpg', 'name': 'Patterned Slacks', 'price': '$29.99'}
{'url': '/exercise/list_basic_detail/93926-B/', 'image': '/static/img/93926-B.jpg', 'name': 'Short Chiffon Dress', 'price': '$49.99'}
{'url': '/exercise/list_basic_detail/90882-B/', 'image': '/static/img/90882-B.jpg', 'name': 'Off-the-shoulder Dress', 'price': '$59.99'}
#... other products omitted for brevity

The code works!

But what if the pagination style requires a user to click a "Load More" button to view more content? Let's see how to tackle that in the following section.  

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Click on a Button to Load More Content

A "Load More" button is another pagination style you might encounter while scraping paginated websites with Scrapy. Scrapy Splash requires interaction with the web page to get more content. 

In this example, we'll use Splash with Scrapy to retrieve product information from Sketchers

This website uses the "Load More" button to show content dynamically, as shown:

Load More Infinite Demo
Click to open the image in full screen

The code below obtains the product names and prices from the target website. The bulk of the job is with the lua_script, which implements the scrolling strategy for Scrapy.

num_scrolls in the lua_script determines the number of times Scrapy will attempt to scroll the page. It then implements a logic to check for the presence of the load button and presses the button if the logic evaluates to true.

Paste the lua_script in your spider file, as shown:

scraper.py
# importing necessary libraries
import scrapy
from scrapy_splash import SplashRequest
 
lua_script = """
function main(splash, args)
    splash:go(args.url)
 
    local num_scrolls = 10
    local wait_after_scroll = 1.0
    local wait_after_click = 5.0
 
    local scroll_to = splash:jsfunc('window.scrollTo')
    local get_body_height = splash:jsfunc(
        'function() { return document.body.scrollHeight; }'
    )
 
    -- scroll to the end and click 'Load More' for 'num_scrolls' times
    for _ = 1, num_scrolls do
        scroll_to(0, get_body_height())
        splash:wait(wait_after_scroll)
        local load_more_button = splash:evaljs([[
        var button = document.querySelectorAll('button.btn.btn-primary')[1];
        return button && button.offsetHeight > 0;
        ]])
 
        -- click the 'Load More' button
 
        if load_more_button then
            load_more_button.click()
            splash:wait(wait_after_click)
        end
    end
 
    return splash:html()
end
"""
#...

Next, write your scraper class and add the lua_scriptto the SplashRequest method. 

scraper.py
#...
 
class ScrapingClubSpider(scrapy.Spider):
    name = 'crutch'
    start_urls = ['https://www.skechers.com/men/shoes/']
    custom_settings = {
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        }
 
    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, callback=self.parse, endpoint='render.html', args={'lua_source': lua_script})
 
    def parse(self, response):
        products = response.css('div.product-grid div.col-6')
 
        for product in products:
            item = {
                'name': product.css('a.link.c-product-tile__title::text').get(),
                'price': product.css('span.value::text').get()
            }
 
            yield item

Here's the code combined:

scraper.py
# importing necessary libraries
import scrapy
from scrapy_splash import SplashRequest
 
lua_script = """
function main(splash, args)
    splash:go(args.url)
 
    local num_scrolls = 10
    local wait_after_scroll = 1.0
    local wait_after_click = 5.0
 
    local scroll_to = splash:jsfunc('window.scrollTo')
    local get_body_height = splash:jsfunc(
        'function() { return document.body.scrollHeight; }'
    )
 
    -- Scroll to the end and click 'Load More' for 'num_scrolls' times
    for _ = 1, num_scrolls do
        scroll_to(0, get_body_height())
        splash:wait(wait_after_scroll)
        local load_more_button = splash:evaljs([[
        var button = document.querySelectorAll('button.btn.btn-primary')[1];
        return button && button.offsetHeight > 0;
        ]])
 
        -- Click the 'Load More' button
 
        if load_more_button then
            load_more_button.click()
            splash:wait(wait_after_click)
        end
    end
 
    return splash:html()
end
"""
 
class ScrapingClubSpider(scrapy.Spider):
    name = 'crutch'
    start_urls = ['https://www.skechers.com/men/shoes/']
    custom_settings = {
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        }
    
    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, callback=self.parse, endpoint='render.html', args={'lua_source': lua_script})
 
    def parse(self, response):
        products = response.css('div.product-grid div.col-6')
 
        for product in products:
            item = {
                'name': product.css('a.link.c-product-tile__title::text').get(),
                'price': product.css('span.value::text').get()
            }
 
            yield item

The code outputs the product names and prices of the items on that page:

Output
{'name': 'Skechers Slip-ins: Snoop Flex - Velvet', 'price': '$100.00'}
{'name': 'Skechers Slip-ins: Snoop One - Boss Life Velvet', 'price': '$115.00'}
{'name': 'SKX RESAGRIP', 'price': '$150.00'}
{'name': 'SKX FLOAT', 'price': '$150.00'}
{'name': 'Skechers Slip-ins: Max Cushioning AF - Fortuitous', 'price': '$120.00'}
{'name': 'Skechers Slip-ins: Max Cushioning AF - Game', 'price': '$120.00'}
#...other content omitted for brevity

That's it! Your code works and is now scraping content dynamically after clicking a "Load More" button on a paginated website.

But the problem is only halfway solved. Most websites use blockers like anti-bots to prevent scraping. 

How can you avoid this while scraping with Scrapy?

Getting Blocked when Scraping Multiple Pages with Scrapy

Your scraper can get blocked if a website uses anti-bot measures, which e.g. detect you as a bot if you request a lot of content too quickly. You have to handle CAPTCHAs, use proxies and so on.

Thankfully, a solution like ZenRows makes your scraping job much easier and integrates with Scrapy to handle all those complexities. It equips you with premium proxies, JavaScript interactions, and everything you need to avoid getting blocked.

Try ZenRows with Scrapy for free and scrape any website.

Conclusion

This article taught you the methods of employing Scrapy for multi-page scraping, covering both traditional and JavaScript-based pagination methods.

You now know how to:

  • Navigate through the page bar and URL-based scraping techniques.
  • Employ dynamic scraping methods for infinite scrolling and content loading.
  • Address common web scraping barriers and their solutions.

The power of Scrapy is clear, yet barriers like anti-bot measures can present challenges. ZenRows integrates seamlessly with Scrapy, providing an easy solution to scrape any website. Try ZenRows for free!

Ready to get started?

Up to 1,000 URLs for free are waiting for you