How to Take a Screenshot With Scrapy Splash

August 13, 2024 · 6 min read

Are you looking for an easy solution to take screenshots with Scrapy-Splash while web scraping in Python? You're in the right place!

In this guide, we'll explore three techniques for capturing screenshots with Scrapy-Splash:

Let's dive in!

How to Take a Screenshot With Splash?

The Scrapy-Splash library integrates Splash into Scrapy. This integration allows you to use Splash's JavaScript rendering capabilities within the Scrapy framework. You can capture screenshots of webpages using the library's built-in features or Lua scripts.

Let's set up our project to get started!

This tutorial assumes you've already created a Scrapy project. If you haven't, check out our general tutorial on Scrapy Splash.

Splash handles the rendering of JavaScript content. To enable communication with Scrapy, you need to run Splash as a separate service using Docker.

If you haven't installed Docker, download Docker Desktop from the official website.

Open a terminal and pull the Splash image with Docker:

Terminal
docker pull scrapinghub/splash

After successfully running this command, you should see the scrapinghub/splash image in the Docker Desktop's Images tab.

Click to open the image in full screen

Next, run the container using the following command:

Terminal
docker run -it -p 8050:8050 --rm scrapinghub/splash

To confirm if Splash is running as expected, open http://localhost:8050/ in your browser. You should see the following page:

Splash
Click to open the image in full screen

Now, open your Scrapy project and run the following command:

Terminal
pip install scrapy-splash

This command installs the Scrapy-Splash package, which includes the middleware and classes needed to send requests to Splash and handle the responses.

Configure the settings.py file to communicate with the Splash service:

Example
# define the name of the Scrapy bot
# use the same name as used for initializing the Scrapy project
BOT_NAME = "splash_scraper"

# define the modules where Scrapy will look for spiders
SPIDER_MODULES = ["splash_scraper.spiders"]
NEWSPIDER_MODULE = "splash_scraper.spiders"

# specify the URL where the Splash service is running
SPLASH_URL = "http://localhost:8050"

# disable obeying robots.txt rules
ROBOTSTXT_OBEY = False

# enable Splash deduplication middleware for handling requests with different Splash arguments
SPIDER_MIDDLEWARES = {
    "scrapy_splash.SplashDeduplicateArgsMiddleware": 100,
}

# enable middleware to handle cookies in Splash requests
# enable middleware to handle Splash requests
# enable middleware to handle HTTP compression
DOWNLOADER_MIDDLEWARES = {
    "scrapy_splash.SplashCookiesMiddleware": 723,
    "scrapy_splash.SplashMiddleware": 725,
    "scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware": 810,
}

# use Splash-aware duplicate filter to handle duplicate requests with different Splash arguments
DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"

You're now ready to start scraping using Scrapy-Splash.

In this tutorial, you'll use a product page of the ScrapingCourse demo website as the target URL.

Initialize a Scrapy spider for the target page by running the following command in the terminal:

Terminal
scrapy genspider screenshot_spider https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/

It will create a new screenshot_spider.py file in the spiders directory. Replace its code with the following script that prints the HTML content of the target webpage.

Example
import scrapy
from scrapy_splash import SplashRequest

class ScreenshotSpider(scrapy.Spider):
    # define the name of the spider
    name = "screenshot_spider"

    def start_requests(self):
        # specify the URL to scrape
        url = "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/"
        # set up Splash arguments to render the HTML with a wait time of 2 seconds
        splash_args = {
            "html": 1,
            "wait": 2
        }

        # make a request using Splash and pass the response to the parse method
        yield SplashRequest(url, self.parse, endpoint="render.html", args=splash_args)

    def parse(self, response):
        # decode the response body to get the HTML content
        html_content = response.body.decode("utf-8")

        # print the HTML content to the console
        print(html_content)

Run the spider using the following command:

Terminal
scrapy crawl screenshot_spider

The target page HTML response will be printed. Here’s the output showing the title of the page, with some parts omitted for brevity:

Output
<!DOCTYPE html>
<html lang="en-US">
    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1">
        <link rel="profile" href="https://gmpg.org/xfn/11">
        <link rel="pingback" href="https://www.scrapingcourse.com/ecommerce/xmlrpc.php">
        <!-- omitted for brevity -->
        <title>Abominable Hoodie - Ecommerce Test Site to Learn Web Scraping</title>
    
    <!-- omitted for brevity -->
    
    </body>
</html>

Fantastic! You're now ready to grab screenshots using this base Scrapy script.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Option 1: Take a Screenshot of The Visible Part of the Screen

The viewport, or simply the visible part of the screen, is the part of the web page visible without scrolling.

You'll capture the following viewport screenshot of the target product page:

ScrapingCourse product page visible part screenshot
Click to open the image in full screen

Let's modify our code to capture the screenshot of the visible part of the screen.

Import the base64 package to handle the screenshot image encoding/decoding.

Example
import base64
# ...

In the start_requests method, add an argument to request a PNG screenshot. Also, change the endpoint to render.json to receive a JSON response that includes the PNG data.

Example
# ...


class ScreenshotSpider(scrapy.Spider):
    # ...

    def start_requests(self):
        # specify the URL to scrape
        url = "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/"
        # set up Splash arguments to render HTML and PNG with a wait time of 2 seconds
        splash_args = {
            "html": 1,
            "png": 1,
            "wait": 2,
            # optionally specify viewport size if you want a specific visible part size
            # 'viewport': '1024x768',
        }

        # make a request using Splash and pass the response to the parse method
        yield SplashRequest(url, self.parse, endpoint="render.json", args=splash_args)


    # ...

Finally, decode the PNG data from the base64-encoded string in the JSON response and write it to an image file.

Example
# ...


class ScreenshotSpider(scrapy.Spider):
    # ... 

    def parse(self, response):
        # decode the PNG data from the response
        png_bytes = base64.b64decode(response.data["png"])

        # save the decoded PNG data to the file
        with open("viewport_screenshot.png", "wb") as f:
            f.write(png_bytes)

Combine all the snippets above. The complete code should look like this:

Example
import base64
import scrapy
from scrapy_splash import SplashRequest

class ScreenshotSpider(scrapy.Spider):
    # define the name of the spider
    name = "screenshot_spider"

    def start_requests(self):
        # specify the URL to scrape
        url = "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/"
        # set up Splash arguments to render HTML and PNG with a wait time of 2 seconds
        splash_args = {
            "html": 1,
            "png": 1,
            "wait": 2,
            # optionally specify viewport size if you want a specific visible part size
            # 'viewport': '1024x768',
        }

        # make a request using Splash and pass the response to the parse method
        yield SplashRequest(url, self.parse, endpoint="render.json", args=splash_args)

    def parse(self, response):
        # decode the PNG data from the response
        png_bytes = base64.b64decode(response.data["png"])

        # save the decoded PNG data to the file
        with open("viewport_screenshot.png", "wb") as f:
            f.write(png_bytes)

You can run the script using the following command:

Terminal
scrapy crawl screenshot_spider

Great! You've just taken a viewport screenshot of a webpage using Scrapy-Splash in Python!

You can achieve the same results within your Scrapy spider using a Lua script. Lua gives you more control over the scraping process and can effectively handle dynamic websites.

Let's define the Lua script we will integrate into the Scrapy code above.

First, set the viewport size to your desired dimensions, e.g., 1024x768. Then, navigate to the specified URL and wait a few seconds for the page to render fully. Finally, capture and return the screenshot data.

Example
# define the Lua script for Splash to execute
lua_script = """
function main(splash)
    -- set viewport size to a specific width and height
    splash:set_viewport_size(1024, 768)

    -- go to the URL
    assert(splash:go(splash.args.url))

    -- wait for the page to fully render
    splash:wait(2)

    -- capture a screenshot
    local screenshot = splash:png()

    return {
        png = screenshot
    }
end
"""

In your Scrapy code, pass the Lua script as an argument to SplashRequest. Once you receive the response, decode the PNG data from its base64-encoded string and save it as a file. Here's how your final code integrating the Lua script should look like:

Example
import base64
import scrapy
from scrapy_splash import SplashRequest

class ScreenshotSpider(scrapy.Spider):
    # define the name of the spider
    name = "screenshot_spider"

    def start_requests(self):
        # specify the URL to scrape
        url = "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/"
        # define the Lua script for Splash to execute
        lua_script = """
        function main(splash)
            -- set viewport size to a specific width and height
            splash:set_viewport_size(1024, 768)

            -- go to the URL
            assert(splash:go(splash.args.url))

            -- wait for the page to fully render
            splash:wait(2)

            -- capture a screenshot
            local screenshot = splash:png()

            return {
                png = screenshot
            }
        end
        """

        # make a request using Splash and pass the response to the parse method
        yield SplashRequest(url, self.parse, endpoint="execute", args={"lua_source": lua_script})

    def parse(self, response):
        # decode the PNG data from the response
        png_bytes = base64.b64decode(response.data["png"])

        # save the decoded PNG data to the file
        with open("viewport_screenshot_using_lua.png", "wb") as f:
            f.write(png_bytes)

Run the script using the following command:

Terminal
scrapy crawl screenshot_spider

Excellent! You've just successfully taken a viewport screenshot with a Lua script.

Option 2: Grab a Full-Page Screenshot

A full-page screenshot encompasses the entire webpage, including the parts you need to scroll down to.

Let's capture the following full-page screenshot of the target product page:

Full Page Screenshot
Click to open the image in full screen

The render_all argument in the Splash request configuration is used to take a full-page screenshot. When it's set to 1, Splash renders the entire web page, not just the visible viewport. Modify the code from the previous section (see the highlighted part of the code):

Example
import base64
import scrapy
from scrapy_splash import SplashRequest

class ScreenshotSpider(scrapy.Spider):
    # define the name of the spider
    name = "screenshot_spider"

    def start_requests(self):
        # specify the URL to scrape
        url = "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/"
        # set up Splash arguments to render HTML and PNG, take a full-page screenshot, and wait for 2 seconds
        splash_args = {
            "html": 1,
            "png": 1,
            "render_all": 1,
            "wait": 2,
        }

        # make a request using Splash and pass the response to the parse method
        yield SplashRequest(url, self.parse, endpoint="render.json", args=splash_args)

    def parse(self, response):
        # decode the PNG data from the response
        png_bytes = base64.b64decode(response.data["png"])

        # save the decoded PNG data to the file
        with open("full_page_screenshot.png", "wb") as f:
            f.write(png_bytes)

Execute this script using the following command:

Terminal
scrapy crawl screenshot_spider

Great! You know now how to take full-page screenshots using Scrapy-Splash in Python.

Option 3: Create a Screenshot of a Specific Element

A specific element screenshot refers to capturing an image of a particular part or element of a webpage. For example, here's a product summary element from the demo product page:

Specific Element Screenshot
Click to open the image in full screen

While Scrapy and Splash can handle basic scraping and rendering tasks, you need to use Lua for advanced web interactions, like grabbing a screenshot of a specific element.

Let's capture the above screenshot using Lua.

Modify the previous code by integrating a Lua script. Target a specific element on the page by its class name .entry-summary. Here's how your final code would look:

Example
import base64
import scrapy
from scrapy_splash import SplashRequest

class ScreenshotSpider(scrapy.Spider):
    # define the name of the spider
    name = "screenshot_spider"

    def start_requests(self):
        # specify the URL to scrape
        url = "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/"
        # define the Lua script for Splash to execute
        lua_script = """
        function main(splash)
            -- set viewport size to a specific width and height
            splash:set_viewport_size(1024, 768)

            -- go to the URL
            assert(splash:go(splash.args.url))

            -- wait for the page to fully render
            splash:wait(2)

            -- select the element by class name
            local element = splash:select(".entry-summary")

            -- capture a screenshot of the specific element
            local screenshot = element:png()

            return {
                png = screenshot
            }
        end
        """

        # make a request using Splash and pass the response to the parse method
        yield SplashRequest(url, self.parse, endpoint="execute", args={"lua_source": lua_script})

    def parse(self, response):
        # decode the PNG data from the response
        png_bytes = base64.b64decode(response.data["png"])

        # save the decoded PNG data to the file
        with open("specific_element_screenshot.png", "wb") as f:
            f.write(png_bytes)

Run this code using the following command:

Terminal
scrapy crawl screenshot_spider

Good job! You now know how to capture all types of screenshots.

Avoid Blocks and Bans While Taking Screenshots With Splash

Unfortunately, taking screenshots of your desired target pages is not always this easy. Websites with anti-bot measures can prevent you from taking screenshots and block your scraper.

For instance, let's try to capture a viewport screenshot of a G2 Reviews webpage, a site protected by anti-bot solutions.

Replace the target URL with the G2 URL to check if it works:

Example
import base64
import scrapy
from scrapy_splash import SplashRequest

class ScreenshotSpider(scrapy.Spider):
    # define the name of the spider
    name = "screenshot_spider"

    def start_requests(self):
        # specify the URL to scrape
        url = "https://www.g2.com/products/asana/reviews"
        # set up Splash arguments to render HTML and PNG with a wait time of 2 seconds
        splash_args = {
            "html": 1,
            "png": 1,
            "wait": 2,
        }

        # make a request using Splash and pass the response to the parse method
        yield SplashRequest(url, self.parse, endpoint="render.json", args=splash_args)

    def parse(self, response):
        # decode the PNG data from the response
        png_bytes = base64.b64decode(response.data["png"])

        # save the decoded PNG data to the file
        with open("g2_blocked_screenshot.png", "wb") as f:
            f.write(png_bytes)

Running the code will fetch you the following screenshot:

G2 Access Blocked
Click to open the image in full screen

This result means that your script got blocked by Cloudflare.

Luckily, there are ways to bypass anti-bot systems and scrape without getting blocked. The most effective way to achieve this is by using a web scraping API like ZenRows.

ZenRows offers an efficient solution for taking all types of screenshots, including viewport, full-page, and specific elements. Additionally, it acts as a headless browser and integrates auto-rotating premium proxies, optimized headers, anti-CAPTCHAs, as well as other advanced technologies to help you avoid the blocks. As such, it provides a complete web scraping toolkit that helps you automatically avoid all blocks and bans and retrieve any data you want.

Let's use ZenRows instead of Scrapy and Splash to capture a screenshot of the same G2 Reviews page that got you blocked in the last step.

Sign up to ZenRows, and you'll get redirected to the Request Builder. Paste the target URL in the link box, select Premium proxies, and activate the JS Rendering Boost mode. Then, choose Python as your language. Finally, click on the API tab.

building a scraper with zenrows
Click to open the image in full screen

The generated code uses Python's Requests as the HTTP client. Install the library using the following command:

Terminal
pip install requests

Modify the generated code by adding "screenshot": "true" to the request parameter. Save the response (screenshot data) to an image file.

Here’s the final code using ZenRows to capture a screenshot:

Example
# pip install requests
import requests

url = "https://www.g2.com/products/asana/reviews"
apikey = "<YOUR_ZENROWS_API_KEY>"

# define your request parameters
params = {
    "url": url,
    "apikey": apikey,
    "js_render": "true",
    "premium_proxy": "true",
    "screenshot": "true",
}

# get the response and take the screenshot
response = requests.get("https://api.zenrows.com/v1/", params=params)
with open("screenshot_using_zenrows.png", "wb") as f:
    f.write(response.content)

Running the code will get you the following screenshot:

G2 Reviews Asana Screenshot
Click to open the image in full screen

Congratulations! You've just taken a screenshot of a Cloudflare-protected webpage with ZenRows.

Conclusion

In this tutorial, you learned how to capture screenshots in Scrapy with Splash. You now know how to:

  • Take a viewport screenshot of a webpage.
  • Capture a full webpage, including the parts which require scrolling.
  • Get a screenshot of a specific target element. = Access a protected website and grab its screenshot.

Even though Scrapy and Splash are handy for taking screenshots, they can't stand against advanced anti-bot solutions that may block your scraper. A web scraping API, such as ZenRows, will help you avoid all blocks and bans and take screenshots uninterrupted. Try ZenRows for free!

Ready to get started?

Up to 1,000 URLs for free are waiting for you