Are you looking for an easy solution to take screenshots with Scrapy-Splash while web scraping in Python? You're in the right place!
In this guide, we'll explore three techniques for capturing screenshots with Scrapy-Splash:
- Take a screenshot of the visible part of the screen.
- Grab a full-page screenshot.
- Create a screenshot of a specific element.
Let's dive in!
How to Take a Screenshot With Splash?
The Scrapy-Splash library integrates Splash into Scrapy. This integration allows you to use Splash's JavaScript rendering capabilities within the Scrapy framework. You can capture screenshots of webpages using the library's built-in features or Lua scripts.
Let's set up our project to get started!
This tutorial assumes you've already created a Scrapy project. If you haven't, check out our general tutorial on Scrapy Splash.
Splash handles the rendering of JavaScript content. To enable communication with Scrapy, you need to run Splash as a separate service using Docker.
If you haven't installed Docker, download Docker Desktop from the official website.
Open a terminal and pull the Splash image with Docker:
docker pull scrapinghub/splash
After successfully running this command, you should see the scrapinghub/splash
image in the Docker Desktop's Images tab.
Next, run the container using the following command:
docker run -it -p 8050:8050 --rm scrapinghub/splash
To confirm if Splash is running as expected, open http://localhost:8050/
in your browser. You should see the following page:
Now, open your Scrapy project and run the following command:
pip install scrapy-splash
This command installs the Scrapy-Splash package, which includes the middleware and classes needed to send requests to Splash and handle the responses.
Configure the settings.py
file to communicate with the Splash service:
# define the name of the Scrapy bot
# use the same name as used for initializing the Scrapy project
BOT_NAME = "splash_scraper"
# define the modules where Scrapy will look for spiders
SPIDER_MODULES = ["splash_scraper.spiders"]
NEWSPIDER_MODULE = "splash_scraper.spiders"
# specify the URL where the Splash service is running
SPLASH_URL = "http://localhost:8050"
# disable obeying robots.txt rules
ROBOTSTXT_OBEY = False
# enable Splash deduplication middleware for handling requests with different Splash arguments
SPIDER_MIDDLEWARES = {
"scrapy_splash.SplashDeduplicateArgsMiddleware": 100,
}
# enable middleware to handle cookies in Splash requests
# enable middleware to handle Splash requests
# enable middleware to handle HTTP compression
DOWNLOADER_MIDDLEWARES = {
"scrapy_splash.SplashCookiesMiddleware": 723,
"scrapy_splash.SplashMiddleware": 725,
"scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware": 810,
}
# use Splash-aware duplicate filter to handle duplicate requests with different Splash arguments
DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"
You're now ready to start scraping using Scrapy-Splash.
In this tutorial, you'll use a product page of the ScrapingCourse demo website as the target URL.
Initialize a Scrapy spider for the target page by running the following command in the terminal:
scrapy genspider screenshot_spider https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/
It will create a new screenshot_spider.py
file in the spiders
directory. Replace its code with the following script that prints the HTML content of the target webpage.
import scrapy
from scrapy_splash import SplashRequest
class ScreenshotSpider(scrapy.Spider):
# define the name of the spider
name = "screenshot_spider"
def start_requests(self):
# specify the URL to scrape
url = "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/"
# set up Splash arguments to render the HTML with a wait time of 2 seconds
splash_args = {
"html": 1,
"wait": 2
}
# make a request using Splash and pass the response to the parse method
yield SplashRequest(url, self.parse, endpoint="render.html", args=splash_args)
def parse(self, response):
# decode the response body to get the HTML content
html_content = response.body.decode("utf-8")
# print the HTML content to the console
print(html_content)
Run the spider using the following command:
scrapy crawl screenshot_spider
The target page HTML response will be printed. Here’s the output showing the title of the page, with some parts omitted for brevity:
<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="profile" href="https://gmpg.org/xfn/11">
<link rel="pingback" href="https://www.scrapingcourse.com/ecommerce/xmlrpc.php">
<!-- omitted for brevity -->
<title>Abominable Hoodie - Ecommerce Test Site to Learn Web Scraping</title>
<!-- omitted for brevity -->
</body>
</html>
Fantastic! You're now ready to grab screenshots using this base Scrapy script.
Option 1: Take a Screenshot of The Visible Part of the Screen
The viewport, or simply the visible part of the screen, is the part of the web page visible without scrolling.
You'll capture the following viewport screenshot of the target product page:
Let's modify our code to capture the screenshot of the visible part of the screen.
Import the base64 package to handle the screenshot image encoding/decoding.
import base64
# ...
In the start_requests
method, add an argument to request a PNG screenshot. Also, change the endpoint to render.json
to receive a JSON response that includes the PNG data.
# ...
class ScreenshotSpider(scrapy.Spider):
# ...
def start_requests(self):
# specify the URL to scrape
url = "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/"
# set up Splash arguments to render HTML and PNG with a wait time of 2 seconds
splash_args = {
"html": 1,
"png": 1,
"wait": 2,
# optionally specify viewport size if you want a specific visible part size
# 'viewport': '1024x768',
}
# make a request using Splash and pass the response to the parse method
yield SplashRequest(url, self.parse, endpoint="render.json", args=splash_args)
# ...
Finally, decode the PNG data from the base64-encoded string in the JSON response and write it to an image file.
# ...
class ScreenshotSpider(scrapy.Spider):
# ...
def parse(self, response):
# decode the PNG data from the response
png_bytes = base64.b64decode(response.data["png"])
# save the decoded PNG data to the file
with open("viewport_screenshot.png", "wb") as f:
f.write(png_bytes)
Combine all the snippets above. The complete code should look like this:
import base64
import scrapy
from scrapy_splash import SplashRequest
class ScreenshotSpider(scrapy.Spider):
# define the name of the spider
name = "screenshot_spider"
def start_requests(self):
# specify the URL to scrape
url = "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/"
# set up Splash arguments to render HTML and PNG with a wait time of 2 seconds
splash_args = {
"html": 1,
"png": 1,
"wait": 2,
# optionally specify viewport size if you want a specific visible part size
# 'viewport': '1024x768',
}
# make a request using Splash and pass the response to the parse method
yield SplashRequest(url, self.parse, endpoint="render.json", args=splash_args)
def parse(self, response):
# decode the PNG data from the response
png_bytes = base64.b64decode(response.data["png"])
# save the decoded PNG data to the file
with open("viewport_screenshot.png", "wb") as f:
f.write(png_bytes)
You can run the script using the following command:
scrapy crawl screenshot_spider
Great! You've just taken a viewport screenshot of a webpage using Scrapy-Splash in Python!
You can achieve the same results within your Scrapy spider using a Lua script. Lua gives you more control over the scraping process and can effectively handle dynamic websites.
Let's define the Lua script we will integrate into the Scrapy code above.
First, set the viewport size to your desired dimensions, e.g., 1024x768. Then, navigate to the specified URL and wait a few seconds for the page to render fully. Finally, capture and return the screenshot data.
# define the Lua script for Splash to execute
lua_script = """
function main(splash)
-- set viewport size to a specific width and height
splash:set_viewport_size(1024, 768)
-- go to the URL
assert(splash:go(splash.args.url))
-- wait for the page to fully render
splash:wait(2)
-- capture a screenshot
local screenshot = splash:png()
return {
png = screenshot
}
end
"""
In your Scrapy code, pass the Lua script as an argument to SplashRequest. Once you receive the response, decode the PNG data from its base64-encoded string and save it as a file. Here's how your final code integrating the Lua script should look like:
import base64
import scrapy
from scrapy_splash import SplashRequest
class ScreenshotSpider(scrapy.Spider):
# define the name of the spider
name = "screenshot_spider"
def start_requests(self):
# specify the URL to scrape
url = "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/"
# define the Lua script for Splash to execute
lua_script = """
function main(splash)
-- set viewport size to a specific width and height
splash:set_viewport_size(1024, 768)
-- go to the URL
assert(splash:go(splash.args.url))
-- wait for the page to fully render
splash:wait(2)
-- capture a screenshot
local screenshot = splash:png()
return {
png = screenshot
}
end
"""
# make a request using Splash and pass the response to the parse method
yield SplashRequest(url, self.parse, endpoint="execute", args={"lua_source": lua_script})
def parse(self, response):
# decode the PNG data from the response
png_bytes = base64.b64decode(response.data["png"])
# save the decoded PNG data to the file
with open("viewport_screenshot_using_lua.png", "wb") as f:
f.write(png_bytes)
Run the script using the following command:
scrapy crawl screenshot_spider
Excellent! You've just successfully taken a viewport screenshot with a Lua script.
Option 2: Grab a Full-Page Screenshot
A full-page screenshot encompasses the entire webpage, including the parts you need to scroll down to.
Let's capture the following full-page screenshot of the target product page:
The render_all
argument in the Splash request configuration is used to take a full-page screenshot. When it's set to 1, Splash renders the entire web page, not just the visible viewport. Modify the code from the previous section (see the highlighted part of the code):
import base64
import scrapy
from scrapy_splash import SplashRequest
class ScreenshotSpider(scrapy.Spider):
# define the name of the spider
name = "screenshot_spider"
def start_requests(self):
# specify the URL to scrape
url = "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/"
# set up Splash arguments to render HTML and PNG, take a full-page screenshot, and wait for 2 seconds
splash_args = {
"html": 1,
"png": 1,
"render_all": 1,
"wait": 2,
}
# make a request using Splash and pass the response to the parse method
yield SplashRequest(url, self.parse, endpoint="render.json", args=splash_args)
def parse(self, response):
# decode the PNG data from the response
png_bytes = base64.b64decode(response.data["png"])
# save the decoded PNG data to the file
with open("full_page_screenshot.png", "wb") as f:
f.write(png_bytes)
Execute this script using the following command:
scrapy crawl screenshot_spider
Great! You know now how to take full-page screenshots using Scrapy-Splash in Python.
Option 3: Create a Screenshot of a Specific Element
A specific element screenshot refers to capturing an image of a particular part or element of a webpage. For example, here's a product summary element from the demo product page:
While Scrapy and Splash can handle basic scraping and rendering tasks, you need to use Lua for advanced web interactions, like grabbing a screenshot of a specific element.
Let's capture the above screenshot using Lua.
Modify the previous code by integrating a Lua script. Target a specific element on the page by its class name .entry-summary
. Here's how your final code would look:
import base64
import scrapy
from scrapy_splash import SplashRequest
class ScreenshotSpider(scrapy.Spider):
# define the name of the spider
name = "screenshot_spider"
def start_requests(self):
# specify the URL to scrape
url = "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/"
# define the Lua script for Splash to execute
lua_script = """
function main(splash)
-- set viewport size to a specific width and height
splash:set_viewport_size(1024, 768)
-- go to the URL
assert(splash:go(splash.args.url))
-- wait for the page to fully render
splash:wait(2)
-- select the element by class name
local element = splash:select(".entry-summary")
-- capture a screenshot of the specific element
local screenshot = element:png()
return {
png = screenshot
}
end
"""
# make a request using Splash and pass the response to the parse method
yield SplashRequest(url, self.parse, endpoint="execute", args={"lua_source": lua_script})
def parse(self, response):
# decode the PNG data from the response
png_bytes = base64.b64decode(response.data["png"])
# save the decoded PNG data to the file
with open("specific_element_screenshot.png", "wb") as f:
f.write(png_bytes)
Run this code using the following command:
scrapy crawl screenshot_spider
Good job! You now know how to capture all types of screenshots.
Avoid Blocks and Bans While Taking Screenshots With Splash
Unfortunately, taking screenshots of your desired target pages is not always this easy. Websites with anti-bot measures can prevent you from taking screenshots and block your scraper.
For instance, let's try to capture a viewport screenshot of a G2 Reviews webpage, a site protected by anti-bot solutions.
Replace the target URL with the G2 URL to check if it works:
import base64
import scrapy
from scrapy_splash import SplashRequest
class ScreenshotSpider(scrapy.Spider):
# define the name of the spider
name = "screenshot_spider"
def start_requests(self):
# specify the URL to scrape
url = "https://www.g2.com/products/asana/reviews"
# set up Splash arguments to render HTML and PNG with a wait time of 2 seconds
splash_args = {
"html": 1,
"png": 1,
"wait": 2,
}
# make a request using Splash and pass the response to the parse method
yield SplashRequest(url, self.parse, endpoint="render.json", args=splash_args)
def parse(self, response):
# decode the PNG data from the response
png_bytes = base64.b64decode(response.data["png"])
# save the decoded PNG data to the file
with open("g2_blocked_screenshot.png", "wb") as f:
f.write(png_bytes)
Running the code will fetch you the following screenshot:
This result means that your script got blocked by Cloudflare.
Luckily, there are ways to bypass anti-bot systems and scrape without getting blocked. The most effective way to achieve this is by using a web scraping API like ZenRows.
ZenRows offers an efficient solution for taking all types of screenshots, including viewport, full-page, and specific elements. Additionally, it acts as a headless browser and integrates auto-rotating premium proxies, optimized headers, anti-CAPTCHAs, as well as other advanced technologies to help you avoid the blocks. As such, it provides a complete web scraping toolkit that helps you automatically avoid all blocks and bans and retrieve any data you want.
Let's use ZenRows instead of Scrapy and Splash to capture a screenshot of the same G2 Reviews page that got you blocked in the last step.
Sign up to ZenRows, and you'll get redirected to the Request Builder. Paste the target URL in the link box, select Premium proxies, and activate the JS Rendering Boost mode. Then, choose Python as your language. Finally, click on the API tab.
The generated code uses Python's Requests as the HTTP client. Install the library using the following command:
pip install requests
Modify the generated code by adding "screenshot": "true"
to the request parameter. Save the response (screenshot data) to an image file.
Here’s the final code using ZenRows to capture a screenshot:
# pip install requests
import requests
url = "https://www.g2.com/products/asana/reviews"
apikey = "<YOUR_ZENROWS_API_KEY>"
# define your request parameters
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
"premium_proxy": "true",
"screenshot": "true",
}
# get the response and take the screenshot
response = requests.get("https://api.zenrows.com/v1/", params=params)
with open("screenshot_using_zenrows.png", "wb") as f:
f.write(response.content)
Running the code will get you the following screenshot:
Congratulations! You've just taken a screenshot of a Cloudflare-protected webpage with ZenRows.
Conclusion
In this tutorial, you learned how to capture screenshots in Scrapy with Splash. You now know how to:
- Take a viewport screenshot of a webpage.
- Capture a full webpage, including the parts which require scrolling.
- Get a screenshot of a specific target element. = Access a protected website and grab its screenshot.
Even though Scrapy and Splash are handy for taking screenshots, they can't stand against advanced anti-bot solutions that may block your scraper. A web scraping API, such as ZenRows, will help you avoid all blocks and bans and take screenshots uninterrupted. Try ZenRows for free!