Do you want to mask your request with a proxy to avoid detection and IP bans while scraping with Splash?
This tutorial will show you the three main methods of configuring a proxy in Splash, no matter if you're using Splash independently or pairing it with Scrapy:
- Option 1: Set a Splash request argument.
- Option 2: Set a proxy using a Lua script with Splash.
- Option 3: Use proxy profiles.
How to Set Your Proxy With Splash
Splash is a runtime server with a dedicated API for interacting with the Lua script during web scraping. It lets you run the script in any programming language via an HTTP request, and execute JavaScript directly inside Lua for extracting dynamic content.
Proxy setup in Splash depends on the use case and can be divided into the following categories:
- For Scrapy integration.
- For independent use with Lua (two methods).
In this section, you'll learn three ways of setting up a proxy in Splash. In each case, you'll request https://httpbin.io/ip
, a website that returns your current IP address.
You'll use free proxies from the Free Proxy List. These free proxies are only suitable for learning and may not work at the time of reading due to their short lifespan. Feel free to exchange them for new ones from the list.
Prerequisites: If You Don't Have a Running Splash Server
If you don't have a running Splash server, set one up before proceeding to the proxy setup.
Ensure you've installed the latest version of Docker on your machine. Then, pull the Splash image with the following command:
docker pull scrapinghub/splash
Include the sudo
command for Linux OS:
sudo docker pull scrapinghub/splash
Once the image is pulled, run the Docker image on a specific port:
docker run -it -p 8050:8050 --rm scrapinghub/splash
If you're on Linux:
sudo docker run -it -p 8050:8050 --rm scrapinghub/splash
The command above will start the Splash server at `http://localhost:8050`. You're now ready to set up your proxy.
Option 1: Set a Splash Request Argument
The Splash request argument is the best option if you're using Scrapy and Splash. It involves adding the proxy address as an argument inside the Splash request instance.
First, ensure you install Scrapy Splash using pip
:
pip install scrapy-splash
Initialize a Scrapy project if you've not done so already:
scrapy startproject scraper
Then, configure your Scrapy project to use the Splash server if you haven't already. To do that, paste the following code into your Scrapy settings file:
# set the Splash local server endpoint
SPLASH_URL = "http://localhost:8050"
# enable the Splash downloader middleware and
# give it a higher priority than HttpCompressionMiddleware
DOWNLOADER_MIDDLEWARES = {
"scrapy_splash.SplashCookiesMiddleware": 723,
"scrapy_splash.SplashMiddleware": 725,
"scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware": 810,
}
# enable the Splash deduplication argument filter to
# make Scrapy Splash saves spice disk on cached requests
SPIDER_MIDDLEWARES = {
"scrapy_splash.SplashDeduplicateArgsMiddleware": 100,
}
# set the Splash deduplication class
DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"
Next, import Scrapy Splash into your spider file and point the scraper class to the target URL.
# import the required libraries
import scrapy
from scrapy_splash import SplashRequest
class Scraper(scrapy.Spider):
name = "scraper"
# point to the target URL
allowed_domains = ["httpbin.io"]
start_urls = ["https://httpbin.io/ip"]
Extend that class with a Lua script inside a multi-line comment. This script takes an instruction to access and print the target website's HTML:
https://gist.github.com/idowupremz/c2a264466e5944e4471e56ad17cd8b55
Now, initiate a request to the target URL with the Splash Request object. Point to the Lua script and include the proxy address in the args
dictionary:
class Scraper(scrapy.Spider):
# ...
def start_requests(self):
# launch a Splash request and spcify the proxy address inside the args
for url in self.start_urls:
yield SplashRequest(
url,
self.parse,
endpoint="execute",
args={
"wait": 0.5,
"lua_source": self.lua_script,
"proxy": "http://189.240.60.171:9090"
},
cache_args=["lua_source"]
)
Finally, log a decoded format of the HTML result from the Lua script inside the parse method:
class Scraper(scrapy.Spider):
# ...
def parse(self, response):
# get the HTML result from Lua
splash_result = response.body
# log the HTML result to view the current IP address
self.logger.info("Splash Result: %s", splash_result.decode("utf-8"))
Here's what your full code looks like after combining the snippets:
# import the required libraries
import scrapy
from scrapy_splash import SplashRequest
class Scraper(scrapy.Spider):
name = "scraper"
# point to the target URL
allowed_domains = ["httpbin.io"]
start_urls = ["https://httpbin.io/ip"]
# add a Lua script to access the website and print its HTML
lua_script = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html()
}
end
"""
def start_requests(self):
# launch a Splash request and spcify the proxy address inside the args
for url in self.start_urls:
yield SplashRequest(
url,
self.parse,
endpoint="execute",
args={
"wait": 0.5,
"lua_source": self.lua_script,
"proxy": "http://189.240.60.171:9090"
},
cache_args=["lua_source"]
)
def parse(self, response):
# get the HTML result from Lua
splash_result = response.body
# log the HTML result to view the current IP address
self.logger.info("Splash Result: %s", splash_result.decode("utf-8"))
Run your spider with the crawl command:
scrapy crawl scraper
Running the code twice outputs similar IP addresses (with different ports) from the specified proxy:
{
"origin": "189.240.60.168:9882"
}
{
"origin": "189.240.60.168:3718"
}
```
You've just implemented a proxy using Splash in your Scrapy web scraper. High five!
Let's go through the other proxy setup options.
Option 2: Set a Proxy Using a Lua Script With Splash
Setting up a proxy inside the Lua script is a good method if you're using Splash independently and want to integrate its API into other programming languages like Python and JavaScript.
Splash operates a dedicated server that can execute the Lua script for advanced proxy configuration and JavaScript rendering tasks.
For this tutorial, you'll run the Splash server locally on your machine and use it to execute Lua with Python's Requests library.
First, ensure you install the Requests library if you've not done so already:
pip install requests
Next, import the Requests library and write your Lua script inside a multi-line string. The script contains a function that starts with the proxy address setup. It visits the target URL (https://httpbin.io/ip
) and returns its HTML content:
# import the required library
import requests
# develop your Lua script
lua_script = """
function main(splash, args)
-- set up proxy
splash:on_request(function(request)
request:set_proxy{
type = "HTTP",
host = "189.240.60.171",
port = 9090,
}
end)
-- visit the target URL
assert(splash:go(args.url))
assert(splash:wait(0.5))
-- print the HTML content
return {
html = splash:html(),
}
end
"""
Send a POST request to the Splash execute
endpoint, specifying the target URL and the Lua script variable in the request body. Then print the JSON response to return the website's HTML:
# ...
response = requests.post(
# specify the Splash server API endpoint
"http://localhost:8050/execute",
# define the request body
json={
"lua_source": lua_script,
"url": "https://httpbin.io/ip",
"timeout": 60,
}
)
# get the response
print(response.json())
Combine both snippets. Your final code should look like this:
# import the required library
import requests
# develop your Lua script
lua_script = """
function main(splash, args)
-- set up proxy
splash:on_request(function(request)
request:set_proxy{
type = "HTTP",
host = "189.240.60.171",
port = 9090,
}
end)
-- visit the target URL
assert(splash:go(args.url))
assert(splash:wait(0.5))
-- print the HTML content
return {
html = splash:html(),
}
end
"""
response = requests.post(
# specify the Splash server API endpoint
"http://localhost:8050/execute",
# define the request body
json={
"lua_source": lua_script,
"url": "https://httpbin.io/ip",
"timeout": 60,
}
)
# get the response
print(response.json())
The specified proxy returns the following IP addresses (different ports) for two manual requests:
"origin": "189.240.60.168:24900"
"origin": "189.240.60.168:14962"
You now know how to add a proxy directly in Splash using the Lua script. But there's one more scalable way to achieve this. Let's take a look.
Option 3: Use Proxy Profiles
The proxy profiles option works best if you need to share one proxy between several scraping scripts or projects. To use it, you'll have to expose the folder containing your proxy profiles to the Splash API.
First, create a new folder inside your project directory and give it a descriptive name (let’s name it "proxy-profile"). Make a profile.ini
file inside this folder and configure it with your proxy details, as shown:
[proxy]
host=189.240.60.171
port=9090
type=HTTP
The next step is to connect this local proxy profile folder with the default directory recognized by the Splash server. The Splash server reads the proxy detail from the following default directory:
/etc/splash/proxy-profiles
Stop the running Splash server image inside Docker. Then, restart it with the following command, replacing path\_to\_your\_proxy\_profile
with the full path to your proxy profile configuration:
docker run -p 8050:8050 -v <path_to_your_proxy_profile>:/etc/splash/proxy-profiles scrapinghub/splash
For example, assume you've written your profile.ini
file inside D:/scraper/proxy-profile
. Include that path in your Docker runner command like this:
docker run -p 8050:8050 -v D:/scraper/proxy-profile:/etc/splash/proxy-profiles scrapinghub/splash
You've now started the Docker server image with your proxy profiles. Awesome!
Now, let's create the Lua script to test this integration. Open your Python file, write Lua code to visit the target website (https://httpbin.io/ip
) and get the HTML content:
# import the required libraries
import requests
# develop your Lua script
lua_script = """
function main(splash, args)
-- visit the target URL
assert(splash:go(args.url))
assert(splash:wait(1.0))
-- print the HTML content
return {
html = splash:html(),
}
end
"""
Send a POST request to the Splash server API and point to the file containing your proxy profile inside the request body. Ensure you use profile
without the .ini
extension. Then, print the JSON result to show the extracted HTML content:
# ...
response = requests.post(
# specify the Splash server API endpoint
"http://localhost:8050/execute",
# define the request body and specify the proxy profile
json={
"lua_source": lua_script,
"url": "https://httpbin.io/ip",
"timeout": 60,
"proxy":"profile"
}
)
# get the response
print(response.json())
Your full code should look like this after combining the snippets:
# import the required libraries
import requests
# develop your Lua script
lua_script = """
function main(splash, args)
-- visit the target URL
assert(splash:go(args.url))
assert(splash:wait(1.0))
-- print the HTML content
return {
html = splash:html(),
}
end
"""
response = requests.post(
# specify the Splash server API endpoint
"http://localhost:8050/execute",
# define the request body and specify the proxy profile
json={
"lua_source": lua_script,
"url": "https://httpbin.io/ip",
"timeout": 60,
"proxy":"profile"
}
)
# get the response
print(response.json())
Running the code twice returns the following IP addresses from the proxy:
"origin": "189.240.60.168:4769"
"origin": "189.240.60.168:17136"
That's it! Your Splash web scraper now uses the specified proxy profiles.
In the examples above, you've used free proxies. But for real projects, you'll need a premium web scraping proxy with authentication credentials.
Get the Best Premium Proxies to Scrape
Free proxies have a low success rate due to frequent downtime. Besides, most anti-bot measures can detect them during web scraping. It means you'll get blocked even after setting up a proxy with Splash.
For example, a protected website like the G2 Reviews page will block your request even if you set up a proxy with Scrapy Splash. Try accessing it by replacing the target URL with G2 in your spider file, as shown below:
# import the required libraries
import scrapy
from scrapy_splash import SplashRequest
class Scraper(scrapy.Spider):
name = "scraper"
# point to the target URL
allowed_domains = ["g2.com"]
start_urls = ["https://www.g2.com/products/asana/reviews"]
# add a Lua script to access the website and print its HTML
lua_script = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html()
}
end
"""
def start_requests(self):
# launch a Splash request and spcify the proxy address inside the args
for url in self.start_urls:
yield SplashRequest(
url,
self.parse,
endpoint="execute",
args={
"wait": 0.5,
"lua_source": self.lua_script,
"proxy": "http://189.240.60.171:9090"
},
cache_args=["lua_source"])
def parse(self, response):
# get the HTML result from Lua
splash_result = response.body
# log the HTML result to view the current IP address
self.logger.info("Splash Result: %s", splash_result.decode("utf-8"))
The spider returns a Scrapy 403 error, indicating that an anti-bot has blocked it:
Crawled (403) <GET https://www.g2.com/products/asana/reviews
Try opening the website via a regular browser, and you'll see that it uses Cloudflare Turnstile to prevent bot activities:
You can use premium residential proxies to avoid getting blocked. They can help you avoid basic detection mechanisms such as IP bans, but they're unlikely to bypass advanced anti-bot systems like Akamai, Cloudflare, or DataDome, which use sophisticated detection mechanisms beyond IP checks.
The best solution is to use a web scraping API like ZenRows, which helps you auto-rotate premium proxies, optimize your request headers, and bypass CAPTCHAs and other advanced anti-bot systems. ZenRows also works as a headless browser featuring JavaScript instructions to extract dynamically loaded content.
Let's try to use ZenRows and the Requests library to access the same G2 Reviews page.
Sign up to open the Request Builder. Paste the target URL in the link box, toggle the Boost mode to JS Rendering, and activate Premium proxies. Select Python as your chosen language and set the request type as API. Then, copy and paste the generated code into your scraper file.
The code should look like this in your Python file:
# pip install requests
import requests
params = {
"url": "https://www.g2.com/products/asana/reviews",
"apikey": "<YOUR_ZENROWS_API_KEY>",
"js_render": "true",
"premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)
The code above bypasses the anti-bot and extracts the page's HTML content:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<title>Asana Reviews 2024</title>
</head>
<body>
<!-- other content omitted for brevity -->
</body>
</html>
You just scraped a protected website successfully with ZenRows. Congratulations!
Conclusion
In this article, you've learned how to configure your Splash web scraper to use a proxy in three ways:
- Setting up a proxy with Scrapy Splash using the Splash request method.
- Adding a proxy directly to the Lua script in Splash.
- Creating a proxy profile and running the Splash server image with the profile.
Proxies offer a fair level of anti-bot bypass capability. Still, they may prove too weak against advanced detection mechanisms. The only foolproof solution is going for a web scraping API such as ZenRows. This way, you’ll bypass any bot detection system and save yourself the trouble of finding and configuring proxies.