How to Use a Proxy With Splash in 2024

April 30, 2024 · 8 min read

Do you want to mask your request with a proxy to avoid detection and IP bans while scraping with Splash?

This tutorial will show you the three main methods of configuring a proxy in Splash, no matter if you're using Splash independently or pairing it with Scrapy:

How to Set Your Proxy With Splash

Splash is a runtime server with a dedicated API for interacting with the Lua script during web scraping. It lets you run the script in any programming language via an HTTP request, and execute JavaScript directly inside Lua for extracting dynamic content.

Proxy setup in Splash depends on the use case and can be divided into the following categories:

  • For Scrapy integration.
  • For independent use with Lua (two methods).

In this section, you'll learn three ways of setting up a proxy in Splash. In each case, you'll request https://httpbin.io/ip, a website that returns your current IP address.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Prerequisites: If You Don't Have a Running Splash Server

If you don't have a running Splash server, set one up before proceeding to the proxy setup.

Ensure you've installed the latest version of Docker on your machine. Then, pull the Splash image with the following command:

Terminal
docker pull scrapinghub/splash

Include the sudo command for Linux OS:

Terminal
sudo docker pull scrapinghub/splash

Once the image is pulled, run the Docker image on a specific port:

Terminal
docker run -it -p 8050:8050 --rm scrapinghub/splash

If you're on Linux:

Terminal
sudo docker run -it -p 8050:8050 --rm scrapinghub/splash

The command above will start the Splash server at `http://localhost:8050`. You're now ready to set up your proxy.

Option 1: Set a Splash Request Argument

The Splash request argument is the best option if you're using Scrapy and Splash. It involves adding the proxy address as an argument inside the Splash request instance.

First, ensure you install Scrapy Splash using pip:

Terminal
pip install scrapy-splash

Initialize a Scrapy project if you've not done so already:

Terminal
scrapy startproject scraper

Then, configure your Scrapy project to use the Splash server if you haven't already. To do that, paste the following code into your Scrapy settings file:

settings.py
# set the Splash local server endpoint
SPLASH_URL = "http://localhost:8050"

# enable the Splash downloader middleware and 
# give it a higher priority than HttpCompressionMiddleware
DOWNLOADER_MIDDLEWARES = {
    "scrapy_splash.SplashCookiesMiddleware": 723,
    "scrapy_splash.SplashMiddleware": 725,
    "scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware": 810,
}

# enable the Splash deduplication argument filter to
# make Scrapy Splash saves spice disk on cached requests
SPIDER_MIDDLEWARES = {
    "scrapy_splash.SplashDeduplicateArgsMiddleware": 100,
}

# set the Splash deduplication class
DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"

Next, import Scrapy Splash into your spider file and point the scraper class to the target URL.

spider.py
# import the required libraries
import scrapy
from scrapy_splash import SplashRequest

class Scraper(scrapy.Spider):
    name = "scraper"

    # point to the target URL
    allowed_domains = ["httpbin.io"]
    start_urls = ["https://httpbin.io/ip"]

Extend that class with a Lua script inside a multi-line comment. This script takes an instruction to access and print the target website's HTML:

https://gist.github.com/idowupremz/c2a264466e5944e4471e56ad17cd8b55

Now, initiate a request to the target URL with the Splash Request object. Point to the Lua script and include the proxy address in the args dictionary:

spider.py
class Scraper(scrapy.Spider):
    
    # ...

    def start_requests(self):
        
        # launch a Splash request and spcify the proxy address inside the args
        for url in self.start_urls:
            yield SplashRequest(
                url, 
                self.parse,
                endpoint="execute",
                args={
                    "wait": 0.5, 
                    "lua_source": self.lua_script, 
                    "proxy": "http://189.240.60.171:9090"
                },
                cache_args=["lua_source"]
            )

Finally, log a decoded format of the HTML result from the Lua script inside the parse method:

spider.py
class Scraper(scrapy.Spider):

    # ... 
    
    def parse(self, response):

        # get the HTML result from Lua
        splash_result = response.body

        # log the HTML result to view the current IP address
        self.logger.info("Splash Result: %s", splash_result.decode("utf-8"))

Here's what your full code looks like after combining the snippets:

spider.py
# import the required libraries
import scrapy
from scrapy_splash import SplashRequest

class Scraper(scrapy.Spider):
    name = "scraper"

    # point to the target URL
    allowed_domains = ["httpbin.io"]
    start_urls = ["https://httpbin.io/ip"]

    # add a Lua script to access the website and print its HTML
    lua_script = """
        function main(splash, args)
            assert(splash:go(args.url))
            assert(splash:wait(0.5))
            return {
                html = splash:html()
            }
        end
    """

    def start_requests(self):

        # launch a Splash request and spcify the proxy address inside the args
        for url in self.start_urls:
            yield SplashRequest(
                url, 
                self.parse,
                endpoint="execute",
                args={
                    "wait": 0.5, 
                    "lua_source": self.lua_script, 
                    "proxy": "http://189.240.60.171:9090"
                },
                cache_args=["lua_source"]
            )

    def parse(self, response):

        # get the HTML result from Lua
        splash_result = response.body

        # log the HTML result to view the current IP address
        self.logger.info("Splash Result: %s", splash_result.decode("utf-8"))

Run your spider with the crawl command:

Terminal
scrapy crawl scraper

Running the code twice outputs similar IP addresses (with different ports) from the specified proxy:

Output
{
  "origin": "189.240.60.168:9882"
}

{
  "origin": "189.240.60.168:3718"
}
```

You've just implemented a proxy using Splash in your Scrapy web scraper. High five!

Let's go through the other proxy setup options.

Option 2: Set a Proxy Using a Lua Script With Splash

Setting up a proxy inside the Lua script is a good method if you're using Splash independently and want to integrate its API into other programming languages like Python and JavaScript.

Splash operates a dedicated server that can execute the Lua script for advanced proxy configuration and JavaScript rendering tasks.

For this tutorial, you'll run the Splash server locally on your machine and use it to execute Lua with Python's Requests library.

First, ensure you install the Requests library if you've not done so already:

Terminal
pip install requests

Next, import the Requests library and write your Lua script inside a multi-line string. The script contains a function that starts with the proxy address setup. It visits the target URL (https://httpbin.io/ip) and returns its HTML content:

scraper.py
# import the required library
import requests

# develop your Lua script
lua_script = """
    function main(splash, args)
        
        -- set up proxy
        splash:on_request(function(request)
        
            request:set_proxy{
                type = "HTTP",
                host = "189.240.60.171",
                port = 9090,
            }
        end)
        
        -- visit the target URL
        assert(splash:go(args.url))
        assert(splash:wait(0.5))
        
        -- print the HTML content
        return {
            html = splash:html(),
        }
        
    end
"""

Send a POST request to the Splash execute endpoint, specifying the target URL and the Lua script variable in the request body. Then print the JSON response to return the website's HTML:

scraper.py
# ...

response = requests.post(
    # specify the Splash server API endpoint
    "http://localhost:8050/execute", 

    # define the request body
    json={
        "lua_source": lua_script,
        "url": "https://httpbin.io/ip",
        "timeout": 60,
    }
)

# get the response
print(response.json())

Combine both snippets. Your final code should look like this:

scraper.py
# import the required library
import requests

# develop your Lua script
lua_script = """
    function main(splash, args)
        
        -- set up proxy
        splash:on_request(function(request)
        
            request:set_proxy{
                type = "HTTP",
                host = "189.240.60.171",
                port = 9090,
            }
        end)
        
        -- visit the target URL
        assert(splash:go(args.url))
        assert(splash:wait(0.5))
        
        -- print the HTML content
        return {
            html = splash:html(),
        }
        
    end
"""
response = requests.post(
    # specify the Splash server API endpoint
    "http://localhost:8050/execute", 

    # define the request body
    json={
        "lua_source": lua_script,
        "url": "https://httpbin.io/ip",
        "timeout": 60,
    }
)

# get the response
print(response.json())

The specified proxy returns the following IP addresses (different ports) for two manual requests:

Output
"origin": "189.240.60.168:24900"

"origin": "189.240.60.168:14962"

You now know how to add a proxy directly in Splash using the Lua script. But there's one more scalable way to achieve this. Let's take a look.

Option 3: Use Proxy Profiles

The proxy profiles option works best if you need to share one proxy between several scraping scripts or projects. To use it, you'll have to expose the folder containing your proxy profiles to the Splash API.

First, create a new folder inside your project directory and give it a descriptive name (let’s name it "proxy-profile"). Make a profile.ini file inside this folder and configure it with your proxy details, as shown:

profile.ini
[proxy]

host=189.240.60.171
port=9090
type=HTTP

The next step is to connect this local proxy profile folder with the default directory recognized by the Splash server. The Splash server reads the proxy detail from the following default directory:

Example
/etc/splash/proxy-profiles

Stop the running Splash server image inside Docker. Then, restart it with the following command, replacing path\_to\_your\_proxy\_profile with the full path to your proxy profile configuration:

Terminal
docker run -p 8050:8050 -v <path_to_your_proxy_profile>:/etc/splash/proxy-profiles scrapinghub/splash

For example, assume you've written your profile.ini file inside D:/scraper/proxy-profile. Include that path in your Docker runner command like this:

Terminal
docker run -p 8050:8050 -v D:/scraper/proxy-profile:/etc/splash/proxy-profiles scrapinghub/splash

You've now started the Docker server image with your proxy profiles. Awesome!

Now, let's create the Lua script to test this integration. Open your Python file, write Lua code to visit the target website (https://httpbin.io/ip) and get the HTML content:

scraper.py
# import the required libraries
import requests

# develop your Lua script
lua_script = """
    function main(splash, args)
        
        -- visit the target URL
        assert(splash:go(args.url))
        assert(splash:wait(1.0))
        -- print the HTML content
        return {
            html = splash:html(),
        }  
    end
"""

Send a POST request to the Splash server API and point to the file containing your proxy profile inside the request body. Ensure you use profile without the .ini extension. Then, print the JSON result to show the extracted HTML content:

scraper.py
# ...

response = requests.post(
    # specify the Splash server API endpoint
    "http://localhost:8050/execute", 
    
    # define the request body and specify the proxy profile
    json={
        "lua_source": lua_script,
        "url": "https://httpbin.io/ip",
        "timeout": 60,
        "proxy":"profile"
    }
)

# get the response
print(response.json())

Your full code should look like this after combining the snippets:

scraper.py
# import the required libraries
import requests

# develop your Lua script
lua_script = """
    function main(splash, args)
        
        -- visit the target URL
        assert(splash:go(args.url))
        assert(splash:wait(1.0))
        -- print the HTML content
        return {
            html = splash:html(),
        }  
    end
"""

response = requests.post(
    # specify the Splash server API endpoint
    "http://localhost:8050/execute", 
    
    # define the request body and specify the proxy profile
    json={
        "lua_source": lua_script,
        "url": "https://httpbin.io/ip",
        "timeout": 60,
        "proxy":"profile"
    }
)

# get the response
print(response.json())

Running the code twice returns the following IP addresses from the proxy:

Output
"origin": "189.240.60.168:4769"

"origin": "189.240.60.168:17136"

That's it! Your Splash web scraper now uses the specified proxy profiles.

In the examples above, you've used free proxies. But for real projects, you'll need a premium web scraping proxy with authentication credentials.

Get the Best Premium Proxies to Scrape

Free proxies have a low success rate due to frequent downtime. Besides, most anti-bot measures can detect them during web scraping. It means you'll get blocked even after setting up a proxy with Splash.

For example, a protected website like the G2 Reviews page will block your request even if you set up a proxy with Scrapy Splash. Try accessing it by replacing the target URL with G2 in your spider file, as shown below:

spider.py
# import the required libraries
import scrapy
from scrapy_splash import SplashRequest

class Scraper(scrapy.Spider):
    name = "scraper"

    # point to the target URL
    allowed_domains = ["g2.com"]
    start_urls = ["https://www.g2.com/products/asana/reviews"]

    # add a Lua script to access the website and print its HTML
    lua_script = """
        function main(splash, args)
            assert(splash:go(args.url))
            assert(splash:wait(0.5))
            return {
                html = splash:html()
            }
        end
    """

    def start_requests(self):

        # launch a Splash request and spcify the proxy address inside the args
        for url in self.start_urls:
            yield SplashRequest(
                url,
                self.parse,
                endpoint="execute",
                args={
                    "wait": 0.5, 
                    "lua_source": self.lua_script, 
                    "proxy": "http://189.240.60.171:9090"
                },
                cache_args=["lua_source"])

    def parse(self, response):

        # get the HTML result from Lua
        splash_result = response.body

        # log the HTML result to view the current IP address
        self.logger.info("Splash Result: %s", splash_result.decode("utf-8"))

The spider returns a Scrapy 403 error, indicating that an anti-bot has blocked it:

Output
 Crawled (403) <GET https://www.g2.com/products/asana/reviews

Try opening the website via a regular browser, and you'll see that it uses Cloudflare Turnstile to prevent bot activities:

Cloudflare Turnstile Block
Click to open the image in full screen

You can use premium residential proxies to avoid getting blocked. They can help you avoid basic detection mechanisms such as IP bans, but they're unlikely to bypass advanced anti-bot systems like Akamai, Cloudflare, or DataDome, which use sophisticated detection mechanisms beyond IP checks.

The best solution is to use a web scraping API like ZenRows, which helps you auto-rotate premium proxies, optimize your request headers, and bypass CAPTCHAs and other advanced anti-bot systems. ZenRows also works as a headless browser featuring JavaScript instructions to extract dynamically loaded content.

Let's try to use ZenRows and the Requests library to access the same G2 Reviews page.

Sign up to open the Request Builder. Paste the target URL in the link box, toggle the Boost mode to JS Rendering, and activate Premium proxies. Select Python as your chosen language and set the request type as API. Then, copy and paste the generated code into your scraper file.

ZenRows Request Builder Page
Click to open the image in full screen

The code should look like this in your Python file:

scraper.py
# pip install requests
import requests

params = {
	"url": "https://www.g2.com/products/asana/reviews",
	"apikey": "<YOUR_ZENROWS_API_KEY>",
	"js_render": "true",
	"premium_proxy": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)

The code above bypasses the anti-bot and extracts the page's HTML content:

Output
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8" />
    <link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
    <title>Asana Reviews 2024</title>
</head>
<body>
    <!-- other content omitted for brevity -->
</body>
</html>

You just scraped a protected website successfully with ZenRows. Congratulations!

Conclusion

In this article, you've learned how to configure your Splash web scraper to use a proxy in three ways:

  • Setting up a proxy with Scrapy Splash using the Splash request method.
  • Adding a proxy directly to the Lua script in Splash.
  • Creating a proxy profile and running the Splash server image with the profile.

Proxies offer a fair level of anti-bot bypass capability. Still, they may prove too weak against advanced detection mechanisms. The only foolproof solution is going for a web scraping API such as ZenRows. This way, you’ll bypass any bot detection system and save yourself the trouble of finding and configuring proxies.

Ready to get started?

Up to 1,000 URLs for free are waiting for you