How to Use a Proxy With HtmlUnit in 2024

May 31, 2024 · 8 min read

Are you looking for a solution to avoid blocks when web scraping in Java?

You're in the right place. In this short guide, you'll learn how to avoid detection using an HtmlUnit proxy.

Let's go!

How to Set A Proxy With HtmlUnit?

Proxies act as intermediaries between your scraper and the target server. Routing your requests through them guarantees anonymity and distributes traffic across multiple IP addresses.

To set your proxy in HtmlUnit, you need to configure the Web Client instance to route requests through the desired proxy server.

There are two ways to achieve this:

  1. Create a ProxyConfig instance with your proxy settings and configure the Web Client to use it.
  2. Provide your proxy settings as parameters when creating the Web Client instance.

The ProxyConfig class helps reduce clutter in the WebClient class by centralizing proxy configurations. So, if you're dealing with multiple WebClient instances, consider the first approach for easy maintenance. This tutorial will use the second approach since it's more straightforward and requires less coding.

Before diving into the step-by-step tutorial, here's a basic HtmlUnit script to which you can add proxy configurations.

program.java

     
package com.example;
 
// import the required classes
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.Page;
 
public class Main {
    public static void main(String[] args) { 
        // create Chrome Web Client instance
        try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
            
            // navigate to target web page
            Page page = webClient.getPage("https://httpbin.io/ip");
 
            // extract the content as string
            String pageContent = page.getWebResponse().getContentAsString();
 
            // print the content
            System.out.println(pageContent);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

This code creates a Chrome Web Client instance, navigates to HTTPbin (a target page that returns the client's IP address), and prints the page content.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step 1: Add a Proxy in HtmlUnit

Start by defining your proxy settings.

program.java
public class Main {
    public static void main(String[] args) {
        // define proxy settings
        String PROXY_HOST = "129.80.134.71";
        int PROXY_PORT = 3128;
    }
}

Create your Web Client instance using the specified settings.

program.java
public class Main {
    public static void main(String[] args) {
        //...
 
        // create Chrome Web Client instance using specified proxy settings.
        try (final WebClient webClient = new WebClient(BrowserVersion.CHROME, PROXY_HOST, PROXY_PORT)) {
           //...
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

That's it. You've configured your first HtmlUnit proxy.

To verify if it works, update the basic script created earlier with this proxy configuration. You should end up with the following complete code:

program.java
package com.example;
 
// import the required classes
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.Page;
 
public class Main {
    public static void main(String[] args) { 
        // define proxy settings
        String PROXY_HOST = "129.80.134.71";
        int PROXY_PORT = 3128;
 
        // create Chrome Web Client instance using specified proxy settings.
        try (final WebClient webClient = new WebClient(BrowserVersion.CHROME, PROXY_HOST, PROXY_PORT))  {
            
            // navigate to target web page
            Page page = webClient.getPage("https://httpbin.io/ip");
 
            // extract the content as string
            String pageContent = page.getWebResponse().getContentAsString();
 
            // print the content
            System.out.println(pageContent);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
```

Run it, and you'll get your proxy's IP address.

Output
{
  "origin": "129.80.134.71:43268"
}

Nice job!

However, you should know that free proxies are only suitable for testing since they're unreliable and prone to expiration. In real-world use cases, you'll need premium web scraping proxies. These proxies often require additional configuration because you need to provide the necessary credentials.

If your proxy server requires details, such as username and password, you can authenticate it using the DefaultCredentialsProvider from the Web Client. Check out the example below:

program.java
package com.example;
 
import com.gargoylesoftware.htmlunit.DefaultCredentialsProvider;
 
public class Main {
    public static void main(String[] args) {
            //...
 
            // set proxy username and password 
            final DefaultCredentialsProvider credentialsProvider = (DefaultCredentialsProvider) webClient.getCredentialsProvider();
            credentialsProvider.addCredentials("username", "password");
    }
}

Your new complete code would look like this:

program.java
package com.example;
 
// import the required classes
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.DefaultCredentialsProvider;
import com.gargoylesoftware.htmlunit.Page;
 
public class Main {
    public static void main(String[] args) { 
        // define proxy settings
        String PROXY_HOST = "67.43.228.252";
        int PROXY_PORT = 8013;
 
        // create Chrome Web Client instance using specified proxy settings.
        try (final WebClient webClient = new WebClient(BrowserVersion.CHROME, PROXY_HOST, PROXY_PORT))  {
 
            //set proxy username and password 
            final DefaultCredentialsProvider credentialsProvider = (DefaultCredentialsProvider) webClient.getCredentialsProvider();
            credentialsProvider.addCredentials("username", "password");
            
            // navigate to target web page
            Page page = webClient.getPage("https://httpbin.io/ip");
 
            // extract the content as string
            String pageContent = page.getWebResponse().getContentAsString();
 
            // print the content
            System.out.println(pageContent);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

On top of that, you must rotate between proxies to avoid getting blocked by websites employing IP-based blocking or rate-limiting measures. Let's see how to implement a rotating HtmlUnit proxy scraper.

Step 2: Rotate Proxies With HtmlUnit

Rotating proxies helps avoid detection, especially when you make numerous requests to a target server. Websites often flag excessive requests as suspicious activity and block accordingly. Switching between proxies makes you harder to detect since you distribute traffic across multiple IP addresses.

To build a HtmlUnit proxy rotator, you need a proxy list to switch between proxies for each request. Start importing the required classes and create a list of proxy servers you want to rotate through. For this exercise, grab a few from the Free Proxy List.

program.java
package com.example;
 
// import the required classes
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.WebClient;
 
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
 
public class Main {
    public static void main(String[] args) {
        // define your proxy list
        List<String> proxyList = new ArrayList<>();
        proxyList.add("129.80.134.71:3128");
        proxyList.add("185.49.170.20:43626");
        proxyList.add("14.177.236.212:55443");
 
        //...
   
    }
}

Next, select a random proxy from your list and extract the host and port. This will allow you to pass the proxy settings as parameters when creating your Web Client instance.

Generate an index to select a random proxy from the list.

program.java
public class Main {
    public static void main(String[] args) {
        //...
 
        // create a random number generator
        Random random = new Random();
        // generate a random index to select a proxy from the list
        int randomIndex = random.nextInt(proxyList.size());
        // select random proxy
        String randomProxy = proxyList.get(randomIndex);
 
        // extract proxy host and port based on the random index
        String PROXY_HOST = randomProxy.split(":")[0];
        int PROXY_PORT = Integer.parseInt(randomProxy.split(":")[1]);
   
    }
}

All that's left is to pass the random proxy settings as parameters in your Chrome Web Client instance. Then, navigate to the target website and print its text content, just like in the previous examples.

program.java
public class Main {
    public static void main(String[] args) { 
        //...
 
        // create Chrome Web Client instance using specified proxy settings.
        try (final WebClient webClient = new WebClient(BrowserVersion.CHROME, PROXY_HOST, PROXY_PORT))  {
 
            // navigate to target web page
            Page page = webClient.getPage("https://httpbin.io/ip");
 
            // extract the content as string
            String pageContent = page.getWebResponse().getContentAsString();
 
            // print the content
            System.out.println(pageContent);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Combine everything. Your final code should look like this:

program.java
package com.example;
 
// import the required classes
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.WebClient;
 
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
 
public class Main {
    public static void main(String[] args) {
        // define your proxy list
        List<String> proxyList = new ArrayList<>();
        proxyList.add("129.80.134.71:3128");
        proxyList.add("185.49.170.20:43626");
        proxyList.add("14.177.236.212:55443");
 
        // create a random number generator
        Random random = new Random();
        // generate a random index to select a proxy from the list
        int randomIndex = random.nextInt(proxyList.size());
        // select random proxy
        String randomProxy = proxyList.get(randomIndex);
 
        // extract proxy host and port based on the random index
        String PROXY_HOST = randomProxy.split(":")[0];
        int PROXY_PORT = Integer.parseInt(randomProxy.split(":")[1]);
 
        // create Chrome Web Client instance using specified proxy settings.
        try (final WebClient webClient = new WebClient(BrowserVersion.CHROME, PROXY_HOST, PROXY_PORT))  {
 
            // navigate to target web page
            Page page = webClient.getPage("https://httpbin.io/ip");
 
            // extract the content as string
            String pageContent = page.getWebResponse().getContentAsString();
 
            // print the content
            System.out.println(pageContent);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

To see if it works, make multiple requests. You should get a different IP address each time. Here are the results for two requests:

Output
{
  "origin": "129.158.41.190:43268"
}
 
{
  "origin": "185.49.170.20:8989"
}

Congratulations on creating your first HtmlUnit proxy rotator!

Choose the Best Premium Proxies to Scrape

Using free proxies with HtmlUnit may seem convenient, but for real-life use cases, you’ll most likely get blocked.

See for yourself. Try to scrape an Amazon product page using the previous HtmlUnit proxy script:

program.java
package com.example;
 
// import the required classes
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.Page;
 
public class Main {
    public static void main(String[] args) { 
        // define proxy settings
        String PROXY_HOST = "129.80.134.71";
        int PROXY_PORT = 3128;
 
        // create Chrome Web Client instance using specified proxy settings.
        try (final WebClient webClient = new WebClient(BrowserVersion.CHROME, PROXY_HOST, PROXY_PORT))  {
            
            // navigate to target web page
            Page page = webClient.getPage("https://www.amazon.com/Lumineux-Teeth-Whitening-Strips-Treatments-Enamel-Safe/dp/B082TPDTM2/");
 
            // extract the content as string
            String pageContent = page.getWebResponse().getContentAsString();
 
            // print the content
            System.out.println(pageContent);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

You'll end up with:

Output
<!DOCTYPE html>
<body>
    <h4>Enter the characters you see below</h4>
    <p class="a-last">
        Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.
    </p>
    <!--
    -->
</body>

This result shows that the HtmlUnit proxy failed. It was met with an anti-bot challenge asking you to prove you're not a robot.

One solution to this problem is using premium proxies, but they aren't foolproof. Many advanced anti-bot systems can still detect their automation properties.

Your best bet is opting for a web scraping API like ZenRows. This tool provides a complete toolkit to scrape without getting blocked, including auto-rotating premium proxies, optimized headers, anti-CAPTCHAs, and more.

Like HtmlUnit, ZenRows offers headless browser functionality, but it's much easier to use and scale. Additionally, it automatically integrates auto-rotating premium proxies, optimized headers, anti-CAPTCHAs, and more.

Let's see how ZenRows performs with the same webpage we tried to scrape earlier.

To get started, sign up to ZenRows for free, and you'll be directed to the Request Builder page.

Paste your target URL, select the JavaScript Rendering mode, and check the box for Premium Proxies to rotate proxies automatically. Select Java as the language, and it'll generate your request code on the right:

ZenRows Request Builder Page
Click to open the image in full screen

Although this code uses the Apache HttpClient, you can use any Java HTTP client. You only need to make your requests to the ZenRows API.

Copy the generated code to your favorite editor. Your new script should look like this:

program.java
import org.apache.hc.client5.http.fluent.Request;
 
public class APIRequest {
    public static void main(final String... args) throws Exception {
        String apiUrl = "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%www.amazon.com%2FLumineux-Teeth-Whitening-Strips-Treatments-Enamel-Safe%2Fdp%2FB082TPDTM2%2F&js_render=true";
        String response = Request.get(apiUrl)
                .execute().returnContent().asString();
 
        System.out.println(response);
    }
}

Run it, and you'll get the page's HTML content.

Output
<!DOCTYPE html>
<title>
    Amazon.com: Lumineux Teeth Whitening Strips 21 Treatments...
</title>

Bingo! That's how easy it is to scrape with ZenRows.

Conclusion

Setting an HtmlUnit proxy in Java lets you route your requests through a different IP address. The two popular ways to do it are:

  • Passing proxy settings as Web Client parameters.
  • Rotating between proxies to avoid IP bans and rate limiting.

Still, these solutions aren't foolproof. Even premium proxies are still at risk of getting banned by more advanced anti-bot systems. To make sure you can bypass any block, consider using a web scraping API, such as ZenRows.

Ready to get started?

Up to 1,000 URLs for free are waiting for you