Are you looking for a solution to avoid blocks when web scraping in Java?
You're in the right place. In this short guide, you'll learn how to avoid detection using an HtmlUnit proxy.
Let's go!
How to Set A Proxy With HtmlUnit?
Proxies act as intermediaries between your scraper and the target server. Routing your requests through them guarantees anonymity and distributes traffic across multiple IP addresses.
To set your proxy in HtmlUnit, you need to configure the Web Client instance to route requests through the desired proxy server.
There are two ways to achieve this:
- Create a
ProxyConfig
instance with your proxy settings and configure the Web Client to use it. - Provide your proxy settings as parameters when creating the Web Client instance.
The ProxyConfig
class helps reduce clutter in the WebClient
class by centralizing proxy configurations. So, if you're dealing with multiple WebClient
instances, consider the first approach for easy maintenance. This tutorial will use the second approach since it's more straightforward and requires less coding.
Before diving into the step-by-step tutorial, here's a basic HtmlUnit script to which you can add proxy configurations.
package com.example;
// import the required classes
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.Page;
public class Main {
public static void main(String[] args) {
// create Chrome Web Client instance
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
// navigate to target web page
Page page = webClient.getPage("https://httpbin.io/ip");
// extract the content as string
String pageContent = page.getWebResponse().getContentAsString();
// print the content
System.out.println(pageContent);
} catch (Exception e) {
e.printStackTrace();
}
}
}
This code creates a Chrome Web Client instance, navigates to HTTPbin (a target page that returns the client's IP address), and prints the page content.
Using the HtmlPage
class returns the following error when dealing with non-HTML responses:
class com.gargoylesoftware.htmlunit.UnexpectedPage cannot be cast to class com.gargoylesoftware.htmlunit.html.HtmlPage
The sample code above uses the Page
class, which allows you to handle any type of web content, including JSON.
Step 1: Add a Proxy in HtmlUnit
This tutorial uses a free proxy from the Free Proxy List. It may no longer work at the time of reading, so feel free to switch to a new one. Use HTTPS proxies since they work for both HTTP and HTTPS websites.
Start by defining your proxy settings.
public class Main {
public static void main(String[] args) {
// define proxy settings
String PROXY_HOST = "129.80.134.71";
int PROXY_PORT = 3128;
}
}
Create your Web Client instance using the specified settings.
public class Main {
public static void main(String[] args) {
//...
// create Chrome Web Client instance using specified proxy settings.
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME, PROXY_HOST, PROXY_PORT)) {
//...
} catch (Exception e) {
e.printStackTrace();
}
}
}
That's it. You've configured your first HtmlUnit proxy.
To verify if it works, update the basic script created earlier with this proxy configuration. You should end up with the following complete code:
package com.example;
// import the required classes
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.Page;
public class Main {
public static void main(String[] args) {
// define proxy settings
String PROXY_HOST = "129.80.134.71";
int PROXY_PORT = 3128;
// create Chrome Web Client instance using specified proxy settings.
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME, PROXY_HOST, PROXY_PORT)) {
// navigate to target web page
Page page = webClient.getPage("https://httpbin.io/ip");
// extract the content as string
String pageContent = page.getWebResponse().getContentAsString();
// print the content
System.out.println(pageContent);
} catch (Exception e) {
e.printStackTrace();
}
}
}
```
Run it, and you'll get your proxy's IP address.
{
"origin": "129.80.134.71:43268"
}
Nice job!
However, you should know that free proxies are only suitable for testing since they're unreliable and prone to expiration. In real-world use cases, you'll need premium web scraping proxies. These proxies often require additional configuration because you need to provide the necessary credentials.
If your proxy server requires details, such as username and password, you can authenticate it using the DefaultCredentialsProvider
from the Web Client. Check out the example below:
package com.example;
import com.gargoylesoftware.htmlunit.DefaultCredentialsProvider;
public class Main {
public static void main(String[] args) {
//...
// set proxy username and password
final DefaultCredentialsProvider credentialsProvider = (DefaultCredentialsProvider) webClient.getCredentialsProvider();
credentialsProvider.addCredentials("<YOUR_USERNAME>", "<YOUR_PASSWORD>");
}
}
Your new complete code would look like this:
package com.example;
// import the required classes
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.DefaultCredentialsProvider;
import com.gargoylesoftware.htmlunit.Page;
public class Main {
public static void main(String[] args) {
// define proxy settings
String PROXY_HOST = "67.43.228.252";
int PROXY_PORT = 8013;
// create Chrome Web Client instance using specified proxy settings.
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME, PROXY_HOST, PROXY_PORT)) {
//set proxy username and password
final DefaultCredentialsProvider credentialsProvider = (DefaultCredentialsProvider) webClient.getCredentialsProvider();
credentialsProvider.addCredentials("<YOUR_USERNAME>", "<YOUR_PASSWORD>");
// navigate to target web page
Page page = webClient.getPage("https://httpbin.io/ip");
// extract the content as string
String pageContent = page.getWebResponse().getContentAsString();
// print the content
System.out.println(pageContent);
} catch (Exception e) {
e.printStackTrace();
}
}
}
On top of that, you must rotate between proxies to avoid getting blocked by websites employing IP-based blocking or rate-limiting measures. Let's see how to implement a rotating HtmlUnit proxy scraper.
Step 2: Rotate Proxies With HtmlUnit
Rotating proxies helps avoid detection, especially when you make numerous requests to a target server. Websites often flag excessive requests as suspicious activity and block accordingly. Switching between proxies makes you harder to detect since you distribute traffic across multiple IP addresses.
To build a HtmlUnit proxy rotator, you need a proxy list to switch between proxies for each request. Start importing the required classes and create a list of proxy servers you want to rotate through. For this exercise, grab a few from the Free Proxy List.
package com.example;
// import the required classes
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.WebClient;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
public class Main {
public static void main(String[] args) {
// define your proxy list
List<String> proxyList = new ArrayList<>();
proxyList.add("129.80.134.71:3128");
proxyList.add("185.49.170.20:43626");
proxyList.add("14.177.236.212:55443");
//...
}
}
Next, select a random proxy from your list and extract the host and port. This will allow you to pass the proxy settings as parameters when creating your Web Client instance.
Generate an index to select a random proxy from the list.
public class Main {
public static void main(String[] args) {
//...
// create a random number generator
Random random = new Random();
// generate a random index to select a proxy from the list
int randomIndex = random.nextInt(proxyList.size());
// select random proxy
String randomProxy = proxyList.get(randomIndex);
// extract proxy host and port based on the random index
String PROXY_HOST = randomProxy.split(":")[0];
int PROXY_PORT = Integer.parseInt(randomProxy.split(":")[1]);
}
}
All that's left is to pass the random proxy settings as parameters in your Chrome Web Client instance. Then, navigate to the target website and print its text content, just like in the previous examples.
public class Main {
public static void main(String[] args) {
//...
// create Chrome Web Client instance using specified proxy settings.
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME, PROXY_HOST, PROXY_PORT)) {
// navigate to target web page
Page page = webClient.getPage("https://httpbin.io/ip");
// extract the content as string
String pageContent = page.getWebResponse().getContentAsString();
// print the content
System.out.println(pageContent);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Combine everything. Your final code should look like this:
package com.example;
// import the required classes
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.WebClient;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
public class Main {
public static void main(String[] args) {
// define your proxy list
List<String> proxyList = new ArrayList<>();
proxyList.add("129.80.134.71:3128");
proxyList.add("185.49.170.20:43626");
proxyList.add("14.177.236.212:55443");
// create a random number generator
Random random = new Random();
// generate a random index to select a proxy from the list
int randomIndex = random.nextInt(proxyList.size());
// select random proxy
String randomProxy = proxyList.get(randomIndex);
// extract proxy host and port based on the random index
String PROXY_HOST = randomProxy.split(":")[0];
int PROXY_PORT = Integer.parseInt(randomProxy.split(":")[1]);
// create Chrome Web Client instance using specified proxy settings.
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME, PROXY_HOST, PROXY_PORT)) {
// navigate to target web page
Page page = webClient.getPage("https://httpbin.io/ip");
// extract the content as string
String pageContent = page.getWebResponse().getContentAsString();
// print the content
System.out.println(pageContent);
} catch (Exception e) {
e.printStackTrace();
}
}
}
To see if it works, make multiple requests. You should get a different IP address each time. Here are the results for two requests:
{
"origin": "129.158.41.190:43268"
}
{
"origin": "185.49.170.20:8989"
}
Congratulations on creating your first HtmlUnit proxy rotator!
Choose the Best Premium Proxies to Scrape
Using free proxies with HtmlUnit may seem convenient, but for real-life use cases, you’ll most likely get blocked.
See for yourself. Try to scrape an Amazon product page using the previous HtmlUnit proxy script:
package com.example;
// import the required classes
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.Page;
public class Main {
public static void main(String[] args) {
// define proxy settings
String PROXY_HOST = "129.80.134.71";
int PROXY_PORT = 3128;
// create Chrome Web Client instance using specified proxy settings.
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME, PROXY_HOST, PROXY_PORT)) {
// navigate to target web page
Page page = webClient.getPage("https://www.amazon.com/Lumineux-Teeth-Whitening-Strips-Treatments-Enamel-Safe/dp/B082TPDTM2/");
// extract the content as string
String pageContent = page.getWebResponse().getContentAsString();
// print the content
System.out.println(pageContent);
} catch (Exception e) {
e.printStackTrace();
}
}
}
You'll end up with:
<!DOCTYPE html>
<body>
<h4>Enter the characters you see below</h4>
<p class="a-last">
Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.
</p>
<!--
-->
</body>
This result shows that the HtmlUnit proxy failed. It was met with an anti-bot challenge asking you to prove you're not a robot.
One solution to this problem is using premium proxies, but they aren't foolproof. Many advanced anti-bot systems can still detect their automation properties.
Your best bet is opting for a web scraping API like ZenRows. This tool provides a complete toolkit to scrape without getting blocked, including auto-rotating premium proxies, optimized headers, anti-CAPTCHAs, and more.
Like HtmlUnit, ZenRows offers headless browser functionality, but it's much easier to use and scale. Additionally, it automatically integrates auto-rotating premium proxies, optimized headers, anti-CAPTCHAs, and more.
Let's see how ZenRows performs with the same webpage we tried to scrape earlier.
To get started, sign up to ZenRows for free, and you'll be directed to the Request Builder page.
Paste your target URL, select the JavaScript Rendering
mode, and check the box for Premium Proxies
to rotate proxies automatically. Select Java as the language, and it'll generate your request code on the right:
Although this code uses the Apache HttpClient, you can use any Java HTTP client. You only need to make your requests to the ZenRows API.
Copy the generated code to your favorite editor. Your new script should look like this:
import org.apache.hc.client5.http.fluent.Request;
public class APIRequest {
public static void main(final String... args) throws Exception {
String apiUrl = "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%www.amazon.com%2FLumineux-Teeth-Whitening-Strips-Treatments-Enamel-Safe%2Fdp%2FB082TPDTM2%2F&js_render=true";
String response = Request.get(apiUrl)
.execute().returnContent().asString();
System.out.println(response);
}
}
Run it, and you'll get the page's HTML content.
<!DOCTYPE html>
<title>
Amazon.com: Lumineux Teeth Whitening Strips 21 Treatments...
</title>
Bingo! That's how easy it is to scrape with ZenRows.
Conclusion
Setting an HtmlUnit proxy in Java lets you route your requests through a different IP address. The two popular ways to do it are:
- Passing proxy settings as Web Client parameters.
- Rotating between proxies to avoid IP bans and rate limiting.
Still, these solutions aren't foolproof. Even premium proxies are still at risk of getting banned by more advanced anti-bot systems. To make sure you can bypass any block, consider using a web scraping API, such as ZenRows.