How to Solve Jsoup 403 Forbidden Error

July 1, 2024 ยท 8 min read

Are you web scraping with Java and jsoup and running into a 403 forbidden error?

Terminal
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403

This common issue can render your web scraper useless. But don't worry; there are a few easy-to-implement ways to deal with it.

In this article, you'll see what causes the error and learn three practical methods to resolve it. You'll be able to carry on your web scraping project interrupted.

Let's go!

Method 1: Set a Custom User Agent

The User Agent is one of the basic parameters checked by websites' protection systems to differentiate between humans and bots. Therefore, requests with missing or bot-like User Agents may result in a 403 forbidden error.

Fortunately, setting up a User Agent in soup is easy.

First, ensure you use an updated UA to increase your chances of avoiding the jsoup 403 error. It should contain the latest browser version, like the one below:

Terminal
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36

You can grab a suitable User Agent from the list of top User Agents for web scraping.

Let's set a User Agent with jsoup and send a request to https://httpbin.io/user-agent to check your current User Agent string:

scraper.java
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
 
public class Scraper {
    public static void main(String[] args) throws Exception{

        // set up the connection options with jsoup
        String url = "https://httpbin.io/user-agent";
        Document doc = Jsoup
            .connect(url)
            .ignoreContentType(true)
            .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36")
            .get();
       
            String htmlOutput = doc.outerHtml();

            // print the output in the console
            System.out.println(htmlOutput);
    }
}

The code visits the test website and outputs your current User Agent string:

Output
<html>
 <head></head>
 <body>
  { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36" }
 </body>
</html>

Your jsoup scraper now uses the specified User Agent.

That said, setting the User Agent alone may be insufficient to bypass blocks during web scraping. Another solution is to boost your request with a proxy. You'll learn how to do that in the next section.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Method 2: Implement IP Rotation With Proxies

Some websites limit the frequency of requests a user can send within a particular period. Requesting beyond the acceptable limit can cause the website to block your IP. This IP ban is a common cause of the jsoup 403 error, especially during high-volume scraping.

A proxy routes your request through another IP address, making it look like it comes from another location.

For this exercise, grab a proxy from the Free Proxy List and set it up, as shown in the snippet below. Then, send a request to https://httpbin.io/ip and print the website's HTML to check your current IP address.

scraper.java
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
 
public class Scraper{
    public static void main(String[] args) throws Exception{

        // set up the connection options with jsoup
        String url = "https://httpbin.io/ip";
        Document doc = Jsoup
            .connect(url)
            .ignoreContentType(true)
            .proxy("35.185.196.38", 3128)
            .get(); 

        String htmlOutput = doc.outerHtml();
 
        // print the output in the console
        System.out.println(htmlOutput);
    }
}

The above code routes your request through the specified IP address, as shown:

Output
<html>
 <head></head>  
 <body>
  {"origin": "35.185.196.38:22673"}
 </body>        
</html>

Jsoup now sends your request through the provided proxy address.

However, a better solution is to rotate proxies to distribute your request across several locations. This technique increases your chances of avoiding blocks. Let's see how to do that with jsoup.

First, import the following modules and define a list of IP addresses in your Scraper class:

scraper.java
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import java.util.Random;

public class Scraper {
    // list of proxy addresses and ports
    private static final List<Proxy> PROXIES = Arrays.asList(
            new Proxy("35.185.196.38", 3128),
            new Proxy("115.97.103.72", 3128),
            new Proxy("185.217.136.67", 1337)
            // add more proxies as needed
    );
}

Extend the class with a static class that requests the target website. This class rotates the proxies and implements a retry approach if a connection fails:

scraper.java
public class Scraper{
    //...

    public static void main(String[] args) throws Exception {
        String url = "https://httpbin.io/ip";
       
        // retry mechanism in case of failure
        boolean success = false;
        while (!success) {
            Proxy proxy = getRandomProxy();
            try {
                // set up the connection options with Jsoup
                Document doc = Jsoup
                        .connect(url)
                        .ignoreContentType(true) 
                        .proxy(proxy.address, proxy.port)
                        .get();
               
                String htmlOutput = doc.outerHtml();

                // print the output in the console
                System.out.println(htmlOutput);
               
                success = true; // exit the loop if successful
            } catch (IOException e) {
                System.err.println("Failed with proxy " + proxy + ": " + e.getMessage());
            }
        }
    }
}

Now, define a function to get the proxies randomly from the list:

scraper.java
public class Scraper{

   //...

   // get a random proxy from the list
    private static Proxy getRandomProxy() {
        Random random = new Random();
        return PROXIES.get(random.nextInt(PROXIES.size()));
    }
}

Finally, define a Proxy class to store the proxy information:

scraper.java
public class Scraper{

//...

    // inner class to store proxy details
    static class Proxy {
        String address;
        int port;

        Proxy(String address, int port) {
            this.address = address;
            this.port = port;
        }

        @Override
        public String toString() {
            return address + ":" + port;
        }
    }
}

Put all the snippets together to get the following complete code:

scraper.java
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import java.util.Random;

public class Scraper {
    // list of proxy addresses and ports
    private static final List<Proxy> PROXIES = Arrays.asList(
            new Proxy("35.185.196.38", 3128),
            new Proxy("115.97.103.72", 3128),
            new Proxy("185.217.136.67", 1337)
            // add more proxies as needed
    );

    public static void main(String[] args) throws Exception {
        String url = "https://httpbin.io/ip";
       
        // retry mechanism in case of failure
        boolean success = false;
        while (!success) {
            Proxy proxy = getRandomProxy();
            try {
                // set up the connection options with Jsoup
                Document doc = Jsoup
                        .connect(url)
                        .ignoreContentType(true)
                        .proxy(proxy.address, proxy.port)
                        .get();
               
                String htmlOutput = doc.outerHtml();

                // print the output in the console
                System.out.println(htmlOutput);
               
                success = true; // exit the loop if successful
            } catch (IOException e) {
                System.err.println("Failed with proxy " + proxy + ": " + e.getMessage());
            }
        }
    }

    // get a random proxy from the list
    private static Proxy getRandomProxy() {
        Random random = new Random();
        return PROXIES.get(random.nextInt(PROXIES.size()));
    }

    // inner class to store proxy details
    static class Proxy {
        String address;
        int port;

        Proxy(String address, int port) {
            this.address = address;
            this.port = port;
        }

        @Override
        public String toString() {
            return address + ":" + port;
        }
    }
}

The above code outputs a different IP address from the proxy pool per request. Here's a sample result of three consecutive requests:

Output
<!--request 1-->
<html>
 <head></head>  
 <body>
  { "origin": "185.217.136.67:344324"}
 </body>        
</html>

<!--request 2-->
<html>
 <head></head>  
 <body>
  { "origin": "115.97.103.72:55667"}
 </body>        
</html>

<!--request 3-->
<html>
 <head></head>  
 <body>
  { "origin": "185.217.136.67:43454"}
 </body>        
</html>

Your jsoup scraper now rotates proxies by randomizing them from a pool. Great job!

However, free proxies are only good for testing and unsuitable for real-life projects because they're short-lived. The proxies used in this example may not work at the time of reading. Feel free to use new ones from the Free Proxy List.

The best option for large-scale projects is premium web scraping proxies with IP auto-rotation. Premium proxies let you save yourself the hassle of hardcoding the IP rotation or manually switching IP addresses.

Method 3: Use a Web Scraping API

Integrating a web scraping API into your scraper is the best way to handle the complexities of IP rotation, header configuration, CAPTCHA and Web Application Firewall (WAF) bypass, and more.

One of the leading web scraping APIs is ZenRows. It offers headless browser functionalities, premium proxy auto-rotation, updated User Agent implementation, and everything you need to bypass CAPTCHAs and other anti-bot measures at scale.

Let's see how ZenRows works by scraping the G2 Reviews page, a Cloudflare-protected website that blocks jsoup with the 403 forbidden error.

Sign up to open the ZenRows Request Builder. Paste the target URL in the link box, toggle the Boost mode to JS Rendering, and activate Premium Proxies. Select Java as your preferred language and choose the API connection mode. Copy and paste the generated code into your scraper file.

ZenRows Request Builder
Click to open the image in full screen

The generated code uses the Fluent API as the HTTP client. Ensure you add it to your Gradle dependencies:

gradle.build
dependencies {
    implementation "org.apache.httpcomponents.client5:httpclient5-fluent:5.1"
}

If using Maven, include it in your pom.xml file:

pon.xml
<dependency>
    <groupId>org.apache.httpcomponents.client5</groupId>
    <artifactId>httpclient5-fluent</artifactId>
    <version>5.1</version>
</dependency>

The generated code should look like this:

scraper.java
import org.apache.hc.client5.http.fluent.Request;

public class APIRequest {
    public static void main(final String... args) throws Exception {
        String apiUrl = "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fasana%2Freviews&js_render=true&premium_proxy=true";
        String response = Request.get(apiUrl)
                .execute().returnContent().asString();

        System.out.println(response);
    }
}

The code accesses the protected website and scrapes its full-page HTML:

Output
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8" />
    <link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
    <title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
</head>
<body>
    <!-- other content omitted for brevity -->
</body>

Congratulations! You've just bypassed a Cloudflare-protected website with ZenRows.

Conclusion

In this article, you've learned what causes jsoup 403 forbidden error and three proven ways to deal with it.

The common culprits are misconfigured or missing User Agent and IP bans. You can manually change a User Agent, route your requests through proxies, or combine the two methods. However, these solutions aren't sustainable at scale.

Integrating ZenRows, an all-in-one web-scaping solution that works with any programming language, is the best way to bypass all blocks and avoid the need for manual configurations. Try ZenRows now and get your free API key.

Ready to get started?

Up to 1,000 URLs for free are waiting for you