Puppeteer in Java for Web Scraping

Idowu Omisola
Idowu Omisola
April 22, 2025 · 6 min read

Puppeteer is a NodeJS library that lets you control headless Chrome or Chromium through the DevTools protocol. While it actually belongs to NodeJS, you still can use it in Java as long as you find a working Puppeteer Java wrapper.

In this tutorial, you'll learn exactly how to scrape data using Puppeteer and Java.

How to Use Puppeteer in Java for Web Scraping

Follow the steps below to learn how to use Puppeteer in Java.

Step 1: Prerequisites

Before we dive into the code, make sure you meet the following requirements:

To start with, create a Java project and include the Jvppeteer dependency. If you're using Maven, add this XML snippet in the dependencies section of your pom.xml file.

pom.xml
<dependency>
  <groupId>io.github.fanyong920</groupId>
  <artifactId>jvppeteer</artifactId>
  <version>3.3.2</version>
</dependency>

Step 2: Building a Basic Web Scraper in Puppeteer with Java

There is currently no official Java support for Puppeteer. However, third-party community-driven wrappers, such as Jvppeteer, allow you to use Puppeteer in Java.

Jvppeteer is an open-source Puppeteer port that provides a Java interface for controlling Chrome and Firefox using the DevTools and WebDriver-bidi.

This means you can access Puppeteer functionalities, including automating browser interactions, rendering JavaScript, and simulating human browsing in your Java web scraper.

Note that Jvppeteer doesn't implement all Puppeteer features, and unsupported functions will throw an Unsupported Operation Exception error.

However, it's still a good option for simple web scraping operations.

For this tutorial, we'll use the ScrapingCourse E-commerce test site as the target page.

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

Here's a simple Puppeteer Java scraper using Jvppeteer.

It starts a Chrome browser, opens the target webpage, and downloads the HTML content.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE
Main.java
package com.example;

// import the required classes
import com.ruiyun.jvppeteer.api.core.Browser;
import com.ruiyun.jvppeteer.api.core.Page;
import com.ruiyun.jvppeteer.cdp.core.Puppeteer;
import com.ruiyun.jvppeteer.cdp.entities.LaunchOptions;

public class Main {
    public static void main(String[] args) {
        System.out.println("Launching browser...");

        // initialize launch options
        LaunchOptions launchOptions = LaunchOptions.builder()
            // run in GUI mode
            //.headless(false)
            .build();

        try (Browser cdpBrowser = Puppeteer.launch(launchOptions)) {
            // open a new page
            Page page = cdpBrowser.newPage();
           
            // navigate to the target URL
            page.goTo("https://www.scrapingcourse.com/ecommerce/");
           
            // retrieve the page's HTML content
            String pageContent = page.content();
            System.out.println(pageContent);

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Here's the result:

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>
        Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com
    </title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <div class="beta site-title">
        <a href="https://www.scrapingcourse.com/ecommerce/" rel="home">
            Ecommerce Test Site to Learn Web Scraping
        </a>
    </div>
    <!-- other content omitted for brevity -->
</body>
</html>

Remember, websites can easily flag and block CDP-based tools due to their automation properties.

If you're getting blocked, consider the ZenRows Scraping Browser to bypass the website's bot detection. We'll explore this solution in more detail in a subsequent section.

Step 3: Parse Data from the Page

Parsing data from the downloaded HTML involves instructing Puppeteer to go through the DOM, locate elements, and extract their text content.

For this purpose, Puppeteer provides two main methods for traversing the HTML document: XPath and CSS selectors.

Check out our XPath vs CSS selectors comparison guide for a detailed comparison of the two methods.

That said, we recommend CSS selectors as they're more intuitive and beginner-friendly.

In this example, we'll extract each product's name, price, and image URL.

Let's begin!

Firstly, inspect the page and identify the CSS selectors that correspond to all the data points you want. Visit the page in a browser, right-click on the first product, and choose the Inspect option.

This opens up the Developer Tools window, as shown below.

scrapingcourse ecommerce homepage inspect first product li
Click to open the image in full screen

Now, notice that each product is a list item with class product, and they contain the following data points:

  • Product name: <h2> with class product-name.
  • Product price: span element with class product-price.
  • Product image: <img> tag with class product-image.

Using this information, select all product items on the page, iterate through them, and get the product name, price, and image URL.

Although you can do this with Puppeteer, we recommend integrating with JSoup, a Java parsing library, as it's more intuitive and offers simpler syntax.

To add JSoup to your project, include the following XML snippet in your pom.xml <dependencies> section,

pom.xml
<dependency>
  <!-- jsoup HTML parser library @ https://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.18.3</version>
</dependency>

After that, import the required classes and parse the downloaded HTML using JSoup.

Main.java
package com.example;

// import the required classes
import com.ruiyun.jvppeteer.api.core.Browser;
import com.ruiyun.jvppeteer.api.core.Page;
import com.ruiyun.jvppeteer.cdp.core.Puppeteer;
import com.ruiyun.jvppeteer.cdp.entities.LaunchOptions;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;



public class Main {
    public static void main(String[] args) {
        System.out.println("Launching browser...");

        // initialize launch options
        LaunchOptions launchOptions = LaunchOptions.builder()
            // run in GUI mode
            //.headless(false)
            .build();

        try (Browser cdpBrowser = Puppeteer.launch(launchOptions)) {
            // open a new page
            Page page = cdpBrowser.newPage();
           
            // navigate to the target URL
            page.goTo("https://www.scrapingcourse.com/ecommerce/");
           
            // retrieve the page's HTML content
            String pageContent = page.content();

            // parse HTML using JSoup
            Document document = Jsoup.parse(pageContent);

            // select product items
            Elements products = document.select("li.product");
            // iterate through products and extract name, price, and image
            for (Element product : products) {
                String name = product.select(".product-name").text();
                String price = product.select(".product-price").text();
                String image = product.select(".product-image").attr("src");

                System.out.println("Product Name: " + name);
                System.out.println("Price: " + price);
                System.out.println("Image URL: " + image);
                System.out.println("----------------------");
            }
           
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

That's it!

This code extracts each product's name, price, and image URL. Here's the result:

Output
Product Name: Abominable Hoodie
Price: $69.00
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main.jpg
----------------------
Product Name: Adrienne Trek Jacket
Price: $57.00
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_main.jpg
----------------------
// ... omitted for brevity

Step 4: Export Scraped Data to a CSV File

Exporting data to CSV is essential for easy access and analysis. In Java, you can do so using the built-in FileWriter class.

Here's a step-by-step guide.

Import the required classes. Then, initialize an empty list and add the extracted data to this list.

Main.java
package com.example;

// import the required classes

// ...
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class Main {
    // initialize an empty list to store scraped product data
    private static List<String[]> productData = new ArrayList<>();

    public static void main(String[] args) {
       // ...

        try (Browser cdpBrowser = Puppeteer.launch(launchOptions)) {
            // ...

            // ...scraping logic

                // store the product details in the data list
                productData.add(new String[]{name, price, image});
            }
           
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

After that, initialize a FileWriter class. Then, write the CSV headers and populate the rows with the scraped data. Let's abstract this into a reusable method.

Main.java
package com.example;

// import the required classes

// ...

public class Main {
    //...
    
    public static void main(String[] args) {
        //...

        private static void exportDataToCsv(String filePath) {
            try (FileWriter writer = new FileWriter(filePath)) {
                // write headers
                writer.append("Product Name,Price,Image URL\n");
           
                // write data rows
                for (String[] row : productData) {
                    writer.append(String.join(",", row));
                    writer.append("\n");
                }
                System.out.println("Data saved to " + filePath);
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

That's it!

Now, combine all the steps above and call the exportDataToCSV() method in main() to get the following complete code.

Main.java
package com.example;

// import the required classes
import com.ruiyun.jvppeteer.api.core.Browser;
import com.ruiyun.jvppeteer.api.core.Page;
import com.ruiyun.jvppeteer.cdp.core.Puppeteer;
import com.ruiyun.jvppeteer.cdp.entities.LaunchOptions;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;



public class Main {
    // initialize an empty list to store scraped product data
    private static List<String[]> productData = new ArrayList<>();

    public static void main(String[] args) {
        System.out.println("Launching browser...");

        // initialize launch options
        LaunchOptions launchOptions = LaunchOptions.builder()
            // run in GUI mode
            //.headless(false)
            .build();

        try (Browser cdpBrowser = Puppeteer.launch(launchOptions)) {
            // open a new page
            Page page = cdpBrowser.newPage();
           
            // navigate to the target URL
            page.goTo("https://www.scrapingcourse.com/ecommerce/");
           
            // retrieve the page's HTML content
            String pageContent = page.content();

            // parse HTML using JSoup
            Document document = Jsoup.parse(pageContent);

            // select product items
            Elements products = document.select("li.product");
            // iterate through products and extract name, price, and image
            for (Element product : products) {
                String name = product.select(".product-name").text();
                String price = product.select(".product-price").text();
                String image = product.select(".product-image").attr("src");

                // store the product details in the data list
                productData.add(new String[]{name, price, image});
            }

            // export data to CSV
            exportDataToCsv("products.csv");
           
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    // method to export data to CSV file
    private static void exportDataToCsv(String filePath) {
        try (FileWriter writer = new FileWriter(filePath)) {
            // write headers
            writer.append("Product Name,Price,Image URL\n");
           
            // write data rows
            for (String[] row : productData) {
                writer.append(String.join(",", row));
                writer.append("\n");
            }
            System.out.println("Data saved to " + filePath);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

This code creates a new products.csv file in your project's root directory and exports the scraped data to it.

Here's a sample screenshot for reference.

Extracted Data in CSV  File
Click to open the image in full screen

Congratulations! You've created a functional Puppeteer Java web scraper.

Interact With the Page Using Puppeteer in Java

Some web scraping scenarios require you to simulate user actions. In this section, you'll learn two popular browser interactions using Puppeteer in Java.

Handle Infinite Scrolling

Modern websites use infinite scrolling to update their data as you scroll down the page. This means that the entire web page content isn't loaded at once, and you must simulate the scrolling action to access the full HTML.

To scrape such pages using Puppeteer in Java, you must scroll down the page gradually until you reach the bottom, where no more content loads.

Let's put this into practice.

We'll use the following ScrapingCourse Infinite Scrolling test page as the target website for this example.

Infinite Scroll Demo
Click to open the image in full screen

To get started, launch a Chrome browser and navigate to the target website as in the previous steps.

Main.java
package com.example;

// import the required classes
import com.ruiyun.jvppeteer.api.core.Browser;
import com.ruiyun.jvppeteer.api.core.Page;
import com.ruiyun.jvppeteer.cdp.core.Puppeteer;
import com.ruiyun.jvppeteer.cdp.entities.LaunchOptions;

public class Main {
    public static void main(String[] args) {
        System.out.println("Launching browser...");

        // initialize launch options
        LaunchOptions launchOptions = LaunchOptions.builder()
            // run in GUI mode
            //.headless(false)
            .build();

        try (Browser cdpBrowser = Puppeteer.launch(launchOptions)) {
            // open a new page
            Page page = cdpBrowser.newPage();
           
            // navigate to the target URL
            page.goTo("https://www.scrapingcourse.com/infinite-scrolling");
           

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Next is simulating the scrolling action.

To do so, get the page's initial scroll height. Then, using a loop, scroll to the bottom of the page and wait a few seconds for new content to load. Break the loop if that's the actual bottom (no more new content); otherwise, continue scrolling.

Main.java
public class Main {
    public static void main(String[] args) {
        //...
        try (Browser cdpBrowser = Puppeteer.launch(launchOptions)) {
            //...
            
            long lastHeight = ((Number) page.evaluate("() => document.body.scrollHeight")).longValue();

            while (true) {
                // scroll down
                page.evaluate("window.scrollTo(0, document.body.scrollHeight)");
                // wait for content to load
                Thread.sleep(3000);

                // get new scroll height
                long newHeight = ((Number) page.evaluate("() => document.body.scrollHeight")).longValue();

                if (newHeight == lastHeight) {
                    // stop scrolling if there is no more new content
                    break; 
                }
                lastHeight = newHeight;
            }

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

That's it. You've simulated the scrolling action using Puppeteer in Java.

Add your scraping logic to extract product data and export it to CSV, as in the previous steps. Then, combine the code snippets above to get the following complete code.

Main.java
package com.example;

// import the required classes
import com.ruiyun.jvppeteer.api.core.Browser;
import com.ruiyun.jvppeteer.api.core.Page;
import com.ruiyun.jvppeteer.cdp.core.Puppeteer;
import com.ruiyun.jvppeteer.cdp.entities.LaunchOptions;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class Main {
    // initialize an empty list to store scraped product data
    private static List<String[]> productData = new ArrayList<>();

    public static void main(String[] args) {
        System.out.println("Launching browser...");

        // initialize launch options
        LaunchOptions launchOptions = LaunchOptions.builder()
            // run in GUI mode
            //.headless(false)
            .build();

        try (Browser cdpBrowser = Puppeteer.launch(launchOptions)) {
            // open a new page
            Page page = cdpBrowser.newPage();
           
            // navigate to the target URL
            page.goTo("https://www.scrapingcourse.com/infinite-scrolling");

            long lastHeight = ((Number) page.evaluate("() => document.body.scrollHeight")).longValue();

            while (true) {
                // scroll down
                page.evaluate("window.scrollTo(0, document.body.scrollHeight)");
                // wait for content to load
                Thread.sleep(3000);

                // get new scroll height
                long newHeight = ((Number) page.evaluate("() => document.body.scrollHeight")).longValue();

                if (newHeight == lastHeight) {
                    // stop scrolling if there is no more new content
                    break; 
                }
                lastHeight = newHeight;
            }
            // retrieve the page's HTML content
            String pageContent = page.content();

            // parse HTML using JSoup
            Document document = Jsoup.parse(pageContent);

            // select product items
            Elements products = document.select(".product-item");
            // iterate through products and extract name, price, and image
            for (Element product : products) {
                String name = product.select(".product-name").text();
                String price = product.select(".product-price").text();
                String image = product.select(".product-image").attr("src");

                // store the product details in the data list
                productData.add(new String[]{name, price, image});
            }

            // export data to CSV
            exportDataToCsv("products.csv");


        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    // method to export data to CSV file
    private static void exportDataToCsv(String filePath) {
        try (FileWriter writer = new FileWriter(filePath)) {
            // write headers
            writer.append("Product Name,Price,Image URL\n");
           
            // write data rows
            for (String[] row : productData) {
                writer.append(String.join(",", row));
                writer.append("\n");
            }
            System.out.println("Data saved to " + filePath);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

This code navigates to the new target page, scrolls to the bottom, extracts product data, and exports to CSV.

Here's a sample screenshot for reference.

Click to open the image in full screen

Congratulations!

Take Screenshots

Puppeteer allows you to capture screenshots of different specifications:

  • Full page: The entire web page, including parts that may require scrolling.
  • Visible area: Only what's visible in the browser window.
  • Specific element: A specific HTML element, such as a product card.

To capture a full-page screenshot, set the setFullPage screenshot option to true.

Main.java
package com.example;


// import the required classes
import com.ruiyun.jvppeteer.api.core.Browser;
import com.ruiyun.jvppeteer.api.core.Page;
import com.ruiyun.jvppeteer.cdp.core.Puppeteer;
import com.ruiyun.jvppeteer.cdp.entities.ScreenshotOptions;
import com.ruiyun.jvppeteer.cdp.entities.LaunchOptions;

public class Main {
    public static void main(String[] args) {
        System.out.println("Launching browser...");

        // initialize launch options
        LaunchOptions launchOptions = LaunchOptions.builder()
            // run in headless mode
            .headless(true)
            .build();

        try (Browser cdpBrowser = Puppeteer.launch(launchOptions)) {
            // open a new page
            Page page = cdpBrowser.newPage();
           
            // navigate to the target URL
            page.goTo("https://www.scrapingcourse.com/ecommerce/");
           
            // configure screenshot options
            ScreenshotOptions screenshotOptions = new ScreenshotOptions();
            screenshotOptions.setPath("full_page.png");
            screenshotOptions.setOmitBackground(true);
            screenshotOptions.setFullPage(true);
           
            // take a screenshot
            page.screenshot(screenshotOptions);
           
            System.out.println("Screenshot taken and saved");

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

For a visible area screenshot, simply omit the setFullPage option in your screenshot options.

Main.java
public class Main {
    public static void main(String[] args) {

        //...
        try (Browser cdpBrowser = Puppeteer.launch(launchOptions)) {
            //...
           
            // configure screenshot options
            ScreenshotOptions screenshotOptions = new ScreenshotOptions();
            screenshotOptions.setPath("visible_area.png");
            screenshotOptions.setOmitBackground(true);
           
            // take a screenshot
            page.screenshot(screenshotOptions);
           
            System.out.println("Screenshot taken and saved");

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Lastly, to take a screenshot of a specific element, select the element using a CSS selector and use the screenshot() method on the element.

Suppose you're interested in the first product on the target page. You'll need to select the element with the ElementHandle class and take its screenshot. Your script will look like this:

Main.java
package com.example;

import com.ruiyun.jvppeteer.api.core.Browser;
import com.ruiyun.jvppeteer.api.core.Page;
import com.ruiyun.jvppeteer.cdp.core.Puppeteer;
import com.ruiyun.jvppeteer.cdp.entities.LaunchOptions;
import com.ruiyun.jvppeteer.api.core.ElementHandle;

public class Main {
   public static void main(String[] args) {
       System.out.println("Launching browser...");

       // initialize launch options
       LaunchOptions launchOptions = LaunchOptions.builder()
               .headless(true)
               .build();

       try (Browser cdpBrowser = Puppeteer.launch(launchOptions)) {
           // open a new page
           Page page = cdpBrowser.newPage();

           // navigate to the target URL
           page.goTo("https://www.scrapingcourse.com/ecommerce/");

           // wait until the element is available
           page.waitForSelector(".product");

           // get the element
           ElementHandle product = (ElementHandle) page.evaluateHandle(
                   "document.querySelector('.product')"
           );

           // take a screenshot
           product.screenshot("specific_element.png");

           System.out.println("Screenshot taken and saved");

       } catch (Exception e) {
           e.printStackTrace();
       }
   }
}

Note that the ElementHandle.screenshot() method only takes a string (file path). Hence, screenshot options are absent in this case.

Avoid Getting Blocked While Scraping With Puppeteer

Getting blocked is a common challenge when web scraping with Puppeteer. This is because the headless browser often exhibits automation properties that are easily flagged by anti-bot solutions.

Here's a Puppeteer Java script that attempts to scrape an Antibot Challenge page.

Main.java
package com.example;

// import the required classes
import com.ruiyun.jvppeteer.api.core.Browser;
import com.ruiyun.jvppeteer.api.core.Page;
import com.ruiyun.jvppeteer.cdp.core.Puppeteer;
import com.ruiyun.jvppeteer.cdp.entities.LaunchOptions;

public class Main {
    public static void main(String[] args) {
        System.out.println("Launching browser...");

        // initialize launch options
        LaunchOptions launchOptions = LaunchOptions.builder()
            // run in GUI mode
            .headless("true");
            .build();

        try (Browser cdpBrowser = Puppeteer.launch(launchOptions)) {
            // open a new page
            Page page = cdpBrowser.newPage();
           
            // navigate to the target URL
            page.goTo("https://www.scrapingcourse.com/antibot-challenge");
           
            // retrieve the page's HTML content
            String pageContent = page.content();
            System.out.println(pageContent);

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Here's the result:

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <p>
        Verifying you are human. This may take a few seconds.
    </p>
    <!-- other content omitted for brevity -->
</body>
</html>

This response signifies that the website flagged Puppeteer as a bot and blocked the request.

Common recommendations for overcoming this challenge include rotating proxies and setting custom User Agents. However, these measures do not work against advanced anti-bot solutions.

To avoid getting blocked while web scraping with Puppeteer, consider using ZenRows' Universal Scraper API. This tool is an all-in-one scraping solution that bypasses blocks and handles dynamic content extraction with its headless browser features.

A single API call is all you need to integrate the Universal Scraper API.

Here's a step-by-step guide on using the ZenRows Universal Scraper API to scrape the Antibot challenge page that blocked us earlier.

Sign up and go to the Request Builder. Then, paste the target URL in the link box and activate Premium Proxies and JS Rendering.

building a scraper with zenrows
Click to open the image in full screen

Select Java as your preferred programming language and choose the API connection mode. Copy the generated code and paste it into your scraper.

The generated code should look like this:

Scraper.java
import org.apache.hc.client5.http.fluent.Request;

public class APIRequest {
   public static void main(final String... args) throws Exception {
       String apiUrl = "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.scrapingcourse.com%2Fantibot-challenge&js_render=true&premium_proxy=true";
       String response = Request.get(apiUrl)
               .execute().returnContent().asString();

       System.out.println(response);
   }
}

Here's the result, showing you bypassed the challenge:

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

Congratulations! You can now bypass anti-bot restrictions and scrape at any scale using ZenRows' Universal Scraper API.

Conclusion

While Puppeteer is primarily a NodeJS library, you can leverage its functionality in Java using a Puppeteer Java wrapper (Jvppeteer).

However, since Puppeteer leaves traces of its automation properties, websites can easily flag your requests.

To avoid getting blocked when scraping with Puppeteer, use the ZenRows Universal Scraper API, a complete toolkit that lets you scrape any website confidently without limitations.

Sign up now to try ZenRows for free.

Ready to get started?

Up to 1,000 URLs for free are waiting for you