Panther PHP Web Scraping: Step-By-Step Tutorial

May 14, 2024 · 12 min read

Panther web scraping is a powerful approach to extracting data from a site in PHP. The library controls a real-world browser, using the WebDriver protocol, and is perfect for scraping and testing.

This tutorial will cover the basics of Panther scraping and touch on more complex interactions. At the end of this guide, you’ll know how to:

Let's dive in!

Why Use Panther for Web Scraping?

Panther is a standalone PHP library for scraping sites and running end-to-end tests in real browsers. It's well-known in the PHP community, with thousands of stars on GitHub. That's because it implements Symfony's well-known BrowserKit and DomCrawler APIs.

Panther uses the WebDriver protocol to control browsers like Google Chrome and Firefox. With its intuitive API and browser automation capabilities, it's an excellent tool for both testing and PHP web scraping.

A popular alternative to Panther is Selenium. However, the PHP bindings of the Selenium WebDriver don't receive official updates. Learn more in our guide on Selenium PHP.

How to Scrape With Panther

Get started with Panther in PHP by learning how to scrape on this infinite scrolling demo page:

infinite scrolling demo page
Click to open the image in full screen
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

This page dynamically loads new products via JavaScript as the user scrolls down, so it requires browser automation for scraping. Since a simple HTML parser couldn't retrieve data from it, you need to interact with it with a tool like Panther.

Let's see how to extract some data from it!

Step 1: Install Panther in PHP

Before getting started, make sure your computer has PHP 8+ and Composer installed. Click the links, download them, and follow the installation wizard.

Next, open the terminal. Create a folder for your Panther web scraping project and enter it:

Terminal
mkdir panther-scraper
cd panther-scraper

Launch the init command to initialize a new Composer project inside it. Follow the instructions and answer the questions with the default options:

Terminal
composer init

Awesome! The panther-scraper folder now contains an empty Composer project. Load the project directory in a PHP IDE, such as WebStorm or Visual Studio Code with the PHP extension.

Install the Panther Symfony component with the command below:

Terminal
composer req symfony/panther

Compose will show a warning message and ask you the question below. Answer “no,” since this is the correct setup to use Panther for scraping in a production environment:

Output
The package you required is recommended to be placed in require-dev (because it is tagged as 'testing") but you did not use --dev.
Do you want to re-run the command with --dev? [yes]? no

Keep in mind that Panther uses the WebDriver protocol to control the browser with which you crawl sites. Therefore, you need to download and set up the right WebDriver executables based on your browser version.

Automate that process with the dbrekelmans/browser-driver-installer package. It will retrieve the right ChromeDriver and geckodriver. Install the Composer library and execute it with these commands:

Terminal
composer require dbrekelmans/bdi
vendor/bin/bdi detect drivers

After downloading the WebDriver executables, dbrekelmans/bdi will print a message:

Output
[OK] chromedriver 123.0.6312.58 installed to drivers\chromedriver.exe 

Great! You now have everything you need to use Panther in PHP for web scraping.

Add a scraper.php file in the /src folder and initialize it with the code below. The first line contains the autoload import required by Composer. Next comes the Panther import:

scraper.php
<?php

use Symfony\Component\Panther\Client;

require_once("vendor/autoload.php");

// scraping logic...

Here we go! Your Panther scraping project is ready.

Step 2: Scrape Your Target Page's HTML

Use the code below to initialize a Chrome driver client that will control a local instance of Chrome:

Terminal
php src/scraper.php

By default, Panther will start Chrome in headless mode. To avoid that and see the actions made by your script in the browser, set the PANTHER_NO_HEADLESS env to 1:

scraper.php
$client = Client::createChromeClient();
```

On Windows, use the equivalent PowerShell command:

scraper.php
export PANTHER_NO_HEADLESS=1

Then, use the request() method from $client to connect to the target page:

Terminal
$Env:PANTHER_NO_HEADLESS=1

Then, use the request() method from $client to connect to the target page:

scraper.php
$crawler = $client->request("GET", "https://scrapingclub.com/exercise/list_infinite_scroll/");

The result of that method is a Crawler Symfony object, which exposes the methods to select HTML nodes and extract data from them.

Use the html() method of $scraper to retrieve the source HTML code of the current page. Print it in the terminal with echo:

scraper.php
$html = $crawler->html();
echo $html;

This is what your scraper.php file should contain:

scraper.php
<?php

use Symfony\Component\Panther\Client;

require_once("vendor/autoload.php");

// initialize a Chrome client instance
$client = Client::createChromeClient();

// connect to the target page
$crawler = $client->request("GET", "https://scrapingclub.com/exercise/list_infinite_scroll/");

// retrieve the HTML source code of the
// target page and print it
$html = $crawler->html();
echo $html;

Execute the Panther web scraping script in headed mode. The scraper will open a Chrome window and visit the Infinite Scrolling demo page below. Then, it will close as the script execution terminates:

Scraping Club Webpage Screenshot
Click to open the image in full screen

The PHP script will also print:

Output
<html class="h-full"><head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <meta name="description" content="Learn to scrape infinite scrolling pages"><title>Scraping Infinite Scrolling Pages (Ajax) | ScrapingClub</title>
  <link rel="icon" href="/static/img/icon.611132651e39.png" type="image/png">
  <!-- Omitted for brevity... -->

Perfect! That's the HTML source code of the target page.

Step 3: Extract the Data You Want

The Crawler object returned by Panther can parse HTML content and extract data from it. Suppose you want to retrieve the name and price information from the products on the page. These are the steps you have to take:

  1. Select the product HTML elements on the page through an effective node selection strategy.
  2. Collect the desired data from each of them.
  3. Store the scraped data in a PHP array.

A proper selection strategy usually relies on an XPath expression or CSS Selector. CSS selectors are short and intuitive, while XPath expressions are longer but more powerful. For more info, check out our guide on CSS Selector vs XPath.

Panther supports both CSS Selector and XPath via the filter() and filterXPath() methods, respectively. So, you have multiple options for selecting HTML nodes from the DOM.

Let's keep things simple and go for CSS selectors. Analyze the HTML code of a product node to figure out which CSS selectors you need to reach your goal. Open the target site in the browser, right-click on a product element, and inspect it with the DevTools:

DevTools Inspection
Click to open the image in full screen

Expand the HTML code. Notice that each product has a post class. The product name is in an <h4> while the price is in an <h5>. This information is enough to define the selectors you need to perform Panther web scraping.

Follow the instructions below to learn how to get the name and price of the products on the page.

Initialize a $products array to keep track of the scraped data:

scraper.php
$products = [];

Use the filter() method to apply a CSS selector and select the HTML product nodes. Iterate over each product with each(), scrape the name and price, instantiate a new object, and add it to the $products list:

scraper.php
$crawler->filter(".post")->each(function ($productHTMLElement) use (&$products) {
  // scraping logic
  $name = $productHTMLElement->filter("h4")->eq(0)->text();
  $price = $productHTMLElement->filter("h5")->eq(0)->text();

  // instantiate a new product object
  // and add it to the list
  $product = [
      "name" => $name,
      "price" => $price,
  ];
  $products[] = $product;
});

Make sure the above Panther scraping logic works by logging $products in the terminal:

scraper.php
print_r($products);

scraper.php will now contain:

scraper.php
<?php

use Symfony\Component\Panther\Client;

require_once("vendor/autoload.php");

// initialize a Chrome client instance
$client = Client::createChromeClient();

// connect to the target page
$crawler = $client->request("GET", "https://scrapingclub.com/exercise/list_infinite_scroll/");

// where to store the scraped data
$products = [];

// select all product HTML elements on the page,
// iterate over them, and apply the scraping logic
$crawler->filter(".post")->each(function ($productHTMLElement) use (&$products) {
  // scraping logic
  $name = $productHTMLElement->filter("h4")->eq(0)->text();
  $price = $productHTMLElement->filter("h5")->eq(0)->text();

  // instantiate a new product object
  // and add it to the list
  $product = [
      "name" => $name,
      "price" => $price,
  ];
  $products[] = $product;
});

// print all products
print_r($products);

Launch the script, and it'll generate this output:

Output
Array
(
    [0] => Array
        (
            [name] => Short Dress
            [price] => $24.99
        )

    // ...

    [9] => Array
        (
            [name] => Fitted Dress
            [price] => $34.99
        )

)

Fantastic! The $products array stores the scraped objects with the data of interest. All that remains is to export the scraped information in a human-readable format.

Step 4: Convert Your Data Into a CSV File

The PHP standard library provides everything needed to export the scraped data to a CSV file. Use fopen() to create a products.csv file. Next, iterate over $products and employ fputcsv() to convert each product object to a CSV record and append it to the output file:

scraper.php
// create the CSV output file
$csvFilePath = "products.csv";
$csvFile = fopen($csvFilePath, "w");

// write the header row
$header = ["name", "price"];
fputcsv($csvFile, $header);

// add each product to the CSV file
foreach ($products as $product) {
    fputcsv($csvFile, $product);
}

// close the CSV file
fclose($csvFile);

Take a look at your final Panther script for web scraping:

scraper.php
<?php

use Symfony\Component\Panther\Client;

require_once("vendor/autoload.php");

// initialize a Chrome client instance
$client = Client::createChromeClient();

// connect to the target page
$crawler = $client->request("GET", "https://scrapingclub.com/exercise/list_infinite_scroll/");

// where to store the scraped data
$product = [];

// select all product HTML elements on the page,
// iterate over them, and apply the scraping logic
$crawler->filter(".post")->each(function ($productHTMLElement) use (&$products) {
  // scraping logic
  $name = $productHTMLElement->filter("h4")->eq(0)->text();
  $price = $productHTMLElement->filter("h5")->eq(0)->text();

  // instantiate a new product object
  // and add it to the list
  $product = [
      "name" => $name,
      "price" => $price,
  ];
  $products[] = $product;
});

// create the CSV output file
$csvFilePath = "products.csv";
$csvFile = fopen($csvFilePath, "w");

// write the header row
$header = ["name", "price"];
fputcsv($csvFile, $header);

// add each product to the CSV file
foreach ($products as $product) {
    fputcsv($csvFile, $product);
}

// close the CSV file
fclose($csvFile);

Launch it with the command below:

Terminal
php src/scraper.php
```

After the script execution, a products.csv file will appear in the root folder of your project. Open it, and you'll see these records:

CSV File
Click to open the image in full screen

Congratulations! You now know the basics of web scraping in Panther.

Keep in mind that the current output involves only ten records. That’s because the page initially only shows a few products, and loads more via infinite scrolling. Go to the next section to learn how to scrape all products on the site!

Interacting With Web Pages in a Browser With Panther

Panther can reproduce many user interactions, such as waits, mouse movements, and more. Thanks to browser automation, your script appears as a human being navigating the site, which will help you avoid anti-bot measures.

The interactions Panther can simulate include:

  • Clicking elements and moving the mouse.
  • Waiting for elements on the page to be present, contain text, enabled, etc.
  • Filling out input fields and submit forms.
  • Following links.
  • Scrolling up and down the page.
  • Taking screenshots.

The library offers built-in methods for executing most of these operations. You also have the executeScript() method for running a JavaScript script directly on the page. With both tools, you can simulate any user interaction.

Let’s see how to retrieve all product data from the infinite scroll demo page and explore other popular Panther scraping interactions!

Scrolling

After the first load, the target page contains only ten products. When the user scrolls to the end of the page, the site loads new products dynamically. Panther doesn't have a method for simulating the scrolling interaction, so you need custom JavaScript logic.

This JavaScript script instructs the browser to scroll down 10 times at an interval of 0.5 seconds each:

scraper.php
// scroll down the page 10 times
const scrolls = 10
let scrollCount = 0

// scroll down and then wait for 0.5s
const scrollInterval = setInterval(() => {
  window.scrollTo(0, document.body.scrollHeight)
  scrollCount++

  if (scrollCount === numScrolls) {
    clearInterval(scrollInterval)
  }
}, 500)

Store the script above in a string variable and pass it to the executeScript() method as follows:

Terminal
$scrolling_script = <<<EOD
// scroll down the page 10 times
const scrolls = 10
let scrollCount = 0

// scroll down and then wait for 0.5s
const scrollInterval = setInterval(() => {
  window.scrollTo(0, document.body.scrollHeight)
  scrollCount++

  if (scrollCount === numScrolls) {
    clearInterval(scrollInterval)
  }
}, 500)
EOD;

$client->executeScript($scrolling_script);

Instructing Panther to scroll down the page isn't enough. You also need to wait for the page to retrieve and render the products. To do so, stop the script execution for 10 seconds with sleep():

scraper.php
sleep(10);

Here's your new complete code:

scraper.php
<?php

use Symfony\Component\Panther\Client;

require_once("vendor/autoload.php");

// initialize a Chrome client instance
$client = Client::createChromeClient();

// connect to the target page
$crawler = $client->request("GET", "https://scrapingclub.com/exercise/list_infinite_scroll/");

 // simulate the infinite scrolling interaction
$scrolling_script = <<<EOD
// scroll down the page 10 times
const scrolls = 10
let scrollCount = 0

// scroll down and then wait for 0.5s
const scrollInterval = setInterval(() => {
  window.scrollTo(0, document.body.scrollHeight)
  scrollCount++

  if (scrollCount === scrolls) {
    clearInterval(scrollInterval)
  }
}, 500)
EOD;
// launch the JS script on the page
$client->executeScript($scrolling_script);

// wait 10 seconds for the new products to load
sleep(10);

// where to store the scraped data
$product = [];

// select all product HTML elements on the page,
// iterate over them, and apply the scraping logic
$crawler->filter(".post")->each(function ($productHTMLElement) use (&$products) {
  // scraping logic
  $name = $productHTMLElement->filter("h4")->eq(0)->text();
  $price = $productHTMLElement->filter("h5")->eq(0)->text();

  // instantiate a new product object
  // and add it to the list
  $product = [
      "name" => $name,
      "price" => $price,
  ];
  $products[] = $product;
});

// create the CSV output file
$csvFilePath = "products.csv";
$csvFile = fopen($csvFilePath, "w");

// write the header row
$header = ["name", "price"];
fputcsv($csvFile, $header);

// add each product to the CSV file
foreach ($products as $product) {
    fputcsv($csvFile, $product);
}

// close the CSV file
fclose($csvFile);
```

Execute the Panther scraping script again:

Terminal
php src/scraper.php

The execution will take a while because of the sleep() instruction, so be patient.

This time, the products.csv output file will contain many more records than before.

Updated Products CSV File
Click to open the image in full screen

The execution will take a while because of the sleep() instruction, so be patient.

This time, the products.csv output file will contain many more records than before.

Click to open the image in full screen

Mission complete! You’ve just scraped all products from the target site.

Still, you can improve the script by waiting for the new product nodes to be on the page instead of using sleep().

Wait for Element

The current script uses a hard wait after the scroll-down interaction. That's a discouraged practice, because it introduces flakiness into the scraping logic, rendering the scraper vulnerable to network slowdowns.

Relying on a fixed wait isn't reliable and slows down your script. Instead, use smart waits. They allow you to wait for specific events to occur, such as the presence of a specific node on the page.

The Panther browser client provides the waitFor() method to verify if a node is on the page. Use it to wait up to 10 seconds for the 60th product to appear on the page:

scraper.php
$client->waitFor(".post:nth-child(60)", 10);

That line should replace the sleep() instruction since it leads to the same result. The scrolls will trigger some AJAX calls to retrieve new products. After that, the script will automatically wait for those new products to be rendered on the page.

The definitive Panther web scraping script will be:

scraper.php
<?php

use Symfony\Component\Panther\Client;

require_once("vendor/autoload.php");

// initialize a Chrome client instance
$client = Client::createChromeClient();

// connect to the target page
$crawler = $client->request("GET", "https://scrapingclub.com/exercise/list_infinite_scroll/");

 // simulate the infinite scrolling interaction
$scrolling_script = <<<EOD
// scroll down the page 10 times
const scrolls = 10
let scrollCount = 0

// scroll down and then wait for 0.5s
const scrollInterval = setInterval(() => {
  window.scrollTo(0, document.body.scrollHeight)
  scrollCount++

  if (scrollCount === scrolls) {
    clearInterval(scrollInterval)
  }
}, 500)
EOD;
// launch the JS script on the page
$client->executeScript($scrolling_script);

// wait up to 10 seconds for the 60th product
// to be on the page
$client->waitFor(".post:nth-child(60)", 10);

// where to store the scraped data
$product = [];

// select all product HTML elements on the page,
// iterate over them, and apply the scraping logic
$crawler->filter(".post")->each(function ($productHTMLElement) use (&$products) {
  // scraping logic
  $name = $productHTMLElement->filter("h4")->eq(0)->text();
  $price = $productHTMLElement->filter("h5")->eq(0)->text();

  // instantiate a new product object
  // and add it to the list
  $product = [
      "name" => $name,
      "price" => $price,
  ];
  $products[] = $product;
});

// create the CSV output file
$csvFilePath = "products.csv";
$csvFile = fopen($csvFilePath, "w");

// write the header row
$header = ["name", "price"];
fputcsv($csvFile, $header);

// add each product to the CSV file
foreach ($products as $product) {
    fputcsv($csvFile, $product);
}

// close the CSV file
fclose($csvFile);

Run it, and you'll get the same results as before much faster.

You now know how to extract data from each product on the site effectively and efficiently. It's time to explore other useful Panther interactions.

Wait for Page to Load

$client->request() automatically waits for the browser to trigger the load event on the page. It’s fired only once the whole page has loaded, including stylesheets, scripts, iframes, and images.

Modern pages are so dynamic that listening to the load event may not be enough to tell if the page has finished loading. AJAX requests and dynamic interactions can still change the DOM. For such complex scenarios, Panther offers the following waiting methods:

  • $client->waitFor(): Wait for the specified element to be attached to the DOM.
  • $client->waitForStaleness(): Wait for the specified element to be removed from the DOM.
  • $client->waitForVisibility(): Wait for the specified element to become visible.
  • $client->waitForInvisibility(): Wait for the specified element to become hidden.
  • $client->waitForElementToContain(): Wait for the given element to contain the specified text.
  • $client->waitForElementToNotContain(): Wait for the given element not to contain the given text.
  • $client->waitForEnabled(): Wait for the given element to become enabled.
  • $client->waitForDisabled(): Wait for the given element to become disabled.
  • $client->waitForAttributeToContain(): Wait for the specified HTML attribute of an element to contain some content.
  • $client->waitForAttributeToNotContain(): Wait for the specified HTML attribute of an element not to contain some content.

Click Elements

The objects returned by filter() expose the click() method to simulate click interactions:

scraper.php
$crawler->filter("[type='submit']")->eq(0)->click();

This function tells the browser to send a mouse click event on the selected element. The browser will execute the HTML onclick() callback associated with the clicked node.

If the click() method triggers a page change (as in the snippet below), you'll have to adjust the parsing logic to the new DOM structure:

scraper.php
// click on the first product card on the page
$crawler->filter(".post")->eq(0)->click();

// you are now on the detail product page...
    
// new scraping logic...

// $crawler->filter(...)

Take a Screenshot

A webpage doesn't only contain text. Images are equally important and provide lots of useful information, such as visual insights into competitors’ sites.

$client has a takeScreenshot() method to take a screenshot of the current viewport:

scraper.php
// take a screenshot of the current viewport
$client->takeScreenshot("screenshot.png");

This instruction produces a screenshot.png file in the root folder of your project.

Congratulations! You've mastered Panther web scraping interactions.

Avoid Getting Blocked When Scraping With Panther

The biggest challenge for Panther web scraping is getting blocked by anti-bot solutions. Avoiding them requires making your requests more natural and random. As a starting point, you should set a real-world User-Agent header and use proxies to change your exit IP.

To set a custom user agent, pass it in the --user-agent flag option to createChromeClient():

scraper.php
$custom_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36";
$client = Client::createChromeClient(null, [
  "--user-agent=$custom_user_agent",
  // other options...
]);

Learn more from our guide to User Agents for web scraping.

Setting a proxy follows a similar process and occurs through the --proxy-server flag. To follow this tutorial, retrieve the URL of a free proxy from a site like Free Proxy List and then pass it to Chrome:

scraper.php
$proxy_url = "234.36.2.15:6813";
$client = Client::createChromeClient(null, [
  "--proxy-server=$proxy_url",
  // other options...
]);

Those two approaches are just baby steps to bypassing anti-bot systems. Complete solutions like Cloudflare will still be able to detect your Panther scraping script and recognize it as a bot. Verify that with the following script by targeting a Cloudflare-protected page from G2.com:

scraper.php
<?php

use Symfony\Component\Panther\Client;

require_once("vendor/autoload.php");

$custom_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36";
$proxy_url = "234.36.2.15:6813";

// initialize a Chrome client instance
$client = Client::createChromeClient(null, [
  "--user-agent=$custom_user_agent",
  "--proxy-server=$proxy_url"
]);

// connect to the target page
$crawler = $client->request("GET", "https://www.g2.com/products/zapier/reviews");

// retrieve the HTML source code of the
// target page and print it
$html = $crawler->html();
echo $html;

This snippet will result in the following anti-bot page containing a CAPTCHA:

Click to open the image in full screen

Should you give up? Of course not! The best way to solve this issue is to opt for a web scraping API, such as ZenRows. ZenRows seamlessly integrates with Panther to extend it with an AI-powered anti-bot toolkit that will help you successfully avoid all possible blocks.

Give Panther superpowers with ZenRows! Sign up for free, redeem your first 1,000 credits, and get to the Request Builder page below:

ZenRows Request Builder
Click to open the image in full screen

Assume you want to scrape the G2 page protected with Cloudflare. Follow these steps:

  1. Paste the target URL (https://www.g2.com/products/zapier/reviews) into the "URL to Scrape" input.
  2. Click on "Premium Proxy" to enable IP rotation (User-Agent rotation and the AI-powered anti-bot toolkit are included by default).
  3. Select the “cURL” option on the right and then the “API” mode to get the full URL of the ZenRows API.

Pass the generated URL to the Panther's request() method:

scraper.php
<?php

use Symfony\Component\Panther\Client;

require_once("vendor/autoload.php");

// initialize a Chrome client instance
$client = Client::createChromeClient();

// connect to the target page
$crawler = $client->request("GET", "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fzapier%2Freviews&js_render=true&premium_proxy=true");

// retrieve the HTML source code of the
// target page and print it
$html = $crawler->html();
echo $html;

Run it, and it'll return the HTML source code of the target G2.com page:

Output
<!DOCTYPE html>
<head>
  <meta charset="utf-8" />
  <link href="https://www.g2.com/assets/favicon-fdacc4208a68e8ae57a80bf869d155829f2400fa7dd128b9c9e60f07795c4915.ico" rel="shortcut icon" type="image/x-icon" />
  <title>Airtable Reviews 2024: Details, Pricing, &amp; Features | G2</title>
  <!-- omitted for brevity ... -->

Brilliant! You’ve just integrated ZenRows into the Panther web scraping library.

But what about anti-bot measures such as form CAPTCHAs that could still stop your script? Good news: ZenRows not only extends Panther but can also replace it completely.

As a cloud service, ZenRows also results in significant savings over the cost of Selenium.

Conclusion

In this Panther scraping guide, you learned the fundamentals of PHP browser automation.

You saw the basics and then explored more advanced techniques. Now you know:

  • How to create a Composer project and install Panther.
  • How to use the library to extract data from a dynamic content page.
  • What user interactions you can simulate with Panther.
  • The challenges of scraping online data and how to address them.

No matter how good your browser automation is, anti-bot systems can still block it. Bypass them all with ZenRows, a web scraping API with browser automation functionality, IP rotation, and the most advanced anti-scraping bypass available. Scraping data from any site has never been easier!

Ready to get started?

Up to 1,000 URLs for free are waiting for you