Panther web scraping is a powerful approach to extracting data from a site in PHP. The library controls a real-world browser, using the WebDriver protocol, and is perfect for scraping and testing.
This tutorial will cover the basics of Panther scraping and touch on more complex interactions. At the end of this guide, you’ll know how to:
Let's dive in!
Why Use Panther for Web Scraping?
Panther is a standalone PHP library for scraping sites and running end-to-end tests in real browsers. It's well-known in the PHP community, with thousands of stars on GitHub. That's because it implements Symfony's well-known BrowserKit and DomCrawler APIs.
Panther uses the WebDriver protocol to control browsers like Google Chrome and Firefox. With its intuitive API and browser automation capabilities, it's an excellent tool for both testing and PHP web scraping.
A popular alternative to Panther is Selenium. However, the PHP bindings of the Selenium WebDriver don't receive official updates. Learn more in our guide on Selenium PHP.
How to Scrape With Panther
Get started with Panther in PHP by learning how to scrape on this infinite scrolling demo page:
This page dynamically loads new products via JavaScript as the user scrolls down, so it requires browser automation for scraping. Since a simple HTML parser couldn't retrieve data from it, you need to interact with it with a tool like Panther.
Let's see how to extract some data from it!
Step 1: Install Panther in PHP
Before getting started, make sure your computer has PHP 8+ and Composer installed. Click the links, download them, and follow the installation wizard.
Next, open the terminal. Create a folder for your Panther web scraping project and enter it:
mkdir panther-scraper
cd panther-scraper
Launch the init
command to initialize a new Composer project inside it. Follow the instructions and answer the questions with the default options:
composer init
Awesome! The panther-scraper
folder now contains an empty Composer project. Load the project directory in a PHP IDE, such as WebStorm or Visual Studio Code with the PHP extension.
Install the Panther Symfony component with the command below:
composer req symfony/panther
Compose will show a warning message and ask you the question below. Answer “no,” since this is the correct setup to use Panther for scraping in a production environment:
The package you required is recommended to be placed in require-dev (because it is tagged as 'testing") but you did not use --dev.
Do you want to re-run the command with --dev? [yes]? no
Keep in mind that Panther uses the WebDriver protocol to control the browser with which you crawl sites. Therefore, you need to download and set up the right WebDriver executables based on your browser version.
Automate that process with the dbrekelmans/browser-driver-installer
package. It will retrieve the right ChromeDriver and geckodriver. Install the Composer library and execute it with these commands:
composer require dbrekelmans/bdi
vendor/bin/bdi detect drivers
After downloading the WebDriver executables, dbrekelmans/bdi
will print a message:
[OK] chromedriver 123.0.6312.58 installed to drivers\chromedriver.exe
Great! You now have everything you need to use Panther in PHP for web scraping.
Add a scraper.php
file in the /src
folder and initialize it with the code below. The first line contains the autoload import required by Composer. Next comes the Panther
import:
<?php
use Symfony\Component\Panther\Client;
require_once("vendor/autoload.php");
// scraping logic...
Here we go! Your Panther scraping project is ready.
Step 2: Scrape Your Target Page's HTML
Use the code below to initialize a Chrome driver client that will control a local instance of Chrome:
php src/scraper.php
By default, Panther will start Chrome in headless mode. To avoid that and see the actions made by your script in the browser, set the PANTHER_NO_HEADLESS
env to 1
:
$client = Client::createChromeClient();
```
On Windows, use the equivalent PowerShell command:
export PANTHER_NO_HEADLESS=1
Then, use the request()
method from $client
to connect to the target page:
$Env:PANTHER_NO_HEADLESS=1
Then, use the request()
method from $client
to connect to the target page:
$crawler = $client->request("GET", "https://scrapingclub.com/exercise/list_infinite_scroll/");
The result of that method is a Crawler Symfony object
, which exposes the methods to select HTML nodes and extract data from them.
Use the html()
method of $scraper
to retrieve the source HTML code of the current page. Print it in the terminal with echo
:
$html = $crawler->html();
echo $html;
This is what your scraper.php
file should contain:
<?php
use Symfony\Component\Panther\Client;
require_once("vendor/autoload.php");
// initialize a Chrome client instance
$client = Client::createChromeClient();
// connect to the target page
$crawler = $client->request("GET", "https://scrapingclub.com/exercise/list_infinite_scroll/");
// retrieve the HTML source code of the
// target page and print it
$html = $crawler->html();
echo $html;
Execute the Panther web scraping script in headed mode. The scraper will open a Chrome window and visit the Infinite Scrolling demo page below. Then, it will close as the script execution terminates:
The message "Chrome is being controlled by automated test software" means that Panther is controlling the browser via the WebDriver.
The PHP script will also print:
<html class="h-full"><head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="Learn to scrape infinite scrolling pages"><title>Scraping Infinite Scrolling Pages (Ajax) | ScrapingClub</title>
<link rel="icon" href="/static/img/icon.611132651e39.png" type="image/png">
<!-- Omitted for brevity... -->
Perfect! That's the HTML source code of the target page.
Step 3: Extract the Data You Want
The Crawler
object returned by Panther can parse HTML content and extract data from it. Suppose you want to retrieve the name and price information from the products on the page. These are the steps you have to take:
- Select the product HTML elements on the page through an effective node selection strategy.
- Collect the desired data from each of them.
- Store the scraped data in a PHP array.
A proper selection strategy usually relies on an XPath expression or CSS Selector. CSS selectors are short and intuitive, while XPath expressions are longer but more powerful. For more info, check out our guide on CSS Selector vs XPath.
Panther supports both CSS Selector and XPath via the filter()
and filterXPath()
methods, respectively. So, you have multiple options for selecting HTML nodes from the DOM.
Let's keep things simple and go for CSS selectors. Analyze the HTML code of a product node to figure out which CSS selectors you need to reach your goal. Open the target site in the browser, right-click on a product element, and inspect it with the DevTools:
Expand the HTML code. Notice that each product has a post
class. The product name is in an <h4>
while the price is in an <h5>
. This information is enough to define the selectors you need to perform Panther web scraping.
Follow the instructions below to learn how to get the name and price of the products on the page.
Initialize a $products
array to keep track of the scraped data:
$products = [];
Use the filter()
method to apply a CSS selector and select the HTML product nodes. Iterate over each product with each()
, scrape the name and price, instantiate a new object, and add it to the $products
list:
$crawler->filter(".post")->each(function ($productHTMLElement) use (&$products) {
// scraping logic
$name = $productHTMLElement->filter("h4")->eq(0)->text();
$price = $productHTMLElement->filter("h5")->eq(0)->text();
// instantiate a new product object
// and add it to the list
$product = [
"name" => $name,
"price" => $price,
];
$products[] = $product;
});
Make sure the above Panther scraping logic works by logging $products
in the terminal:
print_r($products);
scraper.php
will now contain:
<?php
use Symfony\Component\Panther\Client;
require_once("vendor/autoload.php");
// initialize a Chrome client instance
$client = Client::createChromeClient();
// connect to the target page
$crawler = $client->request("GET", "https://scrapingclub.com/exercise/list_infinite_scroll/");
// where to store the scraped data
$products = [];
// select all product HTML elements on the page,
// iterate over them, and apply the scraping logic
$crawler->filter(".post")->each(function ($productHTMLElement) use (&$products) {
// scraping logic
$name = $productHTMLElement->filter("h4")->eq(0)->text();
$price = $productHTMLElement->filter("h5")->eq(0)->text();
// instantiate a new product object
// and add it to the list
$product = [
"name" => $name,
"price" => $price,
];
$products[] = $product;
});
// print all products
print_r($products);
Launch the script, and it'll generate this output:
Array
(
[0] => Array
(
[name] => Short Dress
[price] => $24.99
)
// ...
[9] => Array
(
[name] => Fitted Dress
[price] => $34.99
)
)
Fantastic! The $products
array stores the scraped objects with the data of interest. All that remains is to export the scraped information in a human-readable format.
Step 4: Convert Your Data Into a CSV File
The PHP standard library provides everything needed to export the scraped data to a CSV file. Use fopen()
to create a products.csv
file. Next, iterate over $products
and employ fputcsv()
to convert each product object to a CSV record and append it to the output file:
// create the CSV output file
$csvFilePath = "products.csv";
$csvFile = fopen($csvFilePath, "w");
// write the header row
$header = ["name", "price"];
fputcsv($csvFile, $header);
// add each product to the CSV file
foreach ($products as $product) {
fputcsv($csvFile, $product);
}
// close the CSV file
fclose($csvFile);
Take a look at your final Panther script for web scraping:
<?php
use Symfony\Component\Panther\Client;
require_once("vendor/autoload.php");
// initialize a Chrome client instance
$client = Client::createChromeClient();
// connect to the target page
$crawler = $client->request("GET", "https://scrapingclub.com/exercise/list_infinite_scroll/");
// where to store the scraped data
$product = [];
// select all product HTML elements on the page,
// iterate over them, and apply the scraping logic
$crawler->filter(".post")->each(function ($productHTMLElement) use (&$products) {
// scraping logic
$name = $productHTMLElement->filter("h4")->eq(0)->text();
$price = $productHTMLElement->filter("h5")->eq(0)->text();
// instantiate a new product object
// and add it to the list
$product = [
"name" => $name,
"price" => $price,
];
$products[] = $product;
});
// create the CSV output file
$csvFilePath = "products.csv";
$csvFile = fopen($csvFilePath, "w");
// write the header row
$header = ["name", "price"];
fputcsv($csvFile, $header);
// add each product to the CSV file
foreach ($products as $product) {
fputcsv($csvFile, $product);
}
// close the CSV file
fclose($csvFile);
Launch it with the command below:
php src/scraper.php
```
After the script execution, a products.csv
file will appear in the root folder of your project. Open it, and you'll see these records:
Congratulations! You now know the basics of web scraping in Panther.
Keep in mind that the current output involves only ten records. That’s because the page initially only shows a few products, and loads more via infinite scrolling. Go to the next section to learn how to scrape all products on the site!
Interacting With Web Pages in a Browser With Panther
Panther can reproduce many user interactions, such as waits, mouse movements, and more. Thanks to browser automation, your script appears as a human being navigating the site, which will help you avoid anti-bot measures.
The interactions Panther can simulate include:
- Clicking elements and moving the mouse.
- Waiting for elements on the page to be present, contain text, enabled, etc.
- Filling out input fields and submit forms.
- Following links.
- Scrolling up and down the page.
- Taking screenshots.
The library offers built-in methods for executing most of these operations. You also have the executeScript()
method for running a JavaScript script directly on the page. With both tools, you can simulate any user interaction.
Let’s see how to retrieve all product data from the infinite scroll demo page and explore other popular Panther scraping interactions!
Scrolling
After the first load, the target page contains only ten products. When the user scrolls to the end of the page, the site loads new products dynamically. Panther doesn't have a method for simulating the scrolling interaction, so you need custom JavaScript logic.
This JavaScript script instructs the browser to scroll down 10 times at an interval of 0.5 seconds each:
// scroll down the page 10 times
const scrolls = 10
let scrollCount = 0
// scroll down and then wait for 0.5s
const scrollInterval = setInterval(() => {
window.scrollTo(0, document.body.scrollHeight)
scrollCount++
if (scrollCount === numScrolls) {
clearInterval(scrollInterval)
}
}, 500)
Store the script above in a string variable and pass it to the executeScript()
method as follows:
$scrolling_script = <<<EOD
// scroll down the page 10 times
const scrolls = 10
let scrollCount = 0
// scroll down and then wait for 0.5s
const scrollInterval = setInterval(() => {
window.scrollTo(0, document.body.scrollHeight)
scrollCount++
if (scrollCount === numScrolls) {
clearInterval(scrollInterval)
}
}, 500)
EOD;
$client->executeScript($scrolling_script);
Place the executeScript()
instruction before the node selection logic. Otherwise, the pages’ DOM will not get updated and you’ll still only see 10 products.
Instructing Panther to scroll down the page isn't enough. You also need to wait for the page to retrieve and render the products. To do so, stop the script execution for 10 seconds with sleep()
:
sleep(10);
Here's your new complete code:
<?php
use Symfony\Component\Panther\Client;
require_once("vendor/autoload.php");
// initialize a Chrome client instance
$client = Client::createChromeClient();
// connect to the target page
$crawler = $client->request("GET", "https://scrapingclub.com/exercise/list_infinite_scroll/");
// simulate the infinite scrolling interaction
$scrolling_script = <<<EOD
// scroll down the page 10 times
const scrolls = 10
let scrollCount = 0
// scroll down and then wait for 0.5s
const scrollInterval = setInterval(() => {
window.scrollTo(0, document.body.scrollHeight)
scrollCount++
if (scrollCount === scrolls) {
clearInterval(scrollInterval)
}
}, 500)
EOD;
// launch the JS script on the page
$client->executeScript($scrolling_script);
// wait 10 seconds for the new products to load
sleep(10);
// where to store the scraped data
$product = [];
// select all product HTML elements on the page,
// iterate over them, and apply the scraping logic
$crawler->filter(".post")->each(function ($productHTMLElement) use (&$products) {
// scraping logic
$name = $productHTMLElement->filter("h4")->eq(0)->text();
$price = $productHTMLElement->filter("h5")->eq(0)->text();
// instantiate a new product object
// and add it to the list
$product = [
"name" => $name,
"price" => $price,
];
$products[] = $product;
});
// create the CSV output file
$csvFilePath = "products.csv";
$csvFile = fopen($csvFilePath, "w");
// write the header row
$header = ["name", "price"];
fputcsv($csvFile, $header);
// add each product to the CSV file
foreach ($products as $product) {
fputcsv($csvFile, $product);
}
// close the CSV file
fclose($csvFile);
```
Execute the Panther scraping script again:
php src/scraper.php
The execution will take a while because of the sleep()
instruction, so be patient.
This time, the products.csv
output file will contain many more records than before.
The execution will take a while because of the sleep()
instruction, so be patient.
This time, the products.csv
output file will contain many more records than before.
Mission complete! You’ve just scraped all products from the target site.
Still, you can improve the script by waiting for the new product nodes to be on the page instead of using sleep()
.
Wait for Element
The current script uses a hard wait after the scroll-down interaction. That's a discouraged practice, because it introduces flakiness into the scraping logic, rendering the scraper vulnerable to network slowdowns.
Relying on a fixed wait isn't reliable and slows down your script. Instead, use smart waits. They allow you to wait for specific events to occur, such as the presence of a specific node on the page.
The Panther browser client provides the waitFor()
method to verify if a node is on the page. Use it to wait up to 10 seconds for the 60th product to appear on the page:
$client->waitFor(".post:nth-child(60)", 10);
That line should replace the sleep()
instruction since it leads to the same result. The scrolls will trigger some AJAX calls to retrieve new products. After that, the script will automatically wait for those new products to be rendered on the page.
The definitive Panther web scraping script will be:
<?php
use Symfony\Component\Panther\Client;
require_once("vendor/autoload.php");
// initialize a Chrome client instance
$client = Client::createChromeClient();
// connect to the target page
$crawler = $client->request("GET", "https://scrapingclub.com/exercise/list_infinite_scroll/");
// simulate the infinite scrolling interaction
$scrolling_script = <<<EOD
// scroll down the page 10 times
const scrolls = 10
let scrollCount = 0
// scroll down and then wait for 0.5s
const scrollInterval = setInterval(() => {
window.scrollTo(0, document.body.scrollHeight)
scrollCount++
if (scrollCount === scrolls) {
clearInterval(scrollInterval)
}
}, 500)
EOD;
// launch the JS script on the page
$client->executeScript($scrolling_script);
// wait up to 10 seconds for the 60th product
// to be on the page
$client->waitFor(".post:nth-child(60)", 10);
// where to store the scraped data
$product = [];
// select all product HTML elements on the page,
// iterate over them, and apply the scraping logic
$crawler->filter(".post")->each(function ($productHTMLElement) use (&$products) {
// scraping logic
$name = $productHTMLElement->filter("h4")->eq(0)->text();
$price = $productHTMLElement->filter("h5")->eq(0)->text();
// instantiate a new product object
// and add it to the list
$product = [
"name" => $name,
"price" => $price,
];
$products[] = $product;
});
// create the CSV output file
$csvFilePath = "products.csv";
$csvFile = fopen($csvFilePath, "w");
// write the header row
$header = ["name", "price"];
fputcsv($csvFile, $header);
// add each product to the CSV file
foreach ($products as $product) {
fputcsv($csvFile, $product);
}
// close the CSV file
fclose($csvFile);
Run it, and you'll get the same results as before much faster.
You now know how to extract data from each product on the site effectively and efficiently. It's time to explore other useful Panther interactions.
Wait for Page to Load
$client->request()
automatically waits for the browser to trigger the load
event on the page. It’s fired only once the whole page has loaded, including stylesheets, scripts, iframes, and images.
Modern pages are so dynamic that listening to the load
event may not be enough to tell if the page has finished loading. AJAX requests and dynamic interactions can still change the DOM. For such complex scenarios, Panther offers the following waiting methods:
-
$client->waitFor()
: Wait for the specified element to be attached to the DOM. -
$client->waitForStaleness()
: Wait for the specified element to be removed from the DOM. -
$client->waitForVisibility()
: Wait for the specified element to become visible. -
$client->waitForInvisibility()
: Wait for the specified element to become hidden. -
$client->waitForElementToContain()
: Wait for the given element to contain the specified text. -
$client->waitForElementToNotContain()
: Wait for the given element not to contain the given text. -
$client->waitForEnabled()
: Wait for the given element to become enabled. -
$client->waitForDisabled()
: Wait for the given element to become disabled. -
$client->waitForAttributeToContain()
: Wait for the specified HTML attribute of an element to contain some content. -
$client->waitForAttributeToNotContain()
: Wait for the specified HTML attribute of an element not to contain some content.
Click Elements
The objects returned by filter()
expose the click()
method to simulate click interactions:
$crawler->filter("[type='submit']")->eq(0)->click();
This function tells the browser to send a mouse click event on the selected element. The browser will execute the HTML onclick()
callback associated with the clicked node.
If the click()
method triggers a page change (as in the snippet below), you'll have to adjust the parsing logic to the new DOM structure:
// click on the first product card on the page
$crawler->filter(".post")->eq(0)->click();
// you are now on the detail product page...
// new scraping logic...
// $crawler->filter(...)
Take a Screenshot
A webpage doesn't only contain text. Images are equally important and provide lots of useful information, such as visual insights into competitors’ sites.
$client
has a takeScreenshot()
method to take a screenshot of the current viewport:
// take a screenshot of the current viewport
$client->takeScreenshot("screenshot.png");
This instruction produces a screenshot.png
file in the root folder of your project.
Congratulations! You've mastered Panther web scraping interactions.
Avoid Getting Blocked When Scraping With Panther
The biggest challenge for Panther web scraping is getting blocked by anti-bot solutions. Avoiding them requires making your requests more natural and random. As a starting point, you should set a real-world User-Agent header and use proxies to change your exit IP.
To set a custom user agent, pass it in the --user-agent
flag option to createChromeClient()
:
$custom_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36";
$client = Client::createChromeClient(null, [
"--user-agent=$custom_user_agent",
// other options...
]);
Learn more from our guide to User Agents for web scraping.
Setting a proxy follows a similar process and occurs through the --proxy-server
flag. To follow this tutorial, retrieve the URL of a free proxy from a site like Free Proxy List and then pass it to Chrome:
$proxy_url = "234.36.2.15:6813";
$client = Client::createChromeClient(null, [
"--proxy-server=$proxy_url",
// other options...
]);
Use free proxies for learning purposes only, since they are short-lived and unreliable. By the time you read this tutorial, the proxy server you choose will no longer work.
Those two approaches are just baby steps to bypassing anti-bot systems. Complete solutions like Cloudflare will still be able to detect your Panther scraping script and recognize it as a bot. Verify that with the following script by targeting a Cloudflare-protected page from G2.com:
<?php
use Symfony\Component\Panther\Client;
require_once("vendor/autoload.php");
$custom_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36";
$proxy_url = "234.36.2.15:6813";
// initialize a Chrome client instance
$client = Client::createChromeClient(null, [
"--user-agent=$custom_user_agent",
"--proxy-server=$proxy_url"
]);
// connect to the target page
$crawler = $client->request("GET", "https://www.g2.com/products/zapier/reviews");
// retrieve the HTML source code of the
// target page and print it
$html = $crawler->html();
echo $html;
This snippet will result in the following anti-bot page containing a CAPTCHA:
Should you give up? Of course not! The best way to solve this issue is to opt for a web scraping API, such as ZenRows. ZenRows seamlessly integrates with Panther to extend it with an AI-powered anti-bot toolkit that will help you successfully avoid all possible blocks.
ZenRows provides the same rendering and automation capabilities as Panther. Thus, you can replace Panther with ZenRows and a simple HTTP client.
Give Panther superpowers with ZenRows! Sign up for free, redeem your first 1,000 credits, and get to the Request Builder page below:
Assume you want to scrape the G2 page protected with Cloudflare. Follow these steps:
- Paste the target URL (
https://www.g2.com/products/zapier/reviews
) into the "URL to Scrape" input. - Click on "Premium Proxy" to enable IP rotation (User-Agent rotation and the AI-powered anti-bot toolkit are included by default).
- Select the “cURL” option on the right and then the “API” mode to get the full URL of the ZenRows API.
Pass the generated URL to the Panther's request()
method:
<?php
use Symfony\Component\Panther\Client;
require_once("vendor/autoload.php");
// initialize a Chrome client instance
$client = Client::createChromeClient();
// connect to the target page
$crawler = $client->request("GET", "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fzapier%2Freviews&js_render=true&premium_proxy=true");
// retrieve the HTML source code of the
// target page and print it
$html = $crawler->html();
echo $html;
Run it, and it'll return the HTML source code of the target G2.com page:
<!DOCTYPE html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/assets/favicon-fdacc4208a68e8ae57a80bf869d155829f2400fa7dd128b9c9e60f07795c4915.ico" rel="shortcut icon" type="image/x-icon" />
<title>Airtable Reviews 2024: Details, Pricing, & Features | G2</title>
<!-- omitted for brevity ... -->
Brilliant! You’ve just integrated ZenRows into the Panther web scraping library.
But what about anti-bot measures such as form CAPTCHAs that could still stop your script? Good news: ZenRows not only extends Panther but can also replace it completely.
As a cloud service, ZenRows also results in significant savings over the cost of Selenium.
Conclusion
In this Panther scraping guide, you learned the fundamentals of PHP browser automation.
You saw the basics and then explored more advanced techniques. Now you know:
- How to create a Composer project and install Panther.
- How to use the library to extract data from a dynamic content page.
- What user interactions you can simulate with Panther.
- The challenges of scraping online data and how to address them.
No matter how good your browser automation is, anti-bot systems can still block it. Bypass them all with ZenRows, a web scraping API with browser automation functionality, IP rotation, and the most advanced anti-scraping bypass available. Scraping data from any site has never been easier!