Web Scraping with Goutte PHP: Tutorial 2024

May 7, 2024 · 12 min read

Goutte web scraping is a popular approach to retrieving data from the web in PHP. The library simulates the behavior of a web browser and is now part of Symfony.

This guide will cover the basics and then explore more complex techniques. At the end of this tutorial, you'll know how to:

Let's dive in!

Why Use Goutte PHP for Web Scraping?

Goutte is a powerful PHP web scraping and crawling library with thousands of stars on GitHub. The package has become popular thanks to its intuitive browser-like API, which makes it easier to extract data from HTML/XML web pages.

As covered in the docs, Goutte is now deprecated. This doesn't mean the library is no longer used. Quite the opposite, Goutte is a proxy for the HttpBrowser class from the Symfony BrowserKit component.

Goutte is part of Symfony and remains one of the most useful libraries for PHP web scraping. In the next section, you’ll learn how to use it.

Prerequisites

Follow the instructions below and set up a PHP environment for web scraping in Goutte.

Create the Project

Before getting started, make sure you meet the following prerequisites:

If you miss any of these components, set it up by clicking the link above and following the wizard.

You now have everything you need to initialize a PHP Composer project. Create a folder for your Goutte web scraping project and enter it in the terminal:

Terminal
mkdir goutte-scraper
cd goutte-scraper

Then, launch the init command to create a new Composer project inside it. Follow the wizard and answer the questions as required:

Terminal
composer init

Perfect! goutte-scraper now contains a new Composer project.

Add a scraper.php file in the /src folder and initialize it with the code below. The first line contains the autoload import required by Composer. Then, there is a simple log instruction:

scraper.php
<?php
require_once("vendor/autoload.php");

echo "Hello, World!";
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

You can run the PHP script with this command:

Terminal
php src/scraper.php

That should produce the following output:

Output
"Hello, World!"

Here we go! Your PHP setup is ready.

Install Goutte

Since Goutte is deprecated for standalone usage, you shouldn't install it directly. Instead, you need to use the BrowserKit and HttpClient components from Symfony. These two packages wrap Goutte and provide the same experience.

Install them with this Composer command:

Terminal
composer require symfony/http-client symfony/browser-kit

Then, import them in your scraper.php by adding the following two lines on top of it:

program.php
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;

Your scraper.php file is ready to become a Goutte web scraping script:

scraper.php
 <?php
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;

require_once("vendor/autoload.php");

// scraping logic...

Learn how Goutte allows you to retrieve data from the Web in the next section.

Tutorial: Your First Web Scraper with Goutte

In this section, you'll use Goutte to extract all product data from an e-commerce website. The target site will be ScrapeMe, a platform with a paginated list of Pokémon products:

Scrapeme Homepage
Click to open the image in full screen

Let’s follow the steps below to perform web scraping with Goutte!

Step 1: Get the HTML of Your Target Page

To start scraping a webpage, you need to connect to it and retrieve its HTML by making an HTTP GET request to the target page.

Initialize the Goutte-powered browser client:

scraper.php
// initialize an HTTP client object
$client = new HttpClient();
// initialize a browser-like HTTP client
$browser = new HttpBrowser(HttpClient::create());

Then, connect to the desired page using the request() method. This returns a Crawler object that exposes the Goutte API required for web scraping:

scraper.php
$crawler = $browser->request("GET", "https://scrapeme.live/shop/");

Next, access the server response and its HTML content:

scraper.php
// get the response returned by the server
$response = $browser->getResponse();

// extract the HTML content and print it
$html = $response->getContent();
echo $html;

This is what your current scraper.php file should look like:

scraper.php
<?php
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;

require_once("vendor/autoload.php");

// initialize an HTTP client object"
$client = new HttpClient();
// initialize a browser-like HTTP client
$browser = new HttpBrowser(HttpClient::create());

// make a GET request to the target site
$crawler = $browser->request("GET", "https://scrapeme.live/shop/");

// get the response returned by the server
$response = $browser->getResponse();

// extract the HTML content and print it
$html = $response->getContent();
echo $html;

The script will produce the following output:

Output
<!doctype html>
<html lang="en-GB">
 <head>
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=2.0" />
  <link rel="profile" href="http://gmpg.org/xfn/11" />
  <link rel="pingback" href="https://scrapeme.live/xmlrpc.php" />
  <title>Products – ScrapeMe</title>

Fantastic! Your web scraping Goutte script connects to the target page. Now, get ready to extract some data.

Step 2: Extract Data from One Element

To collect data from HTML elements, you must first isolate them with an effective node selection strategy. To devise it, get familiar with the HTML of the target page.

Visit the target page in the browser and inspect a product HTML node with the DevTools:

DevTools Inspection
Click to open the image in full screen

Expand the HTML code and analyze it. Note that you can select each product with the CSS selector that follows:

scraper.php
li.product

If you're not familiar with this syntax, li is the tag of the HTML element and product is its class.

Given a product HTML element, you can extract:

  • The URL from the <a> node.
  • The image URL from the <img> node.
  • The name from the <h2> node.
  • The price from the <span> node.

Before implementing the scraping logic, you have to install the CssSelector Symfony component:

Terminal
composer require symfony/css-selector

Call the filter() method on $crawler to apply a CSS selector on the page. This will return an array of all nodes that match the specified selection strategy. Select a single HTML product element with the eq(0) method and then use text() and attr() to extract data from it:

scraper.php
// select the first product HTML element on the page
$productHTMLElement = $crawler->filter("li.product")->eq(0);

// scraping logic
$url = $productHTMLElement->filter("a")->eq(0)->attr("href");
$image = $productHTMLElement->filter("img")->eq(0)->attr("src");
$name = $productHTMLElement->filter("h2")->eq(0)->text();
$price = $productHTMLElement->filter("span")->eq(0)->text();

You can finally print the scraped data in the terminal with:

scraper.php
echo $url, PHP_EOL;
echo $image, PHP_EOL;
echo $name, PHP_EOL;
echo $price, PHP_EOL;

Integrate the above logic in scraper.php, and you'll get:

File
<?php
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;

require_once("vendor/autoload.php");

// initialize an HTTP client object
$client = new HttpClient();
// initialize a browser-like HTTP client
$browser = new HttpBrowser(HttpClient::create());

// make a GET request to the target site
$crawler = $browser->request("GET", "https://scrapeme.live/shop/");

// select the first product HTML element on the page
$productHTMLElement = $crawler->filter("li.product")->eq(0);

// scraping logic
$url = $productHTMLElement->filter("a")->eq(0)->attr("href");
$image = $productHTMLElement->filter("img")->eq(0)->attr("src");
$name = $productHTMLElement->filter("h2")->eq(0)->text();
$price = $productHTMLElement->filter("span")->eq(0)->text();

// log the scraped data
echo $url, PHP_EOL;
echo $image, PHP_EOL;
echo $name, PHP_EOL;
echo $price, PHP_EOL;

Launch the Goutte web scraping script, and it'll produce:

Output
https://scrapeme.live/shop/Bulbasaur/
https://scrapeme.live/wp-content/uploads/2018/08/001-350x350.png
Bulbasaur
£63.00

Awesome! You’ve just retrieved the data of interest from a single product HTML node on the page. Learn how to scrape all products in the next section.

Step 3: Extract Data from Multiple Elements

The target web page contains multiple products, not just one. Initialize a new array to store all products:

scraper.php
$products = []

At the end of the script, this array will store all scraped data objects.

Now, remove eq(0) from the first filter() instruction and use each() to iterate over all products. For each product node, apply the scraping logic, instantiate a new object, and add it to the list:

scraper.php
$crawler->filter("li.product")->each(function ($productHTMLElement) use (&$products) {
    // scraping logic
    $url = $productHTMLElement->filter("a")->eq(0)->attr("href");
    $image = $productHTMLElement->filter("img")->eq(0)->attr("src");
    $name = $productHTMLElement->filter("h2")->eq(0)->text();
    $price = $productHTMLElement->filter("span")->eq(0)->text();

    // instantiate a new product object
    $product = [
        "url" => $url,
        "image" => $image,
        "name" => $name,
        "price" => $price
    ];
    // add it to the list
    $products[] = $product;
});

Verify that the web scraping Goutte logic above works with this log:

scraper.php
print_r($products);

The scraper.php script should currently contain:

File
<?php
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;

require_once ("vendor/autoload.php");

// initialize an HTTP client object
$client = new HttpClient();
// initialize a browser-like HTTP client
$browser = new HttpBrowser(HttpClient::create());

// make a GET request to the target site
$crawler = $browser->request("GET", "https://scrapeme.live/shop/");

// where to store the scraped data
$products = [];

// select all product HTML elements on the page,
// iterate over them, and scrape them
$crawler->filter("li.product")->each(function ($productHTMLElement) use (&$products) {
    // scraping logic
    $url = $productHTMLElement->filter("a")->eq(0)->attr("href");
    $image = $productHTMLElement->filter("img")->eq(0)->attr("src");
    $name = $productHTMLElement->filter("h2")->eq(0)->text();
    $price = $productHTMLElement->filter("span")->eq(0)->text();

    // instantiate a new product object
    $product = [
        "url" => $url,
        "image" => $image,
        "name" => $name,
        "price" => $price
    ];
    // add it to the list
    $products[] = $product;
});

// log the scraped data
print_r($products);

Run it, and it'll generate this output:

scraper.php
(
    [0] => Array
        (
            [url] => https://scrapeme.live/shop/Bulbasaur/
            [image] => https://scrapeme.live/wp-content/uploads/2018/08/001-350x350.png
            [name] => Bulbasaur
            [price] => £63.00
        )

    // omitted for brevity...

    [15] => Array
        (
            [url] => https://scrapeme.live/shop/Pidgey/
            [image] => https://scrapeme.live/wp-content/uploads/2018/08/016-350x350.png
            [name] => Pidgey
            [price] => £159.00
        )

)

Terrific! The $products array stores the scraped objects with the desired data. Now you need to export this data to a readable format.

Step 4: Convert Scraped Data Into a CSV File

The PHP standard library provides all you need to create a CSV file and fill it with the scraped data. Use fopen() to create a products.csv file and populate it with fputcsv(). This will convert each product array object to a CSV record and append it to the output file:

scraper.php
// create the output CSV file
$csvFile = fopen("products.csv", "w");

// write the header row
$header = ["url", "image", "name", "price"];
fputcsv($csvFile, $header);

// add each product to the CSV file
foreach ($products as $product) {
    fputcsv($csvFile, $product);
}

// close the CSV file
fclose($csvFile);

Put it all together, and you'll get:

scraper.php
<?php
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;

require_once ("vendor/autoload.php");

// initialize an HTTP client object
$client = new HttpClient();
// initialize a browser-like HTTP client
$browser = new HttpBrowser(HttpClient::create());

// make a GET request to the target site
$crawler = $browser->request("GET", "https://scrapeme.live/shop/");

// where to store the scraped data
$products = [];

// select all product HTML elements on the page,
// iterate over them, and scrape them
$crawler->filter("li.product")->each(function ($productHTMLElement) use (&$products) {
    // scraping logic
    $url = $productHTMLElement->filter("a")->eq(0)->attr("href");
    $image = $productHTMLElement->filter("img")->eq(0)->attr("src");
    $name = $productHTMLElement->filter("h2")->eq(0)->text();
    $price = $productHTMLElement->filter("span")->eq(0)->text();

    // instantiate a new product object
    $product = [
        "url" => $url,
        "image" => $image,
        "name" => $name,
        "price" => $price
    ];
    // add it to the list
    $products[] = $product;
});

// create the output CSV file
$csvFile = fopen("products.csv", "w");

// write the header row
$header = ["url", "image", "name", "price"];
fputcsv($csvFile, $header);

// add each product to the CSV file
foreach ($products as $product) {
    fputcsv($csvFile, $product);
}

// close the CSV file
fclose($csvFile);

Execute the script:

Terminal
php src/scraper.php

After the script execution, a products.csv file will appear in the project's folder. Open it, and you'll see:

CSV File
Click to open the image in full screen

Et voilà! You just built a Goutte web scraping script!

Now, let’s move on to more advanced techniques.

Advanced Web Scraping Techniques with Goutte

Now that you know the basics of web scraping with Goutte, you're ready to learn about handling pagination, scraping dynamic content, and bypassing anti-bot systems. Let’s go!

Handle Pagination

The current script retrieves data from one web page. However, in most use cases, you’ll need more data from your target website.

What if you wanted to retrieve all the products? That's where web crawling comes in!

Web crawling is the process of automatically discovering web pages. Learn more in our guide on web crawling vs web scraping.

To implement web crawling in Goutte, you have to:

  1. Connect to a web page on the destination site.
  2. Extract the URLs from the pagination link nodes on the page and add them to an array.
  3. Repeat the cycle on a new page read from the array.

That loop would only stop when there are no more pages to discover. Since the web scraping Goutte script is just a demo, let's limit the pages to crawl to 5:

scraper.php
// number of pages scraped
$pageCounter = 1;

// maximum number of pages to scrape
$pageLimit = 5;

You already know how to connect to a webpage in Goutte. The next step is to learn how to extract URLs from pagination link elements. Inspect their HTML nodes:

Inspect HTML Nodes
Click to open the image in full screen

Here, you can see that you can select them all with this CSS selector:

scraper.php
a.page-numbers

Bear in mind that crawling a site isn't as easy as extracting links and following them blindly. You’d risk visiting the same pages multiple times. Avoid that by keeping track of the pages you have already accessed with two extra data structures:

  • pagesDiscovered: An array to use a set to store all the URLs discovered during the crawling logic.
  • pagesToScrape: An array to use a stack to store the URLs of the pages the scraper will visit soon.

Initialize both with the URL of the first product pagination page:

scraper.php
// the first page to visit in the crawling logic
$firstPageToScrape = "https://scrapeme.live/shop/page/1/";

// the Set of pages discovered during the crawling logic
$pagesDiscovered = [$firstPageToScrape];
// the list of remaining pages to scrape
$pagesToScrape = [$firstPageToScrape]; 

Next, implement the crawling logic as explained earlier with the following while loop:

scraper.php
while (count($pagesToScrape) != 0 && $pageCounter <= $pageLimit) {
    // retrieve the next URL to visit
    $pageUrl = array_shift($pagesToScrape);

    echo $pageUrl, PHP_EOL;

    // connect to the current page
    $crawler = $browser->request("GET", $pageUrl);

    // crawling logic
    $crawler->filter("a.page-numbers")->each(function ($paginationHTMLElement) use (&$pagesDiscovered, &$pagesToScrape) {
        // extract the current pagination URL
        $newPaginationUrl = $paginationHTMLElement->attr("href");
        // if the page discovered is new
        if (!in_array($newPaginationUrl, $pagesDiscovered)) {
            // if the page discovered needs to be scraped
            if (!in_array($newPaginationUrl, $pagesToScrape)) {
                $pagesToScrape[] = $newPaginationUrl;
            }
            $pagesDiscovered[] = $newPaginationUrl;
        }
    });

    // scraping logic...

    // increment the iterator counter
    $pageCounter++;
}

Extend scraper.php with the crawling logic above, and you'll have the final code:

scraper.php
<?php
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;

require_once ("vendor/autoload.php");

// initialize an HTTP client object
$client = new HttpClient();
// initialize a browser-like HTTP client
$browser = new HttpBrowser(HttpClient::create());

// where to store the scraped data
$products = [];

// number of pages scraped
$pageCounter = 1;

// maximum number of pages to scrape
$pageLimit = 5;

// the first page to visit in the crawling logic
$firstPageToScrape = "https://scrapeme.live/shop/page/1/";

// the Set of pages discovered during the crawling logic
$pagesDiscovered = [$firstPageToScrape];
// the list of remaining pages to scrape
$pagesToScrape = [$firstPageToScrape];

// iterate until there are no pages to scrape
// or the limit is hit
while (count($pagesToScrape) != 0 && $pageCounter <= $pageLimit) {
    // retrieve the next URL to visit
    $pageUrl = array_shift($pagesToScrape);

    echo $pageUrl, PHP_EOL;

    // connect to the current page
    $crawler = $browser->request("GET", $pageUrl);

    // crawling logic
    $crawler->filter("a.page-numbers")->each(function ($paginationHTMLElement) use (&$pagesDiscovered, &$pagesToScrape) {
        // extract the current pagination URL
        $newPaginationUrl = $paginationHTMLElement->attr("href");
        // if the page discovered is new
        if (!in_array($newPaginationUrl, $pagesDiscovered)) {
            // if the page discovered needs to be scraped
            if (!in_array($newPaginationUrl, $pagesToScrape)) {
                $pagesToScrape[] = $newPaginationUrl;
            }
            $pagesDiscovered[] = $newPaginationUrl;
        }
    });

    // scraping logic
    $crawler->filter("li.product")->each(function ($productHTMLElement) use (&$products) {
        $url = $productHTMLElement->filter("a")->eq(0)->attr("href");
        $image = $productHTMLElement->filter("img")->eq(0)->attr("src");
        $name = $productHTMLElement->filter("h2")->eq(0)->text();
        $price = $productHTMLElement->filter("span")->eq(0)->text();

        // instantiate a new product object
        $product = [
            "url" => $url,
            "image" => $image,
            "name" => $name,
            "price" => $price
        ];
        // add it to the list
        $products[] = $product;
    });

    // increment the iterator counter
    $pageCounter++;
}


// create the output CSV file
$csvFile = fopen("products.csv", "w");

// write the header row
$header = ["url", "image", "name", "price"];
fputcsv($csvFile, $header);

// add each product to the CSV file
foreach ($products as $product) {
    fputcsv($csvFile, $product);
}

// close the CSV file
fclose($csvFile);

Launch the web scraping Goutte script:

Terminal
php src/scraper.php

The scraper will be a bit longer than before because it now has to go through 5 pages. This time, the products.csv file generated by the script will contain more records:

Updated CSV
Click to open the image in full screen

Congrats! You’ve just learned how to perform web crawling and web scraping in Goutte!

Avoid Getting Blocked When Scraping With Goutte

With data being companies’ most valuable assets, more and more sites adopt anti-bot measures. These technologies can detect and block automated scripts, such as your Goutte scraper.

There are a few tips and tricks to perform web scraping without getting blocked. However, bypassing all anti-bot systems isn't easy. Complete solutions such as Cloudflare will still be able to block your script.

Verify that by trying to retrieve the HTML of a [G2]webpage protected with Cloudflare:

scraper.php
<?php
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;

require_once("vendor/autoload.php");

// initialize an HTTP client object
$client = new HttpClient();
// initialize a browser-like HTTP client
$browser = new HttpBrowser(HttpClient::create());

// make a GET request to the target site
$crawler = $browser->request("GET", "https://www.g2.com/products/zapier/reviews");

// get the response returned by the server
$response = $browser->getResponse();

// extract the HTML content and print it
$html = $response->getContent();
echo $html;

The Goutte web scraping script above will print the following 403 Forbidden error page:

Output
<!doctype html>
<html class="no-js" lang="en-US">
 <head> 
  <title>Attention Required! | Cloudflare</title> 
  <meta charset="UTF-8"> 
  <!-- omitted for brevity... -->

The most efficient solution to this problem is choosing a web scraping API, such as ZenRows. It provides a top-notch anti-bot toolkit to bypass any block. Other useful features of the tool are IP and User-Agent rotation or anti-CAPTCHAs.

Use ZenRows in Goutte for maximum effectiveness. Sign up for free to get your first 1,000 credits, and then reach the Request Builder page:

ZenRows Request Builder
Click to open the image in full screen

Assume you want to scrape the Cloudflare-protected G2.com pageused before. Follow these steps:

  1. Paste the target URL (https://www.g2.com/products/zapier/reviews) into the "URL to Scrape" input.
  2. Click on "Premium Proxy" to enable IP rotation.
  3. Enable the "JS Rendering" feature (by default, User-Agent rotation and the AI-powered anti-bot toolkit are included).
  4. Select the “cURL” option on the right and then the “API” mode to get the full URL of the ZenRows API.

Pass the generated URL as an argument in the request() method:

scraper.php
<?php
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;

require_once("vendor/autoload.php");

// initialize an HTTP client object
$client = new HttpClient();
// initialize a browser-like HTTP client
$browser = new HttpBrowser(HttpClient::create());

// make a GET request to the target site
// through ZenRows
$crawler = $browser->request("GET", "https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fzapier%2Freviews&js_render=true&premium_proxy=true");

// get the response returned by the server
$response = $browser->getResponse();

// extract the HTML content and print it
$html = $response->getContent();
echo $html;

Execute your scraping script again. This time, it'll print the source HTML code of the G2 page:

Output
<!DOCTYPE html>
<head>
  <meta charset="utf-8" />
  <link href="https://www.g2.com/assets/favicon-fdacc4208a68e8ae57a80bf869d155829f2400fa7dd128b9c9e60f07795c4915.ico" rel="shortcut icon" type="image/x-icon" />
  <title>Zapier Reviews 2024: Details, Pricing, &amp; Features | G2</title>
  <!-- omitted for brevity ... -->

Scrape Dynamic Content With Goutte

As mentioned before, Goutte provides a browser-like API. For example, selecting a link element gives you access to the click() method:

scraper.php
// select the link that contains the "Bulbasaur" text
// and click on it
$link = $crawler->selectLink("Bulbasaur")->link();
$crawler = $client->click($link);

Similarly, you can fill out and submit a form with this syntax:

scraper.php
// select the "Sign-in" form
$form = $crawler->selectButton("Sign in")->form();
// fill out the form and submit it
$crawler = $client->submit($form, ["login" => "<YOUR_USERNAME>", "password" => "<YOUR_PASSWORD>"]);

You’ve just learned how to use ZenRows for web scraping in Goutte. Bye-bye, 403 errors.

As you can see, writing Goutte web scraping logic is intuitive. At the same time, Goutte doesn't execute those instructions in a browser. This is just an API convention used to simplify coding.

The consequence is that Goutte can only scrape static HTML pages. To retrieve data from dynamic pages that need JavaScript execution, you have to use a browser automation tool, the most popular of which is Selenium.

Learn how to use it in our guide on web scraping with Selenium PHP.

Conclusion

This tutorial guided you through the process of web scraping in Goutte. After learning both the fundamentals and the more advanced tricks, you've become a Goutte web scraping expert!

Goutte is a useful library for web scraping in PHP, especially for static sites. Its browser-like API makes it easy to retrieve data from a page. However, anti-scraping solutions that can block your script can still pose a huge challenge. The solution is ZenRows, a scraping API with the most effective anti-bot bypass capabilities. Extracting online data from any web page has never been easier!

Ready to get started?

Up to 1,000 URLs for free are waiting for you