Web Scraping With PHP: Step-By-Step Tutorial

May 17, 2024 ยท 12 min read

Do you want to start your first web scraping project with PHP? You've come to the right place!

In this article, you'll learn to build a PHP web scraper, from the basics to more advanced techniques.

Let's go!

Is PHP Good for Web Scraping?

While PHP is more commonly used for web development, it's perfectly good for scraping thanks to its scripting ability and built-in HTTP clients like cURL, which support proxy configuration.

Most developers go for alternatives such as Python and JavaScript for web scraping due to their popularity, simplicity, and the vast ecosystem of scraping libraries. However, PHP offers a few advantages as well, e.g., excellent flexibility and ease of deployment.

On top of that, PHP has many web scraping libraries, including HTTP clients like Guzzle and cURL, HTML parsers like Simple DOM Parser and DiDOM, and headless browsers like the PHP WebDriver.

Now that you know PHP can be used for web scraping, let's move to the step-by-step guide on how to do it.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Prerequisites

Before starting, you'll need a few installations and configurations. Let's go through them all.

Set Up the Environment

This tutorial uses PHP 8.3+, so the first step is to download and install the latest version from the official PHP download website. You can also develop your scraper using any IDE, but this tutorial uses VS Code on a Windows operating system.

Set Up a PHP Project

After the initial installation steps, create a new project folder with a scraper.php file and open VS Code to that directory.

You'll use PHP's built-in cURL as the HTTP client and parse HTML using the Simple HTML DOM Parser.

Download Simple HTML DOM Parser from SourceForge and extract the zipped file. Open the extracted folder, then copy and paste the simple_html_dom.php file into your project's root directory.

Your project structure should look like this:

Example
โ””โ”€โ”€ ๐Ÿ“PHP-scraping-tutorial
        โ””โ”€โ”€ scraper.php
        โ””โ”€โ”€ simple_html_dom.php

You're ready to scrape with PHP!

How to Scrape a Website in PHP

In this section, you'll extract product information from the ScrapingCourse e-commerce demo website, starting with the full-page HTML. You'll then parse that HTML to extract data from one element and scale your code to scrape an entire page before writing the extracted data to a CSV.

See what the target website looks like:

Scrapingcourse Ecommerce Store
Click to open the image in full screen

Now, let's start by requesting the full-page HTML.

Step 1: Retrieve the HTML of Your Target Page

The first step is to obtain the target website's HTML using the cURL HTTP client, so you need to ensure that the client works as expected. Let's see how to achieve that.

Create a cURL instance and set up its options. Then, execute the cURL session to visit the target web page. Validate the response and print the HTML content:

scraper.php
<?php

// specify the target website's URL
$url = "https://scrapingcourse.com/ecommerce/";

// initialize a cURL session
$curl = curl_init();

// set the website URL
curl_setopt($curl, CURLOPT_URL, $url);

// return the response as a string
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

// follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); 

// ignore SSL verification
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

// execute the cURL session
$htmlContent = curl_exec($curl);

// check for errors
if ($htmlContent === false) {

    // handle the error
    $error = curl_error($curl);
    echo "cURL error: " . $error;
    exit;
}

// print the HTML content
echo $htmlContent;

// close cURL session
curl_close($curl);
?>

Run the code with the following command:

Terminal
php scraper.php

The code outputs the full-page HTML of the target website. Here's a version of the result showing the page title with omitted content:

Output
<!DOCTYPE html>
<html lang="en-US">
<head>
    <!--- ... --->
  
    <title>Ecommerce Test Site to Learn Web Scraping โ€“ ScrapingCourse.com</title>
    
  <!--- ... --->
</head>
<body class="home archive ...">
    <p class="woocommerce-result-count">Showing 1โ€“16 of 188 results</p>
    <ul class="products columns-4">
        <!--- ... --->
      
        <li>
            <h2 class="woocommerce-loop-product__title">Abominable Hoodie</h2>
            <span class="price">
                <span class="woocommerce-Price-amount amount">
                    <bdi>
                        <span class="woocommerce-Price-currencySymbol">$</span>69.00
                    </bdi>
                </span>
            </span>
            <a aria-describedby="This product has multiple variants. The options may ...">Select options</a>
        </li>
      
        <!--- ... other products omitted for brevity --->
    </ul>
</body>
</html>

Your HTTP client works! Let's start extracting specific product information.

Step 2: Get Data From One Element

Getting data from one element requires parsing the requested HTML using the Simple HTML DOM Parser library and selecting the target element with a CSS selector.

Let's quickly inspect the target page's HTML to view its element attributes. Open the target website via a browser like Chrome, right-click the first product, and click Inspect:

Scrapingcourse First Element Inspection
Click to open the image in full screen

To begin content extraction, include the Simple HTML DOM Parser library in your scraper file, request the website using cURL, and parse it with the parser library:

scraper.php
// include the Simple HTML DOM parser library
include_once("simple_html_dom.php");

// specify the target website's URL
$url = "https://scrapingcourse.com/ecommerce/";

// initialize a cURL session
$curl = curl_init();

// set the website URL
curl_setopt($curl, CURLOPT_URL, $url);

// return the response as a string
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

// follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); 

// ignore SSL verification (not recommended in production)
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

// execute the cURL session
$htmlContent = curl_exec($curl);

// check for errors
if ($htmlContent === false) {
    // handle the error
    $error = curl_error($curl);
    echo "cURL error: " . $error;
    exit;
}

// close cURL session
curl_close($curl);

// create a new Simple HTML DOM instance and parse the HTML
$html = str_get_html($htmlContent);

Extract the first matching element from the current page using the parser's CSS selector.

scraper.php
// ...

// find the first product's name
$name = $html->find(".woocommerce-loop-product__title", 0);

// find the first product image
$image = $html->find("img", 0);

// find the first product's price
$price = $html->find("span.price", 0);

PHP doesn't convert HTML entities by default. So, ensure you decode the price currency symbol to reflect the expected symbol in your output. Finally, print the extracted data and clean up the parser resources:

scraper.php
// ...

// decode the HTML entity in the currency symbol
$decodedPrice = html_entity_decode($price->plaintext);

// print the extracted data
echo "Name: $name->plaintext \n";
echo "Price: $decodedPrice \n";
echo "Image URL: $image->src \n";

// clean up resources
$html->clear();

Combine the snippets, and your final code should look like this:

scraper.php
<?php

// include the Simple HTML DOM parser library
include_once("simple_html_dom.php");

// specify the target website's URL
$url = "https://scrapingcourse.com/ecommerce/";

// initialize a cURL session
$curl = curl_init();

// set the website URL
curl_setopt($curl, CURLOPT_URL, $url);

// return the response as a string
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

// follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); 

// ignore SSL verification (not recommended in production)
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

// execute the cURL session
$htmlContent = curl_exec($curl);

// check for errors
if ($htmlContent === false) {
    // handle the error
    $error = curl_error($curl);
    echo "cURL error: " . $error;
    exit;
}

// close cURL session
curl_close($curl);

// create a new Simple HTML DOM instance and parse the HTML
$html = str_get_html($htmlContent);

// find the first product's name
$name = $html->find(".woocommerce-loop-product__title", 0);

// find the first product image
$image = $html->find("img", 0);

// find the first product's price
$price = $html->find("span.price", 0);

// decode the HTML entity in the currency symbol
$decodedPrice = html_entity_decode($price->plaintext);

// print the extracted data
echo "Name: $name->plaintext \n";
echo "Price: $decodedPrice \n";
echo "Image URL: $image->src \n";

// clean up resources
$html->clear();
?>

The code extracts the first product's name, price, and image URL, as shown:

Output
Name: Abominable Hoodie
Price: $ 69.00
Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg

You can now extract data from one element using PHP. Let's scale that code to get all the content on that page.

Step 3: Get Data From All Elements

To get the specified data from all the elements on the target page, you'll extract product information from each parent element containing the target content.

It requires modifying the previous code by iterating through the product containers to extract each product's name, price, and image URL.

If you inspect the product containers, you'll see that each product is inside a list element:

Scrapingcourse Product Container Inspection
Click to open the image in full screen

Modify the previous code to extract all the product containers on that page. Then, specify an empty array to collect the extracted data:

scraper.php
// ...

// obtain the product containers
$products = $html->find(".product");

// create an empty product array to collect the extracted data
$productData = array();

Use foreach to iterate through the containers and obtain the specified data from each parent element.

Confirm the presence of the product elements, decode the price currency symbol as done previously, and append the extracted data to the empty product data array. Print the product data array to view the scraped content. Then, clean up all parser resources:

scraper.php
// ...

// loop through the product container to extract its elements
foreach ($products as $product) {

    // find the name elements within the current product element
    $name = $product->find(".woocommerce-loop-product__title", 0);

    // find the image elements within the current product element
    $image = $product->find("img", 0);

    // find the price elements within the current product element
    $price = $product->find("span.price", 0);

    // check if the target elements exist with the required attributes
    if (
        $name && $price && $image 
        && isset($name->plaintext)
        && isset($price->plaintext) 
        && isset($image->src)
        
    ) {

        // decode the price symbol to $
        $decodedPrice = html_entity_decode($price->plaintext);

        // create an array of the extracted data
        $productInfo = array(
            "Name" => $name->plaintext,
            "Price" => $decodedPrice,
            "Image URL" => $image->src
        );

        // append the extracted data to the empty product array
        $productData[] = $productInfo;
    }
}

// print the extracted products
print_r($productData);

// clean up resources
$html->clear();

After modifying the previous code with the two snippets, your final code should look like this:

scraper.php
<?php

// include the Simple HTML DOM parser library
include_once("simple_html_dom.php");

// specify the target website's URL
$url = "https://scrapingcourse.com/ecommerce/";

// initialize a cURL session
$curl = curl_init();

// set the website URL
curl_setopt($curl, CURLOPT_URL, $url);

// return the response as a string
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

// follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); 

// ignore SSL verification (not recommended in production)
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

// execute the cURL session
$htmlContent = curl_exec($curl);

// check for errors
if ($htmlContent === false) {
    // handle the error
    $error = curl_error($curl);
    echo "cURL error: " . $error;
    exit;
}

// close cURL session
curl_close($curl);

// create a new Simple HTML DOM instance
$html = str_get_html($htmlContent);

// obtain the product containers
$products = $html->find(".product");

// create an empty product array to collect the extracted data
$productData = array();

// loop through the product container to extract its elements
foreach ($products as $product) {

    // find the name elements within the current product element
    $name = $product->find(".woocommerce-loop-product__title", 0);

    // find the image elements within the current product element
    $image = $product->find("img", 0);

    // find the price elements within the current product element
    $price = $product->find("span.price", 0);

    // check if the target elements exist with the required attributes
    if (
        $name && $price && $image 
        && isset($name->plaintext)
        && isset($price->plaintext) 
        && isset($image->src)
        
    ) {

        // decode the price symbol to $
        $decodedPrice = html_entity_decode($price->plaintext);

        // create an array of the extracted data
        $productInfo = array(
            "Name" => $name->plaintext,
            "Price" => $decodedPrice,
            "Image URL" => $image->src
        );

        // append the extracted data to the empty product array
        $productData[] = $productInfo;
    }
}

// print the extracted products
print_r($productData);

// clean up resources
$html->clear();
?>

The code above retrieves the names, prices, and image URLs of all the products on the target page into an array, as shown:

Output
(
    [0] => Array
        (
            [Name] => Abominable Hoodie
            [Price] => $ 69.00 
            [Image URL] => https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg
        )
        
    // ... other products omitted for brevity
        
    [15] => Array
        (
            [Name] => Artemis Running Short
            [Price] => $ 45.00
            [Image URL] => https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wsh04-black_main-324x324.jpg
        )
)

You've just scraped an entire web page using cURL and the Simple HTML DOM Parser in PHP. Let's take it further by writing the data into a CSV file.

Step 4: Export Your Data to a CSV File

Exporting the extracted data to a CSV lets you store it for further operations. Let's extend the previous code to achieve that.

Specify the CSV file path, open the file, and input the data headers. Write the extracted products to the CSV file using a foreach loop and close the file:

scraper.php
// ...

// define the path to the CSV file
$csvFilePath = "products.csv";

// open the CSV file for writing
$file = fopen($csvFilePath, "w");

// write the header row to the CSV file
fputcsv($file, array_keys($productData[0]));

// write each product's data to the CSV file
foreach ($productData as $product) {
    fputcsv($file, $product);
}

// close the CSV file
fclose($file);

// output a message indicating successful CSV creation
echo "CSV file created successfully: $csvFilePath";

Merge the snippet with the previous code. Your final code should look like this:

scraper.php
<?php

// include the Simple HTML DOM parser library -> download from source and paste its simple_html_dom.php file in your project folder
include_once("simple_html_dom.php");

// specify the target website's URL
$url = "https://scrapingcourse.com/ecommerce/";

// initialize a cURL session
$curl = curl_init();

// set the website URL
curl_setopt($curl, CURLOPT_URL, $url);

// return the response as a string
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

// follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); 

// ignore SSL verification (not recommended in production)
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

// execute the cURL session
$htmlContent = curl_exec($curl);

// check for errors
if ($htmlContent === false) {
    // handle the error
    $error = curl_error($curl);
    echo "cURL error: " . $error;
    exit;
}

// close cURL session
curl_close($curl);

// create a new Simple HTML DOM instance
$html = str_get_html($htmlContent);

// obtain the product containers
$products = $html->find(".product");

// create an empty product array to collect the extracted data
$productData = array();

// loop through the product container to extract its elements
foreach ($products as $product) {

    // find the name elements within the current product element
    $name = $product->find(".woocommerce-loop-product__title", 0);

    // find the image elements within the current product element
    $image = $product->find("img", 0);

    // find the price elements within the current product element
    $price = $product->find("span.price", 0);

    // check if the target elements exist with the required attributes
    if (
        $name && $price && $image 
        && isset($name->plaintext)
        && isset($price->plaintext) 
        && isset($image->src)
        
    ) {

        // decode the price symbol to $
        $decodedPrice = html_entity_decode($price->plaintext);

        // create an array of the extracted data
        $productInfo = array(
            "Name" => $name->plaintext,
            "Price" => $decodedPrice,
            "Image URL" => $image->src
        );

        // append the extracted data to the empty product array
        $productData[] = $productInfo;
    }
}

// define the path to the CSV file
$csvFilePath = "products.csv";

// open the CSV file for writing
$file = fopen($csvFilePath, "w");

// write the header row to the CSV file
fputcsv($file, array_keys($productData[0]));

// write each product's data to the CSV file
foreach ($productData as $product) {
    fputcsv($file, $product);
}

// close the CSV file
fclose($file);

// output a message indicating successful CSV creation
echo "CSV file created successfully: $csvFilePath";

// clean up resources
$html->clear();
?>

The code writes the extracted content to a CSV file as expected. See the result below.

Extracted Data in CSV  File
Click to open the image in full screen

Great job! Your scraper now saves extracted data into a CSV file. Want to take your web scraping skills to the next level? Keep reading to learn more advanced web scraping techniques.

Advanced Web Scraping Techniques With PHP

Web scraping at scale involves a few more advanced methods and concepts, including crawling, dealing with dynamic content, and bypassing anti-bot measures. In this section, youโ€™ll learn more about each of them.

Web Crawling With PHP

Web crawling involves following more pages on a website and scraping their content. It's an essential technique for scraping paginated websites, such as the e-commerce demo website you scraped in the previous section.

The scraper you built previously only extracts data from a single page. Let's scale it to crawl and scrape content from all pages.

First, inspect the target website to see how it handles pagination. Open the page via a browser like Chrome, right-click the next page button, and select Inspect to view its element:

Scrapingcourse Navigation Link Inspection
Click to open the image in full screen

First, include the parser library in your code, define the target URL, and specify an empty product array to collect the extracted data:

scraper.php
// include the Simple HTML DOM parser library
include_once("simple_html_dom.php");

// specify the target website's URL
$url = "https://scrapingcourse.com/ecommerce/";

// initialize an array to store all product data
$productData = array();

Define a scraper function that accepts a URL argument. Initialize a cURL instance, set cURL options, and execute a session to open the target website. Validate the response and close the cURL session:

scraper.php
// ...

// create a function to scrape product data from a given URL
function scraper($url) {

    // log the currently scraped page
    echo "Scraping page: $url\n";

    // initialize a cURL session
    $curl = curl_init();

    // set the website URL
    curl_setopt($curl, CURLOPT_URL, $url);

    // return the response as a string
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

    // follow redirects
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); 

    // ignore SSL verification (not recommended in production)
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

    // execute the cURL session
    $htmlContent = curl_exec($curl);

    // check for errors
    if ($htmlContent === false) {
        // handle the error
        $error = curl_error($curl);
        echo "cURL error: " . $error;
        exit;
    }

    // close cURL session
    curl_close($curl);
}

Now, parse the response HTML with a Simple HTML DOM Parser instance and extract all product containers:

scraper.php
// ...

// create a function to scrape product data from a given URL
function scraper($url) {

    // ...

    // create a new Simple HTML DOM instance
    $html = str_get_html($htmlContent);

    // obtain the product containers
    $products = $html->find(".product");
}

The next step is to navigate the website's pages.

Start a foreach loop to extract the target content from each product container, validate the presence of each product information element, and decode the price symbol. Append the collected data to the empty product data array. Then, find the next page button element and extract its link. Visit that link recursively to open the next page until the last page.

scraper.php
// ...

// create a function to scrape product data from a given URL
function scraper($url) {

    // ...

    // loop through the product container to extract its elements
    foreach ($products as $product) {

        // find the name elements within the current product element
        $name = $product->find(".woocommerce-loop-product__title", 0);

        // find the image elements within the current product element
        $image = $product->find("img", 0);

        // find the price elements within the current product element
        $price = $product->find("span.price", 0);

        // check if the target elements exist with the required attributes
        if (
            $name && $price && $image 
            && isset($name->plaintext)
            && isset($price->plaintext) 
            && isset($image->src)
            
        ) {

            // decode the price symbol to $
            $decodedPrice = html_entity_decode($price->plaintext);

            // create an array of the extracted data
            $productInfo = array(
                "URL" => $name->plaintext,
                "price" => $decodedPrice,
                "image_src" => $image->src
            );

            // append the extracted data to the product array
            global $productData;
            $productData[] = $productInfo;
        }
    }

    // check if there is a next page
    $nextPageLink = $html->find("a.next", 0);
    if ($nextPageLink) {
        $nextPageUrl = $nextPageLink->href;

        // scrape data from the next page
        scraper($nextPageUrl);
    }
}

Finally, execute the scraper function outside the loop to scrape the first page before opening other pages. Print the product data array to view the extracted content:

scraper.py
// ...

// call the function to start scraping from the initial URL
scraper($url);

// print the extracted products
print_r($productData);

Your final code should look like this after combining all the snippets:

scraper.php
<?php

// include the Simple HTML DOM parser library
include_once("simple_html_dom.php");

// specify the target website's URL
$url = "https://scrapingcourse.com/ecommerce/";

// initialize an array to store all product data
$productData = array();

// create a function to scrape product data from a given URL
function scraper($url) {

    // log the currently scraped page
    echo "Scraping page: $url\n";

    // initialize a cURL session
    $curl = curl_init();

    // set the website URL
    curl_setopt($curl, CURLOPT_URL, $url);

    // return the response as a string
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

    // follow redirects
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); 

    // ignore SSL verification (not recommended in production)
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

    // execute the cURL session
    $htmlContent = curl_exec($curl);

    // check for errors
    if ($htmlContent === false) {
        // handle the error
        $error = curl_error($curl);
        echo "cURL error: " . $error;
        exit;
    }

    // close cURL session
    curl_close($curl);

    // create a new Simple HTML DOM instance
    $html = str_get_html($htmlContent);

    // obtain the product containers
    $products = $html->find(".product");

    // loop through the product container to extract its elements
    foreach ($products as $product) {

        // find the name elements within the current product element
        $name = $product->find(".woocommerce-loop-product__title", 0);

        // find the image elements within the current product element
        $image = $product->find("img", 0);

        // find the price elements within the current product element
        $price = $product->find("span.price", 0);

        // check if the target elements exist with the required attributes
        if (
            $name && $price && $image 
            && isset($name->plaintext)
            && isset($price->plaintext) 
            && isset($image->src)
            
        ) {

            // decode the price symbol to $
            $decodedPrice = html_entity_decode($price->plaintext);

            // create an array of the extracted data
            $productInfo = array(
                "URL" => $name->plaintext,
                "price" => $decodedPrice,
                "image_src" => $image->src
            );

            // append the extracted data to the product array
            global $productData;
            $productData[] = $productInfo;
        }
    }

    // check if there is a next page
    $nextPageLink = $html->find("a.next", 0);
    if ($nextPageLink) {
        $nextPageUrl = $nextPageLink->href;

        // scrape data from the next page
        scraper($nextPageUrl);
    }
}

// call the function to start scraping from the initial URL
scraper($url);

// print the extracted products
print_r($productData);
?>

The above code navigates all pages on the target website and scrapes the names, prices, and image URLs of all its products:

Output
(
    [0] => Array
        (
            [Name] => Abominable Hoodie
            [Price] => $ 69.00 
            [Image URL] => https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg
        )
        
    // ... other products omitted for brevity    
    
    [187] => Array
        (
            [URL] => Zoltan Gym Tee
            [price] => $ 29.00
            [image_src] => https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/ms06-blue_main-324x324.jpg
        )
)

Great job! Now, you know how to implement pagination with PHP's cURL and Simple HTML DOM Parser. Next, you'll learn how to avoid getting blocked while scraping with PHP.

Avoid Getting Blocked When Scraping With PHP

Anti-bot measures are a big challenge in web scraping, as many websites employ them to stop you from accessing and extracting their content. You'll need to bypass them to scrape without getting blocked.

You can avoid blocks by configuring your scraper to use premium web scraping proxies and rotate your IP address, preventing potential IP bans. Another option is to optimize your request headers to mimic a legitimate user.

While these measures can increase your chance of evading detection, they're usually insufficient against advanced AI-powered anti-bot measures like Cloudflare, Akamai, and DataDome.

For example, the current scraper won't work with a Cloudflare-protected website like the G2 Reviews.

Try it out by replacing the target URL with G2's URL:

scraper.py
<?php

// specify the target website's URL
$url = "https://www.g2.com/products/asana/reviews";

// initialize a cURL session
$curl = curl_init();

// set the website URL
curl_setopt($curl, CURLOPT_URL, $url);

// return the response as a string
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

// follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); 

// ignore SSL verification
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

// execute the cURL session
$htmlContent = curl_exec($curl);

// check for errors
if ($htmlContent === false) {

    // handle the error
    $error = curl_error($curl);
    echo "cURL error: " . $error;
    exit;
}

// print the HTML content
echo $htmlContent;

// close cURL session
curl_close($curl);
?>

The scraper's HTTP client (cURL) got blocked by Cloudflare Turnstile:

Output
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
  
    <!--  ...    -->
  
    <title>Attention Required! | Cloudflare</title>
</head>

The best way to avoid anti-bot measures while scraping with PHP or any other language is to use a web scraping API like ZenRows. It fixes your request headers, auto-rotates premium proxies, and bypasses CAPTCHAs and any other anti-bot system, allowing you to scrape any website without limitations.

Let's use ZenRows to scrape the previous G2 page that blocked you earlier to see how it works.

Sign up to open the ZenRows Request Builder. Paste the target URL in the link box, set the Boost mode to JS Rendering, and activate Premium Proxies. Select PHP as your preferred language and choose the API connection mode. Copy and paste the generated code into your scraper file.

ZenRows Request Builder
Click to open the image in full screen

Here's a slightly modified version of the generated code:

scraper.php
<?php

// format the request parameters
$apiUrl = 
    "https://api.zenrows.com/v1/" .
    "?apikey=<YOUR_ZENROWS_API_KEY>" .
    "&url=" . urlencode("https://www.g2.com/products/asana/reviews") .
    "&js_render=true&premium_proxy=true";

// set cURL optiions
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $apiUrl);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

// disable SSL verification
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

// sexecute the request
$response = curl_exec($ch);

// get the output
if ($response === false) {
    echo "Error: " . curl_error($ch);
} else {
    echo $response . PHP_EOL;
}

curl_close($ch);
?>

The code accesses the protected website and scrapes its full-page HTML. See the result below, showing the page title with omitted content:

Output
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8" />
    <link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
    <title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
</head>
<body>
    <!-- other content omitted for brevity -->
</body>

Congratulations! You've just scraped a Cloudflare-protected website with the ZenRows API.

Scrape with a Headless Browser in PHP

Many websites use JavaScript to render content. Such websites require a headless browser like Selenium since ordinary HTML parsers like the Simple HTML DOM Parser can't execute JavaScript.

Let's scrape the ScrapingCourse infinite scrolling challenge with Selenium to see how it works. Below is a demonstration of how the page renders content:

infinite scrolling demo
Click to open the image in full screen

To use Selenium in PHP, ensure you've downloaded and installed Composer on your computer. Create a new project folder and open your command prompt to that directory. Then run the following command. Skip the onscreen instructions to maintain the default settings:

Output
composer init

Install the Selenium WebDriver:

Terminal
composer require php-webdriver/webdriver

You'll also need the CSS Selector package to locate elements easily. Install it using the composer:

Terminal
composer require symfony/css-selector

Next, download the Selenium Server to your project directory. However, that package requires Java 11+ installed on your computer.

Start the Selenium server with the following command, replacing "version" with the downloaded version:

Terminal
java -jar selenium-server-<version>.jar standalone --selenium-manager true

The above starts the Selenium server on port 4444.

Now, create a new "scraper.php" file in your project root folder, and let's develop the code to scrape product names, prices, and image URLs from the target website.

Import the required libraries, specify the local host address, and set the Chrome options to start the Chrome browser in headless mode. Then, instantiate a ChromeDriver with the local server address and your desired capabilities:

scraper.php
// declare a WebDriver namespace
namespace Facebook\WebDriver;

// import the required libraries
use Facebook\WebDriver\Remote\DesiredCapabilities;
use Facebook\WebDriver\Remote\RemoteWebDriver;
use Facebook\WebDriver\Chrome\ChromeOptions;

require_once("vendor/autoload.php");

// specify the URL to the local Selenium Server
$host = "http://localhost:4444/";

// specify the desired capabilities
$capabilities = DesiredCapabilities::chrome();
$chromeOptions = new ChromeOptions();

// run Chrome in headless mode
$chromeOptions->addArguments(["--headless"]);

// register the Chrome options
$capabilities->setCapability(ChromeOptions::CAPABILITY_W3C, $chromeOptions);

// initialize a driver to control a Chrome instance
$driver = RemoteWebDriver::create($host, $capabilities);

Create a scraper function that accepts a driver argument and scrapes the target product data from each product container iteratively. This function uses the cssSelector to locate elements:

scraper.php
// ...

// function to extract data from the page
function scraper($driver) {

    // maximize the window
    $driver->manage()->window()->maximize();
    // extract the product container
    $products = $driver->findElements(WebDriverBy::cssSelector(".flex.flex-col.items-center.rounded-lg"));
    
    // loop through the product container to extract names and prices
    foreach ($products as $product) {
        $name = $product->findElement(WebDriverBy::cssSelector(".self-start.text-left.w-full > span:first-child"))->getText();
        $price = $product->findElement(WebDriverBy::cssSelector(".text-slate-600"))->getText();
        $image_url = $product->findElement(WebDriverBy::cssSelector("img"))->getAttribute("src");
        
        // output the extracted data
        echo "Name: $name\n";
        echo "Price: $price\n";
        echo "Image URL: $image_url\n";
    }
}

Open the target page and set the previous height value. Start a while loop that continuously scrolls to the bottom of the page. Wait for elements to load after scrolling, update the page height with the new one, and execute the scraper function once there are no more heights to scroll:

scraper.php
// ...

// open the target page
$driver->get("https://www.scrapingcourse.com/infinite-scrolling");

$lastHeight = $driver->executeScript("return document.body.scrollHeight");
while (true) {
    // Scroll down to bottom
    $driver->executeScript("window.scrollTo(0, document.body.scrollHeight);");

    // wait for the page to load
    sleep(2);

    // get the new height and compare with last height
    $newHeight = $driver->executeScript("return document.body.scrollHeight");
    if ($newHeight == $lastHeight) {
        // extract data once all content has loaded
        scraper($driver);
        break;
    }
    $lastHeight = $newHeight;
}

// close the browser
$driver->quit();

Merge the snippets to get the following final code:

scraper.php
<?php
namespace Facebook\WebDriver;

// import the required libraries
use Facebook\WebDriver\Remote\DesiredCapabilities;
use Facebook\WebDriver\Remote\RemoteWebDriver;
use Facebook\WebDriver\Chrome\ChromeOptions;

require_once("vendor/autoload.php");

// specify the URL to the local Selenium Server
$host = "http://localhost:4444/";

// specify the desired capabilities
$capabilities = DesiredCapabilities::chrome();
$chromeOptions = new ChromeOptions();

// run Chrome in headless mode
$chromeOptions->addArguments(["--headless"]);

// register the Chrome options
$capabilities->setCapability(ChromeOptions::CAPABILITY_W3C, $chromeOptions);

// initialize a driver to control a Chrome instance
$driver = RemoteWebDriver::create($host, $capabilities);

// function to extract data from the page
function scraper($driver) {

    // maximize the window
    $driver->manage()->window()->maximize();
    // extract the product container
    $products = $driver->findElements(WebDriverBy::cssSelector(".flex.flex-col.items-center.rounded-lg"));
    
    // loop through the product container to extract names and prices
    foreach ($products as $product) {
        $name = $product->findElement(WebDriverBy::cssSelector(".self-start.text-left.w-full > span:first-child"))->getText();
        $price = $product->findElement(WebDriverBy::cssSelector(".text-slate-600"))->getText();
        $image_url = $product->findElement(WebDriverBy::cssSelector("img"))->getAttribute("src");
        
        // output the extracted data
        echo "Name: $name\n";
        echo "Price: $price\n";
        echo "Image URL: $image_url\n";
    }
}

// open the target page
$driver->get("https://www.scrapingcourse.com/infinite-scrolling");

$lastHeight = $driver->executeScript("return document.body.scrollHeight");
while (true) {
    // Scroll down to bottom
    $driver->executeScript("window.scrollTo(0, document.body.scrollHeight);");

    // wait for the page to load
    sleep(2);

    // get the new height and compare with last height
    $newHeight = $driver->executeScript("return document.body.scrollHeight");
    if ($newHeight == $lastHeight) {
        // extract data once all content has loaded
        scraper($driver);
        break;
    }
    $lastHeight = $newHeight;
}

// close the browser
$driver->quit();
?>

The code scrolls the entire page and extracts all product names, prices, and image URLs:

Output
Name: Chaz Kangeroo Hoodie
Price: $52
Image URL: https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh01-gray_main.jpg

// ... other products omitted for brevity

Name: Breathe-Easy Tank
Price: $34
Image URL: https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wt09-white_main.jpg

You've just used Selenium to extract content from a page that loads content with infinite scrolling. Congratulations!

Conclusion

In this tutorial, you've learned the basic and advanced techniques of scraping the web with PHP. You now know how to:

  • Get the full-page HTML of a website with the cURL HTTP client.
  • Extract a single element from a target web page.
  • Scrape all the content on a web page.
  • Write extracted data to a CSV file.
  • Handle dynamic web pages, such as infinite scrolling with Selenium.

Remember that most websites apply anti-bot measures, which may block your web scraper. We recommend integrating ZenRows to bypass them and scrape any website without getting blocked. Try ZenRows for free!

Ready to get started?

Up to 1,000 URLs for free are waiting for you