Web Crawling Webinar for Tech Teams
Web Crawling Webinar for Tech Teams

Web Crawler With PHP: Step-by-Step Tutorial

Yuvraj Chandra
Yuvraj Chandra
January 30, 2025 · 8 min read

Do you want to develop your PHP scraper into a full-fledged web crawler? You're in the right place!

In this tutorial, you'll learn the basic to advanced techniques of building and optimizing a PHP web crawler.

Let's go!

What Is Web Crawling?

Web crawling involves discovering and following links automatically across web pages. Unlike web scraping, which collects data directly from specific web pages, the primary goal of crawling is to track and navigate links. 

That said, crawling can be a subset of scraping, as it usually includes collecting data from visited links. To dive deeper into the differences and connections between these techniques, check out our complete guide on web crawling vs. web scraping.

Note that you'll need to consider factors such as the crawl depth and frequency during crawling to avoid endless execution and potential bans.

Now that you have a better understanding of what web crawling is, let's dive into the tutorial.

Build Your First PHP Web Crawler

In this tutorial, you'll crawl the E-commerce Challenge page using cURL and PHP Simple HTML DOM. Here's what the page looks like:

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

The demo website features many links, including carts, categories, paginated product pages, and more. Your PHP web crawler will follow some of these links and scrape specific product information, such as product names, prices, and images.

Prerequisites for PHP Crawling

Before you begin, ensure you set up the following requirements.

  • PHP: This tutorial uses PHP 3+. If you haven't already, download and install the latest version from the official download site.
  • Code Editor: This tutorial uses VS Code on a Windows machine. However, you can use any suitable code editor.
  • cURL and PHP Simple HTML DOM Parser: You'll handle HTTP requests using PHP's built-in cURL package and parse HTML using the Simple HTML DOM Parser. Download PHP Simple HTML DOM Parser from SourceForge and extract the zipped folder. Open the extracted folder, then copy and paste the simple_html_dom.php file into your project root folder.

All done? You're now ready to crawl websites with PHP.

Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

The first step is to request the target website using cURL. This action retrieves the website's HTML in preparation for parsing.

Specify the target URL and create a function to fetch the target website via a url parameter. This function initializes cURL with the required parameters and returns the website's HTML content. You'll parse this returned HTML later using the simple_html_dom library:

scraper.php
<?php
// define the target URL
$targetUrl = 'https://www.scrapingcourse.com/ecommerce/';

// function to fetch webpage content using cURL
function fetchContent($url) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 30);

    $response = curl_exec($ch);
    if (curl_errno($ch)) {
        echo "cURL error: " . curl_error($ch) . "\n";
        return false;
    }
    curl_close($ch);
   
    // return the response
    return $response;
}
// execute the function
$content = fetchContent($targetUrl);

echo $content;

?>

Let's now expand the above code to follow links on the target website. 

First, import simple_html_dom and add the target URL to an array of URLs to visit. Then, apply a crawl depth by setting the crawlLength to 20. This crawl depth is essential to avoid endless crawling:

scraper.php
<?php

// include the simple_html_dom library
include('simple_html_dom.php');

// ...

// create an initial array of URLs to visit
$urlsToVisit = [$targetUrl];

// define the crawl limit
$maxCrawlLength = 20;
// ...
?>

Now, create a crawler function that accepts the target URL, the URLs to visit, and the maximum crawl length. Set the initial crawl count to 0. Then, start a while loop that checks if the array containing the URLs to visit is empty. The while condition stops the crawl once the function hits the crawl limit. Shift to the next URL in the urlsToVisit array and increment the crawl counter:

scraper.php
// ...

// crawler function
function crawler($targetUrl, $urlsToVisit, $maxCrawlLength) {
    $crawlCount = 0;

    while (!empty($urlsToVisit) && $crawlCount <= $maxCrawlLength) {
        // get the next URL from the list
        $currentUrl = array_shift($urlsToVisit);
        $crawlCount++;
    }
}

Open a try block and visit the current URL by calling the previous fetchContent function. Parse the HTML output using the string parser from simple_html_dom and find all the a tags on the current page using a foreach loop. 

Format every relative URL into an absolute one for uniformity. Then, implement a logic to limit duplicate crawls. The entire function returns the urlsToVisit array:

scraper.php
// ...

function crawler($targetUrl, $urlsToVisit, $maxCrawlLength) {
    // ...

    while (!empty($urlsToVisit) && $crawlCount <= $maxCrawlLength) {

        // ...
        try {
            // fetch the webpage content
            $htmlContent = fetchContent($currentUrl);
            if (!$htmlContent) continue;

            // parse the webpage content
            $html = str_get_html($htmlContent);
            if (!$html) {
                echo "Failed to parse HTML for $currentUrl\n";
                continue;
            }

            // find all <a> tags with href attributes
            foreach ($html->find('a[href]') as $linkElement) {
                $url = $linkElement->href;

                // handle relative URLs
                if (!preg_match('/^http/', $url)) {
                    $url = rtrim($targetUrl, '/') . '/' . ltrim($url, '/');
                }

                // check if the URL is within the target domain and has not already been visited
                if (strpos($url, $targetUrl) === 0 && !in_array($url, $urlsToVisit) && $url !== $currentUrl) {
                    $urlsToVisit[] = $url;
                }
            }

        } catch (Exception $e) {
            // handle any exceptions
            echo "Error fetching $currentUrl: " . $e->getMessage() . "\n";
        }
    }

    // return the URLs to visit
    return $urlsToVisit;
}

Finally, execute the crawler function to view the crawled links:

scraper.php
// ...

// execute the crawler
print_r(crawler($targetUrl, $urlsToVisit, $maxCrawlLength));

Now, before moving to the next step, let’s review the complete snippet as it stands so far:

scraper.php
<?php

// include the simple_html_dom library
include('simple_html_dom.php');

// define the target URL and initialize the list of URLs to visit
$targetUrl = 'https://www.scrapingcourse.com/ecommerce/';
$urlsToVisit = [$targetUrl];

// define the crawl limit
$maxCrawlLength = 20;

// function to fetch webpage content using cURL
function fetchContent($url) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 30);

    $response = curl_exec($ch);
    if (curl_errno($ch)) {
        echo "cURL error: " . curl_error($ch) . "\n";
        return false;
    }
    curl_close($ch);

    return $response;
}

// crawler function
function crawler($targetUrl, $urlsToVisit, $maxCrawlLength) {
    $crawlCount = 0;

    while (!empty($urlsToVisit) && $crawlCount <= $maxCrawlLength) {
        // get the next URL from the list
        $currentUrl = array_shift($urlsToVisit);
        $crawlCount++;

        try {
            // fetch the webpage content
            $htmlContent = fetchContent($currentUrl);
            if (!$htmlContent) continue;

            // parse the webpage content
            $html = str_get_html($htmlContent);
            if (!$html) {
                echo "Failed to parse HTML for $currentUrl\n";
                continue;
            }

            // find all <a> tags with href attributes
            foreach ($html->find('a[href]') as $linkElement) {
                $url = $linkElement->href;

                // handle relative URLs
                if (!preg_match('/^http/', $url)) {
                    $url = rtrim($targetUrl, '/') . '/' . ltrim($url, '/');
                }

                // check if the URL is within the target domain and has not already been visited

                if (strpos($url, $targetUrl) === 0 && !in_array($url, $urlsToVisit) && $url !== $currentUrl) {
                    $urlsToVisit[] = $url;
                }
            }

        } catch (Exception $e) {
            // handle any exceptions
            echo "Error fetching $currentUrl: " . $e->getMessage() . "\n";
        }
    }

    // return the URLs to visit
    return $urlsToVisit;
}

// execute the crawler
print_r(crawler($targetUrl, $urlsToVisit, $maxCrawlLength));
?>

Execute the code, and you'll get the following links:

Output
(
    [0] => https://www.scrapingcourse.com/ecommerce/?add-to-cart=2740
    [1] => https://www.scrapingcourse.com/ecommerce/product/ajax-full-zip-sweatshirt/

    // ... omitted for brevity

    [25] => https://www.scrapingcourse.com/ecommerce/page/1/
    [26] => https://www.scrapingcourse.com/ecommerce/page/5/

    //... omitted for brevity
)

Nice! Your crawler now follows links on the target website. Let's improve it to target product URLs and scrape their content.

Step 2: Extract Data From Your Crawler

You'll now scrape product names, prices, and image URLs from specific pages. To achieve that, you'll need to filter the crawled links and focus on paginated product pages.

Modify the previous scraper to specify an array to collect the extracted product data. Since product pages have the page/<PAGE_NUMBER> path in their URL, use regex to match this pattern:

scraper.php
// ...

// to store scraped product data
$productData = [];

// define a regex to match the pagination pattern
$pagePattern = '/page\/\d+/i';

// ...

Update the crawler function to accept the pagePattern and productData parameters. Implement the scraping logic only if the crawled URLs match the specified page pattern:

scraper.php
// ...

// crawler function
function crawler($urlsToVisit, $maxCrawlLength, $targetUrl, $pagePattern, &$productData) {

    while (!empty($urlsToVisit) && $crawledCount <= $maxCrawlLength) {

        try {

            // ...

            // extract product information from paginated product pages only
            if (preg_match($pagePattern, $currentUrl) || $currentUrl === $targetUrl) {
               
                // retrieve all product containers
                foreach ($html->find('.product') as $productElement) {
                    $data = [];
                    // remove HTML entities from the price element (&#36; to $)
                    $price = $productElement->find('.price', 0)->plaintext ?? 'N/A';
                    $price = html_entity_decode($price);
                    $price = preg_replace('/\s+/', '', trim($price));

                    $data = [
                        'url' => $productElement->find('.woocommerce-LoopProduct-link', 0)->href ?? 'N/A',
                        'image' => $productElement->find('.product-image', 0)->src ?? 'N/A',
                        'name' => trim($productElement->find('.product-name', 0)->plaintext ?? 'N/A'),
                        'price' => $price,
                    ];

                    // append the scraped data to the productData array
                    $productData[] = $data;
                }
            }
        } catch (Exception $e) {
            // ... error handling
        }
    }
}

// ...

Your crawler now filters product pages successfully to extract data from them. Let's store the extracted data in the next step.

Step 3: Export the Scraped Data to CSV

Data storage is essential for record keeping, further analysis, referencing, and more. You can store the extracted data in JSON, CSV, XLSX or a local or remote database. 

Let's write the data to CSV for simplicity.

To save the scraped product data as CSV, execute the crawler function. Then, create the headers and write each data into a row:

scraper.php
//...

// execute the crawler
crawler($urlsToVisit, $maxCrawlLength, $targetUrl, $pagePattern, $productData);

// write product data to CSV file
$header = "Url,Image,Name,Price\n";
$csvRows = [];
foreach ($productData as $item) {
    $csvRows[] = implode(',', [
        $item['url'],
        $item['image'],
        $item['name'],
        $item['price']
    ]);
}
$csvData = $header . implode("\n", $csvRows);

// save to CSV file
file_put_contents('products.csv', $csvData);
echo "CSV file has been successfully created!";

Combine the snippets from all the steps, and you'll get this full code:

scraper.php
<?php
// include the simple_html_dom library
require 'simple_html_dom.php';

// specify the URL of the site to crawl
$targetUrl = 'https://www.scrapingcourse.com/ecommerce/';

// add the target URL to an array of URLs to visit
$urlsToVisit = [$targetUrl];

// define the desired crawl limit
$maxCrawlLength = 20;

// to store scraped product data
$productData = [];

// define a regex to match the pagination pattern
$pagePattern = '/page\/\d+/i';

// function to fetch webpage content using cURL
function fetchContent($url) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 30);

    $response = curl_exec($ch);
    if (curl_errno($ch)) {
        echo "cURL error: " . curl_error($ch) . "\n";
        return false;
    }
    curl_close($ch);

    return $response;
}

// crawl function
function crawler($urlsToVisit, $maxCrawlLength, $targetUrl, $pagePattern, &$productData) {
    $crawledCount = 0;

    while (!empty($urlsToVisit) && $crawledCount <= $maxCrawlLength) {
        $currentUrl = array_shift($urlsToVisit);
        $crawledCount++;

        try {
            // request the target website
            $htmlContent = fetchContent($currentUrl);
            if (!$htmlContent) continue;
           
            // parse the webpage content
            $html = str_get_html($htmlContent);

            if (!$html) {
                echo "Failed to parse HTML for $currentUrl\n";
                continue;
            }

            // find all links on the page
            foreach ($html->find('a[href]') as $linkElement) {
                $url = $linkElement->href;

                // handle relative URLs
                if (!preg_match('/^http/', $url)) {
                    $url = rtrim($targetUrl, '/') . '/' . ltrim($url, '/');
                }

                // check if the URL is within the target domain and has not already been visited
                if (strpos($url, $targetUrl) === 0 && !in_array($url, $urlsToVisit) && $url !== $currentUrl) {
                    $urlsToVisit[] = $url;
                }
            }

            // extract product information from paginated product pages only
            if (preg_match($pagePattern, $currentUrl) || $currentUrl === $targetUrl) {
               
                // retrieve all product containers
                foreach ($html->find('.product') as $product) {
                    $data = [];
       
                    // remove HTML entities from the price element (&#36; to $)
                    $price = $product->find('.price', 0)->plaintext ?? 'N/A';
                    $price = html_entity_decode($price);
                    $price = preg_replace('/\s+/', '', trim($price));
       
                    $data['url'] = $product->find('.woocommerce-LoopProduct-link', 0)->href ?? 'N/A';
                    $data['image'] = $product->find('.product-image', 0)->src ?? 'N/A';
                    $data['name'] = trim($product->find('.product-name', 0)->plaintext) ?? 'N/A';
                    $data['price'] = $price;
       
                    // append the scraped data to the productData array
                    $productData[] = $data;
                }            }
        } catch (Exception $e) {
            echo "Error fetching $currentUrl: " . $e->getMessage() . PHP_EOL;
        }
    }
}

// execute the crawler
crawler($urlsToVisit, $maxCrawlLength, $targetUrl, $pagePattern, $productData);

// write product data to CSV file
$header = "Url,Image,Name,Price\n";
$csvRows = [];
foreach ($productData as $item) {
    $csvRows[] = implode(',', [
        $item['url'],
        $item['image'],
        $item['name'],
        $item['price']
    ]);
}
$csvData = $header . implode("\n", $csvRows);

// save to CSV file
file_put_contents('products.csv', $csvData);
echo "CSV file has been successfully created!";
?>

The above crawler stores the scraped data into a CSV file inside your project root folder. Here's the CSV file content:

scrapingcourse ecommerce product output csv
Click to open the image in full screen

You did it! You just created your first PHP web crawler using cURL and Simple HTML DOM. 

That's not all. You can still improve your web crawler with some essential tweaks.

Optimize Your PHP Web Crawler

The current crawler already handles features like crawl depth and basic duplicate checks. However, you can add more advanced features and improve existing ones to make it more efficient. 

The current approach to checking duplicate crawls doesn't handle trivial formatting details, such as case sensitivity, ineffective query strings, and slashes in similar URLs.

To effectively prevent link duplication, start by creating a URL normalizer function that formats crawled links into lower cases and absolute URLs:

scraper.php
//...

// function to normalize URLs
function normalizeUrl($url, $baseUrl) {
    $parsedUrl = parse_url($url);
    // if the URL is already absolute, return it
    if (isset($parsedUrl['scheme'])) {
        return strtolower($url);
    }
   
    // otherwise, join it with the base URL
    return strtolower(rtrim($baseUrl, '/') . '/' . ltrim($url, '/'));
}

//...

Now, define a visitedURL array to track crawled URLs and update the crawler function parameters to include this array. Adjust the while loop to check the crawl length against the visited URLs instead of the crawl counter. Then, modify the code to normalize the crawled links and mark the visited ones:

scraper.php
// ...

// define the visited URL array
$visitedUrls = [];

function crawler(&$urlsToVisit, &$visitedUrls, $maxCrawlLength, $targetUrl, $pagePattern, &$productData) {
    while (!empty($urlsToVisit) && count($visitedUrls) <= $maxCrawlLength) {

        //...

        // normalize the URL
        $normalizedUrl = normalizeUrl($currentUrl, $targetUrl);
        if (in_array($normalizedUrl, $visitedUrls)) continue;

        // mark URL as visited
        $visitedUrls[] = $normalizedUrl;

        // ... other crawling logic

    }
}

Update the try block to request the normalized URL. Similarly, normalize the crawled links and check their absolute URLs against visitedURLs and urlsToVisit. Rewrite the crawling logic to only scrape data if the page pattern and the absolute URLs match:

scraper.php
 try {
            // request the URL
            $htmlContent = fetchContent($normalizedUrl);
            if (!$htmlContent) continue;

            // parse the website's HTML
            // ...
            if (!$html) {
                echo "Failed to parse HTML for $normalizedUrl\n";
                // ...
            }

            // find all links on the page
            foreach ($html->find('a[href]') as $linkElement) {
                // ...
                $absoluteUrl = normalizeUrl($url, $targetUrl);

                if (strpos($absoluteUrl, $targetUrl) === 0 &&
                    !in_array($absoluteUrl, $visitedUrls) &&
                    !in_array($absoluteUrl, $urlsToVisit)) {
                    $urlsToVisit[] = $absoluteUrl;
                }
            }

            // extract product information from paginated pages and target URL
            if (preg_match($pagePattern, $normalizedUrl) || $normalizedUrl === $targetUrl) {

                //... scraping logic

            }
    } catch (Exception $e) {
            //... error handling
        }

Your PHP web crawler will now prevent duplicate crawling more efficiently. Let's improve the code further with priority queueing. 

Prioritize Specific Pages

Setting queue priorities allows you to crawl specific URLs before the others. This approach can also increase the number of target pages you can crawl within the chosen crawl depth.

Let's prioritize product pages in this example.

Replace the urlsToVisit array with arrays containing high and low-priority queues. Include the priority queues as crawl function parameters and use them to update the while loop condition. Obtain the URLs to crawl from the queues and only queue links that match the pagination pattern:

scraper.php
// ...

// high-priority and low-priority queues
$highPriorityQueue = [$targetUrl];
$lowPriorityQueue = [$targetUrl];

// ...

function crawler(&$highPriorityQueue, &$lowPriorityQueue, &$visitedUrls, $maxCrawlLength, $targetUrl, $pagePattern, &$productData) {
    while ((count($highPriorityQueue) > 0 || count($lowPriorityQueue) > 0) && count($visitedUrls) <= $maxCrawlLength) {
        // check for URLs in high-priority queue first
        if (count($highPriorityQueue) > 0) {
            $currentUrl = array_shift($highPriorityQueue);
        } else {
            // otherwise, get the next URL from the low-priority queue
            $currentUrl = array_shift($lowPriorityQueue);
        }

        // ...

        try {

            // ...

            // find all links on the page
            foreach ($html->find('a[href]') as $linkElement) {

                // ...

                if (strpos($absoluteUrl, $targetUrl) === 0 &&
                    //...
                    !in_array($absoluteUrl, $highPriorityQueue) &&
                    !in_array($absoluteUrl, $lowPriorityQueue)) {

                        // prioritize paginated pages
                        if (preg_match($pagePattern, $absoluteUrl)) {
                            $highPriorityQueue[] = $absoluteUrl;
                        } else {
                            $lowPriorityQueue[] = $absoluteUrl;
                        }
                }
            }
            // ...scraping logic
        } catch (Exception $e) {
            // ...error handling
        }
    }
}

The above modification ensures your crawler only follows paginated product links first, increasing your chances of following more product pages and extracting more data.

Now, let's leverage the power of sessions.

Maintain a Single Crawl Session

Keeping a single crawl session can prevent unnecessary reconnections and is handy for persisting session cookies across several requests.

To track session cookies, update the fetchContent function to store cookies and retrieve them from a cookies file:

scraper.php
//...

// cURL helper function to handle requests and session management
function fetchContent($url, $cookieFile) {
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);  // store cookies here
    curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile); // send cookies from here

    $response = curl_exec($ch);
   
    // check for cURL errors
    if (curl_errno($ch)) {
        echo 'cURL error: ' . curl_error($ch);
    }
   
    curl_close($ch);
   
    return $response;
}

Specify the cookie file inside the crawler function and include the cookie file as a parameter while executing the fetchContent function:

scraper.php
//...

function crawler(&$highPriorityQueue, &$lowPriorityQueue, &$visitedUrls, $maxCrawlLength, $targetUrl, $pagePattern, &$productData) {
    $cookieFile = 'cookies.txt';
    while ((count($highPriorityQueue) > 0 || count($lowPriorityQueue) > 0) && count($visitedUrls) <= $maxCrawlLength) {
        //...
        try {
            // request the URL
            $htmlContent = fetchContent($normalizedUrl, $cookieFile);
            //... other crawling logic
        } catch (Exception $e) {
            //... error handling
        }
    }
}

The above modification creates a cookies file in your project root folder. It continuously dumps initial cookies into this file and retrieves them for subsequent connections.

Let's combine all the snippets, and you'll get this final code:

scraper.php
<?php
// include simple_html_dom.php
include_once 'simple_html_dom.php';

// function to normalize URLs
function normalizeUrl($url, $baseUrl) {
    $parsedUrl = parse_url($url);
    if (isset($parsedUrl['scheme'])) {
        return strtolower($url);
    }
    return strtolower(rtrim($baseUrl, '/') . '/' . ltrim($url, '/'));
}

// cURL helper function to handle requests and session management
function fetchContent($url, $cookieFile) {
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);  // store cookies here
    curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile); // send cookies from here

    $response = curl_exec($ch);
   
    // check for cURL errors
    if (curl_errno($ch)) {
        echo 'cURL error: ' . curl_error($ch);
    }
   
    curl_close($ch);
   
    return $response;
}

// specify the target URL of the site to crawl
$targetUrl = 'https://www.scrapingcourse.com/ecommerce/';

// high-priority and low-priority queues
$highPriorityQueue = [$targetUrl];
$lowPriorityQueue = [$targetUrl];

// define the visited URL array
$visitedUrls = [];

// define the desired crawl limit
$maxCrawlLength = 20;

// to store scraped product data
$productData = [];

// define a regex to match the pagination pattern
$pagePattern = '/page\/\d+/i';

function crawler(&$highPriorityQueue, &$lowPriorityQueue, &$visitedUrls, $maxCrawlLength, $targetUrl, $pagePattern, &$productData) {
    $cookieFile = 'cookies.txt';
    while ((count($highPriorityQueue) > 0 || count($lowPriorityQueue) > 0) && count($visitedUrls) <= $maxCrawlLength) {
        // check for URLs in high-priority queue first
        if (count($highPriorityQueue) > 0) {
            $currentUrl = array_shift($highPriorityQueue);
        } else {
            // otherwise, get the next URL from the low-priority queue
            $currentUrl = array_shift($lowPriorityQueue);
        }

        // normalize the URL
        $normalizedUrl = normalizeUrl($currentUrl, $targetUrl);
        if (in_array($normalizedUrl, $visitedUrls)) continue;

        // mark URL as visited
        $visitedUrls[] = $normalizedUrl;

        try {
            // request the URL
            $htmlContent = fetchContent($normalizedUrl, $cookieFile);
            if (!$htmlContent) continue;

            // parse the website's HTML
            $html = str_get_html($htmlContent);
            if (!$html) {
                echo "Failed to parse HTML for $normalizedUrl\n";
                continue;
            }

            // find all links on the page
            foreach ($html->find('a[href]') as $linkElement) {
                $url = $linkElement->href;
                $absoluteUrl = normalizeUrl($url, $targetUrl);

                if (strpos($absoluteUrl, $targetUrl) === 0 &&
                    !in_array($absoluteUrl, $visitedUrls) &&
                    !in_array($absoluteUrl, $highPriorityQueue) &&
                    !in_array($absoluteUrl, $lowPriorityQueue)) {

                        // prioritize paginated pages
                        if (preg_match($pagePattern, $absoluteUrl)) {
                            $highPriorityQueue[] = $absoluteUrl;
                        } else {
                            $lowPriorityQueue[] = $absoluteUrl;
                        }
                }
            }

            // extract product information from paginated pages and target URL
            if (preg_match($pagePattern, $normalizedUrl)) {

                foreach ($html->find('.product') as $product) {
                    $data = [];
                    $price = $product->find('.price', 0)->plaintext ?? 'N/A';
                    $price = html_entity_decode($price);
                    $price = preg_replace('/\s+/', '', trim($price));

                    $data['url'] = $product->find('.woocommerce-LoopProduct-link', 0)->href ?? 'N/A';
                    $data['image'] = $product->find('.product-image', 0)->src ?? 'N/A';
                    $data['name'] = trim($product->find('.product-name', 0)->plaintext ?? 'N/A');
                    $data['price'] = $price;

                    $productData[] = $data;
                }
            }
        } catch (Exception $e) {
            echo "Error fetching $currentUrl: " . $e->getMessage() . PHP_EOL;
        }
    }
}

// execute the crawler
crawler($highPriorityQueue, $lowPriorityQueue, $visitedUrls, $maxCrawlLength, $targetUrl, $pagePattern, $productData);

// write product data to CSV file
$header = "Url,Image,Name,Price\n";
$csvRows = [];
foreach ($productData as $item) {
    $csvRows[] = implode(',', [
        $item['url'],
        $item['image'],
        $item['name'],
        $item['price']
    ]);
}
$csvData = $header . implode("\n", $csvRows);
file_put_contents('products.csv', $csvData);
echo "CSV file has been successfully created!";
?>

Bravo! You've now built an advanced web crawler with PHP. Despite its robustness, you still have to deal with anti-bot measures. So, how can you do that? Let's find out.

Avoid Getting Blocked While Crawling With PHP

Getting blocked is common during web crawling because the process involves visiting and extracting data from many pages. To guarantee successful crawling, you'll need to optimize your crawler to avoid detection.

You can reduce the chances of detection by setting proxies, spoofing suitable request headers for scraping, and reducing request frequency. However, these techniques are insufficient at scale, especially when dealing with advanced anti-bot measures.

The easiest way to crawl any website without getting blocked is to use a web scraping API such as ZenRows' Universal Scraper API. ZenRows handles complex setups under the hood, including premium proxy rotation, request header optimization, cookie management for session persistence, JavaScript rendering, anti-bot auto-bypass, and more.

Let's use ZenRows to access and scrape the full-page HTML of this Antibot Challenge page to see how it works.

Sign up and open the ZenRows Request Builder. Paste the target URL in the link box, and activate Premium Proxies and JS Rendering.

building a scraper with zenrows
Click to open the image in full screen

Choose PHP as your programming language and select the API connection mode. Copy and paste the generated code into your crawler file:

The generated code should look like this:

scraper.php
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.scrapingcourse.com%2Fantibot-challenge&js_render=true&premium_proxy=true');
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec($ch);
echo $response . PHP_EOL;
curl_close($ch);
?>

The above code outputs the protected site's full-page HTML:

Output
<html lang="en">
<head>
    <!-- ... -->
    <title>Antibot Challenge - ScrapingCourse.com</title>
    <!-- ... -->
</head>
<body>
    <!-- ... -->
    <h2>
        You bypassed the Antibot challenge! :D
    </h2>
    <!-- other content omitted for brevity -->
</body>
</html>

Congratulations! 🎉 Your crawler now bypasses anti-bot detection using ZenRows' Universal Scraper API. 

Web Crawling Tools for PHP

Although your current crawler uses cURL and Simple HTML DOM, here are some more PHP web crawling tools you can use, depending on your project scope.

ZenRows

ZenRows is one of the top scraping solutions. It offers a complete toolkit for crawling any website at scale without getting blocked. It's the best choice for handling edge cases like memory management, premium proxy rotation, concurrency, automatic retries, fingerprinting evasion, JavaScript rendering, anti-bot auto-bypass, and more. The good news is that you get all these benefits with only a few code lines.

Spatie Crawler

Spatie Crawler (spatie/crawler) is an open-source PHP crawler that uses Guzzle to execute concurrency requests under the hood. Although Spati Crawler doesn't execute JavaScript by default, you can implement JavaScript rendering using BrowserShot, a Puppeteer-based package.

Roach

Roach is another open-source PHP web crawling tool. Roach's architecture is similar to Python's Scrapy, making it feature-rich and well-suited for complex crawling tasks. The library has advanced crawling features like item pipelines for data processing and supports custom middleware. You can integrate Roach easily with web development frameworks like Laravel and Symfony and use the crawling library directly in your applications.

PHP Crawling Best Practices and Considerations

Although you've optimized your PHP crawler with additional features, some best practices will help you scale in edge cases. Here are a few best practices to guide you further:

Parallel Crawling and Concurrency

Currently, your crawler handles requests sequentially, which can slow down the crawling process as it waits for each response before proceeding to a subsequent request. Concurrency allows your crawler to follow pages simultaneously without waiting for responses. This approach can speed up the crawling process significantly.

To modify your crawler for concurrent crawling, update the fetchContent function to handle multiple URLs concurrently by initiating cURL with the curl_multi_init() method. Here's the updated crawler:

scraper.php
<?php
// include simple_html_dom.php
include_once 'simple_html_dom.php';

// function to normalize URLs
function normalizeUrl($url, $baseUrl) {
    $parsedUrl = parse_url($url);
    if (isset($parsedUrl['scheme'])) {
        return strtolower($url);
    }
    return strtolower(rtrim($baseUrl, '/') . '/' . ltrim($url, '/'));
}

// cURL helper function to handle requests and session management
function fetchContent($url, $cookieFile, $multiHandle, $multiHandles) {
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);  // store cookies here
    curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile); // send cookies from here

    // add the handle to the multi-handle
    curl_multi_add_handle($multiHandle, $ch);
    $multiHandles[$url] = $ch;
    return $ch;
}

// specify the target URL of the site to crawl
$targetUrl = 'https://www.scrapingcourse.com/ecommerce/';

// high-priority and low-priority queues
$highPriorityQueue = [$targetUrl];
$lowPriorityQueue = [$targetUrl];

// define the visited URL array
$visitedUrls = [];

// define the desired crawl limit
$maxCrawlLength = 20;

// to store scraped product data
$productData = [];

// define a regex to match the pagination pattern
$pagePattern = '/page\/\d+/i';

function crawler(&$highPriorityQueue, &$lowPriorityQueue, &$visitedUrls, $maxCrawlLength, $targetUrl, $pagePattern, &$productData) {
    $cookieFile = 'cookies.txt';
    $multiHandle = curl_multi_init(); // multi handle for concurrency
    $multiHandles = [];

    // to handle the number of simultaneous requests
    while ((count($highPriorityQueue) > 0 || count($lowPriorityQueue) > 0) && count($visitedUrls) <= $maxCrawlLength) {
        // check for URLs in high-priority queue first
        if (count($highPriorityQueue) > 0) {
            $currentUrl = array_shift($highPriorityQueue);
        } else {
            // otherwise, get the next URL from the low-priority queue
            $currentUrl = array_shift($lowPriorityQueue);
        }

        // normalize the URL
        $normalizedUrl = normalizeUrl($currentUrl, $targetUrl);
        if (in_array($normalizedUrl, $visitedUrls)) continue;

        // mark URL as visited
        $visitedUrls[] = $normalizedUrl;

        try {
            // fetch content using multi-curl
            $ch = fetchContent($normalizedUrl, $cookieFile, $multiHandle, $multiHandles);

            // execute all multi-curl requests
            $active = null;
            do {
                curl_multi_exec($multiHandle, $active);
            } while ($active);

            // check if we have a response for the current URL
            $htmlContent = curl_multi_getcontent($ch);
            if (!$htmlContent) continue;

            // parse the website's HTML
            $html = str_get_html($htmlContent);
            if (!$html) {
                echo "Failed to parse HTML for $normalizedUrl\n";
                continue;
            }

            // find all links on the page
            foreach ($html->find('a[href]') as $linkElement) {
                $url = $linkElement->href;
                $absoluteUrl = normalizeUrl($url, $targetUrl);

                if (strpos($absoluteUrl, $targetUrl) === 0 &&
                    !in_array($absoluteUrl, $visitedUrls) &&
                    !in_array($absoluteUrl, $highPriorityQueue) &&
                    !in_array($absoluteUrl, $lowPriorityQueue)) {

                        // prioritize paginated pages
                        if (preg_match($pagePattern, $absoluteUrl)) {
                            $highPriorityQueue[] = $absoluteUrl;
                        } else {
                            $lowPriorityQueue[] = $absoluteUrl;
                        }
                }
            }

            // extract product information from paginated pages and target URL
            if (preg_match($pagePattern, $normalizedUrl)) {

                foreach ($html->find('.product') as $product) {
                    $data = [];
                    $price = $product->find('.price', 0)->plaintext ?? 'N/A';
                    $price = html_entity_decode($price);
                    $price = preg_replace('/\s+/', '', trim($price));

                    $data['url'] = $product->find('.woocommerce-LoopProduct-link', 0)->href ?? 'N/A';
                    $data['image'] = $product->find('.product-image', 0)->src ?? 'N/A';
                    $data['name'] = trim($product->find('.product-name', 0)->plaintext ?? 'N/A');
                    $data['price'] = $price;

                    $productData[] = $data;
                }
            }
        } catch (Exception $e) {
            echo "Error fetching $currentUrl: " . $e->getMessage() . PHP_EOL;
        }

        // remove the completed request from the multi-handle
        curl_multi_remove_handle($multiHandle, $ch);
    }

    // close multi-handle
    curl_multi_close($multiHandle);
}

// execute the crawler
crawler($highPriorityQueue, $lowPriorityQueue, $visitedUrls, $maxCrawlLength, $targetUrl, $pagePattern, $productData);

// write product data to CSV file
$header = "Url,Image,Name,Price\n";
$csvRows = [];
foreach ($productData as $item) {
    $csvRows[] = implode(',', [
        $item['url'],
        $item['image'],
        $item['name'],
        $item['price']
    ]);
}
$csvData = $header . implode("\n", $csvRows);
file_put_contents('products.csv', $csvData);
echo "CSV file has been successfully created!";
?>

Nice! You just supercharged your crawler with concurrency.

Crawling JavaScript Rendered Pages in PHP

Some websites render content dynamically with JavaScript. The current target site is static, so it loads content immediately. However, dynamic pages take time to render and typically require a browser to execute JavaScript.

Standard HTTP clients like cURL can't handle JavaScript rendering because they only fetch raw HTML without executing scripts. The best way to access a dynamic website during crawling is to simulate a browser environment using a browser automation tool like Selenium.

Here's a sample Selenium script to access a dynamic page like the JS Rendering challenge page:

scraper.php
<?php

namespace Facebook\WebDriver;

// import the required libraries
use Facebook\WebDriver\Remote\DesiredCapabilities;
use Facebook\WebDriver\Remote\RemoteWebDriver;
use Facebook\WebDriver\Chrome\ChromeOptions;

require_once('vendor/autoload.php');

// specify the URL to the local Selenium Server
$host = 'http://localhost:4444/';

// specify the desired capabilities
$capabilities = DesiredCapabilities::chrome();
$chromeOptions = new ChromeOptions();

// run Chrome in headless mode
$chromeOptions->addArguments(['--headless']);

// register the Chrome options
$capabilities->setCapability(ChromeOptions::CAPABILITY_W3C, $chromeOptions);

// initialize a driver to control a Chrome instance
$driver = RemoteWebDriver::create($host, $capabilities);

// maximize the window
$driver->manage()->window()->maximize();

// open the target page
$driver->get('https://www.scrapingcourse.com/javascript-rendering');

// extract the HTML page source and print it
$html = $driver->getPageSource();
echo $html;

// close the driver and release its resources
$driver->close();
?>

However, the above script requires some configurations that are beyond the scope of this article. Check out our tutorial on web scraping with Selenium in PHP for a detailed guide.

Distributed Web Crawling in PHP

Web crawling can be resource-intensive at scale and consumes significant memory, CPU, and bandwidth. This behavior can slow down your local machine, especially when running the crawler on a single node. 

Distributed web crawling helps spread the crawling workload across multiple nodes, reducing the strain on a single machine. This technique improves performance and makes your spider more fault-tolerant. So, even if one node fails, the distributed system can reassign the crawling task to other nodes.

Conclusion

In this article, you've learned the basic to advanced concepts of PHP web crawling using cURL and Simple DOM Parser. You now know how to:

  • Crawl a website to follow links.
  • Extract data from crawled links.
  • Store the extracted data in a CSV file.
  • Optimize your PHP web crawler to avoid duplicate crawls, prioritize specific links, and manage sessions.

Despite building a robust web crawler, remember that anti-bot measures will still block you if you don't implement measures to bypass them. We recommend using ZenRows, an all-in-one scraping solution, to avoid anti-bot detection at scale during crawling.

Try ZenRows for free now!

Ready to get started?

Up to 1,000 URLs for free are waiting for you