How to Parse HTML With PHP [2024 Tutorial]

August 22, 2024 ยท 3 min read

Are you looking for the best way to parse HTML with PHP? You're in the right place!

This tutorial will show you how to set up a PHP project with the php-html-parser library and use it to extract data from HTML content seamlessly.

Let's go!

Step #1: Install PHP HTML Parser

To get started, create a new directory and run the following command to initialize a Composer project:

Terminal
composer init

Follow the prompts to set up your project. You can accept the default values by pressing Enter.

Use Composer to download and install the php-html-parser library:

Terminal
composer require paquettg/php-html-parser

Once you have installed the library, your project directory should look like this:

Example
your-html-parser-project/
โ”œโ”€โ”€ composer.json
โ”œโ”€โ”€ composer.lock
โ””โ”€โ”€ vendor/
    โ””โ”€โ”€ autoload.php
    โ””โ”€โ”€ paquettg/
    โ”œโ”€โ”€    โ””โ”€โ”€ php-html-parser/
    โ”œโ”€โ”€ ...       โ””โ”€โ”€ ...
Frustrated that your web scrapers are blocked once and again?
ZenRows API handles rotating proxies and headless browsers for you.
Try for FREE

Step #2: Extract HTML

Time to extract some data. We'll target the Scraping Course E-commerce demo page.

ScrapingCourse.com Ecommerce homepage
Click to open the image in full screen

Let's use the cURL HTTP client to fetch the webpage content. Include the Composer autoloader to load installed packages and import the Dom class from the php-html-parser library.

Example
<?php

require "vendor/autoload.php";

use PHPHtmlParser\Dom;

?>

Then, initialize a cURL session to fetch the HTML content from the target URL (ScrapingCourse E-commerce demo page). You need to set some cURL options:

  • CURLOPT_URL to set the URL to fetch.
  • CURLOPT_RETURNTRANSFER to ensure the response is returned as a string.
  • CURLOPT_FOLLOWLOCATION to follow any redirects.
  • CURLOPT_SSL_VERIFYPEER to ignore SSL certificate verification for simplicity.
Example
<?php

// ...

// specify the target website's URL
$url = "https://scrapingcourse.com/ecommerce/";

// initialize a cURL session
$curl = curl_init();

// set the website URL
curl_setopt($curl, CURLOPT_URL, $url);

// return the response as a string
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

// follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); 

// ignore SSL verification
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

?>

Execute the cURL session and store the response. Check for errors and handle them appropriately. After closing the cURL session, use php-html-parser to parse the fetched HTML content and output the entire HTML.

Example
<?php

// ...

// execute the cURL session
$htmlContent = curl_exec($curl);

// check for errors
if ($htmlContent === false) {
    // handle the error
    $error = curl_error($curl);
    echo "cURL error: " . $error;
    exit;
}

// close cURL session
curl_close($curl);

// parse the HTML content using php-html-parser
$dom = new Dom;
$dom->loadStr($htmlContent);

// print the entire HTML content
echo $dom->outerHtml;
?>

Here's what your final script should look like:

Example
<?php

require "vendor/autoload.php";

use PHPHtmlParser\Dom;

// specify the target website's URL
$url = "https://scrapingcourse.com/ecommerce/";

// initialize a cURL session
$curl = curl_init();

// set the website URL
curl_setopt($curl, CURLOPT_URL, $url);

// return the response as a string
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

// follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); 

// ignore SSL verification
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

// execute the cURL session
$htmlContent = curl_exec($curl);

// check for errors
if ($htmlContent === false) {
    // handle the error
    $error = curl_error($curl);
    echo "cURL error: " . $error;
    exit;
}

// close cURL session
curl_close($curl);

// parse the HTML content using php-html-parser
$dom = new Dom;
$dom->loadStr($htmlContent);

// print the entire HTML content
echo $dom->outerHtml;
?>

Running this script will give you the following output:

Output
<!DOCTYPE html>
<html lang="en-US">
<head>
    <!--- ... --->
  
    <title>Ecommerce Test Site to Learn Web Scraping &#8211; ScrapingCourse.com</title>
    <!--- ... --->
    
</head>
<body class="home archive ...">
    <p class="woocommerce-result-count" id="result-count" data-testid="result-count" data-sorting="count"> Showing 1&ndash;16 of 188 results</p>
    <!--- ... --->
    
    <ul class="products columns-4" id="product-list" data-testid="product-list" data-products="list">
    <!--- ... --->

    </ul>
    
    <!--- ... --->
</body>
</html>

Congratulations! Youโ€™ve successfully extracted the HTML content of the target page using the cURL HTTP client.

Step #3: Parse Data

There are two primary methods of parsing data from HTML: CSS selectors and XPath. You can learn more about the differences between these methods in our โ€‹โ€‹XPath vs CSS Selector guide. For this tutorial, we'll use CSS selectors because they are simpler and more user-friendly.

To start, we'll extract the product name. Use your browser's DevTools to locate the HTML element that contains the product name. First, open the target webpage in your browser. Right-click on the first product name and select "Inspect" or "Inspect Element". The DevTools window will open, highlighting the HTML element. You'll see that the product names are inside <h2> tags with the class woocommerce-loop-product__title.

ScrapingCourse FIrst Product Name Inspection
Click to open the image in full screen

With this information, you can extract the product names. Use the find method of the Dom object to locate all elements that match the given CSS selector.

Here's the modified script to parse the product names:

Example
<?php

require "vendor/autoload.php";

use PHPHtmlParser\Dom;

// specify the target website's URL
$url = "https://scrapingcourse.com/ecommerce/";

// initialize a cURL session
$curl = curl_init();

// set the website URL
curl_setopt($curl, CURLOPT_URL, $url);

// return the response as a string
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

// follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); 

// ignore SSL verification
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

// execute the cURL session
$htmlContent = curl_exec($curl);

// check for errors
if ($htmlContent === false) {
    // handle the error
    $error = curl_error($curl);
    echo "cURL error: " . $error;
    exit;
}

// close cURL session
curl_close($curl);

// parse the HTML content using php-html-parser
$dom = new Dom;
$dom->loadStr($htmlContent);

// extract product names
$productNames = $dom->find(".woocommerce-loop-product__title");
foreach ($productNames as $name) {
    echo "Product Name: " . $name->text . "\n";
}
?>

The code will print out the extracted product names:

Output
Product Name: Abominable Hoodie
Product Name: Adrienne Trek Jacket
Product Name: Aeon Capri
Product Name: Aero Daily Fitness Tee
Product Name: Aether Gym Pant
Product Name: Affirm Water Bottle
Product Name: Aim Analog Watch
Product Name: Ajax Full-Zip Sweatshirt
Product Name: Ana Running Short
Product Name: Angel Light Running Short
Product Name: Antonia Racer Tank
Product Name: Apollo Running Short
Product Name: Arcadio Gym Short
Product Name: Argus All-Weather Tank
Product Name: Ariel Roll Sleeve Sweatshirt
Product Name: Artemis Running Short

Now, let's extract all the product details, including the product name, price, and image URL.

To get the required data from all the elements, you need to extract product information from each parent element containing the target content.

Inspect the first product container. You'll see that each product is inside a list element with the product class:

ScrapingCourse First Product Parent Element Inspection
Click to open the image in full screen

To extract the product price, use the CSS selector .price .woocommerce-Price-amount. In the DevTools, notice that the product price is contained within a <span> element with the class woocommerce-Price-amount, which is nested inside another <span> element with the class price.

Use the strip_tags method to ensure you get the text content without any HTML tags. Also, use the html_entity_decode method to convert any HTML entities, such as &#36, to the dollar sign.

In the DevTools, you'll find that each product image is contained within an <img> tag, and the image's URL is specified in the src attribute. Use the getAttribute method to retrieve this src attribute value, which provides the URL of the product image.

Here's the code implementing this parsing logic:

Example
<?php

require "vendor/autoload.php";

use PHPHtmlParser\Dom;

// specify the target website's URL
$url = "https://scrapingcourse.com/ecommerce/";

// initialize a cURL session
$curl = curl_init();

// set the website URL
curl_setopt($curl, CURLOPT_URL, $url);

// return the response as a string
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

// follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); 

// ignore SSL verification
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

// execute the cURL session
$htmlContent = curl_exec($curl);

// check for errors
if ($htmlContent === false) {
    // handle the error
    $error = curl_error($curl);
    echo "cURL error: " . $error;
    exit;
}

// close cURL session
curl_close($curl);

// parse the HTML content using php-html-parser
$dom = new Dom;
$dom->loadStr($htmlContent);

// extract product elements
$productElements = $dom->find(".product");

// initialize an array to hold product data
$products = [];

// iterate over each product element to extract data
foreach ($productElements as $element) {
    // extract product name
    $productName = $element->find(".woocommerce-loop-product__title")->text;

    // extract product price
    $priceElement = $element->find(".price .woocommerce-Price-amount");
    $productPrice = $priceElement ? strip_tags($priceElement->innerHtml) : "N/A";
    $productPrice = html_entity_decode($productPrice);

    // extract product image URL
    $productImage = $element->find("img")->getAttribute("src");

    // collect product data into an array
    $products[] = [
        "name" => $productName,
        "price" => $productPrice,
        "image" => $productImage
    ];
}

// print the extracted product data
foreach ($products as $product) {
    echo "Product Name: " . $product["name"] . "\n";
    echo "Product Price: " . $product["price"] . "\n";
    echo "Product Image URL: " . $product["image"] . "\n";
    echo "-----------------------------------\n";
}
?>

You'll get the following output displaying all the product information:

Output
Product Name: Abominable Hoodie
Product Price: $69.00
Product Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg
-----------------------------------
Product Name: Adrienne Trek Jacket
Product Price: $57.00
Product Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_main-324x324.jpg
-----------------------------------
Product Name: Aeon Capri
Product Price: $48.00
Product Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wp07-black_main-324x324.jpg
-----------------------------------
// omitted for brevity

Congratulations! You've successfully extracted the complete product data.ย 

Step #4: Export Data to CSV

Now, it's time to export the extracted data to a CSV file. Define the path for the CSV file and open it in write mode. Add headers to label the columns appropriately. Iterate over the array of products and write each product's data to the CSV file. Finally, close the CSV file to ensure all data is properly saved.

Modify your previous script, and your final code should look like this:

Example
<?php

require "vendor/autoload.php";

use PHPHtmlParser\Dom;

// specify the target website's URL
$url = "https://scrapingcourse.com/ecommerce/";

// initialize a cURL session
$curl = curl_init();

// set the website URL
curl_setopt($curl, CURLOPT_URL, $url);

// return the response as a string
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

// follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); 

// ignore SSL verification
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

// execute the cURL session
$htmlContent = curl_exec($curl);

// check for errors
if ($htmlContent === false) {
    // handle the error
    $error = curl_error($curl);
    echo "cURL error: " . $error;
    exit;
}

// close cURL session
curl_close($curl);

// parse the HTML content using php-html-parser
$dom = new Dom;
$dom->loadStr($htmlContent);

// extract product elements
$productElements = $dom->find(".product");

// initialize an array to hold product data
$products = [];

// iterate over each product element to extract data
foreach ($productElements as $element) {
    // extract product name
    $productName = $element->find(".woocommerce-loop-product__title")->text;

    // extract product price
    $priceElement = $element->find(".price .woocommerce-Price-amount");
    $productPrice = $priceElement ? strip_tags($priceElement->innerHtml) : "N/A";
    $productPrice = html_entity_decode($productPrice);

    // extract product image URL
    $productImage = $element->find("img")->getAttribute("src");

    // collect product data into an array
    $products[] = [
        "name" => $productName,
        "price" => $productPrice,
        "image" => $productImage
    ];
}

// define the CSV file path
$csvFile = "products.csv";

// open the CSV file for writing
$fp = fopen($csvFile, "w");

// add the headers to the CSV file
fputcsv($fp, ["Name", "Price", "Image URL"]);

// add the product data to the CSV file
foreach ($products as $product) {
    fputcsv($fp, [$product["name"], $product["price"], $product["image"]]);
}

// close the CSV file
fclose($fp);

echo "Data successfully exported to $csvFile\n";
?>

This is how your exported CSV file will look:

Extracted Data in CSV File
Click to open the image in full screen

Congratulations! Your scraper now saves the extracted data into a CSV file.

Conclusion

In this tutorial, you've learned how to integrate the php-html-parser library into your PHP project, fetch HTML content using cURL, and extract specific data such as product names, prices, and image URLs. You also learned how to export the extracted data to a CSV file.

This article provides a solid introduction to HTML parsing in PHP. To dive deeper into more advanced topics, including handling JavaScript-heavy pages and parsing multiple pages, check out our tutorial on web scraping with PHP. This will help you expand your scraping skills and tackle more complex scraping tasks.

Ready to get started?

Up to 1,000 URLs for free are waiting for you