Are you looking for the best way to parse HTML with PHP? You're in the right place!
This tutorial will show you how to set up a PHP project with the php-html-parser library and use it to extract data from HTML content seamlessly.
Let's go!
Step #1: Install PHP HTML Parser
To get started, create a new directory and run the following command to initialize a Composer project:
composer init
Follow the prompts to set up your project. You can accept the default values by pressing Enter.
Use Composer to download and install the php-html-parser library:
composer require paquettg/php-html-parser
Once you have installed the library, your project directory should look like this:
your-html-parser-project/
โโโ composer.json
โโโ composer.lock
โโโ vendor/
โโโ autoload.php
โโโ paquettg/
โโโ โโโ php-html-parser/
โโโ ... โโโ ...
Step #2: Extract HTML
Time to extract some data. We'll target the Scraping Course E-commerce demo page.
Let's use the cURL HTTP client to fetch the webpage content. Include the Composer autoloader to load installed packages and import the Dom
class from the php-html-parser library.
<?php
require "vendor/autoload.php";
use PHPHtmlParser\Dom;
?>
Then, initialize a cURL session to fetch the HTML content from the target URL (ScrapingCourse E-commerce demo page). You need to set some cURL options:
-
CURLOPT_URL
to set the URL to fetch. -
CURLOPT_RETURNTRANSFER
to ensure the response is returned as a string. -
CURLOPT_FOLLOWLOCATION
to follow any redirects. -
CURLOPT_SSL_VERIFYPEER
to ignore SSL certificate verification for simplicity.
<?php
// ...
// specify the target website's URL
$url = "https://scrapingcourse.com/ecommerce/";
// initialize a cURL session
$curl = curl_init();
// set the website URL
curl_setopt($curl, CURLOPT_URL, $url);
// return the response as a string
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
// follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
// ignore SSL verification
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
?>
Execute the cURL session and store the response. Check for errors and handle them appropriately. After closing the cURL session, use php-html-parser to parse the fetched HTML content and output the entire HTML.
<?php
// ...
// execute the cURL session
$htmlContent = curl_exec($curl);
// check for errors
if ($htmlContent === false) {
// handle the error
$error = curl_error($curl);
echo "cURL error: " . $error;
exit;
}
// close cURL session
curl_close($curl);
// parse the HTML content using php-html-parser
$dom = new Dom;
$dom->loadStr($htmlContent);
// print the entire HTML content
echo $dom->outerHtml;
?>
Here's what your final script should look like:
<?php
require "vendor/autoload.php";
use PHPHtmlParser\Dom;
// specify the target website's URL
$url = "https://scrapingcourse.com/ecommerce/";
// initialize a cURL session
$curl = curl_init();
// set the website URL
curl_setopt($curl, CURLOPT_URL, $url);
// return the response as a string
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
// follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
// ignore SSL verification
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
// execute the cURL session
$htmlContent = curl_exec($curl);
// check for errors
if ($htmlContent === false) {
// handle the error
$error = curl_error($curl);
echo "cURL error: " . $error;
exit;
}
// close cURL session
curl_close($curl);
// parse the HTML content using php-html-parser
$dom = new Dom;
$dom->loadStr($htmlContent);
// print the entire HTML content
echo $dom->outerHtml;
?>
Running this script will give you the following output:
<!DOCTYPE html>
<html lang="en-US">
<head>
<!--- ... --->
<title>Ecommerce Test Site to Learn Web Scraping – ScrapingCourse.com</title>
<!--- ... --->
</head>
<body class="home archive ...">
<p class="woocommerce-result-count" id="result-count" data-testid="result-count" data-sorting="count"> Showing 1–16 of 188 results</p>
<!--- ... --->
<ul class="products columns-4" id="product-list" data-testid="product-list" data-products="list">
<!--- ... --->
</ul>
<!--- ... --->
</body>
</html>
Congratulations! Youโve successfully extracted the HTML content of the target page using the cURL HTTP client.
If you get blocked while trying to get the HTML of a webpage, consider using a web scraping API, such as ZenRows. ZenRows works perfectly with PHP and provides a complete toolkit to bypass anti-bot systems and scrape uninterrupted.
Step #3: Parse Data
There are two primary methods of parsing data from HTML: CSS selectors and XPath. You can learn more about the differences between these methods in our โโXPath vs CSS Selector guide. For this tutorial, we'll use CSS selectors because they are simpler and more user-friendly.
To start, we'll extract the product name. Use your browser's DevTools to locate the HTML element that contains the product name. First, open the target webpage in your browser. Right-click on the first product name and select "Inspect" or "Inspect Element". The DevTools window will open, highlighting the HTML element. You'll see that the product names are inside <h2>
tags with the class woocommerce-loop-product__title
.
With this information, you can extract the product names. Use the find
method of the Dom
object to locate all elements that match the given CSS selector.
Here's the modified script to parse the product names:
<?php
require "vendor/autoload.php";
use PHPHtmlParser\Dom;
// specify the target website's URL
$url = "https://scrapingcourse.com/ecommerce/";
// initialize a cURL session
$curl = curl_init();
// set the website URL
curl_setopt($curl, CURLOPT_URL, $url);
// return the response as a string
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
// follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
// ignore SSL verification
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
// execute the cURL session
$htmlContent = curl_exec($curl);
// check for errors
if ($htmlContent === false) {
// handle the error
$error = curl_error($curl);
echo "cURL error: " . $error;
exit;
}
// close cURL session
curl_close($curl);
// parse the HTML content using php-html-parser
$dom = new Dom;
$dom->loadStr($htmlContent);
// extract product names
$productNames = $dom->find(".woocommerce-loop-product__title");
foreach ($productNames as $name) {
echo "Product Name: " . $name->text . "\n";
}
?>
The code will print out the extracted product names:
Product Name: Abominable Hoodie
Product Name: Adrienne Trek Jacket
Product Name: Aeon Capri
Product Name: Aero Daily Fitness Tee
Product Name: Aether Gym Pant
Product Name: Affirm Water Bottle
Product Name: Aim Analog Watch
Product Name: Ajax Full-Zip Sweatshirt
Product Name: Ana Running Short
Product Name: Angel Light Running Short
Product Name: Antonia Racer Tank
Product Name: Apollo Running Short
Product Name: Arcadio Gym Short
Product Name: Argus All-Weather Tank
Product Name: Ariel Roll Sleeve Sweatshirt
Product Name: Artemis Running Short
Now, let's extract all the product details, including the product name, price, and image URL.
To get the required data from all the elements, you need to extract product information from each parent element containing the target content.
Inspect the first product container. You'll see that each product is inside a list element with the product
class:
To extract the product price, use the CSS selector .price .woocommerce-Price-amount
. In the DevTools, notice that the product price is contained within a <span>
element with the class woocommerce-Price-amount
, which is nested inside another <span>
element with the class price
.
Use the strip_tags
method to ensure you get the text content without any HTML tags. Also, use the html_entity_decode
method to convert any HTML entities, such as $
, to the dollar sign.
In the DevTools, you'll find that each product image is contained within an <img>
tag, and the image's URL is specified in the src
attribute. Use the getAttribute
method to retrieve this src attribute value, which provides the URL of the product image.
Here's the code implementing this parsing logic:
<?php
require "vendor/autoload.php";
use PHPHtmlParser\Dom;
// specify the target website's URL
$url = "https://scrapingcourse.com/ecommerce/";
// initialize a cURL session
$curl = curl_init();
// set the website URL
curl_setopt($curl, CURLOPT_URL, $url);
// return the response as a string
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
// follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
// ignore SSL verification
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
// execute the cURL session
$htmlContent = curl_exec($curl);
// check for errors
if ($htmlContent === false) {
// handle the error
$error = curl_error($curl);
echo "cURL error: " . $error;
exit;
}
// close cURL session
curl_close($curl);
// parse the HTML content using php-html-parser
$dom = new Dom;
$dom->loadStr($htmlContent);
// extract product elements
$productElements = $dom->find(".product");
// initialize an array to hold product data
$products = [];
// iterate over each product element to extract data
foreach ($productElements as $element) {
// extract product name
$productName = $element->find(".woocommerce-loop-product__title")->text;
// extract product price
$priceElement = $element->find(".price .woocommerce-Price-amount");
$productPrice = $priceElement ? strip_tags($priceElement->innerHtml) : "N/A";
$productPrice = html_entity_decode($productPrice);
// extract product image URL
$productImage = $element->find("img")->getAttribute("src");
// collect product data into an array
$products[] = [
"name" => $productName,
"price" => $productPrice,
"image" => $productImage
];
}
// print the extracted product data
foreach ($products as $product) {
echo "Product Name: " . $product["name"] . "\n";
echo "Product Price: " . $product["price"] . "\n";
echo "Product Image URL: " . $product["image"] . "\n";
echo "-----------------------------------\n";
}
?>
You'll get the following output displaying all the product information:
Product Name: Abominable Hoodie
Product Price: $69.00
Product Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg
-----------------------------------
Product Name: Adrienne Trek Jacket
Product Price: $57.00
Product Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_main-324x324.jpg
-----------------------------------
Product Name: Aeon Capri
Product Price: $48.00
Product Image URL: https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wp07-black_main-324x324.jpg
-----------------------------------
// omitted for brevity
Congratulations! You've successfully extracted the complete product data.ย
Step #4: Export Data to CSV
Now, it's time to export the extracted data to a CSV file. Define the path for the CSV file and open it in write mode. Add headers to label the columns appropriately. Iterate over the array of products and write each product's data to the CSV file. Finally, close the CSV file to ensure all data is properly saved.
Modify your previous script, and your final code should look like this:
<?php
require "vendor/autoload.php";
use PHPHtmlParser\Dom;
// specify the target website's URL
$url = "https://scrapingcourse.com/ecommerce/";
// initialize a cURL session
$curl = curl_init();
// set the website URL
curl_setopt($curl, CURLOPT_URL, $url);
// return the response as a string
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
// follow redirects
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
// ignore SSL verification
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
// execute the cURL session
$htmlContent = curl_exec($curl);
// check for errors
if ($htmlContent === false) {
// handle the error
$error = curl_error($curl);
echo "cURL error: " . $error;
exit;
}
// close cURL session
curl_close($curl);
// parse the HTML content using php-html-parser
$dom = new Dom;
$dom->loadStr($htmlContent);
// extract product elements
$productElements = $dom->find(".product");
// initialize an array to hold product data
$products = [];
// iterate over each product element to extract data
foreach ($productElements as $element) {
// extract product name
$productName = $element->find(".woocommerce-loop-product__title")->text;
// extract product price
$priceElement = $element->find(".price .woocommerce-Price-amount");
$productPrice = $priceElement ? strip_tags($priceElement->innerHtml) : "N/A";
$productPrice = html_entity_decode($productPrice);
// extract product image URL
$productImage = $element->find("img")->getAttribute("src");
// collect product data into an array
$products[] = [
"name" => $productName,
"price" => $productPrice,
"image" => $productImage
];
}
// define the CSV file path
$csvFile = "products.csv";
// open the CSV file for writing
$fp = fopen($csvFile, "w");
// add the headers to the CSV file
fputcsv($fp, ["Name", "Price", "Image URL"]);
// add the product data to the CSV file
foreach ($products as $product) {
fputcsv($fp, [$product["name"], $product["price"], $product["image"]]);
}
// close the CSV file
fclose($fp);
echo "Data successfully exported to $csvFile\n";
?>
This is how your exported CSV file will look:
Congratulations! Your scraper now saves the extracted data into a CSV file.
Conclusion
In this tutorial, you've learned how to integrate the php-html-parser library into your PHP project, fetch HTML content using cURL, and extract specific data such as product names, prices, and image URLs. You also learned how to export the extracted data to a CSV file.
This article provides a solid introduction to HTML parsing in PHP. To dive deeper into more advanced topics, including handling JavaScript-heavy pages and parsing multiple pages, check out our tutorial on web scraping with PHP. This will help you expand your scraping skills and tackle more complex scraping tasks.