Do you want to give your PHP web scraping power a full-fledged data extraction capability? Roach, a Scrapy-like PHP scraping framework, will help you achieve that.
In this tutorial, you'll explore how Roach works and how to build a web scraper, including more advanced features such as web crawling and custom pipeline processing. You'll learn how to:
- Create a spider to get a page's HTML.
- Extract all data.
- Scrape multiple pages.
- Process and store data with custom pipelines.
Let's go!
Why Use Roach PHP for Web Scraping?
Roach is a feature-rich web scraping and crawling framework in PHP. Like Python's Scrapy, individual Roach spiders handle the web scraping job. To parse HTML content, it borrows Symfony's DOMcrawler component under the hood.
The library has an item pipeline to facilitate data processing after extraction. If your PHP web scraper demands more features as you scale, you can write custom middleware and extensions or use built-in ones.
Roach also provides adapters for Laravel and Symfony integration, making the library easy to use directly in your web application. Although Roach doesn't support browser automation by default, you can install Browsershot, a third-party extension, to execute JavaScript while scraping.
Ready to scrape with Roach? Let's start with the prerequisites.
Prerequisites
Roach requires PHP version 8.1 or above. This tutorial uses PHP 8.2 on a Windows operating system.
You can code along using any integrated development environment (IDE); this tutorial uses VS Code.
Now, let's create your Roach project.
Create a Project
Roach follows the regular PHP project pattern. To start, create a new project folder on your computer.
Migrate to that folder via your terminal and run the following composer command to create a new PHP project:
composer init
Accept the interactive prompts by pressing "Enter" or replying with "N + Enter" where relevant.
Open your project via your code editor, and you'll see a composer.json
file. Open that file and reconfigure it like so:
{
"name": "<YOUR_USERNAME>/myscraper",
"type": "project",
"autoload": {
"psr-4": {
"App\\": "src/"
}
},
"authors": [
{
"name": "<YOUR NAME>",
"email": "<YOUR_EMAIL_ADDRESS>"
}
],
"require": {
"php": "^8.2"
}
}
The next step is to install the Roach package.
Install Roach PHP
As of now, the latest Roach version is 3.0+. Run the following command to install the latest stable version:
composer require roach-php/core
Your composer.json
file should now require the latest version of Roach as shown:
{
// ...
"require": {
// ...,
"roach-php/core": "^3.2"
}
}
If you use Windows, you may encounter cURL certificate authority (CA) issues while running Roach because Windows doesn't include the CA bundle by default. A quick fix is to download a CA certificate and add its path to your php.ini
file.
Let's set up your project folders and files in the next section.
Set Up Project Folders and Files
The recommended way to build your Roach scraper is to write the spider class inside a dedicated Spiders folder. An advantage of this approach is that it lets you run multiple spiders from a single PHP index file.
Create a new src/Spiders
directory in your project root folder, and add a new Scraper.php
file to the directory. Then, create an index.php
file in your project root. This file will handle individual spider execution.
Your project directory should look like this:
projectRoot
├─ composer.json
├─ composer.lock
├─ index.php
├─ src
│ └─ Spiders
│ └─ Scraper.php
└─ vendor
Once done, you're all set and ready to build your first Roach scraper!
Tutorial: Your First Web Scraper with Roach
To learn how Roach works, you'll use it to scrape product information from Scraping Course, a demo e-commerce website, starting with full-page HTML extraction. You'll then scrape specific elements before advancing to more advanced concepts such as crawling and data storage.
This is what the target website looks like:
Let's scrape it!
Step 1: Create Your Spider to Get the Page's HTML
A Roach spider is a PHP class that requests and parses HTML responses. Let's see how to get the target website's full-page HTML using Roach.
Open your Scraper.php
file and import the following modules:
// specify the namespace
namespace App\Spiders;
// import the required modules
use Generator;
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
use RoachPHP\Downloader\Middleware\RequestDeduplicationMiddleware;
use RoachPHP\Downloader\Middleware\UserAgentMiddleware;
Define a Scraper class that extends the base Roach Spider class. This class begins with an array containing the target URL. The Roach client will dispatch that request by default:
// ...
class Scraper extends BasicSpider
{
// define the URLs to scrape
/**
* @var string[]
*/
public array $startUrls = [
"https://www.scrapingcourse.com/ecommerce/"
];
}
Specify a built-in download middleware array that includes the request duplication and User Agent middleware. The request duplication middleware prevents Roach from dispatching multiple requests, while the User-Agent adds a human touch to your spider to boost its success rate:
// ...
class Scraper extends BasicSpider
{
// ...
// declare download middleware
public array $downloaderMiddleware = [
RequestDeduplicationMiddleware::class,
[UserAgentMiddleware::class, ["userAgent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"]],
];
}
Finally, define the parse
function to collect the website's response as HTML. This function outputs a generator by default and yields the extracted content into an item processor:
// ...
class Scraper extends BasicSpider
{
// ...
// create the parser function
public function parse(Response $response): Generator
{
// extract the entire HTML content of the page
$html = $response->getBody();
yield $this->item([
"html" => $html,
]);
}
}
Combine the snippets, and you'll get this final code:
<?php
// specify the namespace
namespace App\Spiders;
// import the required modules
use Generator;
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
use RoachPHP\Downloader\Middleware\RequestDeduplicationMiddleware;
use RoachPHP\Downloader\Middleware\UserAgentMiddleware;
class Scraper extends BasicSpider
{
// define the URLs to scrape
/**
* @var string[]
*/
public array $startUrls = [
"https://www.scrapingcourse.com/ecommerce/"
];
// declare download middleware
public array $downloaderMiddleware = [
RequestDeduplicationMiddleware::class,
[UserAgentMiddleware::class, ["userAgent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"]],
];
// create the parser function
public function parse(Response $response): Generator
{
// extract the entire HTML content of the page
$html = $response->getBody();
yield $this->item([
"html" => $html,
]);
}
}
Now, it's time to run the spider.
Open the index.php
file in your project root folder. Point to the autoload.php
file and import Roach and the Scraper class. Then, execute the Scraper class:
<?php
// require the autoload file
require "vendor/autoload.php";
// import Roach and the Scraper class
use App\Spiders\Scraper;
use RoachPHP\Roach;
// execute the spider to extract the HTML content
Roach::collectSpider(Scraper::class);
Run the index.php
as shown:
php index.php
The Spider outputs the target website's HTML as expected. See the result below:
<!DOCTYPE html>
<html lang="en-US">
<head>
<!--- ... --->
<title>Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com</title>
<!--- ... --->
</head>
<body class="home archive ...">
<p class="woocommerce-result-count" id="result-count">Showing 1-16 of 188 results</p>
<ul class="products columns-4" id="product-list">
<!--- ... --->
</ul>
</body>
</html>
Congratulations, you've built your first Roach spider! Take it further by targeting specific product elements.
Step 2: Extract Specific Data
To extract specific data, first inspect the target page to expose its CSS selectors. You'll then obtain each element using its CSS selector.
Let's extract product names, prices, and image URLs to see how this works.
Open the target website via a browser like Chrome. Right-click the first product and go to Inspect. You'll see that each product's information is inside a list (li
) tag:
To scrape the product information under each product container (the li
tag), obtain each container and loop through it to extract the target elements using their CSS selectors. Modify the previous parse function like this:
// ...
class Scraper extends BasicSpider
{
// ...
// create the parser function
public function parse(Response $response): Generator
{
// extract each product container and loop through it
$items = $response
->filter("li.product")
->each(fn(Crawler $node) => [
"name" => $node->filter(".woocommerce-loop-product__title")->text(),
"price" => $node->filter(".price span")->text(),
"url" => $node->filter("a")->link()->getUri(),
"image" => $node->filter("img")->attr("src"),
]);
// pass the extracted content to an item pipeline
foreach ($items as $item) {
yield $this->item($item);
}
}
}
Here's the updated full code:
<?php
// specify the namespace
namespace App\Spiders;
// import the required modules
use Generator;
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
use RoachPHP\Downloader\Middleware\RequestDeduplicationMiddleware;
use RoachPHP\Downloader\Middleware\UserAgentMiddleware;
use Symfony\Component\DomCrawler\Crawler;
class Scraper extends BasicSpider
{
// define the URLs to scrape
/**
* @var string[]
*/
public array $startUrls = [
"https://www.scrapingcourse.com/ecommerce/"
];
// declare download middleware
public array $downloaderMiddleware = [
RequestDeduplicationMiddleware::class,
[UserAgentMiddleware::class, ["userAgent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"]],
];
// create the parser function
public function parse(Response $response): Generator
{
// extract each product container and loop through it
$items = $response
->filter("li.product")
->each(fn(Crawler $node) => [
"name" => $node->filter(".woocommerce-loop-product__title")->text(),
"price" => $node->filter(".price span")->text(),
"url" => $node->filter("a")->link()->getUri(),
"image" => $node->filter("img")->attr("src"),
]);
// pass the extracted content to an item pipeline
foreach ($items as $item) {
yield $this->item($item);
}
}
}
Execute the code by running the index.php
file. You'll get the following output:
{
"name":"Abominable Hoodie",
"price":"$69",
"url":"https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/",
"image":"https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg"
}
// ... other products omitted for brevity
{
"name":"Artemis Running Short",
"price":"$45",
"url":"https://www.scrapingcourse.com/ecommerce/product/artemis-running-short/",
"image":"https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wsh04-black_main-324x324.jpg"
}
Great job! Your Roach spider now extracts specific data.
Still, Roach has a few more tricks up its sleeve. Let's try implementing advanced features like crawling and custom pipeline processors.
Advanced Web Scraping Techniques With Roach PHP
In this section, you'll use Roach to execute advanced tasks, including scraping multiple pages and customizing an item pipeline to store the scraped data in a CSV file.
Crawling With Roach to Scrape Multiple Pages
The previous scraper only scrapes one page. However, the target website (ScrapingCourse) contains a list of paginated products. It also has a navigation bar that opens each page when you press the next button. To scrape multiple pages, you need to crawl the website.
Roach supports crawling by default. To scrape paginated websites, it follows the next page link to extract content from each page.
Let's inspect the next page element before proceeding.
Right-click the next button on the navigation bar and click Inspect. The navigation button has the class name .next
. This element becomes invisible once you get to the last page:
Your spider only requires a few modifications to follow and scrape all 12 product pages.
Extract the next page link element. Add logic to terminate the crawl once Roach gets to the last page and can't find the next page button:
// ...
// create the parser function
public function parse(Response $response): Generator
{
// ...
// find the next page URL and make a request
$nextPageLink = $response->filter(".next");
// follow the next page link if it exists
if ($nextPageLink) {
$nextPageUrl = $nextPageLink->link()->getUri();
yield $this->request("GET", $nextPageUrl);
}
}
Merge the snippet with your spider, and you'll get this updated final code:
<?php
// specify the namespace
namespace App\Spiders;
// import the required modules
use Generator;
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
use RoachPHP\Downloader\Middleware\RequestDeduplicationMiddleware;
use RoachPHP\Downloader\Middleware\UserAgentMiddleware;
use Symfony\Component\DomCrawler\Crawler;
class Scraper extends BasicSpider
{
// define the URLs to scrape
/**
* @var string[]
*/
public array $startUrls = [
"https://www.scrapingcourse.com/ecommerce/"
];
// declare download middleware
public array $downloaderMiddleware = [
RequestDeduplicationMiddleware::class,
[UserAgentMiddleware::class, ["userAgent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"]],
];
// create the parser function
public function parse(Response $response): Generator
{
// extract each product container and loop through it
$items = $response
->filter("li.product")
->each(function (Crawler $node) {
return [
"name" => $node->filter(".woocommerce-loop-product__title")->text(),
"price" => $node->filter(".price span")->text(),
"url" => $node->filter("a")->link()->getUri(),
"image" => $node->filter("img")->attr("src"),
];
});
// pass the extracted content to an item pipeline
foreach ($items as $item) {
yield $this->item($item);
}
// find the next page URL and make a request
$nextPageLink = $response->filter(".next");
// follow the next page link if it exists
if ($nextPageLink) {
$nextPageUrl = $nextPageLink->link()->getUri();
yield $this->request("GET", $nextPageUrl);
}
}
}
The code crawls the entire website and scrapes all its content. Here's the output:
{
"name":"Abominable Hoodie",
"price":"$69",
"url":"https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/",
"image":"https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main-324x324.jpg"
}
// ... other products omitted for brevity
{
"name":"Zoltan Gym Tee",
"price":"$29.00",
"url":"https://www.scrapingcourse.com/ecommerce/product/zoltan-gym-tee/",
"image":"https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/ms06-blue_main-324x324.jpg"
}
Avoid Getting Blocked When Scraping With Roach
Anti-bot measures are the biggest threats to any web scraping project. You'll need to find a way to handle them and scrape without getting blocked.
Although Roach supports customizing the User Agent via middleware, that method isn't enough to keep bots at bay. For example, a heavily protected website like the G2 Reviews will block Roach.
Try it out with the following code:
<?php
// specify the namespace
namespace App\Spiders;
// import the required modules
use Generator;
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
use RoachPHP\Downloader\Middleware\RequestDeduplicationMiddleware;
use RoachPHP\Downloader\Middleware\UserAgentMiddleware;
use RoachPHP\Roach;
class Scraper extends BasicSpider
{
// define the URLs to scrape
/**
* @var string[]
*/
public array $startUrls = [
"https://www.g2.com/products/asana/reviews"
];
// declare download middleware
public array $downloaderMiddleware = [
RequestDeduplicationMiddleware::class,
[UserAgentMiddleware::class, ["userAgent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"]],
];
// create the parser function
public function parse(Response $response): Generator
{
// extract the entire HTML content of the page
$html = $response->getBody();
yield $this->item([
"html" => $html,
]);
}
}
The Roach spider above gets blocked by Cloudflare:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<!-- ... -->
<title>Attention Required! | Cloudflare</title>
</head>
This doesn't bode well for web scraping at scale. However, there are solutions to this problem.
The most effective way to extract content without detection is using web scraping APIs, such as ZenRows. It's an all-in-one toolkit that bypasses all CAPTCHAs and any anti-bot system. You can also use it as a headless browser, and ZenRows will modify your request headers and auto-rotate premium proxies, helping you to scrape any website as a legitimate user.
ZenRows also integrates perfectly with all programming languages. Let's see how to use it as your PHP web scraper by scraping the G2 Reviews page that blocked you previously.
Sign up to open the Request Builder. Paste the target URL in the link box, activate Premium Proxies, and set the Boost mode to JS Rendering. Select PHP as your preferred language and choose the API connection mode. Click "Try it" to run the code inside the builder. You can also copy and paste the generated code into your PHP scraper file to run it locally.
The generated code should look like this:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://api.zenrows.com/v1/?apikey=<YOUR_ZENROWS_API_KEY>&url=https%3A%2F%2Fwww.g2.com%2Fproducts%2Fasana%2Freviews&js_render=true&premium_proxy=true');
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec($ch);
echo $response . PHP_EOL;
curl_close($ch);
?>
Running the above code extracts the protected website's full-page HTML:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<link href="https://www.g2.com/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<title>Asana Reviews, Pros + Cons, and Top Rated Features</title>
</head>
<body>
<!-- other content omitted for brevity -->
</body>
</html>
Congratulations! Your PHP web scraper now bypasses anti-bot detection using ZenRows.
Process and Store Data With Custom Pipelines
The item pipeline in Roach allows you to modify and store data during web scraping.
You'll need a custom storage pipeline processor to store the extracted data into a CSV file. Let's write that custom processor to see how it works.
Each custom pipeline should handle a specific task. This technique makes your code more modular and less verbose.
Create a new ItemPipeline
folder in your src
directory. Then, add a new CsvExportProcessor.php
file to this folder. Your new project structure should look like this:
projectRoot
├─ composer.json
├─ composer.lock
├─ index.php
├─ src
│ └─ Spiders
│ └─ Scraper.php
│ └─ ItemPipeline
│ └─ CsvExportProcessor.php
└─ vendor
Import the required libraries and start a new CSV processor class that extends Roach's item processor interface. Call the Configurable
method, and define a function that opens a new CSV file in the write mode:
// specify the namespace
namespace App\ItemPipeline;
// import the required modules
use RoachPHP\ItemPipeline\ItemInterface;
use RoachPHP\ItemPipeline\Processors\ItemProcessorInterface;
use RoachPHP\Support\Configurable;
class CsvExportProcessor implements ItemProcessorInterface
{
// allow configuration
use Configurable;
public function __construct()
{
// open the CSV file in the write mode
$this->file = fopen("product.csv", "a");
}
}
Now, define an item processor function inside that class. This function obtains the scraped data from the item pipeline and writes them to a new row as they come from the product container. Finally, close the file in a separate function:
// ...
public function processItem(ItemInterface $item): ItemInterface
{
// obtain the extracted data from the item
$data = $item->all();
// write the data to the CSV file
fputcsv($this->file, $data);
return $item;
}
// close the file
public function __destruct()
{
fclose($this->file);
}
Here's the complete code after merging both snippets:
<?php
// specify the namespace
namespace App\ItemPipeline;
// import the required modules
use RoachPHP\ItemPipeline\ItemInterface;
use RoachPHP\ItemPipeline\Processors\ItemProcessorInterface;
use RoachPHP\Support\Configurable;
class CsvExportProcessor implements ItemProcessorInterface
{
// allow configuration
use Configurable;
public function __construct()
{
// open the CSV file in the write mode
$this->file = fopen("product.csv", "a");
}
public function processItem(ItemInterface $item): ItemInterface
{
// obtain the extracted data from the item
$data = $item->all();
// write the data to the CSV file
fputcsv($this->file, $data);
return $item;
}
// close the file
public function __destruct()
{
fclose($this->file);
}
}
The next step is to use your pipeline processor in your Roach spider.
Import the above CSV export pipeline class into your Scraper.php
file and add it to the item processors array:
// import the required modules
// ...
use App\ItemPipeline\CsvExportProcessor;
class Scraper extends BasicSpider
{
// ...
// add the custom item processor
public array $itemProcessors = [
CsvExportProcessor::class
];
// ... the parser function
}
Here's your new Scraper.php
file:
<?php
// specify the namespace
namespace App\Spiders;
// import the required modules
use Generator;
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
use RoachPHP\Downloader\Middleware\RequestDeduplicationMiddleware;
use RoachPHP\Downloader\Middleware\UserAgentMiddleware;
use Symfony\Component\DomCrawler\Crawler;
use App\ItemPipeline\CsvExportProcessor;
class Scraper extends BasicSpider
{
// define the URLs to scrape
/**
* @var string[]
*/
public array $startUrls = [
"https://www.scrapingcourse.com/ecommerce/"
];
// declare download middleware
public array $downloaderMiddleware = [
RequestDeduplicationMiddleware::class,
[UserAgentMiddleware::class, ["userAgent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"]],
];
// add the custom item processor
public array $itemProcessors = [
CsvExportProcessor::class
];
// create the parser function
public function parse(Response $response): Generator
{
// extract each product container and loop through it
$items = $response
->filter("li.product")
->each(function (Crawler $node) {
return [
"name" => $node->filter(".woocommerce-loop-product__title")->text(),
"price" => $node->filter(".price span")->text(),
"url" => $node->filter("a")->link()->getUri(),
"image" => $node->filter("img")->attr("src"),
];
});
// pass the extracted content to an item pipeline
foreach ($items as $item) {
yield $this->item($item);
}
// find the next page URL and make a request
$nextPageLink = $response->filter(".next");
// follow the next page link if it exists
if ($nextPageLink) {
$nextPageUrl = $nextPageLink->link()->getUri();
yield $this->request("GET", $nextPageUrl);
}
}
}
Run your index.php
file, and you'll see the products.csv
file in your project directory:
Great job! You've just stored your scraped data in a CSV file with a custom item pipeline.
Conclusion
You've learned how the Roach library works and used its basic to advanced concepts to extract content in PHP. You've learned how to:
- Extract full-page HTML with a basic Roach spider.
- Scrape specific elements with Roach.
- Use Roach's crawling capability to scrape data from multiple pages.
- Store extracted data in a CSV file using a custom pipeline processor.
However, despite these functionalities, Roach can't handle advanced anti-bot detection systems. To bypass all blocks, we recommend using ZenRows, a complete web scraping toolkit that can successfully replace any scraping library and scale your web scraping efforts.
Try ZenRows for free now without a credit card!