Roach PHP - Complete Web Scraping toolkit for PHP

Roach is a complete web scraping toolkit for PHP. It is a shameless clone heavily inspired by the popular Scrapy package for Python.

Roach allows us to define spiders that crawl and scrape web documents. But wait, there’s more. Roach isn’t just a simple crawler, but includes an entire pipeline to clean, persist and otherwise process extracted data as well. It’s your all-in-one resource for web scraping in PHP.

Framework Agnostic

Roach doesn’t depend on a specific framework. Instead, you can use the core package on its own or install one of the framework-specific adapters. Currently there’s a first-party adapter available to use Roach in your Laravel projects with more coming.

Built With Extensibility in Mind

Roach is built from the ground up with extensibility in mind. In fact, most of Roach’s built-in behavior works the exact same way that any custom extensions or middleware works.

Want to store the scraped information in your persistence of choice? Roach has got you covered, just write an appropriate item processor.

Want to add custom HTTP headers to every outgoing request based on some condition? Sure thing, sounds like a job for a downloader middleware.

Post a message into the company Slack after a run was finished to gloat about how great your spider works? I... guess you could write an extension for that and listen on the corresponding event.

Installing Roach as a Standalone Package

Most project will want to install one of Roach’s framework specific adapters to help cut down on the configuration and boilerplate necessary. If you want to use Roach as a standalone package, however, you can do so by installing the core package via composer.

composer require roach-php/core

That’s all there is to it.

Using Roach inside a framework

There are several first-party packages that help you seamlessly integrate Roach with your favorite web framework. Check out the corresponding documentation to learn more.

Spiders

Define how websites get crawled and how data is scraped from its pages.

Spiders are classes which define how a website will get processed. This includes both crawling for links and extracting data from specific pages (scraping).

Example spider

It's easiest to explain all the different parts of a spider by looking at an example. Here's a spider that extracts the title and subtitle of all pages of this very documentation.

<?php

use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;

class RoachDocsSpider extends BasicSpider
{
    /**
     * @var string[]
     */
    public array $startUrls = [
        'https://roach-php.dev/docs/spiders'
    ];

    public function parse(Response $response): \Generator
    {
        $title = $response->filter('h1')->text();

        $subtitle = $response
            ->filter('main > div:nth-child(2) p:first-of-type')
            ->text();

        yield $this->item([
            'title' => $title,
            'subtitle' => $subtitle,
        ]);
    }
}

Here’s how this spider will be processed:

Roach starts by sending requests to all URLs defined inside the $startUrls property of the spider. In our case, there’s only the single URL https://roach-php.dev/docs/spiders.
The response of each request gets passed to the parse method of the spider.
Inside the parse method, we filter the response using CSS selectors to extract both the title and subtitle. Check out the page on scraping responses for more information.
We then yield an item from our method by calling $this->item(...) and passing in array of our data.
The item will then get sent through the item processing pipeline.
Since there are no further requests to be sent, the spider closes.