A PHP Package for Concurrent Website Crawling

spatie/crawler is a PHP package by Freek Van der Herten for crawling websites concurrently using Guzzle promises. It was recently updated to version 9, introducing a new CrawlResponse object, improved scope controls, testing utilities, and more.

Key features include:

Handling crawl events via closure callbacks and observer classes
CrawlResponse object with typed accessors
Collecting URLs and controlling crawl scope
Testing with fake()
And more...

Handling Crawl Events

The crawler supports two approaches for handling crawl events: closure callbacks and observer classes. The closure approach looks like this:

use Spatie\Crawler\Crawler;
use Spatie\Crawler\CrawlResponse;
 
Crawler::create('https://example.com')
    ->onCrawled(function (string $url, CrawlResponse $response) {
        echo "{$url}: {$response->status()}\n";
    })
    ->start();

The onFailed() and onFinished() handlers follow the same pattern for handling errors and post-crawl logic. There's also the onWillCrawl() that is called before a URL is crawled.

CrawlResponse

Each crawled URL delivers a CrawlResponse object with typed accessors for common inspection needs:

Crawler::create('https://example.com')
    ->onCrawled(function (string $url, CrawlResponse $response) {
        if ($response->wasRedirected()) {
            echo "Redirected from: " . implode(' → ', $response->redirectHistory()) . "\n";
        }
 
        $dom = $response->dom(); // Symfony DomCrawler instance
    })
    ->start();

The object also exposes body(), header(), and transferStats() for timing data.

Collecting URLs and Controlling Scope

With the crawler you can control scope and collect URLs without crawling each link individually. This is useful when you want to crawl a page for links—even filtering by internal links only—and return them without processing:

$urls = Crawler::create('https://example.com')
    ->internalOnly()
    ->depth(3)
    ->foundUrls();

Testing with `fake()`

Spatie always delivers excellent test helpers with their package, and the crawler package is no different. This package's fake() method lets you test crawl logic without making real HTTP requests. Pass a map of URLs to HTML strings and the crawler uses those as responses:

Crawler::create('https://example.com')
    ->fake([
        'https://example.com' => '<html><a href="/about">About</a></html>',
        'https://example.com/about' => '<html>About page</html>',
    ])
    ->foundUrls();

Other Highlights

Throttling: FixedDelayThrottle for a fixed delay between requests, AdaptiveThrottle to back off based on server response times
retry(): automatic retries on connection errors and 5xx responses
stream(): opt-in streaming to reduce memory usage on large crawls
FinishReason enum: start() returns Completed, CrawlLimitReached, TimeLimitReached, or Interrupted
JavaScript rendering: a JavaScriptRenderer interface with a CloudflareRenderer included and spatie/browsershot as a suggested driver
And more