spatie/crawler is a PHP package by Freek Van der Herten for crawling websites concurrently using Guzzle promises. It was recently updated to version 9, introducing a new CrawlResponse object, improved scope controls, testing utilities, and more.
Key features include:
- Handling crawl events via closure callbacks and observer classes
CrawlResponseobject with typed accessors- Collecting URLs and controlling crawl scope
- Testing with
fake() - And more...
Handling Crawl Events
The crawler supports two approaches for handling crawl events: closure callbacks and observer classes. The closure approach looks like this:
use Spatie\Crawler\Crawler;use Spatie\Crawler\CrawlResponse; Crawler::create('https://example.com') ->onCrawled(function (string $url, CrawlResponse $response) { echo "{$url}: {$response->status()}\n"; }) ->start();
The onFailed() and onFinished() handlers follow the same pattern for handling errors and post-crawl logic. There's also the onWillCrawl() that is called before a URL is crawled.
CrawlResponse
Each crawled URL delivers a CrawlResponse object with typed accessors for common inspection needs:
Crawler::create('https://example.com') ->onCrawled(function (string $url, CrawlResponse $response) { if ($response->wasRedirected()) { echo "Redirected from: " . implode(' → ', $response->redirectHistory()) . "\n"; } $dom = $response->dom(); // Symfony DomCrawler instance }) ->start();
The object also exposes body(), header(), and transferStats() for timing data.
Collecting URLs and Controlling Scope
With the crawler you can control scope and collect URLs without crawling each link individually. This is useful when you want to crawl a page for links—even filtering by internal links only—and return them without processing:
$urls = Crawler::create('https://example.com') ->internalOnly() ->depth(3) ->foundUrls();
Testing with fake()
Spatie always delivers excellent test helpers with their package, and the crawler package is no different. This package's fake() method lets you test crawl logic without making real HTTP requests. Pass a map of URLs to HTML strings and the crawler uses those as responses:
Crawler::create('https://example.com') ->fake([ 'https://example.com' => '<html><a href="/about">About</a></html>', 'https://example.com/about' => '<html>About page</html>', ]) ->foundUrls();
Other Highlights
- Throttling:
FixedDelayThrottlefor a fixed delay between requests,AdaptiveThrottleto back off based on server response times retry(): automatic retries on connection errors and 5xx responsesstream(): opt-in streaming to reduce memory usage on large crawlsFinishReasonenum:start()returnsCompleted,CrawlLimitReached,TimeLimitReached, orInterrupted- JavaScript rendering: a
JavaScriptRendererinterface with aCloudflareRendererincluded andspatie/browsershotas a suggested driver - And more
You can find the full source at spatie/crawler on GitHub.