A PHP Package for Concurrent Website Crawling

Last updated on by

A PHP Package for Concurrent Website Crawling image

spatie/crawler is a PHP package by Freek Van der Herten for crawling websites concurrently using Guzzle promises. It was recently updated to version 9, introducing a new CrawlResponse object, improved scope controls, testing utilities, and more.

Key features include:

  • Handling crawl events via closure callbacks and observer classes
  • CrawlResponse object with typed accessors
  • Collecting URLs and controlling crawl scope
  • Testing with fake()
  • And more...

Handling Crawl Events

The crawler supports two approaches for handling crawl events: closure callbacks and observer classes. The closure approach looks like this:

use Spatie\Crawler\Crawler;
use Spatie\Crawler\CrawlResponse;
 
Crawler::create('https://example.com')
->onCrawled(function (string $url, CrawlResponse $response) {
echo "{$url}: {$response->status()}\n";
})
->start();

The onFailed() and onFinished() handlers follow the same pattern for handling errors and post-crawl logic. There's also the onWillCrawl() that is called before a URL is crawled.

CrawlResponse

Each crawled URL delivers a CrawlResponse object with typed accessors for common inspection needs:

Crawler::create('https://example.com')
->onCrawled(function (string $url, CrawlResponse $response) {
if ($response->wasRedirected()) {
echo "Redirected from: " . implode(' → ', $response->redirectHistory()) . "\n";
}
 
$dom = $response->dom(); // Symfony DomCrawler instance
})
->start();

The object also exposes body(), header(), and transferStats() for timing data.

Collecting URLs and Controlling Scope

With the crawler you can control scope and collect URLs without crawling each link individually. This is useful when you want to crawl a page for links—even filtering by internal links only—and return them without processing:

$urls = Crawler::create('https://example.com')
->internalOnly()
->depth(3)
->foundUrls();

Testing with fake()

Spatie always delivers excellent test helpers with their package, and the crawler package is no different. This package's fake() method lets you test crawl logic without making real HTTP requests. Pass a map of URLs to HTML strings and the crawler uses those as responses:

Crawler::create('https://example.com')
->fake([
'https://example.com' => '<html><a href="/about">About</a></html>',
'https://example.com/about' => '<html>About page</html>',
])
->foundUrls();

Other Highlights

  • Throttling: FixedDelayThrottle for a fixed delay between requests, AdaptiveThrottle to back off based on server response times
  • retry(): automatic retries on connection errors and 5xx responses
  • stream(): opt-in streaming to reduce memory usage on large crawls
  • FinishReason enum: start() returns Completed, CrawlLimitReached, TimeLimitReached, or Interrupted
  • JavaScript rendering: a JavaScriptRenderer interface with a CloudflareRenderer included and spatie/browsershot as a suggested driver
  • And more

You can find the full source at spatie/crawler on GitHub.

Paul Redmond photo

Staff writer at Laravel News. Full stack web developer and author.

Cube

Laravel Newsletter

Join 40k+ other developers and never miss out on new tips, tutorials, and more.

image
SerpApi

The Web Search API for Your LLM and AI Applications

Visit SerpApi
Lucky Media logo

Lucky Media

Get Lucky Now - the ideal choice for Laravel Development, with over a decade of experience!

Lucky Media
No Compromises logo

No Compromises

Joel and Aaron, the two seasoned devs from the No Compromises podcast, are now available to hire for your Laravel project. ⬧ Flat rate of $9500/mo. ⬧ No lengthy sales process. ⬧ No contracts. ⬧ 100% money back guarantee.

No Compromises
Kirschbaum logo

Kirschbaum

Providing innovation and stability to ensure your web application succeeds.

Kirschbaum
SaaSykit: Laravel SaaS Starter Kit logo

SaaSykit: Laravel SaaS Starter Kit

SaaSykit is a Multi-tenant Laravel SaaS Starter Kit that comes with all features required to run a modern SaaS. Payments, Beautiful Checkout, Admin Panel, User dashboard, Auth, Ready Components, Stats, Blog, Docs and more.

SaaSykit: Laravel SaaS Starter Kit
Laravel Cloud logo

Laravel Cloud

Easily create and manage your servers and deploy your Laravel applications in seconds.

Laravel Cloud
Harpoon: Next generation time tracking and invoicing logo

Harpoon: Next generation time tracking and invoicing

The next generation time-tracking and billing software that helps your agency plan and forecast a profitable future.

Harpoon: Next generation time tracking and invoicing
Tinkerwell logo

Tinkerwell

The must-have code runner for Laravel developers. Tinker with AI, autocompletion and instant feedback on local and production environments.

Tinkerwell
Acquaint Softtech logo

Acquaint Softtech

Acquaint Softtech offers AI-ready Laravel developers who onboard in 48 hours at $3000/Month with no lengthy sales process and a 100 percent money-back guarantee.

Acquaint Softtech
PhpStorm logo

PhpStorm

The go-to PHP IDE with extensive out-of-the-box support for Laravel and its ecosystem.

PhpStorm
Shift logo

Shift

Running an old Laravel version? Instant, automated Laravel upgrades and code modernization to keep your applications fresh.

Shift

The latest

View all →
Route Metadata Support in Laravel 13.17 image

Route Metadata Support in Laravel 13.17

Read article
Ship AI with Laravel: Failover, Queues, and Middleware for AI Agents image

Ship AI with Laravel: Failover, Queues, and Middleware for AI Agents

Read article
Monitor and Control Schedules, Queues, and Errors in Laravel with Watchtower image

Monitor and Control Schedules, Queues, and Errors in Laravel with Watchtower

Read article
Showcase Your PhpStorm Expertise on LinkedIn image

Showcase Your PhpStorm Expertise on LinkedIn

Read article
Privacy Filter: Detect PII in Text from Laravel image

Privacy Filter: Detect PII in Text from Laravel

Read article
NationForge: A Self-Hosted Admin Panel for Civic Organizations image

NationForge: A Self-Hosted Admin Panel for Civic Organizations

Read article