mehrabx / web-crawler
A web crawler package
Installs: 11
Dependents: 0
Suggesters: 0
Security: 0
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
pkg:composer/mehrabx/web-crawler
Requires
- ext-curl: *
- ext-dom: *
- ext-libxml: *
- guzzlehttp/guzzle: ^7.4
This package is auto-updated.
Last update: 2025-10-10 19:21:48 UTC
README
PHP Web Crawler
This library is a php web crawler which takes collection of URLs and DOM selects to crawl through the webpages and executing customized analyzers on each page.
Installation
Install this library using composer :
composer require mehrabx/web-crawler
Usage
In current version use xpath expressions to select element
//set list of URLs and selects DOM elements of each URL page $urls = [ 'https://test.exp/?page=1' => ["//img[@class='type1']","//a[@class='type1']"], 'https://test.exp/?page=2' => ["//img[@class='type2'"], 'https://test.exp/?page=3' => "//img[@class='type3']", ]; //return array of results return \Crawler\Facades\CrawlFacade::make($urls)->start() ;
options
sleep
To avoid being blocked by the target url you can set sleep time between crawling each url :
$urls = [ 'https://test.exp/?page=1' => ["//img[@class='type1']","//a[@class='type1']"], 'https://test.exp/?page=2' => ["//img[@class='type2'"], ]; //set 5 seconds sleep time return \Crawler\Facades\CrawlFacade::make($urls)->sleep(10)->start() ;
defualt select
You can set default select. URLs that have no selects can use it :
$urls = [ 'https://test.exp/?page=1', //this url has not select 'https://test.exp/?page=2' => ["//img[@class='type2'"], ]; return \Crawler\Facades\CrawlFacade::make($urls) ->defaultSelect("//img[@class='type1']") ->start() ;