gyaaniguy / pcrawl
PHP web scraping and crawling library. With support for multiple clients, fast parsing, debugging and on the fly changes to various options
Requires
- php: >=7.4
- ext-curl: *
- ext-json: *
- gravitypdf/querypath: ^3.0
- guzzlehttp/guzzle: ^7.5
Requires (Dev)
- phpunit/phpunit: 7.*
This package is auto-updated.
Last update: 2025-03-12 21:58:11 UTC
README
PCrawl
PCrawl is a PHP library for crawling and scraping web pages.
It supports multiple clients: curl, guzzle. Options to debug, modify and parse responses.
Features
- Rapidly create custom clients. Fluently change clients and client options like user-agent, with method chaining.
- Responses can be modified using reusable callback functions.
- Debug Responses using different criterias - httpcode, regex etc.
- Parse responses using querypath library. Several convenience functions are provided.
- Fluent API. Different debuggers, clients and response mod objects can be be changed on the fly !
Full Example
We'll try to fetch a bad page, then detect using a debugger and finally change client options to fetch the page correctly.
- Setup up some clients
// simple clients. $gu = new GuzzleClient(); // Custom Client, that does not allow redirects. $uptightNoRedirectClient = new CurlClient(); $uptightNoRedirectClient->setRedirects(0); // disable redirects // Custom client - thin wrapper around curl class ConvertToHttpsClient extends CurlClient { public function get(string $url, array $options = []): PResponse { $url = str_replace('http://', 'https://', $url); return parent::get($url, $options); } }
- Lets make some debugger objects
$redirectDetector = new ResponseDebug(); $redirectDetector->setMustNotExistHttpCodes([301, 302, 303, 307, 308]); $fullPageDetector = new ResponseDebug(); $fullPageDetector->setMustExistRegex(['#</html>#']);
Start fetching!
For testing, we will fetch page with a client that does not support redirects, then use the redirectDetector to detect 301. If so we change client option to support redirects and fetch again.
$req = new Request(); $url = "http://www.whatsmyua.info"; $req->setClient($uptightNoRedirectClient); $count = 0; do { $res = $req->get($url); $redirectDetector->setResponse($res); if ($redirectDetector->isFail()) { var_dump($redirectDetector->getFailDetail()); $uptightNoRedirectClient->setRedirects(1); $res = $req->get($url); } } while ($redirectDetector->isFail() && $count++ < 1);
Use the fullPageDetector to detect if the page is proper.
Then parse the response body using Parser
if ($fullPageDetector->setResponse($res)->isFail()) { var_dump($redirectDetector->getFailDetail()); } else { $parser = new ParserCommon($res->getBody()); $h1 = $parser->find('h1')->text(); $htmlClass = $parser->find('html')->attr('class'); }
Note: the debuggers, clients, parsers can be reused.
Detailed Usage
Usage of functions can be divided into parts:
Installation
- Composer:
composer init # for new projects. composer config minimum-stability dev # Will be removed once stable. composer require gyaaniguy/pcrawl composer update include __DIR__ . '/vendor/autoload.php'; #in PHP
- github:
git clone git@github.com:gyaaniguy/PCrawl.git # clone repo cd PCrawl composer update # update composer mv ../PCrawl /desired/location # Move dir to desired location. require __DIR__ . '../PCrawl/vendor/autoload.php'; #in PHP
TODO list
- Leverage guzzlehttp asynchronous support
Standards
PSR-12
PHPUnit tests