zrashwani / arachnid
A crawler to find all unique internal pages on a given website
Installs: 20 195
Dependents: 0
Suggesters: 0
Security: 0
Stars: 253
Watchers: 21
Forks: 60
Open Issues: 5
Requires
- php: >=7.2.0
- ext-spl: *
- fabpot/goutte: ^4.0
- guzzlehttp/psr7: ^1.4
- psr/log: ^1.1
- symfony/panther: ^1.0
- tightenco/collect: ^v8.34
Requires (Dev)
- dbrekelmans/bdi: ^0.3.0
- monolog/monolog: ^1.22
- phpunit/phpunit: ^7.5|^8.5|^9.2
- squizlabs/php_codesniffer: ^3.5
- symfony/var-dumper: ^v4.4
This package is not auto-updated.
Last update: 2024-10-28 17:02:23 UTC
README
This library will crawl all unique internal links found on a given website up to a specified maximum page depth.
This library is using symfony/panther & FriendsOfPHP/Goutte libraries to scrap site pages and extract main SEO-related info, including:
title
, h1 elements
, h2 elements
, statusCode
, contentType
, meta description
, meta keyword
and canonicalLink
.
This library is based on the original blog post by Zeid Rashwani here:
http://zrashwani.com/simple-web-spider-php-goutte
Josh Lockhart adapted the original blog post's code (with permission) for Composer and Packagist and updated the syntax to conform with the PSR-2 coding standard.
Sponsored By
How to Install
You can install this library with Composer. Drop this into your composer.json
manifest file:
{
"require": {
"zrashwani/arachnid": "dev-master"
}
}
Then run composer install
.
Getting Started
Basic Usage:
Here's a quick demo to crawl a website:
<?php require 'vendor/autoload.php'; $url = 'http://www.example.com'; $linkDepth = 3; // Initiate crawl, by default it will use http client (GoutteClient), $crawler = new \Arachnid\Crawler($url, $linkDepth); $crawler->traverse(); // Get link data $links = $crawler->getLinksArray(); //to get links as objects use getLinks() method print_r($links);
Enabling Headless Browser mode:
Headless browser mode can be enabled, so it will use Chrome engine in the background which is useful to get contents of JavaScript-based sites.
enableHeadlessBrowserMode
method set the scraping adapter used to be PantherChromeAdapter
which is based on Symfony Panther library:
$crawler = new \Arachnid\Crawler($url, $linkDepth); $crawler->enableHeadlessBrowserMode() ->traverse() ->getLinksArray();
In order to use this, you need to have chrome-driver installed on your machine, you can use dbrekelmans/browser-driver-installer
to install chromedriver locally:
composer require --dev dbrekelmans/bdi
./vendor/bin/bdi driver:chromedriver drivers
Advanced Usage:
Set additional options to underlying http client, by specifying array of options in constructor or creating Http client scrapper with desired options:
<?php use \Arachnid\Adapters\CrawlingFactory; //third parameter is the options used to configure http client $clientOptions = ['auth_basic' => array('username', 'password')]; $crawler = new \Arachnid\Crawler('http://github.com', 2, $clientOptions); //or by creating and setting scrap client $options = array( 'verify_host' => false, 'verify_peer' => false, 'timeout' => 30, ); $scrapperClient = CrawlingFactory::create(CrawlingFactory::TYPE_HTTP_CLIENT, $options); $crawler->setScrapClient($scrapperClient);
You can inject a PSR-3 compliant logger object to monitor crawler activity (like Monolog):
<?php $crawler = new \Arachnid\Crawler($url, $linkDepth); // ... initialize crawler //set logger for crawler activity (compatible with PSR-3) $logger = new \Monolog\Logger('crawler logger'); $logger->pushHandler(new \Monolog\Handler\StreamHandler(sys_get_temp_dir().'/crawler.log')); $crawler->setLogger($logger); ?>
You can set crawler to visit only pages with specific criteria by specifying callback closure using filterLinks
method:
<?php //filter links according to specific callback as closure $links = $crawler->filterLinks(function($link) { //crawling only links with /blog/ prefix return (bool)preg_match('/.*\/blog.*$/u', $link); }) ->traverse() ->getLinks();
You can use LinksCollection
class to get simple statistics about the links, as following:
<?php $links = $crawler->traverse() ->getLinks(); $collection = new LinksCollection($links); //getting broken links $brokenLinks = $collection->getBrokenLinks(); //getting links for specific depth $depth2Links = $collection->getByDepth(2); //getting external links inside site $externalLinks = $collection->getExternalLinks();
How to Contribute
- Fork this repository
- Create a new branch for each feature or improvement
- Apply your code changes along with corresponding unit test
- Send a pull request from each feature branch
It is very important to separate new features or improvements into separate feature branches, and to send a pull request for each branch. This allows me to review and pull in new features or improvements individually.
All pull requests must adhere to the PSR-2 standard.
System Requirements
- PHP 7.2.0+
Authors
- Josh Lockhart https://github.com/codeguy
- Zeid Rashwani http://zrashwani.com
License
MIT Public License