diggin / diggin-robotrules
parser/handler for Robots Exclusion Protocol (robots.txt and more)
0.10.0
2016-02-26 14:56 UTC
Requires
- php: >=5.3.4
Requires (Dev)
- phpunit/phpunit: ~4.8
- satooshi/php-coveralls: ~0.6
- zendframework/zend-uri: ~2.3
This package is auto-updated.
Last update: 2024-12-06 16:11:39 UTC
README
PHP parser/handler for Robots Exclusion Protocol (robots.txt and more..)
Features
-
implements http://www.robotstxt.org/norobots-rfc.txt
- [DONE] "3.2.2 The Allow and Disallow lines" - as test-case
- [DONE] "4.Examples" as test-case
-
passing Nutch's test code ref
- [DONE] @see tests/Diggin/RobotRules/Imported/NutchTest.php
-
parsing & handling html-meta
ToDos
- handle Crawl-Delay
- sync or testing a little pattern w/ Google Test robots.txt tool
- rewrite with PHPPEG.(because current preg* base parser makes difficulty.)
- more test, refactoring on and on..
USAGE
<?php use Diggin\RobotRules\Accepter\TxtAccepter; use Diggin\RobotRules\Parser\TxtStringParser; $robotstxt = <<<'ROBOTS' # sample robots.txt User-agent: YourCrawlerName Disallow: User-agent: * Disallow: /aaa/ #comment ROBOTS; $accepter = new TxtAccepter; $accepter->setRules(TxtStringParser::parse($robotstxt)); $accepter->setUserAgent('foo'); var_dump($accepter->isAllow('/aaa/')); //false var_dump($accepter->isAllow('/b.html')); //true $accepter->setUserAgent('YourCrawlerName'); var_dump($accepter->isAllow('/aaa/')); // true
INSTALL
Diggin_RobotRules is following PSR-0, so to register namespace Diggin\RobotRules into your ClassLoader.
To install via composer
- $php composer.phar require diggin/diggin-robotrules "dev-master"
License
Diggin_RobotRules is licensed under new-bsd.
Reference & alternatives in others language.
- Perl
- Python
- Ruby