vipnytt / robotstxtparser
Robots.txt parsing library, with full support for every directive and specification.
Installs: 609 077
Dependents: 6
Suggesters: 1
Security: 0
Stars: 26
Watchers: 3
Forks: 6
Open Issues: 1
Requires
- php: ^7.3 || ^8.0
- ext-curl: *
- ext-mbstring: *
- composer/ca-bundle: ^1.0
- vipnytt/useragentparser: ^1.0
Requires (Dev)
- phpunit/phpunit: ^9.0
This package is auto-updated.
Last update: 2024-10-11 00:47:33 UTC
README
Robots.txt parser
An easy to use, extensible robots.txt
parser library with full support for literally every directive and specification on the Internet.
Usage cases:
- Permission checks
- Fetch crawler rules
- Sitemap discovery
- Host preference
- Dynamic URL parameter discovery
robots.txt
rendering
Advantages
(compared to most other robots.txt libraries)
- Automatic
robots.txt
download. (optional) - Integrated Caching system. (optional)
- Crawl Delay handler.
- Documentation available.
- Support for literally every single directive, from every specification.
- HTTP Status code handler, according to Google's spec.
- Dedicated
User-Agent
parser and group determiner library, for maximum accuracy. - Provides additional data like preferred host, dynamic URL parameters, Sitemap locations, etc.
- Protocols supported:
HTTP
,HTTPS
,FTP
,SFTP
andFTP/S
.
Requirements:
Installation
The recommended way to install the robots.txt parser is through Composer. Add this to your composer.json
file:
{ "require": { "vipnytt/robotstxtparser": "^2.1" } }
Then run: php composer update
Getting started
Basic usage example
<?php $client = new vipnytt\RobotsTxtParser\UriClient('http://example.com'); if ($client->userAgent('MyBot')->isAllowed('http://example.com/somepage.html')) { // Access is granted } if ($client->userAgent('MyBot')->isDisallowed('http://example.com/admin')) { // Access is denied }
A small excerpt of basic methods
<?php // Syntax: $baseUri, [$statusCode:int|null], [$robotsTxtContent:string], [$encoding:string], [$byteLimit:int|null] $client = new vipnytt\RobotsTxtParser\TxtClient('http://example.com', 200, $robotsTxtContent); // Permission checks $allowed = $client->userAgent('MyBot')->isAllowed('http://example.com/somepage.html'); // bool $denied = $client->userAgent('MyBot')->isDisallowed('http://example.com/admin'); // bool // Crawl delay rules $crawlDelay = $client->userAgent('MyBot')->crawlDelay()->getValue(); // float | int // Dynamic URL parameters $cleanParam = $client->cleanParam()->export(); // array // Preferred host $host = $client->host()->export(); // string | null $host = $client->host()->getWithUriFallback(); // string $host = $client->host()->isPreferred(); // bool // XML Sitemap locations $host = $client->sitemap()->export(); // array
The above is just a taste the basics, a whole bunch of more advanced and/or specialized methods are available for almost any purpose. Visit the cheat-sheet for the technical details.
Visit the Documentation for more information.
Directives
Specifications
- Google robots.txt specifications
- Yandex robots.txt specifications
- W3C Recommendation HTML 4.01 specification
- Sitemaps.org protocol
- Sean Conner: "An Extended Standard for Robot Exclusion"
- Martijn Koster: "A Method for Web Robots Control"
- Martijn Koster: "A Standard for Robot Exclusion"
- RFC 7231,
2616 - RFC 7230,
2616 - RFC 5322,
2822,822 - RFC 3986,
1808 - RFC 1945
- RFC 1738
- RFC 952