baraja-core / webcrawler
Simple package to load list of urls and make sitemap.
Requires
- php: ^8.0
- ext-curl: *
- nette/http: ^3.0
- nette/utils: ^4.0
Requires (Dev)
- phpstan/extension-installer: ^1.1
- phpstan/phpstan: ^1.0
- phpstan/phpstan-deprecation-rules: ^1.0
- phpstan/phpstan-nette: ^1.0
- phpstan/phpstan-strict-rules: ^1.0
- roave/security-advisories: dev-master
- spaze/phpstan-disallowed-calls: ^2.0
This package is auto-updated.
Last update: 2024-10-09 20:58:51 UTC
README
BRJ organisation
Web crawler
Simply library for crawling websites by following links with minimal dependencies.
📦 Installation
It's best to use Composer for installation, and you can also find the package on Packagist and GitHub.
To install, simply use the command:
$ composer require baraja-core/webcrawler
You can use the package manually by creating an instance of the internal classes, or register a DIC extension to link the services directly to the Nette Framework.
How to use
Crawler can run without dependencies.
In default settings create instance and call crawl()
method:
$crawler = new \Baraja\WebCrawler\Crawler; $result = $crawler->crawl('https://example.com');
In $result
variable will be entity of type CrawledResult
.
Advanced checking of multiple URLs
In real case you need download multiple URLs in single domain and check if some specific URLs works.
Simple example:
$crawler = new \Baraja\WebCrawler\Crawler; $result = $crawler->crawlList( 'https://example.com', // Starting (main) URL [ // Additional URLs 'https://example.com/error-404', '/robots.txt', // Relative links are also allowed '/web.config', ] );
Notice: File robots.txt and sitemap will be downloaded automatically if exist.
Settings
In constructor of service Crawler
you can define your project specific configuration.
Simply like:
$crawler = new \Baraja\WebCrawler\Crawler( new \Baraja\WebCrawler\Config([ // key => value ]) );
No one value is required. Please use as key-value array.
Configuration options:
📄 License
baraja-core/webcrawler
is licensed under the MIT license. See the LICENSE file for more details.