crwlr / crawler
Web crawling and scraping library.
Fund package maintenance!
otsch
Installs: 6 143
Dependents: 2
Suggesters: 0
Security: 0
Stars: 334
Watchers: 4
Forks: 12
Open Issues: 2
Requires
- php: ^8.1
- ext-dom: *
- adbario/php-dot-notation: ^3.1
- chrome-php/chrome: ^1.7
- crwlr/html-2-text: ^0.1.0
- crwlr/robots-txt: ^1.1
- crwlr/schema-org: ^0.2|^0.3
- crwlr/url: ^2.1
- crwlr/utils: ^1.1
- guzzlehttp/guzzle: ^7.4
- psr/log: ^2.0|^3.0
- psr/simple-cache: ^1.0|^2.0|^3.0
- symfony/css-selector: ^6.0|^7.0
- symfony/dom-crawler: ^6.0|^7.0
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.6
- mockery/mockery: ^1.5
- pestphp/pest: ^2.3|^3.0
- phpstan/extension-installer: ^1.1
- phpstan/phpstan: ^1.4
- phpstan/phpstan-mockery: ^1.0
- phpstan/phpstan-phpunit: ^1.0
- spatie/invade: ^2.0
- symfony/process: ^6.0|^7.0
Suggests
- ext-zlib: Needed to uncompress compressed responses
- dev-main
- v2.1.3
- v2.1.2
- v2.1.1
- v2.1.0
- v2.0.1
- v2.0.0
- v2.0.0-beta.2
- v2.0.0-beta
- v1.10.0
- v1.9.5
- v1.9.4
- v1.9.3
- v1.9.2
- v1.9.1
- v1.9.0
- v1.8.0
- v1.7.2
- v1.7.1
- v1.7.0
- v1.6.2
- v1.6.1
- v1.6.0
- v1.5.3
- v1.5.2
- v1.5.1
- v1.5.0
- v1.4.0
- v1.3.5
- v1.3.4
- v1.3.3
- v1.3.2
- v1.3.1
- v1.3.0
- v1.2.2
- v1.2.1
- v1.2.0
- v1.1.6
- v1.1.5
- v1.1.4
- v1.1.3
- v1.1.2
- v1.1.1
- v1.1.0
- v1.0.2
- v1.0.1
- v1.0.0
- v0.7.0
- v0.6.0
- v0.5.0
- v0.4.1
- v0.4.0
- v0.3.0
- v0.2.0
- v0.1.0
- dev-bugfix/cs-fixer-parallel
This package is auto-updated.
Last update: 2024-11-06 22:04:09 UTC
README
Library for Rapid (Web) Crawler and Scraper Development
This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.
To give you an overview, here's a list of things that it helps you with:
- Crawler Politeness 😇 (respecting robots.txt, throttling,...)
- Load URLs using
- a (PSR-18) HTTP client (default is of course Guzzle)
- or a headless browser (chrome) to get source after Javascript execution
- Get absolute links from HTML documents 🔗
- Get sitemaps from robots.txt and get all URLs from those sitemaps
- Crawl (load) all pages of a website 🕷
- Use cookies (or don't) 🍪
- Use any HTTP methods (GET, POST,...) and send any headers or body
- Easily iterate over paginated list pages 🔁
- Extract data from:
- Extract schema.org structured data in JSON-LD format from HTML documents
- Keep memory usage low by using PHP Generators 💪
- Cache HTTP responses during development, so you don't have to load pages again and again after every code change
- Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)
- And a lot more...
Documentation
You can find the documentation at crwlr.software.
Contributing
If you consider contributing something to this package, read the contribution guide (CONTRIBUTING.md).