README

Library for Rapid (Web) Crawler and Scraper Development

This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.

To give you an overview, here's a list of things that it helps you with:

Crawler Politeness 😇 (respecting robots.txt, throttling,...)
Load URLs using
- a (PSR-18) HTTP client (default is of course Guzzle)
- or a headless browser (chrome) to get source after Javascript execution
Get absolute links from HTML documents 🔗
Get sitemaps from robots.txt and get all URLs from those sitemaps
Crawl (load) all pages of a website 🕷
Use cookies (or don't) 🍪
Use any HTTP methods (GET, POST,...) and send any headers or body
Easily iterate over paginated list pages 🔁
Extract data from:
- HTML and also XML (using CSS selectors or XPath queries)
- JSON (using dot notation)
- CSV (map columns)
Extract schema.org structured data in JSON-LD format from HTML documents
Keep memory usage low by using PHP Generators 💪
Cache HTTP responses during development, so you don't have to load pages again and again after every code change
Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)
And a lot more...

Documentation

You can find the documentation at crwlr.software.

Contributing

If you consider contributing something to this package, read the contribution guide (CONTRIBUTING.md).

crwlr / crawler

Maintainers

Details

README

Library for Rapid (Web) Crawler and Scraper Development

Documentation

Contributing