nelexa / roach-php-bundle
Symfony bundle for roach-php/core
Installs: 713
Dependents: 0
Suggesters: 0
Security: 0
Stars: 9
Watchers: 3
Forks: 2
Open Issues: 0
Type:symfony-bundle
Requires
- php: >= 8.0
- roach-php/core: ~1.1.0
- symfony/config: ^6.0
- symfony/console: ^6.0
- symfony/dependency-injection: ^6.0
- symfony/http-kernel: ^6.0
- symfony/serializer: ^6.0
Requires (Dev)
- psalm/plugin-phpunit: ^0.16.1
- psalm/plugin-symfony: ^3.1
- roave/security-advisories: dev-latest
- symfony/framework-bundle: ^6.0
- symfony/maker-bundle: ^1.37
- symfony/phpunit-bridge: ^6.0
- symfony/var-dumper: ^6.0
- vimeo/psalm: ^4.21
Suggests
- spatie/browsershot: Required to execute Javascript in spiders
README
roach-php-bundle
Symfony bundle for Roach PHP.
Roach is a complete web scraping toolkit for PHP. It is
a shameless cloneheavily inspired by the popular Scrapy package for Python.
The Symfony bundle mostly provides the necessary container bindings for the various services Roach uses, as well as making certain configuration options available via a config file. To learn about how to actually start using Roach itself, check out the rest of the documentation.
Installing the Symfony bundle
Add nelexa/roach-php-bundle
to your composer.json file:
composer require nelexa/roach-php-bundle
Versions & Dependencies
Register the bundle:
Register bundle into config/bundles.php (Flex did it automatically):
return [ //... \Nelexa\RoachPhpBundle\RoachPhpBundle::class => ['all' => true], ];
Available Commands
The Symfony bundle of Roach registers a few console commands to make out development experience as pleasant as possible.
Run spider
php bin/console roach:run
After that, you will get the entire list of available spiders.
Choose a spider class:
[0] App\Spider\GoogleSpider
[1] App\Spider\FacebookSpider
[2] App\Spider\TwitterSpider
Simply select the desired spider (▼ or ▲) or enter its number and press Enter.
You can pass as the first argument the name spider class name to run or its alias.
For example, if you have a class App\Spider\GoogleSpider
, then you can pass the following aliases: GoogleSpider
, google_spider
or google
.
php bin/console roach:run google
Sometimes it is useful to override the number of concurrent requests and the pre-request delay. To do this, you can pass the --concurrency
and --delay
options.
php bin/console roach:php google --concurrency 8 --delay 2
These options override the $concurrency
and $requestDelay
public properties of your spider.
Add the --output
(-o
) option and you can save the collected data to a JSON file.
php bin/console roach:php google --output 'path/to/data.json'
Starting the REPL
Roach ships with an interactive shell (often called Read-Evaluate-Print-Loop, or Repl for short) which makes prototyping our spiders a breeze. We can use the provided roach:shell
command to launch a new Repl session.
php bin/console roach:shell "https://roach-php.dev/docs/introduction"
Generator classes
First install Symfony MakerBundle
.
composer require --dev symfony/maker-bundle
Create a new roach spider class
php bin/console make:roach:spider
Create a new roach extension class
php bin/console make:roach:extension
Create a new roach item processor class
php bin/console make:roach:item:processor
Create a new roach downloader request middleware class
php bin/console make:roach:middleware:downloader:request
Create a new roach downloader response middleware class
php bin/console make:roach:middleware:downloader:response
Create a new roach spider item middleware class
php bin/console make:roach:middleware:spider:item
Create a new roach spider request middleware class
php bin/console make:roach:middleware:spider:request
Create a new roach spider response middleware class
php bin/console make:roach:middleware:spider:response
Screencast
Credits
Changelog
Changes are documented in the releases page.
License
The MIT License (MIT). Please see LICENSE for more information.