simgroep / concurrent-spider-bundle
Symfony bundle for running a distributed web page crawler
Installs: 177
Dependents: 0
Suggesters: 0
Security: 0
Stars: 5
Watchers: 12
Forks: 8
Open Issues: 0
Type:symfony-bundle
Requires
- php: >=5.4.0
- nelmio/solarium-bundle: ^2.3
- phpoffice/phpword: ^0.13,>=0.13.1
- predis/predis: ^1.1
- snc/redis-bundle: ^2.0.1
- symfony/process: ^3.1
- symfony/symfony: ^2.7||^3.0
- vdb/php-spider: ^0.2
- videlalvaro/php-amqplib: ~2
Requires (Dev)
- phpunit/phpunit: ~4
- satooshi/php-coveralls: ~0.6
README
This bundle provides a set of commands to run a distributed web page crawler. Crawled web pages are saved to Solr.
Installation
Install it with Composer:
composer require simgroep/concurrent-spider-bundle dev-master
Then add it to your AppKernel.php
new Simgroep\ConcurrentSpiderBundle\SimgroepConcurrentSpiderBundle(),
It is needed to install http://www.foolabs.com/xpdf/ - only pdftotext is realy to be functional from command line:
/path_to_command/pdftotext pdffile.pdf
Configuration
Minimal configuration is necessary. The crawler needs to know the mapping you're using in Solr so it can save documents. The only mandatory part of the config is "mapping". Other values are optional:
simgroep_concurrent_spider:
http_user_agent: "PHP Concurrent Spider"
rabbitmq.host: localhost
rabbitmq.port: 5672
rabbitmq.user: guest
rabbitmq.password: guest
queue.discoveredurls_queue: discovered_urls
queue.indexer_queue: indexer
solr.host: localhost
solr.port: 8080
solr.path: /solr
mapping:
id: #required
title: #required
content: #required
url: #required
tstamp: ~
date: ~
publishedDate: ~
How does it work?
You start the crawler with:
app/console simgroep:start-crawler https://github.com
This will add one job to the queue to crawl the url https://github.com. Then run the following process in background to start crawling:
app/console simgroep:crawl
It's recommended to use a tool to maintain the crawler process in background. We recommend Supervisord. You can run as many as threads as you like (and your machine can handle), but you should be careful to not flood the website. Every thread acts as a visitor on the website you're crawling.
Architecture
This bundle uses RabbitMQ to keep track of a queue that has URLs that should be indexed. Also it uses Solr to save the crawled web pages.