dmoraschi / sitemap-common
Sitemap generator and crawler library
Installs: 105
Dependents: 0
Suggesters: 0
Security: 0
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
pkg:composer/dmoraschi/sitemap-common
Requires
- php: >=5.6
- guzzlehttp/guzzle: ~6.0
Requires (Dev)
- mockery/mockery: @stable
- phpunit/phpunit: 4.*@stable
- satooshi/php-coveralls: dev-master
This package is not auto-updated.
Last update: 2025-10-16 00:37:21 UTC
README
This package provides all of the components to crawl a website and build and write sitemaps file.
Example of console application using the library: dmoraschi/sitemap-app
Installation
Run the following command and provide the latest stable version (e.g v1.0.0):
composer require dmoraschi/sitemap-common
or add the following to your composer.json file :
"dmoraschi/sitemap-common": "1.0.*"
SiteMapGenerator
Basic usage
$generator = new SiteMapGenerator( new FileWriter($outputFileName), new XmlTemplate() );
Add a URL:
$generator->addUrl($url, $frequency, $priority);
Add a single SiteMapUrl object or array:
$siteMapUrl = new SiteMapUrl( new Url($url), $frequency, $priority ); $generator->addSiteMapUrl($siteMapUrl); $generator->addSiteMapUrls([ $siteMapUrl, $siteMapUrl2 ]);
Set the URLs of the sitemap via SiteMapUrlCollection:
$siteMapUrl = new SiteMapUrl( new Url($url), $frequency, $priority ); $collection = new SiteMapUrlCollection([ $siteMapUrl, $siteMapUrl2 ]); $generator->setCollection($collection);
Generate the sitemap:
$generator->execute();
Crawler
Basic usage
$crawler = new Crawler( new Url($baseUrl), new RegexBasedLinkParser(), new HttpClient() );
You can tell the Crawler not to visit certain url's by adding policies. Below the default policies provided by the library:
$crawler->setPolicies([ 'host' => new SameHostPolicy($baseUrl), 'url' => new UniqueUrlPolicy(), 'ext' => new ValidExtensionPolicy(), ]); // or $crawler->setPolicy('host', new SameHostPolicy($baseUrl));
SameHostPolicy, UniqueUrlPolicy, ValidExtensionPolicy are provided with the library, you can define your own policies by implementing the interface Policy.
Calling the function crawl the object will start from the base url in the contructor and crawl all the web pages with the specified depth passed as a argument.
The function will return with the array of all unique visited Url's:
$urls = $crawler->crawl($deep);
You can also instruct the Crawler to collect custom data while visiting the web pages by adding Collector's to the main object:
$crawler->setCollectors([ 'images' => new ImageCollector() ]); // or $crawler->setCollector('images', new ImageCollector());
And then retrive the collected data:
$crawler->crawl($deep); $imageCollector = $crawler->getCollector('images'); $data = $imageCollector->getCollectedData();
ImageCollector is provided by the library, you can define your own collector by implementing the interface Collector.