dmoraschi/sitemap-common

Sitemap generator and crawler library

v1.1.0 2016-08-21 23:07 UTC

This package is not auto-updated.

Last update: 2024-10-30 20:20:40 UTC


README

Build Status Scrutinizer Quality Score

This package provides all of the components to crawl a website and build and write sitemaps file.

Example of console application using the library: dmoraschi/sitemap-app

Installation

Run the following command and provide the latest stable version (e.g v1.0.0):

composer require dmoraschi/sitemap-common

or add the following to your composer.json file :

"dmoraschi/sitemap-common": "1.0.*"

SiteMapGenerator

Basic usage

$generator = new SiteMapGenerator(
    new FileWriter($outputFileName),
    new XmlTemplate()
);

Add a URL:

$generator->addUrl($url, $frequency, $priority);

Add a single SiteMapUrl object or array:

$siteMapUrl = new SiteMapUrl(
    new Url($url), $frequency, $priority
);

$generator->addSiteMapUrl($siteMapUrl);

$generator->addSiteMapUrls([
    $siteMapUrl, $siteMapUrl2
]);

Set the URLs of the sitemap via SiteMapUrlCollection:

$siteMapUrl = new SiteMapUrl(
    new Url($url), $frequency, $priority
);

$collection = new SiteMapUrlCollection([
    $siteMapUrl, $siteMapUrl2
]);

$generator->setCollection($collection);

Generate the sitemap:

$generator->execute();

Crawler

Basic usage

$crawler = new Crawler(
    new Url($baseUrl),
    new RegexBasedLinkParser(),
    new HttpClient()
);

You can tell the Crawler not to visit certain url's by adding policies. Below the default policies provided by the library:

$crawler->setPolicies([
    'host' => new SameHostPolicy($baseUrl),
    'url'  => new UniqueUrlPolicy(),
    'ext'  => new ValidExtensionPolicy(),
]);
// or
$crawler->setPolicy('host', new SameHostPolicy($baseUrl));

SameHostPolicy, UniqueUrlPolicy, ValidExtensionPolicy are provided with the library, you can define your own policies by implementing the interface Policy.

Calling the function crawl the object will start from the base url in the contructor and crawl all the web pages with the specified depth passed as a argument. The function will return with the array of all unique visited Url's:

$urls = $crawler->crawl($deep);

You can also instruct the Crawler to collect custom data while visiting the web pages by adding Collector's to the main object:

$crawler->setCollectors([
    'images' => new ImageCollector()
]);
// or
$crawler->setCollector('images', new ImageCollector());

And then retrive the collected data:

$crawler->crawl($deep);

$imageCollector = $crawler->getCollector('images');
$data = $imageCollector->getCollectedData();

ImageCollector is provided by the library, you can define your own collector by implementing the interface Collector.