survos/site-discovery-bundle

Discover hosted SaaS tenant sites (*.example.com) via web archive indexes. Currently supports Internet Archive CDX API only.

Maintainers

Package info

github.com/survos/site-discovery-bundle

Type:symfony-bundle

pkg:composer/survos/site-discovery-bundle

Fund package maintenance!

kbond

Statistics

Installs: 5

Dependents: 1

Suggesters: 0

Stars: 0

Open Issues: 0

2.0.126 2026-03-04 13:16 UTC

This package is auto-updated.

Last update: 2026-03-11 03:31:52 UTC


README

Discovers tenant hostnames under a shared SaaS domain — e.g. *.pastperfectonline.com, *.omeka.net — by querying web archive indexes.

Current backend: Internet Archive CDX API only. Common Crawl support is planned but not yet implemented.

Requirements: PHP 8.4+, Symfony 8.0+

Background: how the CDX API works

The Internet Archive crawls the web continuously and stores every URL in a CDX (Capture inDeX). URLs are sorted in SURT (Sort-friendly URI Reordering Transform) order, which reverses domain label order:

https://fauquierhistory.pastperfectonline.com/path
  →  com,pastperfectonline,fauquierhistory)/path

For a SaaS platform like PastPerfect Online, every tenant has a subdomain. Their SURT keys look like:

com,pastperfectonline,fauquierhistory)/
com,pastperfectonline,bainbridgehistorymuseum)/advancedsearch

The tenant slug (fauquierhistory) sits between the shared SURT prefix (com,pastperfectonline,) and the closing ). This bundle queries CDX for all URLs under the registered domain, filters to subdomain-only rows, and extracts unique slugs.

Computing the SURT prefix

Reverse the domain labels, join with commas, add a trailing comma:

Domain SURT prefix
pastperfectonline.com com,pastperfectonline,
omeka.net net,omeka,
myheritage.com com,myheritage,
arcgis.com com,arcgis,

Installation

composer require survos/site-discovery-bundle

Register if not using Symfony Flex:

// config/bundles.php
return [
    Survos\SiteDiscoveryBundle\SurvosSiteDiscoveryBundle::class => ['all' => true],
];

Configuration

# config/packages/survos_site_discovery.yaml
survos_site_discovery:
    user_agent: "MyApp SiteDiscovery"   # defaults to "SurvosSiteDiscoveryBundle"

Console command

site:discover <domain> <surtPrefix> [options]

Arguments

Argument Description
domain Bare registered domain, e.g. pastperfectonline.com
surtPrefix SURT prefix for subdomain rows, e.g. com,pastperfectonline,

Options

Option Default Description
--output stdout Write JSONL to this file path
--limit 0 Stop after N unique sites (0 = unlimited). Always use a small number during development.
--page-size 5000 CDX rows per API request (max ~10 000)
--scheme https URL scheme used in base_url

Examples

# Discover PastPerfect Online sites, print to stdout (first 5 for testing)
bin/console site:discover pastperfectonline.com com,pastperfectonline, --limit=5

# Write to a JSONL file
bin/console site:discover pastperfectonline.com com,pastperfectonline, \
    --output=var/discovery/pastperfect-sites.jsonl

# Discover Omeka.net sites
bin/console site:discover omeka.net net,omeka, \
    --output=var/discovery/omeka-sites.jsonl

# Full discovery — slow, expect 10–30 s per CDX page
bin/console site:discover pastperfectonline.com com,pastperfectonline, \
    --output=var/discovery/pastperfect-sites.jsonl

Output JSONL shape

One JSON object per line:

{
  "slug":           "fauquierhistory",
  "host":           "fauquierhistory.pastperfectonline.com",
  "base_url":       "https://fauquierhistory.pastperfectonline.com",
  "discovered_via": "internet_archive_cdx",
  "validated":      false,
  "validated_at":   null
}

PHP API

Inject CdxDiscoveryService

use Survos\SiteDiscoveryBundle\Service\CdxDiscoveryService;
use Survos\SiteDiscoveryBundle\Model\DiscoveredSite;

final class MyHarvester
{
    public function __construct(
        private readonly CdxDiscoveryService $cdx,
    ) {}

    public function run(): void
    {
        foreach ($this->cdx->discover('pastperfectonline.com', 'com,pastperfectonline,') as $site) {
            // $site is a DiscoveredSite value object
            echo $site->slug;     // "fauquierhistory"
            echo $site->host;     // "fauquierhistory.pastperfectonline.com"
            echo $site->baseUrl;  // "https://fauquierhistory.pastperfectonline.com"

            $row = $site->toArray(); // JSONL-ready associative array
        }
    }
}

CdxDiscoveryService::discover() signature

public function discover(
    string $domain,       // e.g. "pastperfectonline.com"
    string $surtPrefix,   // e.g. "com,pastperfectonline,"
    string $scheme   = 'https',
    int    $limit    = 0,     // 0 = unlimited; set small for dev/testing
    int    $pageSize = 5000,
): \Generator  // yields DiscoveredSite

DiscoveredSite value object

final readonly class DiscoveredSite
{
    public string $slug;           // "fauquierhistory"
    public string $host;           // "fauquierhistory.pastperfectonline.com"
    public string $baseUrl;        // "https://fauquierhistory.pastperfectonline.com"
    public string $discoveredVia;  // "internet_archive_cdx"

    public function toArray(): array;  // JSONL-ready
}

CDX API technical notes

These notes are provided for agents and developers integrating with or extending this bundle.

Endpoint: https://web.archive.org/cdx/search/cdx

Parameters used by this bundle:

Parameter Value Purpose
url e.g. pastperfectonline.com Registered domain (no wildcard)
matchType domain Returns all URLs in the entire domain tree
output json Structured response; row 0 is a header array
fl urlkey Only fetch the SURT key column — cheapest option
collapse urlkey CDX-level deduplication
filter urlkey:{surtPrefix}[a-z0-9] Restrict to subdomain rows; skips bare domain entries
limit 5000 Rows per page
showResumeKey true Enables pagination
resumeKey {key from previous page} Continue from prior page

Pagination: when showResumeKey=true, the last row of each page is a resume-key string (not a urlkey). It does NOT start with the SURT prefix — that is how we distinguish it from real data rows. Pass it as resumeKey on the next request.

Why fl=urlkey instead of fl=original: the original field contains the raw crawled URL, which requires URL parsing to extract the hostname. The urlkey encodes the slug directly and unambiguously. One regex match is all that is needed.

Why matchType=domain instead of matchType=host: matchType=host with a wildcard (*.example.com) returns empty results. matchType=domain with the bare registered domain returns the full tree.

Latency: CDX API requests with matchType=domain can take 10–30 seconds per page. The response is streamed; the bundle waits for the full response. Plan accordingly.

Coverage gaps: sites blocked by robots.txt during crawl, or newer than the most recent IA crawl, will not appear. Use the output as a candidate list to be validated, not as a definitive registry.

Rate limiting

The CDX API is free and unauthenticated. The Internet Archive does not publish a formal rate limit, but hammering the API with parallel requests is antisocial. This bundle makes one sequential request per page. Do not add concurrency.

Downstream validation

This bundle only discovers candidate hostnames. It does not validate that a host is currently live or that it is still running the expected platform. Add a probe step in your consumer bundle, e.g.:

// Pseudo-code — implement in your bundle
$response = $httpClient->request('GET', $site->baseUrl . '/AdvancedSearch');
$isLive = $response->getStatusCode() === 200
    && str_contains($response->getContent(), 'pastperfectonline');

Planned backends

  • Common Crawl Host Index (higher coverage; requires DuckDB or Athena)
  • Static seed file (CSV/JSONL of known hosts, for offline or pre-seeded use)

Pull requests for additional backends are welcome. Implement CdxDiscoveryService as a reference — yield DiscoveredSite objects, accept a $limit parameter, stream results lazily.

License

MIT