jbsommeling / scannr
A Laravel package that crawls websites to detect broken links, redirect chains, HTTPS downgrades, and more. Includes JavaScript rendering support for SPAs.
Requires
- php: ^8.4
- ext-libxml: *
- ext-simplexml: *
- illuminate/console: ^12.0
- illuminate/support: ^12.0
- spatie/browsershot: ^5.2
- symfony/css-selector: ^7.0|^8.0
- symfony/dom-crawler: ^7.0|^8.0
Requires (Dev)
- mockery/mockery: ^1.6
- orchestra/testbench: ^10.0
- phpunit/phpunit: ^11.5.3
This package is auto-updated.
Last update: 2026-03-25 08:40:09 UTC
README
Scannr
A Laravel package that crawls websites to detect broken links, redirect chains, HTTPS downgrades, and more. Includes JavaScript rendering support for SPAs. Use it as a dev dependency in your Laravel project or as a Docker-based GitHub Action in your CI/CD pipeline.
Features
- BFS Crawling — Systematically crawls websites level by level
- Broken Link Detection — Identifies links returning 4xx/5xx status codes
- Multi-Element Scanning — Extracts URLs from
<a>,<link>,<script>,<img>, media elements, download triggers, and inline JS handlers - JavaScript Rendering — Scan SPAs (React, Vue, Angular) with headless browser support via
--jsflag - Smart JS Detection — Automatically detect SPAs and enable JS rendering only when needed via
--smart-jsflag - Sitemap Integration — Discover URLs from XML, HTML, or plain text sitemaps (including sitemap index files)
- robots.txt Support — Respects
Disallowrules andCrawl-delaydirectives - Redirect Chain Tracking — Follows and reports redirect chains, including loop detection
- HTTPS Downgrade Detection — Warns when redirects downgrade from HTTPS to HTTP
- Link Flag System — Every link is tagged with observation flags (discovery method, platform traits, technical anomalies, validation outcome)
- Integrity Scoring — Weight-based scoring model that rates your site's link health (A–F)
- Quality Gates — Fail CI/CD when critical issues are found or the integrity score drops below a configurable threshold
- Rate Limit Backoff — Automatic exponential backoff on HTTP 429 responses
- Noise URL Filtering — Hides XML namespaces, CDN preconnect hints, and JS framework error docs
- Multiple Output Formats — Table, JSON, or CSV output
- Domain Validation — Warns when the scanned URL doesn't match your
APP_URL
Responsible Use
Scannr is intended for scanning websites you own or have explicit permission to test.
Users are responsible for ensuring their usage complies with:
- Website terms of service
- Applicable laws and regulations
- robots.txt directives
Do not use Scannr to:
- Crawl websites without permission
- Overload servers or bypass rate limits
- Access protected or private content
The authors are not responsible for misuse of this tool.
Installation
As a Laravel Dev Dependency
composer require jbsommeling/scannr
The package auto-discovers its service provider. To publish the configuration:
php artisan vendor:publish --tag=scannr-config
To publish the migration (for --queue support):
php artisan vendor:publish --tag=scannr-migrations php artisan migrate
JavaScript Rendering (Optional)
For --js and --smart-js support, install Puppeteer:
npm install puppeteer
CLI Usage
Basic Scan
php artisan site:scan https://example.com
Command Signature
php artisan site:scan {url} [options]
Options
| Option | Default | Description |
|---|---|---|
--depth=N |
3 | Maximum crawl depth from the starting URL |
--max=N |
300 | Maximum number of URLs to scan |
--timeout=N |
5 | Request timeout in seconds |
--format=FORMAT |
table | Output format: table, json, or csv |
--status=FILTER |
all | Filter results: all, ok, or broken |
--filter=TYPE |
all | Filter by element type: all, a, link, script, img, media, form |
--scan-elements=TYPES |
all | Element types to scan: all, or comma-separated (e.g., a,img) |
--sitemap |
false | Use sitemap.xml to discover URLs before crawling |
--js |
false | Enable JavaScript rendering for SPA/React sites |
--smart-js |
false | Auto-detect SPAs and enable JS rendering when needed |
--no-robots |
false | Ignore robots.txt rules |
--advanced |
false | Show noise URLs (XML namespaces, CDN hints, etc.) |
--strip-params=PARAMS |
— | Additional tracking parameters to strip (comma-separated) |
--delay-min=N |
config | Minimum delay between requests in milliseconds |
--delay-max=N |
config | Maximum delay between requests in milliseconds |
--queue |
false | Dispatch scan as a background job |
--fail-on-critical |
false | Fail with exit code 1 if critical issues are found |
--min-rating=GRADE |
none | Minimum acceptable rating: excellent, good, needs_attention, none |
Examples
# Show only broken links php artisan site:scan https://example.com --status=broken # Deep scan with sitemap discovery php artisan site:scan https://example.com --depth=5 --max=1000 --sitemap # Scan a React SPA with JSON output php artisan site:scan https://my-spa.com --js --format=json # Only scan images and anchors php artisan site:scan https://example.com --scan-elements=a,img --status=broken # Throttle requests (200–500ms random delay between each request) php artisan site:scan https://example.com --delay-min=200 --delay-max=500 # Fail if any critical issues found or rating drops below "Good" php artisan site:scan https://example.com --fail-on-critical --min-rating=good # Strict quality gate: require "Excellent" rating php artisan site:scan https://example.com --fail-on-critical --min-rating=excellent
Example Output
Table (default)
Site Scan: https://www.example.com
========================================
Robots.txt Crawl-delay: 1s (using 1000ms-1000ms delay)
Robots.txt: respecting 4 Disallow/Allow rule(s)
11/30 [▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░░░░░░░░░] 36% 30/30 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100%
🟢 Site Integrity Score: 100.0 / 100 — Excellent
Link Integrity: 100.0 / 100
Security Hygiene: 100.0 / 100
Technical Hygiene: 100.0 / 100
Redirect Health: 100.0 / 100
Link Verifiability: 100.0 / 100
Critical Issues: 0
Broken Links: 0
Warnings: 0
Manual Verification: 0
Summary:
Total scanned: 11
Pages scanned: 2
Internal links: 1
Assets scanned: 9
External links: 1
Working (2xx): 11
Redirects: 0
Broken: 0
Timeouts: 0
+----------------------------------------------+--------------------------+----------+--------+----------+
| URL | Source | Element | Status | Type |
+----------------------------------------------+--------------------------+----------+--------+----------+
| https://www.example.com | start | <a> | 200 | internal |
| https://example.com/favicon.svg | https://www.example.com | <link> | 200 | internal |
| https://example.com/favicon.ico | https://www.example.com | <link> | 200 | internal |
| https://example.com/assets/app-Dk29a1xC.css | https://www.example.com | <link> | 200 | internal |
| https://example.com/assets/app-Bx7L3f2q.js | https://www.example.com | <script> | 200 | internal |
| https://fonts.googleapis.com/css2?family=... | https://www.example.com | <link> | 200 | external |
+----------------------------------------------+--------------------------+----------+--------+----------+
JSON (--format=json)
{
"summary": {
"totalScanned": 8,
"ok": 8,
"redirects": 0,
"broken": 0,
"timeouts": 0,
"redirectChainCount": 0,
"totalRedirectHops": 0,
"httpsDowngrades": 0,
"criticalCount": 0,
"warningCount": 0,
"lowConfidenceCount": 0,
"pagesScanned": 2,
"internalLinks": 1,
"assetsScanned": 6,
"externalLinks": 1
},
"integrityScore": {
"overallScore": 100,
"grade": "Excellent",
"components": {
"linkIntegrity": 100,
"securityHygiene": 100,
"technicalHygiene": 100,
"redirectHealth": 100,
"linkVerifiability": 100
}
},
"links": [
{
"url": "https://www.example.com",
"sourcePage": "start",
"status": "200",
"type": "internal",
"sourceElement": "a",
"analysis": {
"flags": [],
"severity": "info",
"confidence": "high",
"verification": "none"
},
"redirect": {
"chain": [],
"isLoop": false,
"hasHttpsDowngrade": false
},
"network": {
"retryAfter": null
}
}
],
"brokenLinks": []
}
GitHub Action
Use Scannr in your CI/CD pipeline to catch broken links on every deploy.
Basic Usage
name: Scan for broken links on: push: branches: [main] schedule: - cron: '0 6 * * 1' # Weekly Monday 6am jobs: scan: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: JBSommeling/scannr@v1 with: url: https://example.com
All Options
- uses: JBSommeling/scannr@v1 with: url: https://example.com # Required depth: 5 # Max crawl depth (default: 3) max: 500 # Max URLs to scan (default: 300) timeout: 10 # Request timeout in seconds (default: 5) format: table # Output: table, json, csv (default: table) status: broken # Filter: all, ok, broken (default: all) filter: all # Element type filter (default: all) scan-elements: a,img # Elements to scan (default: all) sitemap: true # Use sitemap.xml (default: false) js: true # JS rendering for SPAs (default: false) smart-js: true # Auto-detect SPAs (default: false) no-robots: false # Ignore robots.txt (default: false) advanced: false # Show noise URLs (default: false) strip-params: ref,tracker # Extra tracking params to strip delay-min: 200 # Min delay between requests in ms delay-max: 500 # Max delay between requests in ms fail-on-broken: true # Fail step on broken links (default: true) fail-on-critical: true # Fail step on critical issues (default: true) min-rating: good # Minimum integrity rating (default: good)
Examples
Scan a SPA with JavaScript rendering:
- uses: JBSommeling/scannr@v1 with: url: https://my-react-app.com js: true status: broken
Get JSON output for downstream processing:
- uses: JBSommeling/scannr@v1 id: scan with: url: https://example.com format: json fail-on-broken: false - name: Process results if: steps.scan.outputs.exit-code != '0' run: echo "Broken links detected!"
Deep weekly audit with sitemap:
- uses: JBSommeling/scannr@v1 with: url: https://example.com depth: 5 max: 1000 sitemap: true smart-js: true
Quality gate — require "Excellent" rating:
- uses: JBSommeling/scannr@v1 with: url: https://example.com fail-on-critical: true min-rating: excellent
Relaxed quality gate — only fail on critical issues:
- uses: JBSommeling/scannr@v1 with: url: https://example.com fail-on-critical: true min-rating: none
URL Restriction
Scannr is designed for scanning your own websites, not arbitrary third-party sites.
Local Dev Dependency
When running via php artisan site:scan, Scannr compares the scan URL's domain against your APP_URL environment variable. If they don't match, a warning is displayed:
⚠ The scan URL domain (other-site.com) does not match your APP_URL (https://myapp.com).
Scannr is intended for scanning your own websites.
GitHub Action (.scannr.yml)
In CI/CD, you can create a .scannr.yml file in your repository root to declare which domains are allowed:
# .scannr.yml allowed_domains: - example.com - staging.example.com - www.example.com
When .scannr.yml is present, the GitHub Action will block scans of URLs that don't match any listed domain.
If no .scannr.yml exists, the URL input is trusted (since you control your own workflow file).
Configuration
Publish the config file to customize scanner behavior:
php artisan vendor:publish --tag=scannr-config
This creates config/scannr.php with settings for:
- Tracking Parameters — URL params to strip during normalization (utm_*, fbclid, etc.)
- User Agent — The User-Agent header sent with requests
- Rate Limiting — Delay between requests, backoff on 429 responses
- Hard Limits — Maximum caps for depth and URL count
- JavaScript Rendering — Custom paths to Node, npm, and Chrome binaries
- Noise URL Detection — Namespace domains, preconnect hints, framework error patterns
- Integrity Scoring — Penalty weights, confidence multipliers, grade thresholds
- External Platforms — Domains with known bot protection (LinkedIn, GitHub, etc.)
Testing
composer test
License
This package is open-sourced software licensed under the MIT license.