padosoft / eval-harness
Laravel evaluation framework for RAG / LLM applications: golden datasets, exact-match + cosine-embedding + LLM-as-judge metrics, JSON + Markdown reports, Artisan-driven CI gate.
Requires
- php: ^8.3
- illuminate/bus: ^12.0|^13.0
- illuminate/cache: ^12.0|^13.0
- illuminate/console: ^12.0|^13.0
- illuminate/filesystem: ^12.0|^13.0
- illuminate/http: ^12.0|^13.0
- illuminate/queue: ^12.0|^13.0
- illuminate/routing: ^12.0|^13.0
- illuminate/support: ^12.0|^13.0
- symfony/polyfill-mbstring: ^1.37
- symfony/yaml: ^7.0|^8.0
Requires (Dev)
- laravel/pint: ^1.18
- orchestra/testbench: ^10.0|^11.0
- phpstan/phpstan: ^2.0
- phpunit/phpunit: ^11.0|^12.0
Suggests
- laravel/ai: Use the laravel/ai SDK providers (OpenAI, OpenRouter, Regolo, etc.) to route metric LLM and embedding calls without managing HTTP credentials by hand.
- laravel/horizon: Recommended in host applications for supervising parallel eval queues in production.
- padosoft/laravel-ai-regolo: Drives every metric LLM call through Italian sovereign Regolo infrastructure (cost + GDPR). Any laravel/ai SDK provider works.
- dev-main
- v1.2.0
- v1.1.0
- 1.0
- v0.1.0
- dev-task/report-api-completeness-v9
- dev-task/report-api-completeness-v9-dataset-trend
- dev-task/report-api-completeness-v9-batch-live
- dev-task/report-api-completeness-v9-adversarial-manifests
- dev-task/report-api-completeness-v9-diff
- dev-task/enterprise-operations-scalability
- dev-task/enterprise-operations-scalability-batch-profiles
- dev-task/report-api-ui-contract
- dev-task/report-api-ui-contract-api-endpoints
- dev-chore/test-count-readme-sync
- dev-chore/import-r36-skill
This package is auto-updated.
Last update: 2026-05-06 21:38:51 UTC
README
Laravel-native evaluation framework for RAG / LLM applications. Golden datasets in YAML, seven built-in metrics, standalone output assertions, Markdown + JSON reports, and an Artisan CI gate. Stop shipping silent regressions in your AI pipeline.
Table of Contents
- Why eval-harness?
- Design rationale
- Features
- Comparison with alternatives
- Installation
- Quick start
- Usage examples
- Web admin panel UI
- Contract stability and migration
- Configuration
- Architecture
- AI vibe-coding pack included
- Testing
- Roadmap
- Contributing
- Security
- License
Why eval-harness?
Imagine deploying a RAG-powered chatbot to production. Quality is great on launch day. Three months later, somebody:
- bumps the embedding model from
text-embedding-3-smalltotext-embedding-3-large, - swaps the chat model from
gpt-4otogpt-4o-minifor cost, - tweaks the prompt template,
- updates
laravel/aifrom^0.5to^0.6, - changes the chunker from 800-token sliding window to 1200-token semantic.
Every one of those changes is a quality regression risk, and you have no programmatic signal they shipped intact. Your test suite green-lights the deployment because PHPUnit doesn't know what a "correct answer" looks like.
padosoft/eval-harness fixes that loop:
- You curate a small golden dataset — a YAML file with 30-200
(question, expected answer)pairs that represent the queries you actually care about. - You declare metrics — exact-match for deterministic outputs, cosine-embedding for paraphrase tolerance, LLM-as-judge for subjective grading.
- You wire up a callable that drives your real pipeline against the dataset.
- CI runs
php artisan eval-harness:run rag.factualityon every PR and gates the merge on the macro-F1 score.
Now your AI pipeline has the same regression protection your business logic has had for the last fifteen years.
Design rationale
The package is opinionated. Three decisions matter most:
1. No SDK lock-in. Every external call goes through Laravel's
Http:: facade — never an OpenAI / Anthropic / Vertex SDK. Tests
substitute via Http::fake() for deterministic offline runs, and
swapping providers is a config-file change, not a refactor.
2. The dataset is YAML, not a Laravel model. YAML is reviewable
in pull requests, diffable across releases, and survives database
wipes. The package never stores datasets in your DB — they live in
eval/golden/*.yml next to your code.
3. Failures are captured by default. A timeout on sample 47
should not mask the macro-F1 score across 200 valid samples. Every
metric exception is recorded against (sample, metric) and
surfaced in the final report so the operator can investigate, not
re-run the whole 30-minute suite. Strict CI lanes can opt into
EVAL_HARNESS_RAISE_EXCEPTIONS=true to abort on the first
MetricException provider/metric contract error.
These decisions cost some flexibility (you can't dispatch metrics across multiple processes yet — see Roadmap) but they keep the public surface small and the offline path fast.
Features
- Nine metrics out of the box —
exact-match,contains,regex,rouge-l,citation-groundedness,cosine-embedding,bertscore-like,llm-as-judge,refusal-quality— and a cleanMetricinterface for adding more. - Strict-schema YAML loader — versioned dataset contracts and actionable validation errors for malformed samples.
- Deterministic LLM-as-judge — temperature 0, seed 42,
response_format=json_object. Strict-JSON parser rejects malformed responses instead of silently scoring 0. - Stable JSON report shape — every payload carries explicit
schema_versionanddataset_schema_versionfields. Wire into your CI dashboard once, then evolve additively. - Cohort-ready report data — JSON and Markdown reports aggregate
scores by
metadata.tags, expose an explicit untagged bucket, and include per-metric score histograms for dashboards. - Citation evidence checks —
citation-groundednesscan score simple citation markers or strictermetadata.citation_evidencespans that require both citation markers and quote text. - Opt-in adversarial lane —
AdversarialDatasetFactoryandphp artisan eval-harness:adversarialbuild/run safety regression seeds for prompt injection, jailbreaks, data leaks, SSRF, tool abuse, and similar red-team categories. JSON/Markdown reports add category and compliance-framework summaries, and optional manifests retain adversarial run summaries while preserving latest failure-free baselines per compatible report schema, dataset, metric names, and adversarial category/sample-count slice under tight retention;--regression-gatefails CI when macro-F1 or configured metric aggregates drop, and--promote-failureswrites failed samples back to a reloadable YAML dataset seed. Scheduler/CI guidance shows how to run the lane continuously without bundling a daemon in this package. - Standalone output assertions — score saved JSON/YAML outputs with the same metrics and report contract, without invoking your agent in CI.
- Usage summaries — JSON and Markdown reports aggregate structured
usagedetails for provider token counts, cost USD, and latency. - Runtime guardrails — provider timeouts are normalized, optional
retries cover Laravel HTTP connection failures plus HTTP 429/5xx,
and strict mode can rethrow
MetricExceptionfailures instead of capturing them. - Batch execution modes — SUT runs flow through deterministic
SerialBatchby default, or queue-backedLazyParallelBatchvia--batch=lazy-parallelfor Laravel queue/Horizon workers. - Operational batch profiles —
--batch-profile=ci|smoke|nightlyapplies named presets of batch defaults (concurrency, queue, timeouts, chunk size, rate limit, checkpoint cadence, result TTL). Explicit CLI flags always win; host apps can override or add profiles undereval-harness.batches.profiles.*inconfig/eval-harness.php. - Producer-side backpressure —
--chunk-size=N,--rate-limit=N --rate-window-seconds=W, and--result-ttl-secondsbound dispatch in flight, throttle to N samples per W-second sliding window (monotonic-clock math, amortized O(1)), and size result metadata TTL for delayed collection. Passnoneon any nullable numeric flag to clear an inherited profile value for a one-off run. - Progress checkpoints and terminal status —
--checkpoint-every=Nemits structured progress events through an optionalBatchProgressReportercontainer binding (defaultNullBatchProgressReporter). Dashboards that need to distinguish a finished failed batch from a stalled one can implement the optionalBatchTerminalProgressReportersub-contract for explicitsuccess/failure/emptyterminal status with partial-wins tolerance on the failure path. - Live batch registry endpoints —
GET /<configured-prefix>/batches/liveandGET /<configured-prefix>/batches/{id}/progressexpose active lazy-parallel batch ids and compact progress counters through cache-backed read-only API contracts. The live registry is enabled by default and can be disabled witheval-harness.batches.live_registry.enabled. - Adversarial manifest discovery endpoints —
GET /<configured-prefix>/adversarial/manifestsandGET /<configured-prefix>/adversarial/manifests/{name}enumerate adversarial run manifests written to a configured disk so a companion UI can browse compliance history without scraping the filesystem. Opt-in viaeval-harness.adversarial.manifests.{disk,path_prefix}; the CLI--adversarial-manifest=<arbitrary-path>flag is preserved for existing operators. - Report diff endpoint —
GET /<configured-prefix>/reports/{id}/diff/{otherId}computes signed deltas (macro_f1, per-metric mean/pass_rate, per-cohort status withadded/removed/regressed/improved/stable, total_samples / total_failures, adversarial categories when present) so a companion UI can show regression diffs side-by-side without fetching full reports. The prefix defaults toeval-harness/apiand is configurable througheval-harness.api.prefix. - Dataset trend endpoint —
GET /<configured-prefix>/datasets/{name}/trend?limit=Nscans stored JSON reports for one dataset, skips malformed artifacts, capslimitat 100, caps scanned JSON files througheval-harness.api.trend.max_files_scanned, and returns chronological points with metrics, cohorts, usage, and theeval-harness.report-api.v1.trendschema discriminator. - Provider-agnostic — works with OpenAI, OpenRouter, Regolo, Mistral, any OpenAI-compatible chat-completions endpoint.
- No DB migrations required — datasets are YAML, results are JSON. The package adds zero rows to your schema.
- Artisan-driven CI gate —
php artisan eval-harness:runexits non-zero on any captured failure. Wire it into the same workflow your unit tests run in. - Architecture tests included — every release proves it doesn't leak symbols from sibling packages, ever.
Comparison with alternatives
Status legend: ✅ YES means first-class support, ⚠️ partial means supported with limits or outside the Laravel-native path, and ❌ NO means not a primary fit.
| Concern | eval-harness | OpenAI Evals | LangSmith | Ragas | Promptfoo | DeepEval |
|---|---|---|---|---|---|---|
| Laravel-native package | ✅ YES - PHP/Laravel package | ❌ NO - Python CLI/library | ❌ NO - hosted Python/TS workflow | ❌ NO - Python library | ❌ NO - Node/YAML CLI | ❌ NO - Python library |
| Runs inside your app container | ✅ YES - resolves Laravel services directly | ⚠️ partial - custom completion functions | ⚠️ partial - SDK/API integration | ⚠️ partial - integrate from Python | ⚠️ partial - external CLI/provider call | ⚠️ partial - local Python runner |
| Local-first storage | ✅ YES - YAML datasets + JSON/Markdown reports | ⚠️ partial - local logs or Snowflake | ❌ NO - LangSmith cloud workspace | ✅ YES - local datasets/results | ✅ YES - local YAML/results | ⚠️ partial - local evals, optional Confident AI cloud |
| Read-only report API | ✅ YES - opt-in Laravel routes for report listing/show, cohorts, histograms, artifact download, and row CSV export | ⚠️ partial - custom artifact/API layer | ✅ YES - hosted experiment API | ⚠️ partial - custom app/API layer | ⚠️ partial - local result files/viewer workflows | ⚠️ partial - local results or hosted platform API |
| Built-in metrics | ✅ YES - offline exact/contains/regex/ROUGE-L/citation plus fakeable cosine/BERTScore-like/judge/refusal | ⚠️ partial - custom eval code | ✅ YES - evaluators in platform/SDK | ✅ YES - RAG-focused metrics | ✅ YES - assertions and graders | ✅ YES - built-in metrics |
| Embedding semantic overlap | ✅ YES - cosine-embedding + bertscore-like via fakeable EmbeddingClient | ⚠️ partial - custom embedding eval code | ⚠️ partial - SDK evaluator path | ✅ YES - RAG embedding metrics | ⚠️ partial - provider-backed similarity assertions | ✅ YES - semantic metrics |
| Deterministic no-network tests | ✅ YES - Http::fake, fake LLM/embedding clients | ⚠️ partial - depends on eval | ⚠️ partial - cloud/API path common | ⚠️ partial - many metrics need LLMs | ⚠️ partial - assertions can be local, red team needs models | ⚠️ partial - metric dependent |
| LLM-as-judge | ✅ YES - schema-checked, fakeable judge client | ✅ YES - model-graded evals | ✅ YES - evaluators | ✅ YES - LLM metrics | ✅ YES - rubric/grader assertions | ✅ YES - LLM metrics |
| Refusal quality / safety judge | ✅ YES - refusal-quality with required metadata + strict JSON schema | ⚠️ partial - custom model-graded eval | ⚠️ partial - custom evaluator workflow | ⚠️ partial - custom LLM metric | ✅ YES - safety/red-team assertions | ✅ YES - safety metrics |
| Adversarial red-team seeds | ✅ YES - opt-in Laravel seed factory for 10 categories | ⚠️ partial - custom eval registry | ⚠️ partial - custom datasets/evaluators | ⚠️ partial - RAG-focused tests | ✅ YES - red-team plugins | ✅ YES - safety test cases |
| Adversarial CLI lane | ✅ YES - eval-harness:adversarial with eval:adversarial alias, saved outputs, and batch options |
⚠️ partial - custom eval runner scripts | ⚠️ partial - custom evaluator automation | ⚠️ partial - Python code orchestration | ✅ YES - red-team CLI workflow | ✅ YES - safety test runner |
| Adversarial compliance mapping | ✅ YES - JSON/Markdown category + OWASP/NIST/EU AI Act summaries | ⚠️ partial - custom eval metadata | ⚠️ partial - custom evaluator metadata | ⚠️ partial - custom report code | ✅ YES - red-team category reporting | ⚠️ partial - safety metadata/reporting |
| Adversarial run history manifests | ✅ YES - local JSON manifest retains adversarial summaries and clean baselines | ⚠️ partial - custom eval logs | ✅ YES - hosted experiment history | ⚠️ partial - custom persistence | ✅ YES - monitoring/history workflows | ⚠️ partial - platform/history workflow |
| Adversarial regression gate | ✅ YES - --regression-gate fails on macro-F1 or metric drops from local manifests |
⚠️ partial - custom eval thresholds | ✅ YES - hosted experiment comparisons | ⚠️ partial - custom CI checks | ✅ YES - threshold/regression workflows | ✅ YES - test assertions/regression workflows |
| Scheduled/continuous monitoring | ✅ YES - Laravel Scheduler/CI cron guidance with manifests, Horizon queues, gates, and failure promotion | ⚠️ partial - custom scheduler around eval runs | ✅ YES - hosted monitoring workflows | ⚠️ partial - custom scheduler around Python metrics | ✅ YES - CLI/CI monitoring workflows | ⚠️ partial - local runner or hosted platform workflow |
| Failure promotion to datasets | ✅ YES - --promote-failures exports failed adversarial samples to YAML seeds |
⚠️ partial - custom eval scripts | ✅ YES - trace-to-dataset workflows | ⚠️ partial - custom dataset curation | ✅ YES - failure-driven test cases | ✅ YES - failed test cases can become datasets |
| Citation evidence spans | ✅ YES - citation_evidence requires marker + quote match | ⚠️ partial - custom eval code | ⚠️ partial - custom evaluator workflow | ✅ YES - RAG faithfulness/context metrics | ⚠️ partial - custom assertions | ✅ YES - RAG faithfulness metrics |
| Cost/token/latency summaries | ✅ YES - built-in provider usage + JSON/Markdown summaries | ⚠️ partial - custom logging | ✅ YES - experiment usage analytics | ✅ YES - usage/cost hooks | ⚠️ partial - provider output dependent | ⚠️ partial - metric/provider dependent |
| Runtime retry / strict exception controls | ✅ YES - normalized timeouts, connection/429/5xx retries, optional raise_exceptions | ⚠️ partial - custom eval code | ⚠️ partial - SDK/platform behavior | ✅ YES - runtime metric settings | ⚠️ partial - provider/config dependent | ⚠️ partial - custom evaluator handling |
| Provider choice | ✅ YES - any OpenAI-compatible endpoint via Laravel HTTP | ⚠️ partial - OpenAI API defaults, custom completion functions possible | ✅ YES - multi-provider ecosystem | ✅ YES - via integrations | ✅ YES - multi-provider | ✅ YES - multi-provider |
| CI gate | ✅ YES - Artisan command with non-zero failure exit | ⚠️ partial - script around CLI/API | ⚠️ partial - API/automation hook | ⚠️ partial - custom script | ✅ YES - CLI gate | ✅ YES - test runner/CI flow |
| Queue/Horizon batch execution | ✅ YES - SerialBatch + LazyParallelBatch for Laravel queues/Horizon | ❌ NO - not Laravel queues | ❌ NO - hosted tracing/evals | ❌ NO - not Laravel queues | ❌ NO - external CLI concurrency | ❌ NO - not Laravel queues |
| Named operational profiles | ✅ YES - --batch-profile=ci|smoke|nightly with host-app overrides under eval-harness.batches.profiles.* |
⚠️ partial - custom run scripts | ⚠️ partial - hosted experiment configs | ❌ NO - custom Python wrappers | ⚠️ partial - YAML config presets | ❌ NO - custom Python wrappers |
| Producer-side backpressure | ✅ YES - --chunk-size, --rate-limit, --rate-window-seconds with monotonic sliding-window math |
❌ NO - per-eval concurrency only | ⚠️ partial - hosted rate controls | ❌ NO - in-process only | ⚠️ partial - external concurrency flag | ❌ NO - in-process only |
| Progress checkpoints + terminal status | ✅ YES - --checkpoint-every plus optional BatchProgressReporter / BatchTerminalProgressReporter bindings |
❌ NO - log-tail only | ✅ YES - hosted run progress | ⚠️ partial - custom callbacks | ⚠️ partial - CLI progress lines | ⚠️ partial - custom hooks |
| Eval sets / multi-dataset runs | ✅ YES - EvalSetDefinition + resumable manifests | ✅ YES - oaievalset |
✅ YES - dataset experiments | ⚠️ partial - run multiple datasets in code | ✅ YES - suites/configs | ✅ YES - metric collections/test suites |
| Resume interrupted multi-dataset progress | ✅ YES - explicit per-dataset resume manifest | ❌ NO - no mid-eval resume | ⚠️ partial - platform run history | ⚠️ partial - custom code | ⚠️ partial - rerun/filter workflows | ⚠️ partial - platform/regression workflows |
| Cohorts / tags / facets | ✅ YES - tag cohorts in JSON/Markdown | ⚠️ partial - custom eval/reporting | ✅ YES - dataset filtering/metadata | ⚠️ partial - custom analysis | ✅ YES - metadata/config-driven views | ⚠️ partial - test metadata |
| Saved-output assertions | ✅ YES - --outputs and Eval::scoreOutputs() |
⚠️ partial - custom eval code | ⚠️ partial - compare uploaded runs | ⚠️ partial - build dataset/results manually | ✅ YES - assertion-first workflow | ✅ YES - test-case assertions |
| Auditable in PR diff | ✅ YES - YAML datasets + stable JSON/Markdown artifacts | ⚠️ partial - local YAML/code possible | ❌ NO - cloud-first | ✅ YES - code/data files | ✅ YES - YAML config | ✅ YES - Python test files |
| Vendor lock-in | ✅ YES - headless, local-first, provider-agnostic | ⚠️ partial - OpenAI-oriented defaults | ❌ NO - LangSmith workspace | ✅ YES - OSS library | ✅ YES - OSS CLI | ⚠️ partial - OSS plus Confident AI option |
| Cost to evaluate 200 offline samples | ✅ YES - free for offline metrics and faked providers | ⚠️ partial - depends on model calls | ❌ NO - cloud/API usage | ⚠️ partial - free only for non-LLM metrics | ⚠️ partial - free only for local assertions | ⚠️ partial - free only for local/non-LLM metrics |
The Python-stack tools are excellent if your stack is Python. If your
RAG pipeline lives in a Laravel monolith, eval-harness is the
shortest path from "we have AI in prod" to "we have a regression
test for our AI in prod".
Installation
composer require padosoft/eval-harness
The package is auto-discovered. No config/app.php edits required.
Optional config publishing:
php artisan vendor:publish --tag=eval-harness-config
This drops config/eval-harness.php into your app where you can
override the embeddings + judge endpoints / models / API keys.
Compatibility matrix
| eval-harness | PHP | Laravel | laravel/ai SDK | symfony/yaml |
|---|---|---|---|---|
| 0.x (current) | 8.3 / 8.4 / 8.5 | 12.x / 13.x | ^0.6 | ^7 / ^8 |
Quick start
1. Curate a golden dataset
eval/golden/factuality.yml:
schema_version: eval-harness.dataset.v1 name: rag.factuality.fy2026 samples: - id: capital-france input: question: "What is the capital of France?" expected_output: "Paris" metadata: tags: [geography, easy] - id: refund-policy input: question: "How many days do I have to return an order?" expected_output: "30 days from delivery." metadata: tags: [policy, support]
schema_version is optional for existing datasets. If omitted, the
loader defaults to eval-harness.dataset.v1.
2. Wire up a registrar in your app
app/Console/EvalRegistrar.php:
<?php namespace App\Console; use Illuminate\Contracts\Container\Container; use Padosoft\EvalHarness\EvalEngine; class EvalRegistrar { public function __construct(private readonly Container $container) {} public function __invoke(EvalEngine $engine): void { $engine->dataset('rag.factuality.fy2026') ->loadFromYaml(base_path('eval/golden/factuality.yml')) ->withMetrics(['exact-match', 'cosine-embedding']) ->register(); $this->container->bind('eval-harness.sut', fn () => fn (array $input): string => app(\App\Rag\KnowledgeAgent::class) ->answer($input['question']), ); } }
3. Run the eval
php artisan eval-harness:run rag.factuality.fy2026 \
--registrar="App\Console\EvalRegistrar" \
--json --out=factuality.json
Exit code is 0 if every metric scored cleanly, non-zero otherwise.
Wire that into the same tests.yml workflow that runs your PHPUnit
suite and you've got a regression gate.
4. Score saved outputs
When another job already generated model responses, keep the same dataset and score those outputs directly:
{
"outputs": {
"capital-france": "Paris",
"refund-policy": "30 days from delivery."
}
}
php artisan eval-harness:run rag.factuality.fy2026 \
--registrar="App\Console\EvalRegistrar" \
--outputs=eval/outputs/factuality.json \
--json --out=factuality.json
--outputs accepts JSON or YAML, map form (outputs.sample_id) or
list form (outputs[].id + outputs[].actual_output). Relative
--out paths use the configured reports disk and path prefix
(eval-harness/reports by default). Add --raw-path only when you
want a literal filesystem path and its parent directory already
exists. The registrar still registers the dataset; no
eval-harness.sut binding is required for this mode.
5. Read the report
php artisan eval-harness:run rag.factuality.fy2026 \
--registrar="App\Console\EvalRegistrar"
# Eval report — rag.factuality.fy2026
_Run completed in 2.41s over 30 samples (0 failures captured)._
## Summary
| total samples | total failures | duration seconds |
| --- | --- | --- |
| 30 | 0 | 2.41 |
## Per-metric aggregates
| metric | mean | p50 | p95 | pass-rate (>= 0.5) |
| --- | --- | --- | --- | --- |
| exact-match | 0.7333 | 1.0000 | 1.0000 | 0.7333 |
| cosine-embedding | 0.9012 | 0.9421 | 0.9893 | 0.9667 |
## Macro-F1 (avg pass-rate across all metrics): 0.8500
## Cohorts by metadata.tags
| cohort | samples | metric | mean | p50 | p95 | pass-rate (>= 0.5) |
| --- | --- | --- | --- | --- | --- | --- |
| geography | 12 | exact-match | 0.9500 | 1.0000 | 1.0000 | 0.9500 |
| refund-policy | 8 | exact-match | 0.6000 | 0.5000 | 1.0000 | 0.6000 |
## Score histograms
### exact-match
| score range | count |
| --- | --- |
| 0.0-0.1 | 8 |
| 0.9-1.0 inclusive | 22 |
6. Optional read-only report API
The package can register opt-in, read-only routes for a separate Laravel admin/UI package to consume stored report artifacts. Routes are disabled by default because this package does not bundle authentication; enable them only behind your host app's existing admin middleware.
// config/eval-harness.php 'api' => [ 'enabled' => true, 'prefix' => 'admin/eval-harness/api', 'middleware' => ['web', 'auth'], 'trend' => [ 'max_files_scanned' => 5000, ], ],
With the API enabled, GET /admin/eval-harness/api/reports lists JSON and
Markdown artifacts from the configured reports disk/prefix, and
GET /admin/eval-harness/api/reports/{id} shows one artifact by URL-safe id.
Additional read-only contracts are now available for UI consumers:
GET /admin/eval-harness/api/reports/{id}/cohortsfor cohort summaries.GET /admin/eval-harness/api/reports/{id}/histogramsfor score distribution buckets.GET /admin/eval-harness/api/reports/{id}/rows.csvfor CSV sample rows.GET /admin/eval-harness/api/reports/{id}/downloadfor direct artifact download.GET /admin/eval-harness/api/reports/{id}/diff/{otherId}for signed report deltas.GET /admin/eval-harness/api/adversarial/manifestsandGET /admin/eval-harness/api/adversarial/manifests/{name}for adversarial run manifest discovery.GET /admin/eval-harness/api/batches/liveandGET /admin/eval-harness/api/batches/{id}/progressfor live lazy-parallel batch monitoring.GET /admin/eval-harness/api/datasets/{name}/trend?limit=Nfor chronological dataset trend points.- JSON examples and contract notes are documented in
docs/REPORT_API_CONTRACT.md.
Usage examples
Programmatic dataset (no YAML)
use Padosoft\EvalHarness\Datasets\DatasetSample; use Padosoft\EvalHarness\Facades\EvalFacade as Eval; Eval::dataset('rag.smoke') ->withSamples([ new DatasetSample(id: 's1', input: ['q' => 'hi'], expectedOutput: 'hello'), new DatasetSample(id: 's2', input: ['q' => 'bye'], expectedOutput: 'goodbye'), ]) ->withMetrics(['exact-match']) ->register(); $report = Eval::scoreOutputs('rag.smoke', [ 's1' => 'hello', 's2' => 'wrong answer', ]);
Custom metric
use Padosoft\EvalHarness\Datasets\DatasetSample; use Padosoft\EvalHarness\Metrics\Metric; use Padosoft\EvalHarness\Metrics\MetricScore; class JaccardWordOverlapMetric implements Metric { public function name(): string { return 'jaccard-words'; } public function score(DatasetSample $sample, string $actualOutput): MetricScore { $expected = array_unique(preg_split('/\\s+/', strtolower((string) $sample->expectedOutput))); $actual = array_unique(preg_split('/\\s+/', strtolower($actualOutput))); $union = array_unique(array_merge($expected, $actual)); if ($union === []) { return new MetricScore(0.0); } $intersection = array_intersect($expected, $actual); return new MetricScore(count($intersection) / count($union)); } } // Wire it into a dataset: Eval::dataset('rag.recall') ->loadFromYaml(base_path('eval/golden/recall.yml')) ->withMetrics([new JaccardWordOverlapMetric(), 'exact-match']) ->register();
CI workflow snippet
# .github/workflows/eval-gate.yml - name: Run RAG regression gate env: EVAL_HARNESS_JUDGE_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: | php artisan eval-harness:run rag.factuality.fy2026 \ --registrar="App\Console\EvalRegistrar" \ --json --out=eval-report.json - uses: actions/upload-artifact@v4 if: always() with: name: eval-report path: eval-report.json
Queue-backed batch execution
--batch=lazy-parallel dispatches one queue job per sample and then
assembles outputs in dataset order through the shared batch result
store. It requires the SUT to be a container-resolvable concrete
SampleRunner class that queue workers can resolve through the
Laravel container.
Constructor-injected object dependencies are supported when the worker
container can resolve an equivalent fresh runner. Arbitrary callables,
closures, anonymous runners, optional/defaulted constructor state,
scalar/array/null runner properties, and caller-specific object
configuration remain serial-only because queued jobs carry only the
runner class name.
use App\Eval\MyRagRunner; $this->app->bind('eval-harness.sut', MyRagRunner::class);
php artisan eval-harness:run rag.factuality.fy2026 \
--registrar="App\\Console\\EvalQueueRegistrar" \
--batch=lazy-parallel \
--concurrency=4 \
--queue=evals \
--timeout=60 \
--batch-timeout=300
Use Laravel's sync queue driver for unit tests. In production, run
Horizon workers on the chosen queue and set
EVAL_HARNESS_BATCH_CACHE_STORE to a cache backend shared by the
command process and workers so queued sample outputs can be collected
for report assembly. --concurrency is the lazy-parallel producer
fan-out cap and is also the default dispatch window size; pass
--chunk-size=N (must be <= --concurrency) for tighter backpressure.
The producer waits after each chunk completes before dispatching the
next, so when --chunk-size < --concurrency, chunk-size becomes the
effective in-flight limit per producer process, not concurrency.
Total worker pool demand scales with concurrent producers: if K eval
commands run at the same time, peak in-flight demand is roughly
chunk-size × K. Size Horizon worker pool capacity for the actual
peak (chunk-size × concurrent producers) — not just one producer's
chunk-size — or the queue will build a backlog. Worker counts
themselves are configured in Horizon.
--timeout is the per-sample job timeout; --batch-timeout caps the
producer's wait on each dispatch window. It bounds BOTH the dispatch
phase (including any producer-side --rate-limit pauses) AND the
result-collection phase: when the timeout fires before all queued
outputs land the command reports the missing samples; when dispatch
itself consumes the budget (for example because a low rate limit
throttles the producer) the command fails with an explicit
"chunk dispatch consumed the full ... wait timeout" diagnostic that
reports how many samples were still undispatched. Lower
--chunk-size, relax --rate-limit, or raise --batch-timeout to
fix it. --batch-timeout is a hard wall-clock cap only on real
queue drivers (Redis, database, beanstalk — the documented Horizon
path) where dispatch() returns immediately. On the sync queue
driver dispatch() executes the job inline, so an individual slow
sample can run arbitrarily longer than the chunk deadline before
control returns. The package only sets the queue job's $timeout
property — Laravel's queue workers honour it, but the sync driver
does NOT enforce it because there is no worker process. On sync,
neither --batch-timeout nor --timeout bounds per-sample runtime;
slow runners can take arbitrarily long. Use a real queue driver in
production for any wall-clock guarantee. Programmatic external dispatch() / collectOutputs()
flows can set BatchOptions::lazyParallel(resultTtlSeconds: ...) to
keep result metadata and sample outputs alive long enough for delayed
collection.
See docs/HORIZON_BATCH_QUEUES.md for
Horizon supervisor, cache-store, and timeout sizing guidance.
Operational profiles and backpressure
--batch-profile=ci|smoke|nightly applies a named operational preset
of batch defaults; explicit options always win, so profiles never lock
operators in. CI lanes get sane lazy-parallel defaults, smoke checks
stay serial, and nightly runs get throttled dispatch with checkpoints.
# CI gate: lazy-parallel, 4 concurrent samples, 30s job timeout, checkpoints every 25 samples. php artisan eval-harness:run rag.factuality.fy2026 \ --batch-profile=ci \ --queue=evals \ --json --out=evals/ci-rag.json # Nightly long run: 16 concurrent samples, throttled at 60 dispatches/60s. php artisan eval-harness:run rag.factuality.fy2026 \ --batch-profile=nightly \ --queue=evals-nightly \ --json --out=evals/nightly-rag.json
Backpressure flags work with any lazy-parallel profile or with
--batch=lazy-parallel:
--chunk-size=Nnarrows the producer dispatch window for tighter backpressure (defaults to--concurrency; must be<= --concurrency, since--concurrencyis the fan-out cap).--rate-limit=N --rate-window-seconds=Wthrottles producer dispatch to N samples per W-second rolling window.--checkpoint-every=Nemits structured progress events every N completed samples; bind a customBatchProgressReporterto forward them to logs or dashboards. Dashboards that need to distinguish a finished failed batch from a stalled one should implement the optionalBatchTerminalProgressReportersub-contract instead — it adds areportTerminal(...)callback with explicitsuccess/failure/emptystatus. See docs/HORIZON_BATCH_QUEUES.md for an example binding.
To clear an inherited numeric profile value for a one-off run without
redefining the profile, pass none (or null) on the corresponding
flag. For example, with --batch-profile=nightly (which sets
rate_limit=60, rate_window_seconds=60, checkpoint_every=100),
--rate-limit=none --checkpoint-every=none disables both for that
single invocation while keeping every other profile field. The same
sentinel works for --timeout, --batch-timeout, --chunk-size,
--rate-limit, --rate-window-seconds, --result-ttl-seconds, and
--checkpoint-every. --queue is excluded — queue names are
arbitrary strings, so override the profile in
eval-harness.batches.profiles.* config when an inherited queue must
be cleared.
Host apps can override or register additional profiles under
eval-harness.batches.profiles in config/eval-harness.php.
See docs/HORIZON_BATCH_QUEUES.md for
the full profile reference and Horizon tuning recipes.
Eval sets and resume manifests
Group registered datasets into an eval set when one CI or release gate needs to run several datasets in order. The returned manifest is stable JSON and can be stored by the host app between attempts; completed datasets are skipped when the manifest is passed back in.
use Eval; use Padosoft\EvalHarness\Batches\BatchOptions; use Padosoft\EvalHarness\EvalSets\EvalSetManifest; $manifestPath = storage_path('eval/release.rag.manifest.json'); $previousManifest = null; if (is_file($manifestPath)) { $manifestPayload = json_decode((string) file_get_contents($manifestPath), true, flags: JSON_THROW_ON_ERROR); $previousManifest = is_array($manifestPayload) ? EvalSetManifest::fromJson($manifestPayload) : null; } $evalSet = Eval::evalSet('release.rag', [ 'rag.factuality.fy2026', 'rag.refusals.fy2026', ]); $result = Eval::runEvalSet( $evalSet, app(App\Eval\MyRagRunner::class), BatchOptions::serial(), $previousManifest ?? null, ); file_put_contents( $manifestPath, json_encode($result->manifest->toJson(), JSON_PRETTY_PRINT | JSON_THROW_ON_ERROR), );
Real-world usage recipes
These are complete snippets you can paste into a Laravel app or CI job.
Recipe A — PR-safe RAG regression gate
Use one dataset, one artifact path, and a deterministic failure signal in CI:
cat > .github/workflows/eval-gate.yml <<'EOF' name: AI Regression Gate on: pull_request: paths: - 'app/**' - 'config/**' - 'eval/**' - 'resources/**' jobs: eval-gate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup PHP uses: shivammathur/setup-php@v2 with: php-version: '8.3' tools: composer:v2 - uses: shivammathur/setup-node@v3 with: node-version: '20' - name: Install dependencies run: composer install --no-interaction --prefer-dist --no-progress - name: Run eval gate env: EVAL_HARNESS_JUDGE_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: | php artisan eval-harness:run rag.factuality.fy2026 \ --registrar="App\\Console\\EvalRegistrar" \ --batch=lazy-parallel \ --concurrency=4 \ --queue=default \ --timeout=60 \ --batch-timeout=180 \ --json \ --out=eval-report.json - uses: actions/upload-artifact@v4 if: always() with: name: eval-report path: eval-report.json EOF
Recipe B — nightly safety baseline refresh with manifest regression gate
Run adversarial tests on a schedule, keep a bounded local manifest, and fail when behavior regresses:
php artisan eval-harness:adversarial \
--registrar="App\\Console\\EvalRegistrar" \
--category=prompt-injection \
--category=pii-leak \
--metric=refusal-quality \
--batch=lazy-parallel \
--concurrency=2 \
--queue=adversarial \
--manifest=storage/app/adversarial-runs.json \
--manifest-retain=10 \
--regression-gate \
--regression-max-drop=5 \
--regression-metric=refusal-quality:pass_rate \
--json \
--out=adversarial-nightly.json
In Laravel Scheduler, dispatch that command via the schedule:run pipeline
and rotate artifacts with normal deployment hygiene.
Recipe C — consume report API artifacts in your admin surface
Use the opt-in Laravel report API as a source of truth for charts and CSV export:
curl -sS \ -H "Accept: application/json" \ "$APP_URL/admin/eval-harness/api/reports" \ | jq '.artifacts[] | select(.id|test("factuality")) | {id, type, size}'
curl -sS \ "$APP_URL/admin/eval-harness/api/reports/<url-safe-id>/rows.csv" \ -H "Accept: text/csv" \ > eval-rows.csv
The examples above use stable API schema identifiers documented in
docs/REPORT_API_CONTRACT.md.
Web admin panel UI
This package stays headless, but it already has a polished companion web admin
panel at padosoft/eval-harness-admin.
Use that repository when you want a ready-made dashboard for browsing stored
reports, comparing regressions, inspecting adversarial manifests, and following
live batch progress through this package's read-only API contracts.
The UI package spec is documented in
docs/UI_PACKAGE_SPEC.md. It covers the intended
read-only admin screens: Dashboard, Reports list, Report detail, Compare,
Trend, Adversarial manifests, and Live batches.
Contract stability and migration
v1.0 marks the package contract baseline and introduces explicit guidance on
semantic compatibility:
- Versioned, additive report and dataset schemas (
eval-harness.report.v1,eval-harness.dataset.v1). - API payload versioning (
eval-harness.api.v1). - Queue manifest compatibility by report schema / dataset / metric set signatures.
- Stable command contracts for
eval-harness:runandeval-harness:adversarialwith additive extension.
Read full details before a major upgrade:
For end users upgrading from pre-1.0, also check the changelog and keep
composer pins strict to avoid surprise majors.
Configuration
config/eval-harness.php (after vendor:publish):
use Padosoft\EvalHarness\Support\RuntimeOptions; use Padosoft\EvalHarness\Support\TimeoutNormalizer; return [ 'metrics' => [ 'cosine_embedding' => [ 'endpoint' => env('EVAL_HARNESS_EMBEDDINGS_ENDPOINT', 'https://api.openai.com/v1/embeddings'), 'api_key' => env('EVAL_HARNESS_EMBEDDINGS_API_KEY', env('OPENAI_API_KEY', '')), 'model' => env('EVAL_HARNESS_EMBEDDINGS_MODEL', 'text-embedding-3-small'), 'timeout_seconds' => TimeoutNormalizer::normalize(env('EVAL_HARNESS_EMBEDDINGS_TIMEOUT'), 30), ], 'llm_as_judge' => [ 'endpoint' => env('EVAL_HARNESS_JUDGE_ENDPOINT', 'https://api.openai.com/v1/chat/completions'), 'api_key' => env('EVAL_HARNESS_JUDGE_API_KEY', env('OPENAI_API_KEY', '')), 'model' => env('EVAL_HARNESS_JUDGE_MODEL', 'gpt-4o-mini'), 'timeout_seconds' => TimeoutNormalizer::normalize(env('EVAL_HARNESS_JUDGE_TIMEOUT'), 60), 'prompt_template' => env('EVAL_HARNESS_JUDGE_PROMPT_TEMPLATE'), ], ], 'runtime' => [ 'raise_exceptions' => RuntimeOptions::normalizeBoolean(env('EVAL_HARNESS_RAISE_EXCEPTIONS'), false), 'provider_retry_attempts' => RuntimeOptions::normalizeNonNegativeInt(env('EVAL_HARNESS_PROVIDER_RETRY_ATTEMPTS'), 0), 'provider_retry_sleep_milliseconds' => RuntimeOptions::normalizeNonNegativeInt(env('EVAL_HARNESS_PROVIDER_RETRY_SLEEP_MS'), 100), ], 'reports' => [ 'disk' => env('EVAL_HARNESS_REPORTS_DISK', 'local'), 'path_prefix' => env('EVAL_HARNESS_REPORTS_PATH', 'eval-harness/reports'), ], 'batches' => [ 'lazy_parallel' => [ 'cache_store' => env('EVAL_HARNESS_BATCH_CACHE_STORE'), 'result_ttl_seconds' => TimeoutNormalizer::normalize(env('EVAL_HARNESS_BATCH_RESULT_TTL'), 3600), 'wait_timeout_seconds' => TimeoutNormalizer::normalize(env('EVAL_HARNESS_BATCH_WAIT_TIMEOUT'), 60), ], ], ];
Pointing at a non-OpenAI provider
# OpenRouter EVAL_HARNESS_JUDGE_ENDPOINT=https://openrouter.ai/api/v1/chat/completions EVAL_HARNESS_JUDGE_API_KEY=or-your-key EVAL_HARNESS_JUDGE_MODEL=anthropic/claude-3.5-sonnet # Regolo (Italian sovereign infra) EVAL_HARNESS_JUDGE_ENDPOINT=https://api.regolo.ai/v1/chat/completions EVAL_HARNESS_JUDGE_API_KEY=rgl-your-key EVAL_HARNESS_JUDGE_MODEL=mistral-large
The embedding-backed metrics (cosine-embedding, bertscore-like)
use the same OpenAI-compatible embeddings endpoint (data[].embedding).
Most providers already implement that contract. Host apps can also bind
Padosoft\EvalHarness\Contracts\EmbeddingClient to route embeddings
through Laravel AI or deterministic fakes.
The judge-backed metrics (llm-as-judge, refusal-quality) share the
same chat-completions settings. refusal-quality requires each sample
to declare metadata.refusal_expected: true|false so safety/refusal
behavior is explicit in the dataset contract.
Adversarial safety/regression seeds are opt-in. Build and register them programmatically when a host app wants a red-team lane:
use Padosoft\EvalHarness\Adversarial\AdversarialCategory; use Padosoft\EvalHarness\Adversarial\AdversarialDatasetFactory; $factory = app(AdversarialDatasetFactory::class); $dataset = $factory->build(categories: [ AdversarialCategory::PromptInjection, 'pii-leak', 'ssrf', ]); app(\Padosoft\EvalHarness\EvalEngine::class)->registerDataset($dataset);
Or run the built-in red-team lane directly from Artisan:
php artisan eval-harness:adversarial \
--registrar="App\Console\EvalRegistrar" \
--category=prompt-injection \
--category=pii-leak \
--metric=refusal-quality \
--manifest=storage/eval/adversarial-runs.json \
--manifest-retain=10 \
--regression-gate \
--regression-max-drop=5 \
--regression-metric=refusal-quality:mean \
--json --out=adversarial.json
eval:adversarial is available as a short alias. The command registers
only the selected adversarial seed dataset for that invocation and
accepts --metric=* (default: refusal-quality). Two scoring modes
are supported: pass --outputs=<path> to score precomputed responses
(this path bypasses the batch contract entirely and goes straight to
scoreOutputs(), so --batch, --batch-profile, --rate-limit, and
the other dispatch flags are ignored when --outputs is set);
otherwise the SUT is invoked through the same batch contract as
eval-harness:run — --batch, --batch-profile, --concurrency,
--queue, --timeout, --batch-timeout, --chunk-size,
--rate-limit, --rate-window-seconds, --result-ttl-seconds, and
--checkpoint-every. Add
--manifest=<path> to update a
local JSON run-history manifest and --manifest-retain=N to keep a
bounded set of adversarial summaries: the newest N summaries plus any
additional failure-free baselines needed for compatible report schema,
dataset, metric names, and adversarial category/sample-count slices. Size
--manifest-retain for the number of distinct report schema, dataset,
metric, category, and sample-count slices you run, because each slice may
need its own clean baseline.
Add --regression-gate to compare the current run with the latest
compatible failure-free existing manifest entry (same report schema,
dataset, metric names, and adversarial category/sample-count slice)
before the current run is recorded. --regression-max-drop=5 means five
normalized percentage points. Repeat --regression-metric=metric or
--regression-metric=metric:mean|p50|p95|pass_rate for additional
metric aggregate checks. Configured regression metrics must exist in the
current run; missing current aggregates fail closed even when no baseline
exists yet. If no compatible failure-free baseline exists after that
validation, the command emits an explicit missing-baseline status. Runs
can advance the next compatible baseline whenever they are written to the
manifest and failure-free, including plain --manifest writes without
--regression-gate. Gated runs are recorded only when they are
failure-free and do not fail configured gate checks; metric failures and
gate failures are left out so they cannot seed broken baselines.
Add --promote-failures=eval/adversarial-failures.yml to export
low-scoring samples and samples with metric exceptions into a reloadable
dataset YAML seed. Use --promoted-dataset=adversarial.security.failures
to control the dataset name inside that YAML; otherwise it defaults to
<dataset>.failures. Promotion preserves the original sample input,
expected output, and metadata, adds
metadata.eval_harness.promoted_failure with the source dataset and
failed metric names, and intentionally omits actual model output and raw
provider error messages from the seed. If no samples fail, an existing
promotion file at that path is removed so fixed CI artifact paths do not
keep stale failure seeds.
The default factory covers 10 categories: prompt injection, jailbreak,
tool abuse, PII leak, SSRF, SQL/shell injection, ASCII smuggling,
competitor endorsement, excessive agency, and hallucination
overreliance. Samples include metadata.tags, metadata.adversarial,
metadata.refusal_expected, and metadata.refusal_policy so they can be
scored with refusal-quality and grouped in JSON/Markdown reports.
Reports expose a safe normalized adversarial subset only: category,
label, severity, and compliance frameworks. Raw prompts, refusal policy
text, and arbitrary sample metadata stay out of JSON sample rows.
The top-level JSON adversarial block aggregates category metrics and
framework counts for OWASP LLM, NIST AI RMF, and EU AI Act style
security reporting; Markdown reports render the same data under
Adversarial coverage. Manifest files use
eval-harness.adversarial-runs.v1, store stable metric aggregates plus
the safe adversarial summary, serialize command updates with a lock file,
and write through a temporary file before replacing the target path.
For recurring safety checks, see
docs/ADVERSARIAL_CONTINUOUS_MONITORING.md.
It shows how to run the adversarial lane from Laravel Scheduler or CI cron
with persistent manifests, Horizon-backed queues, regression gates, and
failure promotion while keeping this package daemon-free.
Provider retries are opt-in. EVAL_HARNESS_PROVIDER_RETRY_ATTEMPTS=2
means two extra attempts after the initial request, with
EVAL_HARNESS_PROVIDER_RETRY_SLEEP_MS between attempts. Retries apply
only to Laravel HTTP connection failures, HTTP 429, and 5xx responses.
Malformed successful responses still fail closed. By default, metric
failures are captured in the report; set
EVAL_HARNESS_RAISE_EXCEPTIONS=true when a strict CI lane should abort
on the first MetricException provider/metric contract error.
For stricter RAG groundedness, citation-groundedness accepts
metadata.citation_evidence:
metadata: citation_evidence: - citation: "[policy:refunds]" quote: "Refunds are available within 30 days."
Each evidence span scores only when the actual output contains both the citation marker and the quoted evidence text. Report details expose counts only, not raw citation or quote strings.
Metrics that expose provider usage can add a structured usage detail:
new MetricScore(1.0, [ 'usage' => [ 'prompt_tokens' => 120, 'completion_tokens' => 40, 'total_tokens' => 160, 'cost_usd' => 0.0024, 'latency_ms' => 850, ], ]);
The renderers aggregate those values into top-level JSON and Markdown
usage summaries while leaving raw prompts/provider payloads out of the
report contract. JSON summaries include per-field reported counts so
consumers can distinguish "not reported" from a reported zero; Markdown
renders unreported usage totals as n/a. Provider usage attached to a
captured metric failure is still included in the aggregate summary so
malformed judge/embedding responses do not hide token or latency spend.
The built-in OpenAI-compatible embedding and judge clients automatically
attach safe usage details to their metric scores: token/cost fields from
the provider's usage object when present, including provider-reported
latency_ms when the backend returns it. They do not synthesize local
wall-clock latency, keeping reports diff-friendly across repeated runs.
Architecture
┌──────────────────────────────────────────────────────────────────┐
│ EvalCommand / AdversarialCommand │
│ └─► php artisan eval-harness:run / eval-harness:adversarial │
│ └─► resolve registrar, dataset, callable/SampleRunner SUT │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ EvalEngine │
│ - dataset registry (in-memory, single source of truth) │
│ - run(dataset, sut) / runBatch(dataset, sut, BatchOptions) │
│ ├─► dispatch samples through SerialBatch or LazyParallelBatch│
│ ├─► invoke input callable or SampleInvocation callable/runner│
│ ├─► lazy-parallel jobs write outputs to BatchResultStore │
│ ├─► for each metric: score(sample, actual) │
│ │ - exception → SampleFailure │
│ │ - clean → MetricScore │
│ └─► assemble EvalReport │
└──────────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌────────────────────────────┐ ┌────────────────────────────┐
│ Metrics │ │ Reports │
│ - ExactMatchMetric │ │ - EvalReport │
│ - CosineEmbeddingMetric │ │ - MarkdownReportRenderer │
│ - BertScoreLikeMetric │ │ - JsonReportRenderer │
│ - LlmAsJudgeMetric │ │ - cohorts + histograms │
│ - RefusalQualityMetric │ │ - macroF1, p50, p95, mean │
└────────────────────────────┘ └────────────────────────────┘
Pluggable metric registry (R23)
MetricResolver accepts:
- A
Metricinstance (full control). - An FQCN string (resolved through the container).
- A built-in alias:
exact-match,contains,regex,rouge-l,citation-groundedness,cosine-embedding,bertscore-like,llm-as-judge,refusal-quality.
Every resolved class is asserted to implement Metric so a typo'd
FQCN fails with a clear error instead of producing a runtime
"method does not exist".
Standalone-agnostic guarantee
tests/Architecture/StandaloneAgnosticTest.php walks src/ and
fails the build if it finds a substring referring to:
- AskMyDocs internal symbols (
KnowledgeDocument,kb_nodes, etc.). - Sibling Padosoft packages (
padosoft/laravel-flow,padosoft/laravel-patent-box-tracker, etc.).
The package is consumed by AskMyDocs / patent-box-tracker / others
but never depends on them. Your composer require works exactly the
same whether you use AskMyDocs or not.
🚀 AI vibe-coding pack included
Every Padosoft package ships with a .claude/ pack that primes
Claude Code (or any compatible AI
coding agent) on the conventions used across the Padosoft repos:
- Skills — auto-loaded when context matches (e.g. opening a
Pull Request triggers
copilot-pr-review-loop, editing tests triggerstest-actually-tests-what-it-claims). - Rules — domain-specific guardrails (Laravel naming, query optimisation, exception handling, frontend testability).
- Agents — specialised sub-agents for reviewing PRs, investigating CI failures, anticipating Copilot review findings.
- Pattern adoption — the COMPANY-PACK + COMPANY-INVENTORY + PATTERN-ADOPTION docs explain how the conventions evolved across PRs so an AI agent can match the prevailing style on day one.
Install Claude Code, open the repo, and your agent immediately knows the house style. No prompt engineering required.
Testing
The package ships three PHPUnit testsuites:
# Default — offline, fast (1-2s) vendor/bin/phpunit --testsuite Unit # File-system invariants (standalone-agnostic) vendor/bin/phpunit --testsuite Architecture # Opt-in live test against a real LLM provider EVAL_HARNESS_LIVE_API_KEY=sk-... vendor/bin/phpunit --testsuite Live
The Unit and Architecture suites are run by CI on every PR
across the PHP 8.3 / 8.4 / 8.5 x Laravel 12 / 13 matrix. The Live
suite is never run by CI — invoke it explicitly when you want
to verify wire compatibility with a new provider or model.
Live suite contract
Every test in tests/Live/ calls markTestSkipped(...) from
setUp() when EVAL_HARNESS_LIVE_API_KEY is empty. A contributor
running the default phpunit invocation never trips the live path
accidentally and never burns API credits.
Roadmap
v1.1.0 — Enterprise operations and scalability add-on (current)
Released 2026-05-05 (release notes). Builds on v1.0 with the v1.x enterprise operations and scalability add-on. All v1 contracts are preserved.
- Operational batch profiles —
--batch-profile=ci|smoke|nightlywith explicit-CLI-wins precedence and host-app overrides undereval-harness.batches.profiles.*. - Producer-side backpressure —
--chunk-size,--rate-limit,--rate-window-seconds, and--result-ttl-secondsflags backed byRateLimitWindow(monotonic clock, amortized O(1) sliding-window math) and the producer-side throttle inLazyParallelBatch. Passnoneon any nullable numeric flag to clear an inherited profile value for a one-off run. - Progress checkpoints —
--checkpoint-everyplus an optionalBatchProgressReportercontainer binding (defaultNullBatchProgressReporter). - Optional terminal status reporting —
BatchTerminalProgressReportersub-contract with explicitsuccess/failure/emptystatus and partial-wins tolerance on the failure path, so dashboards can distinguish a finished failed batch from a stalled one. - Operator-facing Horizon supervisor sizing guidance in
docs/HORIZON_BATCH_QUEUES.md(multi-producer caveat, chunk-size vs concurrency, rate-limit producer-side behavior).
No hard Horizon dependency added; the package still uses sync queues
and queue fakes in tests.
v1.0 (roadmap complete)
- Core feature set complete — cohort metrics, report histograms,
batch execution (
SerialBatch,LazyParallelBatch,--batch=serial,--batch=lazy-parallel), eval sets with resumable progress, standalone output assertions, additional built-in metrics, usage summaries, and runtime guardrails are fully implemented. - Adversarial lane complete — opt-in adversarial datasets
(including multi-input samples),
eval-harness:adversarial, compliance summaries, manifest retention,--regression-gate, and--promote-failuresare implemented and tested. - API contract complete — read-only report API routes/resources for listing/shows, cohort and histogram views, row CSV export, and report artifact download are implemented and documented.
- Stability and release complete — contract stability documents,
migration path from pre-1.0, and release guardrails (
Metric,EvalReport, JSON report shape, queue jobs, commands, and API resources) are now in place. - Roadmap status: all planned Macro Task items (0 through 8) have been completed; no roadmap placeholders remain.
Contributing
See CONTRIBUTING.md. PRs follow the Padosoft 9-step canonical flow (R36): branch + local green + open PR with Copilot reviewer + wait for CI + wait for review + fix + repeat until zero must-fix comments + green CI + merge.
Security
If you discover a security vulnerability, see SECURITY.md — please do not open a public issue.
License
Apache-2.0. See LICENSE.
Copyright © 2026 Padosoft — Lorenzo Padovani.

