aysnc / llm-eval
Simple LLM evaluation framework for PHP.
Installs: 2
Dependents: 0
Suggesters: 0
Security: 0
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
pkg:composer/aysnc/llm-eval
Requires
- php: ^8.3
- guzzlehttp/guzzle: ^7.10
- symfony/console: ^7.4
Requires (Dev)
- aws/aws-sdk-php: ^3.0
- friendsofphp/php-cs-fixer: ^3.92
- phpstan/phpstan: ^2.1
- phpunit/phpunit: ^11.0
Suggests
- aws/aws-sdk-php: Required for BedrockProvider to use Claude via AWS Bedrock (^3.0)
This package is not auto-updated.
Last update: 2026-02-11 12:33:30 UTC
README
A PHP package for evaluating LLM outputs. Test your prompts, validate responses, and ensure your AI features work correctly.
Installation
composer require aysnc/llm-eval
Configuration
Create llm-eval.php in your project root:
<?php use Aysnc\AI\LlmEval\Providers\AnthropicProvider; return [ 'provider' => new AnthropicProvider(getenv('ANTHROPIC_API_KEY')), 'directory' => __DIR__ . '/evals', 'cache' => true, 'cacheTtl' => 0, 'parallel' => false, 'concurrency' => 5, ];
| Option | Type | Default | Description |
|---|---|---|---|
provider |
ProviderInterface |
— | The LLM provider shared across all eval files |
directory |
string |
'evals' |
Directory containing your eval files |
cache |
bool|string |
false |
true uses .llm-cache/, or pass a custom path |
cacheTtl |
int |
0 |
Cache lifetime in seconds (0 = forever) |
parallel |
bool |
false |
Run evals in parallel by default |
concurrency |
int |
0 |
Max concurrent requests when parallel (0 = unlimited) |
Quick Start
1. Create an eval file in your evals directory. Each file returns an LlmEval instance:
// evals/simple.php <?php use Aysnc\AI\LlmEval\Dataset\Dataset; use Aysnc\AI\LlmEval\LlmEval; $dataset = Dataset::fromArray([ ['prompt' => 'What is 2+2? Reply with just the number.', 'expected' => '4'], ['prompt' => 'What is the capital of France? Reply with just the city name.', 'expected' => 'Paris'], ['prompt' => 'Is the sky blue? Reply with just yes or no.', 'expected' => 'yes'], ]); return LlmEval::create('quick-start') ->dataset($dataset) ->assertions(function ($expect, $testCase): void { $expect->contains($testCase->getExpected(), caseSensitive: false); });
2. Run it:
vendor/bin/llm-eval run
LLM-Eval Runner
===============
PASS quick-start - Case 0
PASS quick-start - Case 1
PASS quick-start - Case 2
Summary
-------
Total 3
Passed 3
Failed 0
Pass Rate 100.0%
Duration 1.24s
The config provides the LLM provider, the eval file defines what to test — no ->provider() or ->runAll() needed in the file.
Core Concepts
An evaluation has three parts: a provider (which LLM to call), a dataset (prompts + expected answers), and assertions (how to check the response).
Datasets
A dataset is a collection of test cases. Each test case has a prompt and an optional expected value.
// Inline array $dataset = Dataset::fromArray([ ['prompt' => 'What is 2+2?', 'expected' => '4'], ]); // CSV file (columns: prompt, expected) $dataset = Dataset::fromCsv(__DIR__ . '/data/capitals.csv'); // JSON file (array of objects with prompt + expected keys) $dataset = Dataset::fromJson(__DIR__ . '/data/questions.json');
The expected key can be a single value or multiple named values:
// Single — accessed via $testCase->getExpected() ['prompt' => 'What is 2+2?', 'expected' => '4'] // Multiple — accessed via $testCase->getExpected('name'), $testCase->getExpected('age') ['prompt' => 'Return JSON with name and age.', 'expected' => ['name' => 'Alice', 'age' => '30']]
CSV files use column prefixes for multiple values: expected_name, expected_age.
Any keys that aren't prompt or expected become metadata, accessible via $testCase->getData('key').
Assertions
Assertions define what "correct" means for a response. You chain them inside the assertions() callback.
Text
$expect->contains('Paris'); $expect->contains('paris', caseSensitive: false); $expect->notContains('London'); $expect->matchesRegex('/\d{4}-\d{2}-\d{2}/'); $expect->minLength(10); $expect->maxLength(500);
JSON
$expect->isJson();
Custom
$expect->assert(new MyCustomAssertion());
There are also assertions for tool calls, multi-turn conversations, and LLM-as-judge — covered in the sections below.
Testing Scenarios
Structured Output
Validate that the LLM returns well-formed JSON with the right content. Combine isJson() with contains() or multiple expected values.
// evals/json-output.php $dataset = Dataset::fromArray([ [ 'prompt' => 'Return a JSON object with keys "name" and "age". Use name "Alice" and age 30. Only output JSON.', 'expected' => ['name' => 'Alice', 'age' => '30'], ], [ 'prompt' => 'Return a JSON array of three colors: red, green, blue. Only output JSON.', 'expected' => 'red', ], ]); return LlmEval::create('json-output') ->dataset($dataset) ->assertions(function ($expect, $testCase): void { $expect->isJson() ->contains($testCase->getExpected()) ->contains($testCase->getExpected('name')); });
Tool Call Testing
Test that your LLM calls tools with the right parameters — without executing a full conversation loop. This uses LlmEval::create() (not createConversation) since you're only checking the first response.
// evals/tool-test.php $tools = [ [ 'name' => 'get_weather', 'description' => 'Get weather for a location', 'input_schema' => [ 'type' => 'object', 'properties' => [ 'location' => ['type' => 'string'], ], 'required' => ['location'], ], ], ]; return LlmEval::create('tool-test') ->option('tools', $tools) ->dataset($dataset) ->assertions(function ($expect): void { $expect->calledTool('get_weather'); $expect->toolCallHasParam('get_weather', 'location', 'Paris'); });
Available tool call assertions:
$expect->calledTool('get_weather'); $expect->calledTool('get_weather', times: 2); $expect->toolCallHasParam('get_weather', 'location'); $expect->toolCallHasParam('get_weather', 'location', 'Paris'); $expect->calledToolCount(3); $expect->didNotCallTool('dangerous_function');
Multi-Turn Conversations
Test agentic workflows where the LLM calls tools, receives results, and continues reasoning. Use LlmEval::createConversation() with a tool executor that returns simulated results.
// evals/math-agent.php use Aysnc\AI\LlmEval\Dataset\Dataset; use Aysnc\AI\LlmEval\LlmEval; use Aysnc\AI\LlmEval\Providers\CallableToolExecutor; use Aysnc\AI\LlmEval\Providers\ToolCall; use Aysnc\AI\LlmEval\Providers\ToolResult; $tools = [ [ 'name' => 'calculate', 'description' => 'Evaluate a math expression', 'input_schema' => [ 'type' => 'object', 'properties' => [ 'expression' => ['type' => 'string'], ], 'required' => ['expression'], ], ], ]; $executor = new CallableToolExecutor([ 'calculate' => function (ToolCall $tc): ToolResult { $expr = $tc->getParam('expression'); $result = match ($expr) { '6 * 7', '6*7' => '42', default => 'unknown', }; return new ToolResult($tc->id, $result); }, ]); $dataset = Dataset::fromArray([ ['prompt' => 'Use the calculate tool to compute 6 * 7.', 'expected' => '42'], ]); return LlmEval::createConversation('math-agent') ->withTools($tools) ->executor($executor) ->dataset($dataset) ->assertions(function ($expect, $testCase): void { $expect->contains($testCase->getExpected()) ->usedTool('calculate') ->turnCount(2); });
Available conversation assertions:
$expect->turnCount(2); $expect->usedTool('calculate'); $expect->conversationContains('42');
Multi-Turn Datasets
Use a turns array to test conversations with multiple user messages. Each turn has its own prompt and optional expected values for per-turn assertions. Use getTurn() to access the 1-indexed turn number.
$dataset = Dataset::fromArray([ [ 'turns' => [ ['prompt' => 'What is the weather in Paris?', 'expected' => '22'], ['prompt' => 'Now check Tokyo', 'expected' => '18'], ['prompt' => 'Which city was warmer?', 'expected' => 'Paris'], ], ], ]); return LlmEval::createConversation('multi-turn') ->withTools($tools) ->executor($executor) ->dataset($dataset) ->assertions(function ($expect, $testCase): void { $expect->contains($testCase->getExpected()); if ($testCase->getTurn() <= 2) { $expect->usedTool('get_weather'); } });
LLM-as-Judge
Use one LLM to evaluate another's response quality. Instead of checking for exact strings, you describe what "good" looks like and a judge model scores the response 0-100%.
// evals/quality-check.php use Aysnc\AI\LlmEval\Providers\AnthropicProvider; $judge = new AnthropicProvider(getenv('ANTHROPIC_API_KEY')); return LlmEval::create('quality-check') ->dataset($dataset) ->assertions(function ($expect) use ($judge): void { $expect->judgedBy( judge: $judge, criteria: 'Is this response helpful, accurate, and concise?', threshold: 0.8, ); });
Judging Conversations
For multi-turn conversations, you can use judgedBy() inside assertions() to judge per-turn, or use ->judge() on the eval to run a single evaluation after all turns complete. The judge receives the full conversation history — all messages, tool calls, and results.
return LlmEval::createConversation('multi-turn') ->withTools($tools) ->executor($executor) ->dataset($dataset) ->assertions(function ($expect, $testCase): void { $expect->contains($testCase->getExpected()); }) ->judge($judge, 'Did the model correctly identify the warmer city based on the earlier temperatures?');
CLI Runner
# Run all eval files in the evals directory vendor/bin/llm-eval run # Run a specific eval file vendor/bin/llm-eval run my-test # Run in parallel vendor/bin/llm-eval run --parallel --concurrency=10 # Verbose mode — shows judge reasoning and tool calls for passing tests vendor/bin/llm-eval run -v # JSON output vendor/bin/llm-eval run --format=json # Clear response cache vendor/bin/llm-eval cache:clear # Scaffold a new eval file vendor/bin/llm-eval init
Output
LLM-Eval Runner
===============
Running evaluations...
PASS simple - Case 0
PASS simple - Case 1
FAIL simple - Case 2
Got: "The sky appears blue due to Rayleigh scattering..."
→ Text does not contain "yes"
PASS conversation-json - compare-two-cities - Turn 1
PASS conversation-json - compare-two-cities - Turn 2
PASS conversation-json - compare-two-cities - Turn 3
→ Score: 100% (threshold: 70%) - The response correctly identifies Paris as the warmer city.
PASS llm-judge - photosynthesis
→ Score: 95% (threshold: 70%) - Clear, accurate explanation mentioning plants and sunlight.
Summary
-------
Total 7
Passed 6
Failed 1
Pass Rate 85.7%
Duration 4.32s
With -v, passing tests also show judge scores and tool call details.
Providers
Anthropic Claude
Direct API access. Get your key at console.anthropic.com.
$provider = new AnthropicProvider( apiKey: getenv('ANTHROPIC_API_KEY'), );
Default model: claude-sonnet-4-20250514
AWS Bedrock
Uses the Converse API — works with Claude, Titan, Llama, Mistral, and other Bedrock models. Requires composer require aws/aws-sdk-php. See AWS Bedrock docs.
use Aysnc\AI\LlmEval\Providers\BedrockProvider; // Explicit credentials $provider = new BedrockProvider( region: 'us-east-1', accessKeyId: 'AKIA...', secretAccessKey: 'secret...', ); // Or default credential chain (env vars, ~/.aws/credentials, IAM role) $provider = new BedrockProvider(region: 'us-east-1');
Default model: anthropic.claude-3-5-sonnet-20241022-v2:0
Changing the Model
Use ->model() to override the default model for any provider:
return LlmEval::create('eval-name') ->model('claude-opus-4-20250514') ->dataset($dataset) ->assertions($assertions);
This works with both AnthropicProvider (Anthropic model IDs) and BedrockProvider (Bedrock model IDs).
You can also set ->maxTokens(2048) to override the default max tokens (1024).
Programmatic API
If you need to run evals from PHP code — inside a test suite, a CI script, or anywhere you want to work with the results directly — use ->provider() and ->runAll():
$provider = new AnthropicProvider(getenv('ANTHROPIC_API_KEY')); $results = LlmEval::create('quick-start') ->provider($provider) ->dataset($dataset) ->assertions(function ($expect, $testCase): void { $expect->contains($testCase->getExpected()); }) ->runAll(); echo "Pass rate: {$results->passRatePercent()}\n"; // Pass rate: 100.0%
Requirements
- PHP 8.3+
guzzlehttp/guzzle^7.10aws/aws-sdk-php^3.0 (optional, for Bedrock)