vojtech-dobes / php-grammar-processing
Library for tokenization, abstract syntax tree parsing & interpretation
Installs: 1 072
Dependents: 4
Suggesters: 0
Security: 0
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
pkg:composer/vojtech-dobes/php-grammar-processing
Requires
- php: ~8.4
- ext-pcre: *
Requires (Dev)
- nette/tester: ^2.5.4
- phpstan/phpstan: ^2.1.12
- phpstan/phpstan-strict-rules: ^2.0.4
- spaze/phpstan-disallowed-calls: ^4.5.0
- tracy/tracy: ^2.10.9
- vojtech-dobes/php-codestyle: ~0.2.0
This package is auto-updated.
Last update: 2025-10-09 08:56:51 UTC
README
This library provides convenient APIs for describing a context-free grammar, and subsequently using it to build an abstract syntax tree and analyze it for an arbitrary purpose.
Installation
To install the latest version, run the following command:
composer require vojtech-dobes/php-grammar-processing
Minimum supported PHP version is 8.4.
Documentation
Note
All code examples assume classes are imported from Vojtechdobes\GrammarProcessing namespace.
Let's say in variable $source we have source text of a language that we want to process. For example, we want to evaluate a mathematical expression:
$source = '2 + 5';
Processing of the source text happens in few phases:
$$Tokenization \to Parsing \to Interpretation$$
- In Tokenization phase, source text is split into basic small units called lexical tokens.
- In Parsing phase, abstract syntax tree is constructed from these lexical tokens.
- In Interpretation phase, abstract syntax tree is evaluated to a final arbitrary result that can be used further.
To carry out Tokenization and Parsing phases, we first have to define grammar of the language.
Grammar definition
Grammar is defined by creating Grammar object.
$grammar = new Grammar( // Tokenization lexicalSymbols: $lexicalSymbols, syntaxTokenSymbols: $syntaxTokenSymbols, ignoredTokenSymbols: $ignoredTokenSymbols, // Parsing syntacticSymbols: $syntacticSymbols, );
-
$lexicalSymbolsarray<string, Vocabulary\Symbol>Lexical symbols represent production rules for the source text. Production rule is provided as implementation of
Vocabulary\Symbolinterface. Example:$lexicalSymbols = [ 'Digit' => new Vocabulary\Regexp('[0-9]'), 'Integer' => new Vocabulary\Sequence([ new Vocabulary\Repeat(new Vocabulary\Nonterminal('Digit'), min: 1, max: null), new Vocabulary\NegativeLookahead(new Vocabulary\Nonterminal('Digit')), ]), 'Operator' => new Vocabulary\OneOf([ new Vocabulary\Literal('+'), new Vocabulary\Literal('-'), ]), 'Whitespace' => new Vocabulary\Regexp('\s'), ];
All available implementations of
Vocabulary\Symbolare listed below. ThroughVocabulary\Nonterminalinstance lexical symbol can be based on other lexical symbols. -
$syntaxTokenSymbolslist<string>From lexical symbols we select those that will become constituent items of the resulting token stream. Every part of the source text (except for ignored parts) must fit into one of syntax token symbols, otherwise
UnexpectedTokenExceptionwill be thrown.$syntaxTokenSymbols = ['Integer', 'Operator'];
-
list<string>Not all parts of the source code are relevant in later stages, e.g. arbitrary whitespace or comments can be often discarded. We can list lexical symbols that cover such parts of the source code:
$ignoredTokenSymbols = ['Whitespace'];
-
$syntacticSymbolsarray<string, Vocabulary\Symbol>Every syntactic symbol represents a syntactically meaningful sequence of lexical tokens. Like
$lexicalSymbols, this parameter is a map where key is name of a symbol and value is production rule in form ofVocabulary\Symbolinstance. ThroughVocabulary\Nonterminalinstance syntactic symbol can contain both lexical symbols and other syntactic symbols, thus giving rise to a tree structure. Example:$syntacticSymbols = [ 'BinaryOperation' => new Vocabulary\OneOf([ new Vocabulary\Nonterminal('Addition'), new Vocabulary\Nonterminal('Subtraction'), ]), 'Addition' => new Vocabulary\Sequence([ new Vocabulary\Nonterminal('Integer'), new Vocabulary\Literal('+'), new Vocabulary\Nonterminal('Integer'), ]), 'Subtraction' => new Vocabulary\Sequence([ new Vocabulary\Nonterminal('Integer'), new Vocabulary\Literal('-'), new Vocabulary\Nonterminal('Integer'), ]), ];
Grammar usage
With Grammar instance, we can produce abstract syntax tree. We have to choose root syntactical symbol, which is expected to encompass the whole source text (the whole token stream).
$abstractSyntaxTree = $grammar->parseSource($source, rootSymbol: 'BinaryOperation');
In case of an error, UnexpectedTokenException will be thrown.
The parseSource call carries out both Tokenization and Parsing phase. Alternatively we can execute them manually:
$tokenStream = $grammar->tokenizeSource($source); $abstractSyntaxTree = $grammar->parseTokenStream($tokenStream, rootSymbol: 'BinaryOperation');
Interpretation
With our example mathematical expression, result of Interpretation phase should be a number 7. To do so, we will make following call:
$result = $abstractSyntaxTree->interpret([ 'Addition' => new AdditionNodeInterpretation(), 'Integer' => new IntegerNodeInterpretation(), 'Subtraction' => new SubtractionNodeInterpretation(), ]);
Values in the provided map are custom implementations of NodeInterpretation interface. Each is responsible to evaluate node in the abstract syntax tree it's registered for. Our AdditionNodeInterpretation implementation could look like this:
class AdditionNodeInterpretation implements NodeInterpretation { public function evaluate(Node $node): Generator { return yield $node->value[0] + yield $node->value[2]; } }
What's happening here? In $syntacticSymbols we have defined Addition to be a Sequence with 3 elements. First and last elements are the operands of our mathematical addition operation. The middle element in the sequence (literal +) doesn't play role in our interpretation anymore - we already know we are in Addition node which is responsible for... adding things up!
To make sense of the operands, we can yield subnodes of the abstract syntax tree in order to interpret them first. In this case $node->value[0] & $node->value[2] represent Integer nodes, and so IntegerNodeInterpretation will be used to interpret them.
class IntegerNodeInterpretation implements NodeInterpretation { public function evaluate(Node $node): Generator { return (int) yield $node->value; } }
Because Integer is lexical token, value of the node is Token from the token stream. Yielding instance of Token simply returns it's string value, which we can safely cast to PHP int type.
Available symbols
Vocabulary\LiteralVocabulary\NegativeLookaheadVocabulary\NonterminalVocabulary\OneOfVocabulary\RegexpVocabulary\SequenceVocabulary\Subtract