lastdragon-ru / diy-parser
Yet another package to simplify writing fast and memory-effective parsers that can parse infinite strings.
Requires
- php: ^8.3|^8.4
- ext-mbstring: *
Requires (Dev)
- lastdragon-ru/lara-asp-testing: dev-main
- mockery/mockery: ^1.6.6
- phpunit/phpunit: ^11.2.0|^12.0.0
This package is auto-updated.
Last update: 2025-08-21 10:54:56 UTC
README
There are several tools to generate full-featured parsers even for PHP1. They are overkill when you just need to parse something simple. In such cases, you might decide to create your own parser. There are a lot of articles/examples on the web, and actually it is not too difficult as you may think. This is yet another package to simplify writing fast and memory-effective parsers that can parse infinite strings.
Requirements
Requirement | Constraint | Supported by |
---|---|---|
PHP | ^8.4 |
HEAD , 9.2.0 |
^8.3 |
HEAD , 9.2.0 |
Installation
composer require lastdragon-ru/diy-parser
Introduction
As an example, I will show how to create a parser for mathematical expressions like 2 - (1 + 2) / 3
and how to calculate them as a bonus.
<?php declare(strict_types = 1); namespace LastDragon_ru\DiyParser\Docs\Examples; use LastDragon_ru\DiyParser\Docs\Calculator\Calculator; use LastDragon_ru\LaraASP\Dev\App\Example; Example::dump( (new Calculator())->calculate('2 - (1 + 2) / 3'), );
The (new Calculator())->calculate('2 - (1 + 2) / 3')
is:
1
Our mathematical expression will be a string, subject to the following rules:
- numbers (positive integers only)
- operators (addition
+
, subtraction-
, multiplication*
, and division*
) - sub-expressions inside
(
and)
- space(s)
- Numbers and sub-expressions should be separated by an operator
We will start from the end, from the complete implementation of the parser, and then I will describe what is going inside it. So
<?php declare(strict_types = 1); namespace LastDragon_ru\DiyParser\Docs\Calculator; use LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNode; use LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNodeChild; use LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNodeFactory; use LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode; use LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorAdditionNode; use LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorDivisionNode; use LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorMultiplicationNode; use LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorNode; use LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorSubtractionNode; use LastDragon_ru\DiyParser\Iterables\TransactionalIterable; use LastDragon_ru\DiyParser\Tokenizer\Token; use LastDragon_ru\DiyParser\Tokenizer\Tokenizer; use LogicException; use function filter_var; use function preg_match; use const FILTER_NULL_ON_FAILURE; use const FILTER_VALIDATE_INT; class Parser { public function __construct() { // empty } public function parse(string $pattern): ?ExpressionNode { $node = null; try { $iterable = (new Tokenizer(Name::class))->tokenize([$pattern]); $node = $this->parseExpression($iterable); } catch (LogicException) { // The `$pattern` is not a valid expression } return $node; } /** * @param iterable<mixed, Token<Name>> $iterable */ protected function parseExpression(iterable $iterable): ?ExpressionNode { $iterable = new TransactionalIterable($iterable, 64, 1); $factory = new ExpressionNodeFactory(); while ($iterable->valid()) { $factory->push($this->parseExpressionChild($iterable)); } return $factory->create(); } /** * @param TransactionalIterable<Token<Name>> $iterable */ protected function parseExpressionChild(TransactionalIterable $iterable): ?ExpressionNodeChild { return $this->parseSubExpression($iterable) ?? $this->parseOperator($iterable) ?? $this->parseNumber($iterable) ?? $this->parseSpace($iterable); } /** * @param TransactionalIterable<Token<Name>> $iterable */ protected function parseSubExpression(TransactionalIterable $iterable): ?ExpressionNode { // Is `(`? if ($iterable[0]?->is(Name::LeftParenthesis) !== true) { return null; } // Begin $iterable->begin(); $iterable->next(); // Parse $node = null; $factory = new ExpressionNodeFactory(); while ($iterable->valid()) { // Is `)`? if ($iterable[0]?->is(Name::RightParenthesis) === true) { $node = $factory->create(); $iterable->next(); break; } // Child $factory->push( $this->parseExpressionChild($iterable), ); } // Commit $iterable->end($node); // Return return $node; } /** * @param TransactionalIterable<Token<Name>> $iterable */ protected function parseOperator(TransactionalIterable $iterable): ?OperatorNode { $node = match ($iterable[0]->name ?? null) { Name::Plus => new OperatorAdditionNode(), Name::Minus => new OperatorSubtractionNode(), Name::Asterisk => new OperatorMultiplicationNode(), Name::Slash => new OperatorDivisionNode(), default => null, }; if ($node !== null) { $iterable->next(); } return $node; } /** * @param TransactionalIterable<Token<Name>> $iterable */ protected function parseNumber(TransactionalIterable $iterable): ?NumberNode { // String? if ($iterable[0]?->is(null) !== true) { return null; } // Number? $number = $iterable[0]->value; $number = preg_match('/^[0-9]+$/u', $number) === 1 ? filter_var($number, FILTER_VALIDATE_INT, FILTER_NULL_ON_FAILURE) : null; $node = $number !== null ? new NumberNode($number) : null; if ($node !== null) { $iterable->next(); } // Return return $node; } /** * @param TransactionalIterable<Token<Name>> $iterable */ protected function parseSpace(TransactionalIterable $iterable): null { // Only spaces allowed here if ($iterable[0]?->is(Name::Space) !== true) { throw new LogicException('The string is not a mathematical expression.'); } else { $iterable->next(); } return null; } }
And the resulting AST:
<?php declare(strict_types = 1); namespace LastDragon_ru\DiyParser\Docs\Examples; use LastDragon_ru\DiyParser\Docs\Calculator\Parser; use LastDragon_ru\LaraASP\Dev\App\Example; Example::dump( (new Parser())->parse('2 - (1 + 2) / 3'), );
The (new Parser())->parse('2 - (1 + 2) / 3')
is:
LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNode {
+children: [
LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
+value: 2
},
LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorSubtractionNode {},
LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNode {
+children: [
LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
+value: 1
},
LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorAdditionNode {},
LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
+value: 2
},
]
},
LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorDivisionNode {},
LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
+value: 3
},
]
}
Tokenizer
First, we should split the input string into tokens that is happening inside the Parser::parse()
. The Tokenizer
accept a backed enum(s) and uses their values as delimiters (value can be a single character or a string). According to our rules, we need the following tokens:
<?php declare(strict_types = 1); namespace LastDragon_ru\DiyParser\Docs\Calculator; enum Name: string { case Plus = '+'; case Minus = '-'; case Asterisk = '*'; case Slash = '/'; case Space = ' '; case LeftParenthesis = '('; case RightParenthesis = ')'; }
After tokenization, we will have the list of tokens for further processing. Note that Token::$name
contains the enum case or null
for strings.
<?php declare(strict_types = 1); namespace LastDragon_ru\DiyParser\Docs\Examples; use LastDragon_ru\DiyParser\Docs\Calculator\Name; use LastDragon_ru\DiyParser\Tokenizer\Tokenizer; use LastDragon_ru\LaraASP\Dev\App\Example; use function iterator_to_array; $input = '2 - (1 + 2) / 3'; $tokenizer = new Tokenizer(Name::class); $tokens = $tokenizer->tokenize([$input]); Example::dump(iterator_to_array($tokens));
Example output
The iterator_to_array($tokens)
is:
[
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: null
+value: "2"
+offset: 0
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: LastDragon_ru\DiyParser\Docs\Calculator\Name {#1
+name: "Space"
+value: " "
}
+value: " "
+offset: 1
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: LastDragon_ru\DiyParser\Docs\Calculator\Name {
+name: "Minus"
+value: "-"
}
+value: "-"
+offset: 2
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: LastDragon_ru\DiyParser\Docs\Calculator\Name {#1}
+value: " "
+offset: 3
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: LastDragon_ru\DiyParser\Docs\Calculator\Name {
+name: "LeftParenthesis"
+value: "("
}
+value: "("
+offset: 4
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: null
+value: "1"
+offset: 5
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: LastDragon_ru\DiyParser\Docs\Calculator\Name {#1}
+value: " "
+offset: 6
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: LastDragon_ru\DiyParser\Docs\Calculator\Name {
+name: "Plus"
+value: "+"
}
+value: "+"
+offset: 7
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: LastDragon_ru\DiyParser\Docs\Calculator\Name {#1}
+value: " "
+offset: 8
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: null
+value: "2"
+offset: 9
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: LastDragon_ru\DiyParser\Docs\Calculator\Name {
+name: "RightParenthesis"
+value: ")"
}
+value: ")"
+offset: 10
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: LastDragon_ru\DiyParser\Docs\Calculator\Name {#1}
+value: " "
+offset: 11
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: LastDragon_ru\DiyParser\Docs\Calculator\Name {
+name: "Slash"
+value: "/"
}
+value: "/"
+offset: 12
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: LastDragon_ru\DiyParser\Docs\Calculator\Name {#1}
+value: " "
+offset: 13
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: null
+value: "3"
+offset: 14
},
]
Parser
Next, we pass tokens into Parser::parseExpression()
where we create the instance of TransactionalIterable
and start converting tokens into AST. The TransactionalIterable
is very important - it's not only return current/next/previous token and to navigate back and forward, but support transactions to process nested structures (e.g. sub-expressions) and rollback if parsing failed (eg if )
is missed).
<?php declare(strict_types = 1); namespace LastDragon_ru\DiyParser\Docs\Examples; use LastDragon_ru\DiyParser\Docs\Calculator\Name; use LastDragon_ru\DiyParser\Iterables\TransactionalIterable; use LastDragon_ru\DiyParser\Tokenizer\Tokenizer; use LastDragon_ru\LaraASP\Dev\App\Example; $input = '1 + 2'; $tokens = (new Tokenizer(Name::class))->tokenize([$input]); $iterable = new TransactionalIterable($tokens, 5, 5); Example::dump($iterable[0]); Example::dump($iterable[4]); $iterable->next(2); $iterable->begin(); // start nested Example::dump($iterable[-2]); Example::dump($iterable[0]); Example::dump($iterable[2]); $iterable->rollback(); // oops Example::dump($iterable[0]);
Example output
The $iterable[0]
is:
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: null
+value: "1"
+offset: 0
}
The $iterable[4]
is:
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: null
+value: "2"
+offset: 4
}
The $iterable[-2]
is:
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: null
+value: "1"
+offset: 0
}
The $iterable[0]
is:
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: LastDragon_ru\DiyParser\Docs\Calculator\Name {
+name: "Plus"
+value: "+"
}
+value: "+"
+offset: 2
}
The $iterable[2]
is:
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: null
+value: "2"
+offset: 4
}
The $iterable[0]
is:
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: LastDragon_ru\DiyParser\Docs\Calculator\Name {
+name: "Plus"
+value: "+"
}
+value: "+"
+offset: 2
}
All other methods just create AST nodes and check that the expression is correct. You may notice the ExpressionNodeFactory
class, which is a subclass of NodeParentFactory
. In our case, the ExpressionNodeFactory
helps to simplify the code, and it also checks that an operator between numbers/expressions (one of our requirements).
In the general case, the main reason of NodeParentFactory
is merging sequence of child nodes (of the same class) into one node. The Tokenizer
may generate a lot of "string" tokens (especially if escaping is supported), but we usually want only one (abc
) node in AST instead of multiple (a
, b
, and c
) nodes. If you use static analysis tools like PHPStan, the class will guaranties the type safety too.
<?php declare(strict_types = 1); namespace LastDragon_ru\DiyParser\Docs\Examples; use LastDragon_ru\DiyParser\Ast\NodeChild; use LastDragon_ru\DiyParser\Ast\NodeParentFactory; use LastDragon_ru\DiyParser\Ast\NodeParentImpl; use LastDragon_ru\DiyParser\Ast\NodeString; use LastDragon_ru\LaraASP\Dev\App\Example; use Override; // phpcs:disable PSR1.Files.SideEffects // phpcs:disable PSR1.Classes.ClassDeclaration.MultipleClasses /** * @implements NodeChild<ParentNode> */ class ChildNode extends NodeString implements NodeChild { // empty } /** * @extends NodeParentImpl<ChildNode> */ class ParentNode extends NodeParentImpl { // empty } /** * @extends NodeParentFactory<ParentNode, ChildNode> */ class ParentNodeFactory extends NodeParentFactory { /** * @inheritDoc */ #[Override] protected function onCreate(array $children): ?object { return $children !== [] ? new ParentNode($children) : null; } /** * @inheritDoc */ #[Override] protected function onPush(array $children, ?object $node): bool { return true; } } $factory = new ParentNodeFactory(); $factory->push(new ChildNode('a')); $factory->push(new ChildNode('b')); $factory->push(new ChildNode('c')); Example::dump($factory->create()); // create and reset Example::dump($factory->create()); // `null`
The $factory->create()
is:
LastDragon_ru\DiyParser\Docs\Examples\ParentNode {
+children: [
LastDragon_ru\DiyParser\Docs\Examples\ChildNode {
+string: "abc"
},
]
}
The $factory->create()
is:
null
AST
As you can see, our AST for mathematical expressions doesn't have references to parent/next/previous nodes. It makes the parser simple, and reduce memory usage. The Cursor
class can be used to navigate the AST in all directions. Please note that your nodes must implement NodeParent
and NodeChild
if you want to use the cursor.
<?php declare(strict_types = 1); namespace LastDragon_ru\DiyParser\Docs\Examples; use LastDragon_ru\DiyParser\Ast\Cursor; use LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNode; use LastDragon_ru\DiyParser\Docs\Calculator\Parser; use LastDragon_ru\LaraASP\Dev\App\Example; use function assert; // Parse $ast = (new Parser())->parse('2 - (1 + 2) / 3'); assert($ast instanceof ExpressionNode); // Create the cursor $cursor = new Cursor($ast); // Children can be iterated directly foreach ($cursor as $child) { if ($child->node instanceof ExpressionNode) { Example::dump($child->node); break; } } // Also possible to get n-th child Example::dump($cursor[2]); // And next/previous Example::dump($cursor[2]->next->node ?? null); Example::dump($cursor[2]->previous->node ?? null);
Example output
LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNode {
+children: [
LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
+value: 1
},
LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorAdditionNode {},
LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
+value: 2
},
]
}
The $cursor[2]
is:
LastDragon_ru\DiyParser\Ast\Cursor {
+node: LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNode {#1
+children: [
LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
+value: 1
},
LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorAdditionNode {},
LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
+value: 2
},
]
}
+parent: LastDragon_ru\DiyParser\Ast\Cursor {
+node: LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNode {
+children: [
LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
+value: 2
},
LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorSubtractionNode {},
LastDragon_ru\DiyParser\Docs\Calculator\Ast\ExpressionNode {#1},
LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorDivisionNode {},
LastDragon_ru\DiyParser\Docs\Calculator\Ast\NumberNode {
+value: 3
},
]
}
+parent: null
+index: null
}
+index: 2
}
The $cursor[2]->next->node ?? null
is:
LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorDivisionNode {}
The $cursor[2]->previous->node ?? null
is:
LastDragon_ru\DiyParser\Docs\Calculator\Ast\OperatorSubtractionNode {}
Escape
Of course, it is supported. You just need to specify the token to use as escaping string:
<?php declare(strict_types = 1); namespace LastDragon_ru\DiyParser\Docs\Examples; use LastDragon_ru\DiyParser\Tokenizer\Tokenizer; use LastDragon_ru\LaraASP\Dev\App\Example; use function iterator_to_array; // phpcs:disable PSR1.Files.SideEffects // phpcs:disable PSR1.Classes.ClassDeclaration.MultipleClasses enum Name: string { case Slash = '/'; case Backslash = '\\'; } $input = 'a/b\\/\\c'; $tokenizer = new Tokenizer(Name::class, Name::Backslash); $tokens = $tokenizer->tokenize([$input]); Example::dump(iterator_to_array($tokens));
The iterator_to_array($tokens)
is:
[
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: null
+value: "a"
+offset: 0
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: LastDragon_ru\DiyParser\Docs\Examples\Name {
+name: "Slash"
+value: "/"
}
+value: "/"
+offset: 1
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: null
+value: "b"
+offset: 2
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: null
+value: "/"
+offset: 4
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: LastDragon_ru\DiyParser\Docs\Examples\Name {
+name: "Backslash"
+value: "\"
}
+value: "\"
+offset: 5
},
LastDragon_ru\DiyParser\Tokenizer\Token {
+name: null
+value: "c"
+offset: 6
},
]
Limitations
Internally, the package uses preg_match()
with u
modifier to split string(s) into tokens because this is probably the faster way in PHP (if you have a better way, please create issue/pr/etc). The con is internal encoding is always UTF-8
. Thus, if you need to parse strings in other encoding, you must convert them into UTF-8
before parsing.
Multiple tokens that are the same string/character are not supported yet. Moreover, if conflicted, no error will be reported. The priority is undefined.
The input string potentially may have infinite length. But, Token::$offset
has int
type so its max value is PHP_INT_MAX
. Also, it may be limited by the size of the TransactionalIterable
buffers (actual for nested nodes).
There are no line numbers, only Token::$offset
from the beginning of the string. If you need them, you need to add token(s) for EOL and process/calculate within the parser.
There is also no special code to check nested/closed parentheses or throw an error if string cannot be parsed. It also should be done within parser implementation.
Bonus
For calculate mathematical expressions, we are using shunting yard algorithm to convert our AST into Reverse Polish Notation. Implementation is ever simpler than usual, because (...)
represented as sub-nodes and no need any special actions, please see the ExpressionNode
class.
Upgrading
Please follow Upgrade Guide.
Contributing
This package is the part of Awesome Set of Packages for Laravel. Please use the main repository to report issues, send pull requests, or ask questions.