dekor / php-syntax-tree-builder
This library allows you to build your custom language by describing lexing/parsing rules.
Requires
- php: ^7.0
This package is auto-updated.
Last update: 2024-10-18 12:54:54 UTC
README
This library allows you to build your custom grammar left-right, RD (recursive descent) parser by describing lexing/parsing rules. It builds on Finite-state machine and gives you an instance of Abstract syntax tree.
Inspired by lex/yacc for c/c++ (but not the same!).
Installation
composer install dekor/php-syntax-tree-builder
How it works?
Building grammar takes 2 main steps:
- lexing - splitting origin file into language lexems.
- parser - using finite-state machine, parsing lexems sequence by some grammary rules.
Lexing
This part is very simple. All you need is to describe pairs regex => lexem name. Here is a simple list if lexer:
$lexer = new \Dekor\PhpSyntaxTreeBuilder\Lexer([ '[\s\t\n]+' => '', '\;' => ';', '=' => '=', '\$([a-z]+)' => 'VAR', 'if' => 'IF', '([0-9]+)' => 'NUMBER', '\"(.*)\"' => 'STRING', 'print' => 'PRINT', '(\+|\-|\*\/)' => 'MATH_OPERATOR', ]); $lexems = $lexer->parse($pathToFile);
The result of parse
method is an array of lexems. In case when lexer can't determine
current symbol sequence as any of described lexems, it throws LexerAnalyseException
.
This could be catched and processed.
Parsing
This part is more complex. When you describing construction, you need to have some boot construction which will be the point, at which finite-state machine starts:
$parser = new \Dekor\PhpSyntaxTreeBuilder\Parser([ 'startFrom' => 'g:php', ], [ 'g:php' => [ 'sequence' => [ 'OPENING_PHP_TAG', 'g:statements', ], ], 'g:statements' => [ 'sequence' => [ 'g:statement', '?g:statements', ], ], 'g:statement' => [ 'or' => [ 'g:var_assign', 'g:if', ], ], ... ]);
As you can see in this sample, in the config we say that it will start parsing from g:php.
Why do we add g:
prefix? Actually, this have only visual purposes: it can be easily
understanded, when you mention group and when - token.
Lets go forward. Here we can see sequence
and or
. In each group you can have
only one of that. What does it mean? In case, when we
have sequence
, the parser will foreach through each element of sequence and try to
parse lexems in this order. In case when we have group mentioned within sequence,
it will be recursively parsed. This means that parser will check lexems for opening php
lexem, then it will parse statements. Statements consists of another sequence: statement
and
?statements
. Statement here is a single statement which you can see below, ?statements
means
that after single statement, there will be another statement. But, the ?
symbol tells parser
that it is optional, which means that parser will try to parse group, in case of unseccess it will
finish continue with the following element of the sequence.
Here below, we can see or
. This allows parser to try parsing of each group from the list.
Once parser matches construction, it continues with the group that works in current statement.
This construction allows you to split your alghoritm in multiple branches. FYI, it trying
parsing from left to right.
What can go wrong?
In some cases you may have recursive grammary which may be looped infinitely.
As this parser is left-right one, here takes place Left recursion.
You may check the article in order to resolve this. This particular situation is resolved for
formula parsing in the /demo/php/demo.php
. Please, check the grammar section.
Usage
Here is a simple structure of usage:
<?php require __DIR__ . '/../../vendor/autoload.php'; $lexer = new \Dekor\PhpSyntaxTreeBuilder\Lexer([ '[\s\t\n]+' => '', '([0-9]+)' => 'NUMBER', '\"(.*)\"' => 'STRING', // lexing rules ]); $lexems = $lexer->parse($inFile); $parser = new \Dekor\PhpSyntaxTreeBuilder\Parser([ 'startFrom' => 'g:statements', ], [ 'g:statements' => [ 'sequence' => [ 'g:statement', '?g:statements', ], ], 'g:statement' => [ 'or' => [ 'g:string', 'g:number', ], ], 'g:string' => [ 'sequence' => [ 'STRING', ], ], 'g:number' => [ 'sequence' => [ 'NUMBER', ], ], ]); try { file_put_contents( 'out.json', json_encode($parser->parse($lexems), JSON_PRETTY_PRINT) ); } catch (\Throwable $e) { echo get_class($e) . ': ' . $e->getMessage(); }
Try our all the demo in order to get some good understanding on how this works and how it can be applied.
ASTNode
You can see class Dekor\PhpSyntaxTreeBuilder\ASTNode
. It is our implementation of Abstract Syntax Tree node.
Parser::parse(Lexems[] $lexems)
usually returns ASTNode
instance (may return array of nodes).
You may try and out.json in our demos in order to get some understanding on how it look like.
In case you have your own ASTNode implementation or classes composition (different classes for each node type), you may override node making by adding such construction:
'g:php' => [ 'sequence' => [ 'OPENING_PHP_TAG', 'g:statements', ], 'callback' => function($lexems, $subGroupsSequence) { return new \Dekor\PhpSyntaxTreeBuilder\ASTNode('PHP', $subGroupsSequence, $lexems); } ],
How it works? Right after the sequence parsed, the callback (if it exists) will be executed.
It should always return node object. $subGroupsSequence
is an array with parsed child groups. In this example,
it's g:statements
. $lexems
here is a simple list of lexems which were parsed in this sequence (excluding nested groups lexems).
Have any ideas/need assistance?
You can contact me by email: CONCAT("denis", "koronets", "@", "woo.zp.ua") // sorry for obfuscation, some kind of spam protection :)