charm / lexer
A fast and powerful streaming lexer for tokenizing formulas, programming languages or written languages.
Requires (Dev)
- phpunit/phpunit: ^9.5
This package is not auto-updated.
Last update: 2024-10-29 20:57:11 UTC
README
A streaming lexer which uses regular expressions to match tokens.
Note; many useful regexes have been configured as string constants on the Lexer class.
Basic usage
$lexer = new Charm\Lexer(
// regex patterns
[
'NUM' => [
'\b(?<!\.)(([1-9][0-9]*|[0-9])\.[0-9]+)(?!\.)\b', // matches floats
'(?<!\.)\b([1-9][0-9]*|[0-9])\b(?!\.)', // matches integers
],
'DOUBLE_QUOTED_STRING' => '\"(\\\\\\\\|\\\\"|[^"])*"', // matches "this \" string"
'SINGLE_QUOTED_STRING' => '\'(\\\\\\\\|\\\\\'|[^\'])*\'', // matches 'this \' string'
'IDENTIFIER' => '\b[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*', // matches C/JavaScript/PHP style variable and function names
],
// exact match patterns
[
'BINARY_OPERATORS' => [
'+', '-', '*', '/',
]
]
);
foreach ($lexer->tokenize("123 + 10 * 5") as $token) {
echo "Token: ".$token->content."\n";
echo " - kind = ".$token->kind."\n";
}
Whitespace will be automatically discarded. To avoid discarding it, construct Lexer
with null
for the third argument.
$lexer = new Charm\Lexer($regexPatterns, $stringPatterns, null);
Also whitespace must be captured with a regex pattern like '\s+'
.
Performance
The lexer does perform quite well, if the chunks/strings you provide it with are long enough. It works by creating a large regular expression, then it creates Token instances for every matched token kind.
To avoid yielding a partial token, it will never yield the last token before the input stream has ended.