charm/lexer

A fast and powerful streaming lexer for tokenizing formulas, programming languages or written languages.

1.0.0 2021-11-15 01:03 UTC

This package is not auto-updated.

Last update: 2025-01-21 22:07:25 UTC


README

A streaming lexer which uses regular expressions to match tokens.

Note; many useful regexes have been configured as string constants on the Lexer class.

Basic usage

$lexer = new Charm\Lexer(
    // regex patterns
    [
        'NUM' => [
            '\b(?<!\.)(([1-9][0-9]*|[0-9])\.[0-9]+)(?!\.)\b',           // matches floats
            '(?<!\.)\b([1-9][0-9]*|[0-9])\b(?!\.)',                     // matches integers
        ],
        'DOUBLE_QUOTED_STRING' => '\"(\\\\\\\\|\\\\"|[^"])*"',          // matches "this \" string"
        'SINGLE_QUOTED_STRING' => '\'(\\\\\\\\|\\\\\'|[^\'])*\'',       // matches 'this \' string'
        'IDENTIFIER' => '\b[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*',   // matches C/JavaScript/PHP style variable and function  names
    ],
    // exact match patterns
    [
        'BINARY_OPERATORS' => [
            '+', '-', '*', '/',
        ]
    ]
);

foreach ($lexer->tokenize("123 + 10 * 5") as $token) {
    echo "Token: ".$token->content."\n";
    echo " - kind = ".$token->kind."\n";
}

Whitespace will be automatically discarded. To avoid discarding it, construct Lexer with null for the third argument.

$lexer = new Charm\Lexer($regexPatterns, $stringPatterns, null);

Also whitespace must be captured with a regex pattern like '\s+'.

Performance

The lexer does perform quite well, if the chunks/strings you provide it with are long enough. It works by creating a large regular expression, then it creates Token instances for every matched token kind.

To avoid yielding a partial token, it will never yield the last token before the input stream has ended.