kuria/parser

Character-by-character string parsing library

v4.0.0 2018-08-05 22:17 UTC

This package is auto-updated.

Last update: 2024-08-22 17:27:13 UTC


README

Character-by-character string parsing library.

https://travis-ci.com/kuria/parser.svg?branch=master

Contents

Features

  • line number tracking (can be disabled for performance)
  • supports CR, LF and CRLF line endings
  • verbose exceptions
  • many methods to navigate and operate the parser
    • forward / backward peeking and seeking
    • forward / backward character consumption
    • state stack
  • character types
  • expectations

Requirements

  • PHP 7.1+

Usage

Creating a parser

Create a new parser instance with string input.

The parser begins at the first character.

<?php

use Kuria\Parser\Parser;

$input = 'foo bar baz';

$parser = new Parser($input);

Parser properties

The parser has several public properties that can be used to inspect its current state:

  • $parser->i - current position
  • $parser->char - current character (or NULL at the end of input)
  • $parser->lastChar - last character (or NULL at the start of input)
  • $parser->line - current line (or NULL if line tracking is disabled)
  • $parser->end - end of input indicator (TRUE at the end, FALSE otherwise)
  • $parser->vars - user-defined variables attached to the current state

Warning

All of the public properties (with the exception of $parser->vars) are read-only and must not be modified directly by the calling code.

Use the built-in parser methods to mutate the parser state. See Parser method overview.

Parser method overview

Refer to doc comments of the respective methods for more information.

Also see Character types.

Static methods

  • getCharType($char): int - determine character type
  • getCharTypeName($charType): string - get human-readable character type name

Instance methods

  • getInput(): string - get the input string
  • setInput($input): void - replace the input string (this also resets the parser)
  • getLength(): int - get length of the input string
  • isTrackingLineNumbers(): bool - see if line number tracking is enabled
  • type(): int - get type of the current character
  • is(...$types): bool - check whether the current character is of one of the specified types
  • atNewline(): bool - see if the parser is at the start of a newline sequence
  • eat(): ?string - go to the next character and return the current one (returns NULL at the end)
  • spit(): ?string - go to the previous character and return the current one (returns NULL at the beginning)
  • shift(): ?string - go to the next character and return it (returns NULL at the end)
  • unshift(): ?string - go to the previous character and return it (returns NULL at the beginning)
  • peek($offset, $absolute = false): ?string - get character at the given offset or absolute position (does not affect state)
  • seek($offset, $absolute = false): void - alter current position
  • reset(): void - reset states, vars and rewind to the beginning
  • rewind(): void - rewind to the beginning
  • eatChar($char): ?string - consume specific character and return the next character
  • tryEatChar(): bool - attempt to consume specific character and return success state
  • eatType($type): string - consume all characters of the specified type
  • eatTypes($typeMap): string - consume all characters of the specified types
  • eatWs(): string - consume whitespace, if any
  • eatUntil($delimiterMap, $skipDelimiter = true, $allowEnd = false): string - consume all characters until the specified delimiters
  • eatUntilEol($skip = true): string - consume all character until end of line or input
  • eatEol(): string - consume end of line sequence
  • eatRest(): string - consume reamaining characters
  • getChunk($start, $end): string - get chunk of the input (does not affect state)
  • detectEol(): ?string - find and return the next end of line sequence (does not affect state)
  • countStates(): int - get number of stored states
  • pushState(): void - store the current state
  • revertState(): void - revert to the last stored state and pop it
  • popState(): void - pop the last stored state without reverting to it
  • clearStates(): void - throw away all stored states
  • expectEnd(): void - ensure that the parser is at the end
  • expectNotEnd(): void - ensure that the parser is not at the end
  • expectChar($expectedChar): void - ensure that the current character matches the expectation
  • expectCharType($expectedType): void - ensure that the current character is of the given type

Example INI parser implementation

<?php

use Kuria\Parser\Parser;

/**
 * INI parser (example)
 */
class IniParser
{
    /**
     * Parse an INI string
     */
    public function parse(string $string): array
    {
        // create parser
        $parser = new Parser($string);

        // prepare variables
        $data = [];
        $currentSection = null;

        // parse
        while (!$parser->end) {
            // skip whitespace
            $parser->eatWs();
            if ($parser->end) {
                break;
            }

            // parse the current thing
            if ($parser->char === '[') {
                // a section
                $currentSection = $this->parseSection($parser);
            } elseif ($parser->char === ';') {
                // a comment
                $this->skipComment($parser);
            } else {
                // a key=value pair
                [$key, $value] = $this->parseKeyValue($parser);

                // add to output
                if ($currentSection === null) {
                    $data[$key] = $value;
                } else {
                    $data[$currentSection][$key] = $value;
                }
            }
        }

        return $data;
    }

    /**
     * Parse a section and return its name
     */
    private function parseSection(Parser $parser): string
    {
        // we should be at the [ character now, eat it
        $parser->eatChar('[');

        // eat everything until ]
        $sectionName = $parser->eatUntil(']');

        return $sectionName;
    }

    /**
     * Skip a commented-out line
     */
    private function skipComment(Parser $parser): void
    {
        // we should be at the ; character now, eat it
        $parser->eatChar(';');

        // eat everything until the end of line
        $parser->eatUntilEol();
    }

    /**
     * Parse a key=value pair
     */
    private function parseKeyValue(Parser $parser): array
    {
        // we should be at the first character of the key
        // eat characters until = is found
        $key = $parser->eatUntil('=');

        // eat everything until the end of line
        // that is our value
        $value = trim($parser->eatUntilEol());

        return [$key, $value];
    }
}

Using the parser

<?php

$iniParser = new IniParser();

$iniString = <<<INI
; An example comment
name=Foo
type=Bar

[options]
size=150x100
onload=
INI;

$data = $iniParser->parse($iniString);

print_r($data);

Output:

Array
(
    [name] => Foo
    [type] => Bar
    [options] => Array
        (
            [size] => 150x100
            [onload] =>
        )

)

Character types

The table below lists the default character types.

These types are available as constants on the Parser class:

  • Parser::C_NONE - no character (NULL)
  • Parser::C_WS - whitespace (tab, linefeed, vertical tab, form feed, carriage return and space)
  • Parser::C_NUM - numeric character (0-9)
  • Parser::C_STR - string character (a-z, A-Z, _ and any 8-bit char)
  • Parser::C_CTRL - control character (ASCII 127 and ASCII < 32 except whitespace)
  • Parser::C_SPECIAL - !"#$%&'()*+,-./:;<=>?@[\\]^\`{|}~

Customizing character types

Character types can be customized by extending the base Parser class.

The following example changes "-" and "." from CHAR_SPECIAL to CHAR_STR and inherits everything else.

<?php

class CustomParser extends Parser
{
    const CHAR_TYPE_MAP = [
        '-' => self::C_STR,
        '.' => self::C_STR,
    ] + parent::CHAR_TYPE_MAP; // inherit everything else
}

// usage example
$parser = new CustomParser('foo-bar.baz');

var_dump($parser->eatType(CustomParser::C_STR));

Output:

string(11) "foo-bar.baz"