phug/reader

Pug (ex-Jade) string reader for PHP, HTML template engine structured by indentation


README

What is Phug Reader?

The Reader-class is a small utility that can parse and scan strings for specific entities.

It's mostly based on Regular Expressions, but also brings in tools to scan strings and expressions of any kind (e.g. string escaping, bracket counting etc.)

The string passed to the reader is swallowed byte by byte through consume-mechanisms. When the string is empty, the parsing is done (usually).

This class is specifically made for lexical analysis and expression-validation.

Installation

Install via Composer

composer require phug/reader

Usage

Basics

The process of reading with the Phug\Reader involves peeking and consuming. You peek, check, if it's what you wanted and if it is, you consume.

read-methods on the Reader will peek and consume automatically until they found what you searched for. match-methods work like peek, but work with regular expressions.

Lets create a small example code to parse:

$code = 'someString = "some string"';

Now we create a reader for that code

$reader = new Reader($code);

If you want a fixed encoding, use the second $encoding parameter.

Now we can do our reading process. First we want to read our identifier. We can do that easily with readIdentifier() which returns null if no identifier has been encountered and the identifier found otherwise. It will stop on anything that is not an identifier-character (The space after the identifier, in this case)

$identifier = $reader->readIdentifier();
if ($identifier === null) {
    throw new Exception("Failed to read: Identifier expected");
}
    
var_dump($identifier); //`someString`

To get to our =-character directly, we can just skip all spaces we encounter. This also allows for any spacing you want (e.g. you can indent the above with tabs if you like)

$reader->readSpaces();

If we need the spaces, we can always catch the returned result. If no spaces are encountered, it just returns null.

Now we want to parse the assignment-operator (=) (or rather, validate that it's there)

if (!$reader->peekChar('=')) {
    throw new Exception("Failed to read: Assignment expected");
}
    
//Consume the result, since we're `peek`ing, not `read`ing.
$reader->consume();

Skip spaces again

$reader->readSpaces();

and read the string. If no quote-character (" or ') is encountered, it will return null. Otherwise, it will return the (already parsed) string, without quotes. Notice that you have to check null explicitly, since we could also have an empty string ("") which evaluates to true in PHP

$string = $reader->readString();

if ($string === null) {
    throw new Exception("Failed to read: Expected string");
}

var_dump($string); //`some string`

The quote-style encountered will be escaped by default, so you can scan "some \" string" correctly. If you want to add other escaping, use the first parameter of readString.

Now you have all parts parsed to make up your actual action

echo "Set `$identifier` to `$string`"; //Set `someString` to `some string`

and it was validated on that way.

This was just a small example, Phug Reader is made for loop-parsing.

Build a small tokenizer

use Phug\Reader;

//Some C-style example code
$code = 'someVar = {a, "this is a string (really, it \"is\")", func(b, c), d}';

$reader = new Reader($code);

$tokens = [];
$blockLevel = 0;
$expressionLevel = 0;
while ($reader->hasLength()) {
    
    //Skip spaces of any kind.
    $reader->readSpaces();
    
    //Scan for identifiers
    if ($identifier = $reader->readIdentifier()) {
        
        $tokens[] = ['type' => 'identifier', 'name' => $identifier];
        continue;
    }
    
    //Scan for Assignments
    if ($reader->peekChar('=')) {
        
        $reader->consume();
        $tokens[] = ['type' => 'assignment'];
        continue;
    }
    
    //Scan for strings
    if (($string = $reader->readString()) !== null) {
        
        $tokens[] = ['type' => 'string', 'value' => $string];
        continue;
    }
    
    //Scan block start
    if ($reader->peekChar('{')) {
        
        $reader->consume();
        $blockLevel++;
        $tokens[] = ['type' => 'blockStart'];
        continue;
    }
    
    //Scan block end
    if ($reader->peekChar('}')) {
    
        $reader->consume();
        $blockLevel--;
        $tokens[] = ['type' => 'blockEnd'];
        continue;
    }
    
    //Scan parenthesis start
    if ($reader->peekChar('(')) {
        
        $reader->consume();
        $expressionLevel++;
        $tokens[] = ['type' => 'listStart'];
        continue;
    }
    
    //Scan parenthesis end
    if ($reader->peekChar(')')) {
        
        $reader->consume();
        $expressionLevel--;
        $tokens[] = ['type' => 'listEnd'];
        continue;
    }
    
    //Scan comma
    if ($reader->peekChar(',')) {
        
        $reader->consume();
        $tokens[] = ['type' => 'next'];
        continue;
    }

    throw new \Exception(
        "Unexpected ".$reader->peek(10)
    );
}

if ($blockLevel || $expressionLevel)
    throw new \Exception("Unclosed bracket encountered");

var_dump($tokens);
/* Output:
[
    ['type' => 'identifier', 'name' => 'someVar'],
    ['type' => 'assignment'],
    ['type' => 'blockStart'],
    ['type' => 'identifier', 'name' => 'a'],
    ['type' => 'next'],
    ['type' => 'string', 'value' => 'this is a string (really, it "is")'],
    ['type' => 'next'],
    ['type' => 'identifier', 'name' => 'func'],
    ['type' => 'listStart'],
    ['type' => 'identifier', 'name' => 'b'],
    ['type' => 'next'],
    ['type' => 'identifier', 'name' => 'c'],
    ['type' => 'listEnd'],
    ['type' => 'next'],
    ['type' => 'identifier', 'name' => 'd'],
    ['type' => 'blockEnd']
]
*/

Keep expressions intact

Sometimes you want to keep expressions intact, e.g. when you allow inclusion of third-party-code that needs to be parsed separately.

The Reader brings a bracket-counting-utility that can do just that exactly. Let's take Jade as an example:

a(href=getUri('/abc', true), title=(title ? title : 'Sorry, no title.'))

To parse this, let's do the following:

//Scan Identifier ("a")
$identifier = $reader->readIdentifier();

$attributes = [];
//Enter an attribute block if available
if ($reader->peekChar('(')) {

    $reader->consume();
    while ($reader->hasLength()) {
    
    
        //Ignore spaces
        $reader->readSpaces();
    
    
        //Scan the attribute name
        if (!($name = $this->readIdentifier())) {
            throw new \Exception("Attributes need a name!");
        }
        
        
        //Ignore spaces
        $reader->readSpaces();
        
        
        //Make sure there's a =-character
        if (!$reader->peekChar('=')) {
            throw new \Exception("Failed to read: Expected attribute value");
        }
            
        $reader->consume();
        
        
        //Ignore spaces
        $reader->readSpaces();
        
        
        //Read the expression until , or ) is encountered
        //It will ignore , and ) inside any kind of brackets and count brackets correctly until we actually
        //reached the end-bracket
        $value = $reader->readExpression([',', ')']);
        
        
        //Add the attribute to our attribute array
        $attributes[$name] = $value;
        
        
        //If we don't encounter a , to go on, we break the loop
        if (!$reader->peekChar(',')) {
            break;
        }
            
            
        //Else we consume the , and continue our attribute parsing
        $reader->consume();       
    }
    
    //Now make sure we actually closed our attribute block correctly.
    if (!$reader->peekChar(')')) {
        throw new \Exception("Failed to read: Expected closing bracket");
    }
}


$element = ['identifier' => $identifier, 'attributes' => $attributes];

var_dump($element);
/* Output:
[
    'identifier' => 'a',
    'attributes' => [
        'href' => 'getUri(\'/abc\', true)',
        'title' => '(title ? title : \'Sorry, no title.\')'
    ]
]
*/

You now got a parser for (really, really basic) Jade-elements! It can handle as many attributes as you like with all possible values you could think of without ever breaking the listing, regardless of contained commas and brackets.

Digging deeper, the Phug Reader is actually able to lex source code and text of any kind.