README

This is the Markdown parser for Chyrp Lite. It is a set of PHP classes for converting Markdown to HTML, with a focus on speed and simplicity. The parser combines a one-shot parsing strategy with use of fast string functions wherever possible to increase performance. The parser is modular and extensible; a new flavor of Markdown can be defined by extending the base class and adding the desired traits, or an existing flavor can be extended to recognize additional elements by adding traits.

Currently the following Markdown flavors are supported:

Requirements

PHP 8.0+ is required.
UTF-8 is the only supported text encoding.

Multibyte String and IntlChar are recommended for full functionality but not required.

Performance

The parsing performance is slower than Parsedown and cebe\Markdown but with significantly greater conformance to the CommonMark standard, proven by a comprehensive test suite. The table below benchmarks the average of 30,000 iterations parsing the 27 KB source for John Gruber's Markdown syntax documentation.

Parser	Time to parse	CommonMark conformance
\cebe\markdown\Markdown 1.2.1	2.7 milliseconds	41%
\cebe\markdown\GithubMarkdown	3.9 milliseconds	-
Parsedown 1.8.0	4.0 milliseconds	48%
\xenocrat\markdown\Markdown	5.9 milliseconds	95%
\xenocrat\markdown\GithubMarkdown	8.2 milliseconds	-
\xenocrat\markdown\GitlabMarkdown	8.2 milliseconds	-
\xenocrat\markdown\ChyrpMarkdown	9.6 milliseconds	-
\Michelf\Markdown 2.0.0	15.5 milliseconds	36%
\Michelf\MarkdownExtra	24.0 milliseconds	39%

Test environment: PHP 8.1.0, Windows 11, AMD Ryzen 7 2700X, 32 GB RAM.

Usage

The first step is to choose the Markdown flavor and instantiate the parser:

CommonMark:
$parser = new \xenocrat\markdown\Markdown();
GitHub-Flavored Markdown:
$parser = new \xenocrat\markdown\GithubMarkdown();
GitLab-Flavored Markdown:
$parser = new \xenocrat\markdown\GitlabMarkdown();
Chyrp-Flavoured Markdown:
$parser = new \xenocrat\markdown\ChyrpMarkdown();

The next step is to call the parser method:

Use parse() for parsing the text using the full Markdown language;
Use parseParagraph() to parse only inline elements in the text.

Here are some examples:

// CommonMark; parse full text
$parser = new \xenocrat\markdown\Markdown();
echo $parser->parse($markdown);

// GFM
$parser = new \xenocrat\markdown\GithubMarkdown();
echo $parser->parse($markdown);

// GLFM
$parser = new \xenocrat\markdown\GitlabMarkdown();
echo $parser->parse($markdown);

// CFM
$parser = new \xenocrat\markdown\ChyrpMarkdown();
echo $parser->parse($markdown);

// CommonMark; parse only inline elements (useful for one-line descriptions)
$parser = new \xenocrat\markdown\Markdown();
echo $parser->parseParagraph($markdown);

You may adjust the properties on the parser object before parsing – see public properties below.

Methods

`parse`

Description

public Parser::parse(
    string $text
): string

Parses text using the full Markdown language.

Parameters

text

A UTF-8 encoded string of text to parse.

Return Values

Returns a UTF-8 encoded string of parsed markup.

`parseParagraph`

Description

public Parser::parseParagraph(
    string $text
): string

Parses only inline elements in the text.

Parameters

text

A UTF-8 encoded string of text to parse.

Return Values

Returns a UTF-8 encoded string of parsed markup.

`getContextId`

Description

public Parser::getContextId(
): string

Get the identifier for this rendering context.

Return Values

Returns a string containing the context ID.

`setContextId`

Description

public Parser::setContextId(
    string $string
): string

Set the identifier for this rendering context.

Parameters

text

A UTF-8 encoded string of text to use as the identifier. Any occurrences of the characters &, <, >, ", and (space) will be removed from the string.

Return Values

Returns a string containing the new context ID.

Properties

`html5`

Description

public bool Parser::html5 = false;

Whether to enable HTML5 output instead of HTML4.

`maximumNestingLevel`

Description

public int Parser::maximumNestingLevel = 32;

The maximum level of nested elements to parse.

`maximumNestingLevelThrow`

Description

public bool Parser::maximumNestingLevelThrow = false;

Whether to throw an exception if the maximum nesting level is exceeded.

`maximumExecutionTime`

Description

public float Parser::maximumExecutionTime = 10.0;

The maximum execution time for parsing in seconds.

`maximumExecutionTimeThrow`

Description

public bool Parser::maximumExecutionTimeThrow = false;

Whether to throw an exception if the maximum execution time is exceeded.

`convertTabsToSpaces`

Description

public bool Parser::convertTabsToSpaces = false;

Whether to convert all tabs into 1-4 spaces before parsing.

`keepListStartNumber`

Description

public bool Markdown::keepListStartNumber = true;

Whether to ignore the starting numbers of ordered lists.

`keepReversedList`

Description

public bool Markdown::keepReversedList = false;

Whether to enable ordered lists with descending numbers.

`headlineAnchors`

Description

public bool Markdown::headlineAnchors = false;

Whether to add GitHub-style anchors when rendering headings.

`renderLazyImages`

Description

public bool Markdown::renderLazyImages = false;

Whether to render images with a deferred loading attribute.

`enableImageDimensions`

Description

public bool Markdown::enableImageDimensions = true;

Whether to enable extended syntax for image dimensions.

`enableNewlines`

Description

public bool GithubMarkdown::enableNewlines = false;
public bool GitlabMarkdown::enableNewlines = false;

Whether to convert all newlines in the text to <br/> tags.

`renderCheckboxInputs`

Description

public bool GithubMarkdown::renderCheckboxInputs = false;
public bool GitlabMarkdown::renderCheckboxInputs = false;

Whether to render task items as inputs instead of emoji.

`disallowedRawHTML`

Description

public bool GithubMarkdown::disallowedRawHTML = true;

Whether to enable section 6.11 of the GFM specification.

`renderFrontMatter`

Description

public bool GitlabMarkdown::renderFrontMatter = true;

Whether to render front matter blocks as code.

`renderOrderedToc`

Description

public bool GitlabMarkdown::renderOrderedToc = false;

Whether to render the table of contents as an ordered list.

`renderLazyMedia`

Description

public bool GitlabMarkdown::renderLazyMedia = false;
public bool ChyrpMarkdown::renderLazyMedia = false;

Whether to render video and audio with a deferred loading attribute.

Security considerations

By design Markdown allows HTML to be included within the Markdown text, meaning that the input may contain Javascript and CSS styles. This allows Markdown to be very flexible for creating output that is not limited by the Markdown syntax, but it comes with a security risk if you are parsing untrusted input (see XSS for an overview).

The GitHub-Flavored Markdown specification includes an extension to CommonMark, Disallowed Raw HTML (section 6.11), which defines a subset of raw HTML to be filtered and rendered as text in the output. In default configuration, this parser implements section 6.11 of the GFM specification when parsing with the GithubMarkdown class and classes that extend it.

If you are parsing user input or any other type of untrusted input, you are strongly advised to process the resulting HTML with tools like HTML Purifier that filter out all elements which you have chosen to disallow.

Extended image syntax

By default, LinkTrait enables an extension to the Markdown syntax for specifying the intrinsic dimensions of an image. The HTML width and height attributes can be specified as ![title](url){width} or ![title](url){width:height}, with width and height being integers between 1 and 999999999. The value 0 is valid but ignored. See above if you wish to disable this extended syntax.

Use of special characters

The parser inserts some special characters into text during processing: the substitute character (Unicode codepoint U+001A) and the object replacement character (Unicode codepoint U+FFFC). These characters are extremely unlikely to appear in Markdown input text and are therefore deemed safe for internal use. The characters are removed from output text.

Extending the language

Markdown consists of two types of language elements - let's call them block and inline elements, similar to what you have in HTML with <div> and <span>. Block elements are normally spread over several lines and are separated by blank lines. The most basic block element is a paragraph (<p>). Inline elements are elements that are added inside of block elements i.e. inside of text.

This Markdown parser allows you to extend the Markdown language by changing the behavior of existing elements and also adding new block and inline elements. You do this by extending from the parser class and adding/overriding class methods and properties. For the different element types there are different ways to extend them, as you will see in the following sections.

Adding block elements

The Markdown is parsed line by line to identify each non-empty line as one of the block element types. To identify a line as the beginning of a block element it calls all protected class methods having a name beginning with identify. An identify method returns true if it has identified the block element it is responsible for or false if the line does not match its requirements.

Parsing a block element is done in three steps:

Identifying the method responsible for parsing a block, by calling all detected identify{blockName}() methods until one returns true.
Consuming all the lines belonging to a block, by iterating over the lines starting from the identified line until an end condition occurs. This step is implemented by a method named consume{blockName}() where {blockName} is the same name as used for the identify method above. The consume method also takes the lines array and the number of the current line. It will return two arguments: an array representing the block element in the abstract syntax tree of the Markdown document and the line number to parse next. In the abstract syntax array the first element refers to the name of the element, all other array elements can be freely defined by yourself.
Rendering the element. After all blocks have been consumed, each block is rendered using the method render{elementName}() where elementName refers to the name of the element in the abstract syntax tree.

Adding inline elements

Adding inline elements is done differently from block elements because they are parsed using string markers in the text. An inline element is identified by a marker of one or more characters that marks the possible beginning of an inline element (e.g. [ marks the possible beginning of a link or ` marks possible inline code).

Parsing an inline element is done in two steps:

Parsing methods for inline elements are protected and have names beginning with parse. Additionally a matching method suffixed with Markers is needed to register a parse method for one or more markers. E.g. parseEscape() and parseEscapeMarkers(). The parse method will be called when any of its registered markers is found in the text. As an argument the parse method takes the text starting at the position of the marker. The parser method will return an array containing an element to be added to the abstract sytnax tree and the offset of the text it has parsed from the input. All text up to this offset will be removed from the Markdown before the search continues for the next marker.
Rendering the element. Each element is rendered using the method render{elementName}() where elementName refers to the name of the element in the abstract syntax tree.

Composing your own Markdown flavor

This Markdown parser is composed of traits so it is very easy to create your own Markdown flavor by adding and/or removing the single feature traits.

Designing your Markdown flavor consists of four steps:

Select a base class to extend;
Select language feature traits;
Define escapeable characters;
Optionally add custom rendering behavior.

Select a base class

If you want to extend a flavor and add features you can use one of the existing classes as your base class. If you want to define a subset of the Markdown language, i.e. remove some of the features, you have to extend your class from Parser.

Select language feature traits

In general, just adding traits with use is enough. During parsing, block identifiers added by traits are sorted and called in alphabetical order. This could be a problem if you create a trait to parse a block type that must be identified early. You can bust the alphabetical sort/call strategy by defining the property blockPriorities in your Markdown flavor and supplying a predefined call order for block identifier methods. Any methods detected at runtime that are not listed in the predefined call order will be called in alphabetical order after all predefined methods have been called.

If you use HeadlineTrait, LinkTrait, or FootnoteTrait it may be useful to implement prepare() to reset variables before parsing to ensure you get a reusable parser object.

Define escapeable characters

Depending on the language features you have chosen to implement, a different set of characters must be defined as escapable using a backslash (\) for literal use in Markdown text. The parser defines only backslash as escapable (\\) initially.

Add custom rendering behavior

Optionally you can adjust rendering behavior by overriding some methods. Refer to the consumeParagraph() method of the various Markdown flavors for inspiration on different rules defining which elements are allowed to interrupt a paragraph.

Acknowledgements

Carsten Brandt would like to thank @erusev for creating Parsedown which heavily influenced this work and provided the idea of the line based parsing approach.

Authors

This software was created by the following people:

cebe/markdown: Carsten Brandt
xenocrat/chyrp-markdown: Daniel Pimley

License

This software is open source and licensed under the MIT License. See LICENSE for details.

xenocrat / chyrp-markdown

Maintainers

Package info

Statistics

Security

README

Requirements

Performance

Usage

Methods

parse

Description

Parameters

Return Values

parseParagraph

Description

Parameters

Return Values

getContextId

Description

Return Values

setContextId

Description

Parameters

Return Values

Properties

html5

Description

maximumNestingLevel

Description

maximumNestingLevelThrow

Description

maximumExecutionTime

Description

maximumExecutionTimeThrow

Description

convertTabsToSpaces

Description

keepListStartNumber

Description

keepReversedList

Description

headlineAnchors

Description

renderLazyImages

Description

enableImageDimensions

Description

enableNewlines

Description

renderCheckboxInputs

Description

disallowedRawHTML

Description

renderFrontMatter

Description

renderOrderedToc

Description

renderLazyMedia

Description

Security considerations

Extended image syntax

Use of special characters

Extending the language

Adding block elements

Adding inline elements

Composing your own Markdown flavor

Select a base class

Select language feature traits

Define escapeable characters

Add custom rendering behavior

Acknowledgements

Authors

License

`parse`

`parseParagraph`

`getContextId`

`setContextId`

`html5`

`maximumNestingLevel`

`maximumNestingLevelThrow`

`maximumExecutionTime`

`maximumExecutionTimeThrow`

`convertTabsToSpaces`

`keepListStartNumber`

`keepReversedList`

`headlineAnchors`

`renderLazyImages`

`enableImageDimensions`

`enableNewlines`

`renderCheckboxInputs`

`disallowedRawHTML`

`renderFrontMatter`

`renderOrderedToc`

`renderLazyMedia`