dealnews/datocms-html-to-structured-text

Convert HTML to DatoCMS Structured Text (DAST) format

Installs: 0

Dependents: 0

Suggesters: 0

Security: 0

Stars: 0

Watchers: 0

Forks: 0

Open Issues: 0

pkg:composer/dealnews/datocms-html-to-structured-text

0.1.0 2026-02-23 22:10 UTC

This package is auto-updated.

Last update: 2026-02-23 22:33:03 UTC


README

Convert HTML to DatoCMS Structured Text (DAST format). PHP port of the official JavaScript library.

Requirements

  • PHP 8.2+
  • DOM extension
  • libxml extension
  • Composer

Installation

composer require dealnews/datocms-html-to-structured-text

Basic Usage

<?php
require_once 'vendor/autoload.php';

use DealNews\HtmlToStructuredText\Converter;

// Create converter instance
$converter = new Converter();

// Simple HTML
$html = '<h1>DatoCMS</h1><p>The best <strong>headless CMS</strong>.</p>';

$dast = $converter->convert($html);
// Returns:
// [
//     'schema' => 'dast',
//     'document' => [
//         'type' => 'root',
//         'children' => [...]
//     ]
// ]

Features

  • ✅ Converts HTML to valid DAST documents
  • ✅ Supports all standard HTML elements
  • ✅ Custom handlers for specialized conversions
  • ✅ DOM preprocessing hooks
  • ✅ Configurable allowed blocks, marks, and heading levels
  • ✅ Mark extraction from inline CSS styles
  • ✅ URL resolution with <base> tag support
  • ✅ Type-safe with comprehensive PHPDoc

Supported Elements

Block Elements

HTML DAST Node Notes
<h1> - <h6> heading Level extracted from tag
<p> paragraph
<ul>, <ol> list Style: bulleted/numbered
<li> listItem
<blockquote> blockquote
<pre>, <code> code Language from class attribute
<hr> thematicBreak

Inline Elements

HTML Mark Notes
<strong>, <b> strong
<em>, <i> emphasis
<u> underline
<s>, <strike> strikethrough
<mark> highlight
<code> (inline) code In paragraph context
<a> link With URL and optional meta
<br> span with \n

Ignored Elements

Scripts, styles, and media elements are ignored: <script>, <style>, <video>, <audio>, <iframe>, <embed>

Advanced Usage

Custom Handlers

Override default conversion for specific elements:

use DealNews\HtmlToStructuredText\Converter;
use DealNews\HtmlToStructuredText\Options;
use DealNews\HtmlToStructuredText\Handlers;

$converter = new Converter();
$options = new Options();

// Custom h1 handler - adds prefix to all h1 headings
$options->handlers['h1'] = function (
    callable $create_node,
    \DOMNode $node,
    $context
) {
    // Use default handler
    $result = Handlers::heading($create_node, $node, $context);
    
    // Modify result
    if (isset($result['children'][0]['value'])) {
        $result['children'][0]['value'] = '' . $result['children'][0]['value'];
    }
    
    return $result;
};

$html = '<h1>Important</h1>';
$dast = $converter->convert($html, $options);
// H1 will have "★ Important" as text

Preprocessing

Modify the DOM before conversion:

$options = new Options();

// Convert all <div> tags to <p> tags
$options->preprocess = function (\DOMDocument $doc): void {
    $divs = [];
    foreach ($doc->getElementsByTagName('div') as $div) {
        $divs[] = $div;
    }
    
    foreach ($divs as $div) {
        $p = $doc->createElement('p');
        while ($div->firstChild) {
            $p->appendChild($div->firstChild);
        }
        $div->parentNode->replaceChild($p, $div);
    }
};

$html = '<div>Content</div>';
$dast = $converter->convert($html, $options);
// Div becomes paragraph in DAST

Configuring Allowed Blocks

Control which block types are allowed:

$options = new Options();
$options->allowed_blocks = ['paragraph', 'list']; // Only paragraphs and lists

$html = '<h1>Title</h1><p>Text</p>';
$dast = $converter->convert($html, $options);
// H1 will be converted to paragraph

Configuring Allowed Marks

Control which text marks are allowed:

$options = new Options();
$options->allowed_marks = ['strong']; // Only bold

$html = '<p><strong>Bold</strong> and <em>italic</em></p>';
$dast = $converter->convert($html, $options);
// Only strong mark will be applied, emphasis ignored

Configuring Heading Levels

Control which heading levels are preserved:

$options = new Options();
$options->allowed_heading_levels = [1, 2]; // Only H1 and H2

$html = '<h1>H1</h1><h3>H3</h3>';
$dast = $converter->convert($html, $options);
// H3 will be converted to paragraph

Options Reference

Options Class

class Options {
    // Whether to preserve newlines in text
    public bool $newlines = false;
    
    // Custom handler overrides
    public array $handlers = [];
    
    // Preprocessing function
    public $preprocess = null;
    
    // Allowed block types
    public array $allowed_blocks = [
        'blockquote', 'code', 'heading', 'link', 'list'
    ];
    
    // Allowed mark types
    public array $allowed_marks = [
        'strong', 'code', 'emphasis', 'underline', 
        'strikethrough', 'highlight'
    ];
    
    // Allowed heading levels (1-6)
    public array $allowed_heading_levels = [1, 2, 3, 4, 5, 6];
}

API Reference

Converter::convert(string $html, ?Options $options = null): ?array

Converts HTML string to DAST document.

Parameters:

  • $html - HTML string to convert
  • $options - Optional conversion options

Returns: DAST document array or null if empty

Throws: ConversionError if conversion fails

Converter::convertDocument(\DOMDocument $doc, ?Options $options = null): ?array

Converts a DOMDocument to DAST (for pre-parsed HTML).

Parameters:

  • $doc - DOMDocument to convert
  • $options - Optional conversion options

Returns: DAST document array or null if empty

Throws: ConversionError if conversion fails

Special Features

Code Block Language Detection

The library extracts programming language from code block class names:

$html = '<pre><code class="language-javascript">const x = 1;</code></pre>';
$dast = $converter->convert($html);
// Result will have: ['type' => 'code', 'language' => 'javascript', 'code' => 'const x = 1;']

Default prefix is language- but can be customized in context.

Link Meta Extraction

Link meta attributes (target, rel, title) are extracted:

$html = '<a href="https://example.com" target="_blank" rel="noopener">Link</a>';
$dast = $converter->convert($html);
// Result will have meta array: [['id' => 'target', 'value' => '_blank'], ...]

Inline Style Mark Extraction

The library can extract marks from inline CSS styles:

$html = '<span style="font-weight: bold">Bold via style</span>';
$dast = $converter->convert($html);
// Creates span with strong mark

Supported style properties:

  • font-weight: bold or font-weight > 400strong
  • font-style: italicemphasis
  • text-decoration: underlineunderline

URL Resolution with Base Tag

The <base> tag is respected for relative URL resolution:

$html = '<base href="https://example.com/"><a href="/page">Link</a>';
$dast = $converter->convert($html);
// Link URL will be resolved to: https://example.com/page

Error Handling

The library throws ConversionError exceptions when conversion fails:

use DealNews\HtmlToStructuredText\ConversionError;

try {
    $dast = $converter->convert($html);
} catch (ConversionError $e) {
    echo "Conversion failed: " . $e->getMessage();
    $node = $e->getNode(); // Get problematic DOM node if available
}

Edge Cases

Whitespace Handling

  • Single whitespace-only spans are removed when wrapped
  • Newlines in text are preserved if $options->newlines = true
  • In headings, newlines are converted to spaces (headings can't have line breaks)

Nested Lists

Nested lists are fully supported:

$html = '<ul><li>Item<ul><li>Nested</li></ul></li></ul>';
// Converts correctly to nested list structure

Mixed Inline/Block Content

Links and other hybrid elements are handled correctly:

$html = '<a href="#"><span>Inline</span><p>Block</p></a>';
// Properly splits into separate nodes

Differences from JavaScript Version

  1. No Promises: PHP handlers return directly (synchronous)
  2. No Hast: Works directly with PHP DOMDocument instead of intermediate tree
  3. Array Structure: DAST nodes are arrays (not objects)
  4. Error Handling: Uses exceptions instead of rejection

Development

Running Tests

composer install
./vendor/bin/phpunit

Current test coverage: 86%+

Running Examples

php examples/basic.php
php examples/custom_handlers.php
php examples/preprocessing.php

License

BSD 3-Clause License - see LICENSE file for details

Credits

This is a PHP port of the official DatoCMS HTML to Structured Text JavaScript library.

Ported and maintained by DealNews.

Related Projects