tekintian/html-cleaner

A powerful PHP tool for cleaning HTML content generated by Typora editor. Removes redundant spaces, useless attributes, and optimizes HTML structure while preserving content integrity.

Installs: 1

Dependents: 0

Suggesters: 0

Security: 0

Stars: 0

Watchers: 0

Forks: 0

Open Issues: 0

pkg:composer/tekintian/html-cleaner

1.0.0 2025-11-15 05:01 UTC

This package is auto-updated.

Last update: 2025-11-17 08:47:57 UTC


README

A powerful PHP tool for cleaning HTML content generated by Typora editor. Removes redundant spaces, useless attributes, and optimizes HTML structure while preserving content integrity.

Features

  • Remove Inline Styles: Eliminates all inline style attributes
  • Clean Typora Attributes: Removes Typora-specific attributes like cid, mdtype, md-inline, etc.
  • Optimize Class Attributes: Filters out md-* classes while preserving useful ones
  • Remove Empty Tags: Cleans up empty span and div tags
  • Simplify Tag Structure: Optimizes nested tag structures
  • Clean Whitespace: Removes redundant spaces and normalizes whitespace
  • Process Code Blocks: Preserves language classes and removes br tags in pre tags
  • External Link Handling: Adds target="_blank" to external links
  • Auto Tag Links: Automatically adds links to specified keywords
  • Environment Aware: Smart debug output control based on environment (dev, testing, prod)
  • Configurable Behavior: Environment variables for customizing behavior

Installation

Composer Installation

composer require tekintian/html-cleaner

Manual Installation

Download the package and include the autoloader:

require_once 'vendor/autoload.php';

Usage

Basic Usage

use tekintian\HtmlCleaner\HtmlCleaner;

// Clean HTML string
$dirtyHtml = '<p style="color: red;">Content</p>';
$cleanHtml = HtmlCleaner::clean($dirtyHtml);

// Clean HTML file
$cleanedHtml = HtmlCleaner::cleanFile('input.html', 'output.html');

With Custom Tag Links

use tekintian\HtmlCleaner\HtmlCleaner;

$tagLinks = [
    'PHP' => 'https://www.php.net/manual/en/',
    'JavaScript' => 'https://developer.mozilla.org/en-US/docs/Web/JavaScript',
    'Python' => 'https://docs.python.org/3/',
];

$cleanedHtml = HtmlCleaner::clean($html, $tagLinks);

Using Individual Processing Methods

use tekintian\HtmlCleaner\HtmlCleaner;

// Custom processing pipeline
$html = HtmlCleaner::unifiedAttributeProcessing($dirtyHtml);
$html = HtmlCleaner::removeEmptyTags($html);
$html = HtmlCleaner::simplifyTags($html);
$html = HtmlCleaner::cleanWhitespace($html);

// Skip specific steps if not needed
// $html = HtmlCleaner::removeUselessAttributes($html);
// $html = HtmlCleaner::processPreTags($html);

// Apply custom processing between steps
$html = str_replace('<br>', '<br />', $html);

$cleanedHtml = $html;

Debug Output Control

// Set APP_DEBUG environment variable to control debug output
putenv('APP_DEBUG=true'); // Shows debug output
// putenv('APP_DEBUG=false'); // No debug output (default)

// Alternative: Use APP_ENV for backward compatibility
putenv('APP_ENV=dev'); // Also shows debug output

use tekintian\HtmlCleaner\HtmlCleaner;

$html = HtmlCleaner::clean($dirtyHtml);
// With debug enabled: Shows processing progress
// With debug disabled: Silent operation

API Reference

Main Methods

HtmlCleaner::clean(string $html, array|null $tagLinks = null): string

Cleans HTML content and returns the cleaned version.

Parameters:

  • $html: HTML content to clean
  • $tagLinks: Optional tag link configuration array [keyword => URL]

Returns: Cleaned HTML content

HtmlCleaner::cleanFile(string $inputFile, string|null $outputFile = null, array|null $tagLinks = null): string

Cleans an HTML file and saves the result.

Parameters:

  • $inputFile: Input file path
  • $outputFile: Output file path (auto-generated if null)
  • $tagLinks: Optional tag link configuration array [keyword => URL]

Returns: Cleaned HTML content

Throws: Exception if file operations fail

Individual Processing Methods

HtmlCleaner::unifiedAttributeProcessing(string $html): string

Processes HTML attributes in a unified manner (combining multiple loops).

Parameters:

  • $html: HTML content to process

Returns: HTML content with processed attributes

HtmlCleaner::removeEmptyTags(string $html): string

Removes empty span and div tags from HTML content.

Parameters:

  • $html: HTML content to process

Returns: HTML content with empty tags removed

HtmlCleaner::simplifyTags(string $html): string

Simplifies tag structure by optimizing nested tags.

Parameters:

  • $html: HTML content to process

Returns: Simplified HTML content

HtmlCleaner::cleanAllTagSpaces(string $html): string

Cleans all redundant spaces within HTML tags.

Parameters:

  • $html: HTML content to process

Returns: HTML content with cleaned tag spaces

HtmlCleaner::removeUselessAttributes(string $html): string

Removes useless attributes from HTML content.

Parameters:

  • $html: HTML content to process

Returns: HTML content with useless attributes removed

HtmlCleaner::cleanWhitespace(string $html): string

Cleans whitespace characters and normalizes formatting.

Parameters:

  • $html: HTML content to process

Returns: HTML content with cleaned whitespace

HtmlCleaner::processPreTags(string $html): string

Processes pre tags, preserves language class names and removes br tags.

Parameters:

  • $html: HTML content to process

Returns: HTML content with processed pre tags

Processing Steps

The cleaner performs the following operations in sequence:

  1. Unified Attribute Processing: Combines multiple attribute processing loops
  2. Empty Tag Removal: Removes empty span and div tags
  3. Tag Structure Simplification: Optimizes nested tag structures
  4. Space Cleaning: Removes redundant spaces in tags
  5. Useless Attribute Removal: Eliminates empty and unnecessary attributes
  6. Whitespace Normalization: Cleans up whitespace characters
  7. Pre Tag Processing: Handles code blocks and language classes
  8. External Link Processing: Adds target="_blank" to external links
  9. Tag Link Addition: Automatically adds links to specified keywords

Configuration

Environment Variables

  • APP_DEBUG: Set to true or 1 to enable debug output (recommended)
  • APP_ENV: Environment mode (dev, testing, prod) - also controls debug output for backward compatibility
  • HTML_ADD_TAG_LINK: Set to true to enable automatic tag linking
  • HTTP_HOST: Current host for external link detection (default: 'dev.tekin.cn')
  • REMOVE_HTML_SPAN: Set to true to remove all span tags (aggressive mode)

Debug Control Behavior

Setting Debug Output Use Case
APP_DEBUG=true or 1 ✅ Enabled Development and debugging
APP_ENV=dev or testing ✅ Enabled Backward compatibility
APP_DEBUG=false or unset ❌ Disabled Production deployment

Note: APP_DEBUG takes precedence over APP_ENV for debug control.

Customizing External Link Detection

Override the getCurrentHost() method to customize external link detection:

class CustomHtmlCleaner extends HtmlCleaner {
    private static function getCurrentHost() {
        return 'your-domain.com'; // Custom host for external link detection
    }
}

Performance

The tool is optimized for performance:

  • Efficient Regex Patterns: Uses optimized regular expressions
  • Single Pass Processing: Combines multiple operations where possible
  • Memory Efficient: Processes large files with minimal memory usage

Examples

Before Cleaning

<h1 style="color: red; font-size: 24px;" cid="n0" mdtype="heading">
    <span style="font-weight: bold;" md-inline="plain">Title</span>
</h1>

After Cleaning

<h1>
    <strong>Title</strong>
</h1>

File Structure

html-cleaner/
├── HtmlCleaner.php          # Main cleaner class
├── index.php                # Usage example
├── readme.md               # English documentation
├── readme_zh.md            # Chinese documentation
└── tests/                  # Test files
    ├── 1.html              # Original HTML file
    ├── f1.html             # Template file
    ├── final_cleaned.html  # Cleaned HTML
    └── f1_final.html       # Final template with cleaned content

Testing

Running Tests

The project includes comprehensive unit tests to ensure code quality and functionality. To run the tests:

# Install dependencies (if not already installed)
composer install

# Run all tests
./vendor/bin/phpunit tests/

# Run specific test file
./vendor/bin/phpunit tests/HtmlCleanerTest.php

# Run tests with detailed output
./vendor/bin/phpunit --verbose tests/

Test Coverage

The test suite covers:

  • Basic HTML Cleaning: Core functionality testing
  • Debug Output Control: Environment-based debug behavior
  • Individual Processing Methods: Each public method has dedicated tests
  • File Operations: File input/output handling
  • Environment Variables: Configuration-based behavior
  • Complex HTML Structures: Advanced HTML processing scenarios
  • Error Handling: Exception and edge case testing

Test Environment

Tests are configured to run in PHP 7.2+ environments and include:

  • Environment Management: Proper setup and teardown of environment variables
  • Output Buffering: Testing debug output behavior
  • File System Operations: Temporary file creation and cleanup
  • Mock Data: Comprehensive test cases with various HTML inputs

Continuous Integration

To integrate testing into your development workflow:

# Example GitHub Actions configuration
name: PHP Tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Setup PHP
        uses: shivammathur/setup-php@v2
        with:
          php-version: '7.2'
      - name: Install dependencies
        run: composer install --prefer-dist --no-progress
      - name: Execute tests
        run: ./vendor/bin/phpunit tests/

Test Examples

See the tests/ directory for complete test implementations, including:

  • HtmlCleanerTest.php: Main test class with 23 test methods
  • Test files demonstrating various HTML cleaning scenarios
  • Examples of custom processing pipelines using individual methods

Browser Compatibility

The cleaned HTML is compatible with all modern browsers and maintains semantic structure.

SEO Benefits

  • Reduced File Size: Smaller HTML files load faster
  • Clean Markup: Search engines can better understand content structure
  • Semantic HTML: Preserves meaningful tag structure

License

This project is open source and available under the MIT License.

Support

For issues and feature requests, please visit the GitHub repository.

Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues for discussion.

Changelog

Version 1.1

  • Added comprehensive English documentation
  • Improved code comments and documentation
  • Enhanced tag link functionality
  • Better external link detection

Version 1.0

  • Initial release with core cleaning functionality