tekintian / html-cleaner
A powerful PHP tool for cleaning HTML content generated by Typora editor. Removes redundant spaces, useless attributes, and optimizes HTML structure while preserving content integrity.
Installs: 1
Dependents: 0
Suggesters: 0
Security: 0
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
pkg:composer/tekintian/html-cleaner
Requires
- php: >=7.2.0
Requires (Dev)
- phpunit/phpunit: ^7.5
README
A powerful PHP tool for cleaning HTML content generated by Typora editor. Removes redundant spaces, useless attributes, and optimizes HTML structure while preserving content integrity.
Features
- Remove Inline Styles: Eliminates all inline
styleattributes - Clean Typora Attributes: Removes Typora-specific attributes like
cid,mdtype,md-inline, etc. - Optimize Class Attributes: Filters out
md-*classes while preserving useful ones - Remove Empty Tags: Cleans up empty
spananddivtags - Simplify Tag Structure: Optimizes nested tag structures
- Clean Whitespace: Removes redundant spaces and normalizes whitespace
- Process Code Blocks: Preserves language classes and removes
brtags inpretags - External Link Handling: Adds
target="_blank"to external links - Auto Tag Links: Automatically adds links to specified keywords
- Environment Aware: Smart debug output control based on environment (dev, testing, prod)
- Configurable Behavior: Environment variables for customizing behavior
Installation
Composer Installation
composer require tekintian/html-cleaner
Manual Installation
Download the package and include the autoloader:
require_once 'vendor/autoload.php';
Usage
Basic Usage
use tekintian\HtmlCleaner\HtmlCleaner; // Clean HTML string $dirtyHtml = '<p style="color: red;">Content</p>'; $cleanHtml = HtmlCleaner::clean($dirtyHtml); // Clean HTML file $cleanedHtml = HtmlCleaner::cleanFile('input.html', 'output.html');
With Custom Tag Links
use tekintian\HtmlCleaner\HtmlCleaner; $tagLinks = [ 'PHP' => 'https://www.php.net/manual/en/', 'JavaScript' => 'https://developer.mozilla.org/en-US/docs/Web/JavaScript', 'Python' => 'https://docs.python.org/3/', ]; $cleanedHtml = HtmlCleaner::clean($html, $tagLinks);
Using Individual Processing Methods
use tekintian\HtmlCleaner\HtmlCleaner; // Custom processing pipeline $html = HtmlCleaner::unifiedAttributeProcessing($dirtyHtml); $html = HtmlCleaner::removeEmptyTags($html); $html = HtmlCleaner::simplifyTags($html); $html = HtmlCleaner::cleanWhitespace($html); // Skip specific steps if not needed // $html = HtmlCleaner::removeUselessAttributes($html); // $html = HtmlCleaner::processPreTags($html); // Apply custom processing between steps $html = str_replace('<br>', '<br />', $html); $cleanedHtml = $html;
Debug Output Control
// Set APP_DEBUG environment variable to control debug output putenv('APP_DEBUG=true'); // Shows debug output // putenv('APP_DEBUG=false'); // No debug output (default) // Alternative: Use APP_ENV for backward compatibility putenv('APP_ENV=dev'); // Also shows debug output use tekintian\HtmlCleaner\HtmlCleaner; $html = HtmlCleaner::clean($dirtyHtml); // With debug enabled: Shows processing progress // With debug disabled: Silent operation
API Reference
Main Methods
HtmlCleaner::clean(string $html, array|null $tagLinks = null): string
Cleans HTML content and returns the cleaned version.
Parameters:
$html: HTML content to clean$tagLinks: Optional tag link configuration array [keyword => URL]
Returns: Cleaned HTML content
HtmlCleaner::cleanFile(string $inputFile, string|null $outputFile = null, array|null $tagLinks = null): string
Cleans an HTML file and saves the result.
Parameters:
$inputFile: Input file path$outputFile: Output file path (auto-generated if null)$tagLinks: Optional tag link configuration array [keyword => URL]
Returns: Cleaned HTML content
Throws: Exception if file operations fail
Individual Processing Methods
HtmlCleaner::unifiedAttributeProcessing(string $html): string
Processes HTML attributes in a unified manner (combining multiple loops).
Parameters:
$html: HTML content to process
Returns: HTML content with processed attributes
HtmlCleaner::removeEmptyTags(string $html): string
Removes empty span and div tags from HTML content.
Parameters:
$html: HTML content to process
Returns: HTML content with empty tags removed
HtmlCleaner::simplifyTags(string $html): string
Simplifies tag structure by optimizing nested tags.
Parameters:
$html: HTML content to process
Returns: Simplified HTML content
HtmlCleaner::cleanAllTagSpaces(string $html): string
Cleans all redundant spaces within HTML tags.
Parameters:
$html: HTML content to process
Returns: HTML content with cleaned tag spaces
HtmlCleaner::removeUselessAttributes(string $html): string
Removes useless attributes from HTML content.
Parameters:
$html: HTML content to process
Returns: HTML content with useless attributes removed
HtmlCleaner::cleanWhitespace(string $html): string
Cleans whitespace characters and normalizes formatting.
Parameters:
$html: HTML content to process
Returns: HTML content with cleaned whitespace
HtmlCleaner::processPreTags(string $html): string
Processes pre tags, preserves language class names and removes br tags.
Parameters:
$html: HTML content to process
Returns: HTML content with processed pre tags
Processing Steps
The cleaner performs the following operations in sequence:
- Unified Attribute Processing: Combines multiple attribute processing loops
- Empty Tag Removal: Removes empty
spananddivtags - Tag Structure Simplification: Optimizes nested tag structures
- Space Cleaning: Removes redundant spaces in tags
- Useless Attribute Removal: Eliminates empty and unnecessary attributes
- Whitespace Normalization: Cleans up whitespace characters
- Pre Tag Processing: Handles code blocks and language classes
- External Link Processing: Adds
target="_blank"to external links - Tag Link Addition: Automatically adds links to specified keywords
Configuration
Environment Variables
APP_DEBUG: Set totrueor1to enable debug output (recommended)APP_ENV: Environment mode (dev, testing, prod) - also controls debug output for backward compatibilityHTML_ADD_TAG_LINK: Set totrueto enable automatic tag linkingHTTP_HOST: Current host for external link detection (default: 'dev.tekin.cn')REMOVE_HTML_SPAN: Set totrueto remove all span tags (aggressive mode)
Debug Control Behavior
| Setting | Debug Output | Use Case |
|---|---|---|
APP_DEBUG=true or 1 |
✅ Enabled | Development and debugging |
APP_ENV=dev or testing |
✅ Enabled | Backward compatibility |
APP_DEBUG=false or unset |
❌ Disabled | Production deployment |
Note: APP_DEBUG takes precedence over APP_ENV for debug control.
Customizing External Link Detection
Override the getCurrentHost() method to customize external link detection:
class CustomHtmlCleaner extends HtmlCleaner { private static function getCurrentHost() { return 'your-domain.com'; // Custom host for external link detection } }
Performance
The tool is optimized for performance:
- Efficient Regex Patterns: Uses optimized regular expressions
- Single Pass Processing: Combines multiple operations where possible
- Memory Efficient: Processes large files with minimal memory usage
Examples
Before Cleaning
<h1 style="color: red; font-size: 24px;" cid="n0" mdtype="heading"> <span style="font-weight: bold;" md-inline="plain">Title</span> </h1>
After Cleaning
<h1> <strong>Title</strong> </h1>
File Structure
html-cleaner/
├── HtmlCleaner.php # Main cleaner class
├── index.php # Usage example
├── readme.md # English documentation
├── readme_zh.md # Chinese documentation
└── tests/ # Test files
├── 1.html # Original HTML file
├── f1.html # Template file
├── final_cleaned.html # Cleaned HTML
└── f1_final.html # Final template with cleaned content
Testing
Running Tests
The project includes comprehensive unit tests to ensure code quality and functionality. To run the tests:
# Install dependencies (if not already installed) composer install # Run all tests ./vendor/bin/phpunit tests/ # Run specific test file ./vendor/bin/phpunit tests/HtmlCleanerTest.php # Run tests with detailed output ./vendor/bin/phpunit --verbose tests/
Test Coverage
The test suite covers:
- Basic HTML Cleaning: Core functionality testing
- Debug Output Control: Environment-based debug behavior
- Individual Processing Methods: Each public method has dedicated tests
- File Operations: File input/output handling
- Environment Variables: Configuration-based behavior
- Complex HTML Structures: Advanced HTML processing scenarios
- Error Handling: Exception and edge case testing
Test Environment
Tests are configured to run in PHP 7.2+ environments and include:
- Environment Management: Proper setup and teardown of environment variables
- Output Buffering: Testing debug output behavior
- File System Operations: Temporary file creation and cleanup
- Mock Data: Comprehensive test cases with various HTML inputs
Continuous Integration
To integrate testing into your development workflow:
# Example GitHub Actions configuration name: PHP Tests on: [push, pull_request] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - name: Setup PHP uses: shivammathur/setup-php@v2 with: php-version: '7.2' - name: Install dependencies run: composer install --prefer-dist --no-progress - name: Execute tests run: ./vendor/bin/phpunit tests/
Test Examples
See the tests/ directory for complete test implementations, including:
HtmlCleanerTest.php: Main test class with 23 test methods- Test files demonstrating various HTML cleaning scenarios
- Examples of custom processing pipelines using individual methods
Browser Compatibility
The cleaned HTML is compatible with all modern browsers and maintains semantic structure.
SEO Benefits
- Reduced File Size: Smaller HTML files load faster
- Clean Markup: Search engines can better understand content structure
- Semantic HTML: Preserves meaningful tag structure
License
This project is open source and available under the MIT License.
Support
For issues and feature requests, please visit the GitHub repository.
Contributing
Contributions are welcome! Please feel free to submit pull requests or open issues for discussion.
Changelog
Version 1.1
- Added comprehensive English documentation
- Improved code comments and documentation
- Enhanced tag link functionality
- Better external link detection
Version 1.0
- Initial release with core cleaning functionality