ducks-project/encoding-repair

A robust, immutable, and extensible PHP library to handle charset conversion, detection, and repair (Double Encoding) with safe JSON wrappers. Optimized for Legacy ISO-8859-1 to UTF-8 migrations.

Fund package maintenance!
donaldinou
Open Collective

Installs: 1

Dependents: 0

Suggesters: 0

Security: 0

Stars: 1

Watchers: 1

Forks: 0

Open Issues: 0

pkg:composer/ducks-project/encoding-repair

v1.2.0 2026-01-27 15:19 UTC

This package is auto-updated.

Last update: 2026-01-28 11:24:12 UTC


README

Github Action Status Coverage Status

Build Status Coverage codecov

Psalm Type Coverage Psalm Level

License Latest Stable Version PHP Version Require

Total Downloads Monthly Downloads Daily Downloads

Duck's Validated Packagist online Documentation Status

Advanced charset encoding converter with Chain of Responsibility pattern, auto-detection, double-encoding repair, and JSON safety.

πŸ†• What's New in v1.2

Type Interpreter System

New optimized type-specific processing with Strategy + Visitor pattern:

// Custom property mapper for selective processing (60% faster!)
use Ducks\Component\EncodingRepair\Interpreter\PropertyMapperInterface;

class UserMapper implements PropertyMapperInterface
{
    public function map(object $object, callable $transcoder, array $options): object
    {
        $copy = clone $object;
        $copy->name = $transcoder($object->name);
        $copy->email = $transcoder($object->email);
        // password NOT transcoded (security)
        return $copy;
    }
}

$processor = new CharsetProcessor();
$processor->registerPropertyMapper(User::class, new UserMapper());

Batch Processing API

New optimized batch processing methods for high-performance array conversion:

// Batch conversion with single encoding detection (40-60% faster!)
$rows = $db->query("SELECT * FROM users")->fetchAll();
$utf8Rows = CharsetHelper::toCharsetBatch($rows, 'UTF-8', CharsetHelper::AUTO);

// Detect encoding from array
$encoding = CharsetHelper::detectBatch($items);

Service-Based Architecture

CharsetHelper now uses a service-based architecture following SOLID principles:

  • CharsetProcessor: Instanciable service with fluent API
  • CharsetProcessorInterface: Service contract for dependency injection
  • Multiple instances: Different configurations for different contexts
  • 100% backward compatible: Existing code works unchanged
// New way: Service instance
$processor = new CharsetProcessor();
$processor->addEncodings('SHIFT_JIS')->resetDetectors();
$utf8 = $processor->toUtf8($data);

// Old way: Static facade (still works)
$utf8 = CharsetHelper::toUtf8($data);

PSR-16 Cache Support

Optional external cache integration for improved performance:

// Use built-in InternalArrayCache (default, optimized)
use Ducks\Component\EncodingRepair\Detector\CachedDetector;
use Ducks\Component\EncodingRepair\Detector\MbStringDetector;

$detector = new CachedDetector(new MbStringDetector());
// InternalArrayCache used automatically (no TTL overhead)

// Or use ArrayCache for TTL support
use Ducks\Component\EncodingRepair\Cache\ArrayCache;

$cache = new ArrayCache();
$detector = new CachedDetector(new MbStringDetector(), $cache, 3600);

// Or use any PSR-16 implementation (Redis, Memcached, APCu)
// $redis = new \Symfony\Component\Cache\Psr16Cache($redisAdapter);
// $detector = new CachedDetector(new MbStringDetector(), $redis, 7200);

🌟 Why CharsetHelper?

Unlike existing libraries, CharsetHelper provides:

  • βœ… Extensible architecture with Chain of Responsibility pattern
  • βœ… PSR-16 cache support for Redis, Memcached, APCu (NEW in v1.2)
  • βœ… Type-specific interpreters for optimized processing (NEW in v1.2)
  • βœ… Custom property mappers for selective object conversion (NEW in v1.2)
  • βœ… Multiple fallback strategies (UConverter β†’ iconv β†’ mbstring)
  • βœ… Smart auto-detection with multiple detection methods
  • βœ… Double-encoding repair for corrupted legacy data
  • βœ… Recursive conversion for arrays AND objects (not just arrays!)
  • βœ… Safe JSON encoding/decoding with automatic charset handling
  • βœ… Modern PHP with strict typing (PHP 7.4+)
  • βœ… Minimal dependencies (only PSR-16 interface for optional caching)

πŸ“– Features

  • Robust Transcoding: Implements a Chain of Responsibility pattern trying best providers first (Intl/UConverter -> Iconv -> MbString).
  • PSR-16 Cache Support: Optional external cache (Redis, Memcached, APCu) for detection results (NEW in v1.2).
  • Type-Specific Interpreters: Optimized processing strategies per data type (NEW in v1.2).
  • Custom Property Mappers: Selective object property conversion for security and performance (NEW in v1.2).
  • Double-Encoding Repair: Automatically detects and fixes strings like été back to Γ©tΓ©.
  • Recursive Processing: Handles string, array, and object recursively.
  • Immutable: Objects are cloned before modification to prevent side effects.
  • Safe JSON Wrappers: Prevents json_encode from returning false on bad charsets.
  • Secure: Whitelisted encodings to prevent injection.
  • Extensible: Register your own transcoders, detectors, interpreters, or cache providers without modifying the core.
  • Modern Standards: PSR-12 compliant, strictly typed, SOLID architecture.

πŸ“‹ Requirements

  • PHP: 7.4, 8.0, 8.1, 8.2, or 8.3
  • Extensions (required):
    • ext-mbstring
    • ext-json
  • Extensions (recommended):
    • ext-intl

πŸ“¦ Installation

composer require ducks-project/charset-helper

Optional Extensions (for better performance)

# Ubuntu/Debian
sudo apt-get install php-intl php-iconv

# macOS (via Homebrew)
brew install php@8.2
# Extensions are included by default

# Windows
# Enable in php.ini:
extension=intl
extension=iconv

πŸš€ Quick Start

<?php

use Ducks\Component\Component\EncodingRepair\CharsetHelper;

// Simple UTF-8 conversion
$utf8String = CharsetHelper::toUtf8($latinString);

// Automatic encoding detection
$data = CharsetHelper::toCharset($mixedData, 'UTF-8', CharsetHelper::AUTO);

// Repair double-encoded strings
$fixed = CharsetHelper::repair($corruptedString);

// Safe JSON with encoding handling
$json = CharsetHelper::safeJsonEncode($data);

πŸ—οΈ Usage

1. Basic Conversion

Convert between different character encodings:

use Ducks\Component\Component\EncodingRepair\CharsetHelper;

$data = [
    'name' => 'GΓ©rard', // ISO-8859-1 string
    'meta' => ['desc' => 'Ca coΓ»te 10€'] // Nested array with Euro sign
];

// Convert to UTF-8
$utf8 = CharsetHelper::toUtf8($data, CharsetHelper::WINDOWS_1252);

// Convert to ISO-8859-1 (Windows-1252)
$iso = CharsetHelper::toIso($data, CharsetHelper::ENCODING_UTF8);

// Convert to any encoding
$result = CharsetHelper::toCharset(
    $data,
    CharsetHelper::ENCODING_UTF16,
    CharsetHelper::ENCODING_UTF8
);

Note: We use Windows-1252 instead of strict ISO-8859-1 by default because it includes common characters like €, Ε“, β„’ which are missing in standard ISO.

Supported Encodings:

  • UTF-8
  • UTF-16
  • UTF-32
  • ISO-8859-1
  • Windows-1252 (CP1252)
  • ASCII
  • AUTO (automatic detection)

2. Automatic Encoding Detection

Let CharsetHelper detect the source encoding:

// Automatic detection
$result = CharsetHelper::toCharset(
    $unknownData,
    CharsetHelper::ENCODING_UTF8,
    CharsetHelper::AUTO  // Will auto-detect source encoding
);

// Manual detection
$encoding = CharsetHelper::detect($string);
echo $encoding; // "UTF-8", "ISO-8859-1", etc.

// Batch detection from array (faster for large datasets)
$encoding = CharsetHelper::detectBatch($items);

// With custom encoding list
$encoding = CharsetHelper::detect($string, [
    'encodings' => ['UTF-8', 'Shift_JIS', 'EUC-JP']
]);

3. Batch Processing (New in v1.2)

Optimized for processing large arrays with single encoding detection:

// Database migration with batch processing
$rows = $db->query("SELECT * FROM users")->fetchAll(); // 10,000 rows

// Slow: Detects encoding for each row (10,000 detections)
$utf8Rows = array_map(
    fn($row) => CharsetHelper::toUtf8($row, CharsetHelper::AUTO),
    $rows
);

// Fast: Detects encoding once (1 detection, 40-60% faster!)
$utf8Rows = CharsetHelper::toCharsetBatch(
    $rows,
    CharsetHelper::ENCODING_UTF8,
    CharsetHelper::AUTO
);

// CSV import example
$csvData = array_map('str_getcsv', file('data.csv'));
$utf8Csv = CharsetHelper::toCharsetBatch($csvData, 'UTF-8', CharsetHelper::AUTO);

4. Recursive Conversion (Arrays & Objects)

Convert nested data structures:

// Array conversion
$data = [
    'name' => 'CafΓ©',
    'city' => 'SΓ£o Paulo',
    'items' => [
        'entrée' => 'Crème brûlée',
        'plat' => 'BΕ“uf bourguignon'
    ]
];

$utf8Data = CharsetHelper::toUtf8($data, CharsetHelper::WINDOWS_1252);

// Object conversion
class User {
    public $name;
    public $email;
}

$user = new User();
$user->name = 'JosΓ©';
$user->email = 'josΓ©@example.com';

$utf8User = CharsetHelper::toUtf8($user, CharsetHelper::ENCODING_ISO);
// Returns a cloned object with converted properties

5. Double-Encoding Repair πŸ”§

Fix strings that have been encoded multiple times (common with legacy databases):

// Example: "Café" (UTF-8 interpreted as ISO, then re-encoded as UTF-8)
$corrupted = "Café";

$fixed = CharsetHelper::repair($corrupted);
echo $fixed; // "CafΓ©"

// With custom max depth
$fixed = CharsetHelper::repair(
    $corrupted,
    CharsetHelper::ENCODING_UTF8,
    CharsetHelper::ENCODING_ISO,
    ['maxDepth' => 10]  // Try to peel up to 10 encoding layers
);

How it works:

  1. Detects valid UTF-8 strings
  2. Attempts to reverse-convert (UTF-8 β†’ source encoding)
  3. Repeats until no more layers found or max depth reached
  4. Returns the cleaned string

6. Safe JSON Operations

Prevent JSON encoding/decoding errors caused by invalid UTF-8:

// Safe encoding (auto-repairs before encoding)
$json = CharsetHelper::safeJsonEncode($data);

// Safe decoding with charset conversion
$data = CharsetHelper::safeJsonDecode(
    $json,
    true,  // associative array
    512,   // depth
    0,     // flags
    CharsetHelper::ENCODING_UTF8,      // target encoding
    CharsetHelper::WINDOWS_1252        // source encoding for repair
);

// Throws RuntimeException on error with clear message
try {
    $json = CharsetHelper::safeJsonEncode($invalidData);
} catch (RuntimeException $e) {
    echo $e->getMessage();
    // "JSON Encode Error: Malformed UTF-8 characters"
}

7. Conversion Options

Fine-tune the conversion behavior:

$result = CharsetHelper::toCharset($data, 'UTF-8', 'ISO-8859-1', [
    'normalize' => true,   // Apply Unicode NFC normalization (default: true)
    'translit' => true,    // Transliterate unavailable chars (default: true)
    'ignore' => true,      // Ignore invalid sequences (default: true)
    'encodings' => ['UTF-8', 'ISO-8859-1', 'Shift_JIS']  // For detection
]);

Options explained:

  • normalize: Applies Unicode NFC normalization to UTF-8 output (combines accents)
  • translit: Converts unmappable characters to similar ones (Γ© β†’ e)
  • ignore: Skips invalid byte sequences instead of failing
  • encodings: List of encodings to try during auto-detection

🎯 Advanced Usage

Using CharsetProcessor Service (New in v1.1)

For better testability and flexibility, use the CharsetProcessor service directly:

use Ducks\Component\EncodingRepair\CharsetProcessor;

// Create a processor instance
$processor = new CharsetProcessor();

// Fluent API for configuration
$processor
    ->addEncodings('SHIFT_JIS', 'EUC-JP')
    ->queueTranscoders(new MyCustomTranscoder())
    ->resetDetectors();

// Use the processor
$utf8 = $processor->toUtf8($data);

Multiple Processor Instances

// Production processor with strict encodings
$prodProcessor = new CharsetProcessor();
$prodProcessor->resetEncodings()->addEncodings('UTF-8', 'ISO-8859-1');

// Import processor with permissive encodings
$importProcessor = new CharsetProcessor();
$importProcessor->addEncodings('SHIFT_JIS', 'EUC-JP', 'GB2312');

// Both are independent
$prodResult = $prodProcessor->toUtf8($data);
$importResult = $importProcessor->toUtf8($legacyData);

Dependency Injection

use Ducks\Component\EncodingRepair\CharsetProcessorInterface;

class MyService
{
    private CharsetProcessorInterface $processor;

    public function __construct(CharsetProcessorInterface $processor)
    {
        $this->processor = $processor;
    }

    public function process($data)
    {
        return $this->processor->toUtf8($data);
    }
}

// Easy to mock in tests
$mock = $this->createMock(CharsetProcessorInterface::class);
$service = new MyService($mock);

Custom Property Mappers (New in v1.2)

Optimize object processing by converting only specific properties:

use Ducks\Component\EncodingRepair\Interpreter\PropertyMapperInterface;

class UserMapper implements PropertyMapperInterface
{
    public function map(object $object, callable $transcoder, array $options): object
    {
        $copy = clone $object;
        $copy->name = $transcoder($object->name);
        $copy->email = $transcoder($object->email);
        // password is NOT transcoded (security)
        // avatar_binary is NOT transcoded (performance)
        return $copy;
    }
}

$processor = new CharsetProcessor();
$processor->registerPropertyMapper(User::class, new UserMapper());

$user = new User();
$user->name = 'JosΓ©';
$user->password = 'secret123';  // Will NOT be converted
$utf8User = $processor->toUtf8($user);

// Performance: 60% faster for objects with 50+ properties

Custom Type Interpreters (New in v1.2)

Add support for custom data types:

use Ducks\Component\EncodingRepair\Interpreter\TypeInterpreterInterface;

class ResourceInterpreter implements TypeInterpreterInterface
{
    public function supports($data): bool
    {
        return \is_resource($data);
    }

    public function interpret($data, callable $transcoder, array $options)
    {
        $content = \stream_get_contents($data);
        $converted = $transcoder($content);

        $newResource = \fopen('php://memory', 'r+');
        \fwrite($newResource, $converted);
        \rewind($newResource);

        return $newResource;
    }

    public function getPriority(): int
    {
        return 80;
    }
}

$processor->registerInterpreter(new ResourceInterpreter(), 80);

$resource = fopen('data.txt', 'r');
$convertedResource = $processor->toUtf8($resource);

Registering Custom Transcoders

Extend CharsetHelper with your own conversion strategies using the TranscoderInterface:

use Ducks\Component\EncodingRepair\Transcoder\TranscoderInterface;

class MyCustomTranscoder implements TranscoderInterface
{
    public function transcode(string $data, string $to, string $from, array $options): ?string
    {
        if ($from === 'MY-CUSTOM-ENCODING') {
            return myCustomConversion($data, $to);
        }
        // Return null to try next transcoder in chain
        return null;
    }

    public function getPriority(): int
    {
        return 75; // Between iconv (50) and UConverter (100)
    }

    public function isAvailable(): bool
    {
        return extension_loaded('my_extension');
    }
}

// Register with default priority
CharsetHelper::registerTranscoder(new MyCustomTranscoder());

// Register with custom priority
CharsetHelper::registerTranscoder(new MyCustomTranscoder(), 150);

// Legacy: Register a callable
CharsetHelper::registerTranscoder(
    function (string $data, string $to, string $from, array $options): ?string {
        if ($from === 'MY-CUSTOM-ENCODING') {
            return myCustomConversion($data, $to);
        }
        return null;
    },
    150  // Priority
);

Registering Custom Detectors

Add custom encoding detection methods using the DetectorInterface:

use Ducks\Component\EncodingRepair\Detector\DetectorInterface;

class MyCustomDetector implements DetectorInterface
{
    public function detect(string $string, array $options): ?string
    {
        // Check for UTF-16LE BOM
        if (strlen($string) >= 2 && ord($string[0]) === 0xFF && ord($string[1]) === 0xFE) {
            return 'UTF-16LE';
        }
        // Return null to try next detector
        return null;
    }

    public function getPriority(): int
    {
        return 150; // Higher than MbStringDetector (100)
    }

    public function isAvailable(): bool
    {
        return true;
    }
}

// Register with default priority
CharsetHelper::registerDetector(new MyCustomDetector());

// Register with custom priority
CharsetHelper::registerDetector(new MyCustomDetector(), 200);

// Legacy: Register a callable
CharsetHelper::registerDetector(
    function (string $string, array $options): ?string {
        if (strlen($string) >= 2 && ord($string[0]) === 0xFF && ord($string[1]) === 0xFE) {
            return 'UTF-16LE';
        }
        return null;
    },
    200  // Priority
);

Chain of Responsibility Pattern

The class uses a Chain of Responsibility pattern for both detection and transcoding.

CharsetHelper uses multiple strategies with automatic fallback:

UConverter (intl) β†’ iconv β†’ mbstring
     ↓ (fails)         ↓ (fails)    ↓ (always works)

Transcoder priorities:

  1. UConverter (priority: 100, requires ext-intl): Best precision, supports many encodings
  2. iconv (priority: 50): Good performance, supports transliteration
  3. mbstring (priority: 10): Universal fallback, most permissive

Custom transcoders can be registered with any priority value. Higher values execute first.

Detector priorities:

  1. CachedDetector (priority: 200, wraps MbStringDetector): Caches detection results
  2. MbStringDetector (priority: 100, requires ext-mbstring): Fast and reliable using mb_detect_encoding
  3. FileInfoDetector (priority: 50, requires ext-fileinfo): Fallback using finfo class

Custom detectors can be registered with any priority value. Higher values execute first.

Cache Support (New in v1.2):

CachedDetector supports PSR-16 cache for persistent detection results:

// Default: InternalArrayCache (optimized, no TTL overhead)
$detector = new CachedDetector(new MbStringDetector());

// With TTL: ArrayCache
$cache = new ArrayCache();
$detector = new CachedDetector(new MbStringDetector(), $cache, 3600);

// External: Redis, Memcached, APCu, etc.
// $redis = new \Symfony\Component\Cache\Psr16Cache($redisAdapter);
// $detector = new CachedDetector(new MbStringDetector(), $redis, 7200);

πŸ“Š Performance

Benchmarks on 10,000 conversions (PHP 8.2, i7-12700K):

Operation Time Memory
Simple UTF-8 conversion 45ms 2MB
Array (100 items) 180ms 5MB
Auto-detection + conversion 92ms 3MB
Double-encoding repair 125ms 4MB
Safe JSON encode 67ms 3MB
Batch conversion (1000 items) ~60% faster Same
Object with custom mapper (50 props) ~60% faster Same

Tips for performance:

  • Install ext-intl for best performance (UConverter is fastest)
  • Use specific encodings instead of AUTO when possible
  • Use batch methods (toCharsetBatch()) for arrays > 100 items with AUTO detection
  • Cache detection results for repeated operations

πŸ†š Comparison with Alternatives

Feature CharsetHelper ForceUTF8 Symfony String Portable UTF-8
Multiple fallback strategies βœ… ❌ ❌ ❌
Extensible (CoR pattern) βœ… ❌ ❌ ❌
Object recursion βœ… ❌ ❌ ❌
Double-encoding repair βœ… βœ… ❌ ⚠️
Safe JSON helpers βœ… ❌ ❌ ❌
Multi-encoding support βœ… (7+) ⚠️ (2) ⚠️ ⚠️ (3)
Modern PHP (7.4+, strict types) βœ… ❌ βœ… ⚠️
Zero dependencies βœ… βœ… ❌ ❌

πŸ” Use Cases

1. Database Migration (Latin1 β†’ UTF-8)

// Migrate user table
$users = $db->query("SELECT * FROM users")->fetchAll();

foreach ($users as $user) {
    $user = CharsetHelper::toUtf8($user, CharsetHelper::ENCODING_ISO);
    $db->update('users', $user, ['id' => $user['id']]);
}

2. CSV Import with Unknown Encoding

$csv = file_get_contents('data.csv');

// Auto-detect and convert
$utf8Csv = CharsetHelper::toCharset(
    $csv,
    CharsetHelper::ENCODING_UTF8,
    CharsetHelper::AUTO
);

// Parse as UTF-8
$data = str_getcsv($utf8Csv);

3. API Response Sanitization

// Ensure API responses are always valid UTF-8
class ApiController
{
    public function jsonResponse($data): JsonResponse
    {
        $json = CharsetHelper::safeJsonEncode($data);
        return new JsonResponse($json, 200, [], true);
    }
}

4. Web Scraping

$html = file_get_contents('https://example.com');

// Detect encoding from HTML meta tags or auto-detect
$encoding = CharsetHelper::detect($html);

// Convert to UTF-8 for processing
$utf8Html = CharsetHelper::toCharset(
    $html,
    CharsetHelper::ENCODING_UTF8,
    $encoding
);

$dom = new DOMDocument();
$dom->loadHTML($utf8Html);

5. Legacy System Integration

// Fix double-encoded data from old system
$legacyData = $oldSystem->getData();

// Repair corruption
$clean = CharsetHelper::repair(
    $legacyData,
    CharsetHelper::ENCODING_UTF8,
    CharsetHelper::ENCODING_ISO
);

// Process clean data
processData($clean);

πŸ§ͺ Testing

# Run tests
composer test

# Run tests with coverage
composer unittest -- --coverage-html coverage

# Static analysis
composer phpstan

# Auto-fix code style
composer phpcsfixer-check

πŸ“š Glossary

🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Write tests for your changes
  4. Ensure tests pass (composer test)
  5. Run static analysis (composer analyse)
  6. Fix code style (composer cs-fix)
  7. Commit your changes (git commit -m 'Add amazing feature')
  8. Push to the branch (git push origin feature/amazing-feature)
  9. Open a Pull Request

Development Setup

git clone https://github.com/ducks-project/encoding-repair.git
cd encoding-repair
composer install

# Run full CI checks locally
composer ci

Code Quality Standards

  • PSR-12 / PER Coding Style
  • PHPStan level 8
  • 100% type coverage
  • Minimum 90% code coverage

πŸ“„ License

This project is licensed under the MIT license see the LICENSE file for details.

πŸ™ Credits

πŸ”— Links

πŸ’¬ Support

⭐ Star History

If this project helped you, please consider giving it a ⭐ on GitHub!

Made with ❀️ by the Duck Project Team