szopen / similarity
A similarity library for string and date matching
Installs: 0
Dependents: 0
Suggesters: 0
Security: 0
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
pkg:composer/szopen/similarity
Requires
- php: >=8.2
- ext-iconv: *
- ext-intl: *
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.88
- phpstan/phpstan: ^2.1
- phpunit/phpunit: ^12.3
This package is auto-updated.
Last update: 2025-11-13 19:27:27 UTC
README
A lightweight PHP library for measuring the similarity between strings and dates with flexible normalization, transliteration, and fuzzy comparison strategies.
Overview
The Similarity package provides utilities to compare textual or date-based inputs and determine how closely they match.
It’s designed to be extensible, locale-aware, and safe for fuzzy matching — making it ideal for use cases such as data deduplication, data cleaning, or record linkage.
The core idea is that different types of data (like names, text, or dates) require different similarity strategies, so the library offers multiple specialized classes.
Main Components
StringFuzzySimilarity
Compares two strings with a fuzzy matching algorithm.
Useful when dealing with typos, transliteration differences, or minor formatting variations.
- Normalization: Removes punctuation, trims whitespace, and can optionally transliterate text to Latin.
- Algorithm: Combines
similar_textand normalized Levenshtein distance. - Use case: Matching names like
JosévsJose, orMcDonald’svsMcDonalds.
$similarity = new StringFuzzySimilarity( new StringNormalizer( new TransliteratorFactory() ) ); $result = $similarity->similarity('José García', 'Jose Garcia'); // e.g., 0.97
DateFuzzySimilarity
Compares two date values even if they use different formats or delimiters.
- Normalization: Converts supported formats (e.g.,
YYYY-MM-DD,DD/MM/YYYY,MM.DD.YYYY) into a canonical form. - Algorithm: Computes a similarity score based on date component proximity (e.g., days, months, years).
- Use case: Matching
12-03-1990and1990/03/12as the same date.
$similarity = new DateFuzzySimilarity( new DateFuzzySimilarityConfiguration( new DatePartsWeights(), new DateDiffPenalty() ), new DateNormalizer() ); $result = $similarity->similarity('1990-03-12', '12/03/1990'); // 1.0
StringExactSimilarity
Compares two strings for an exact match after normalization.
- Normalization: Cleans strings but does not introduce fuzziness.
- Algorithm: Returns
1.0if normalized strings are identical, otherwise0.0. - Use case: Validating IDs, codes, or fields that must match exactly.
$similarity = new StringExactSimilarity(); $result = $similarity->similarity('ABC123', 'abc123'); // 1.0
Factory Usage Example
The library provides a simple factory for creating the right similarity strategy depending on the data type or context.
use Szopen\Similarity\SimilarityFactory; $factory = new SimilarityFactory( new DateFuzzySimilarityConfiguration( new DatePartsWeights(), new DateDiffPenalty(), ) ); // Automatically selects a suitable comparator $stringSim = $factory->create(SimilarityFactory::STRING_FUZZY); $dateSim = $factory->create(SimilarityFactory::DATE_FUZZY); // Compute similarity echo $stringSim->similarity('Leandro', 'Leandor'); // e.g., 0.9 echo $dateSim->similarity('2023-11-12', '12.11.2023'); // 1.0
You can extend or customize the factory to add your own similarity strategies.
Installation
composer require szopen/similarity
Requirements
- PHP 8.2+
ext-intl(recommended for proper transliteration)ext-iconv(recommended as fallback fromext-intl)
License
This project is licensed under the MIT LICENSE.
Contributing
Contributions are welcome!
Please open an issue or submit a pull request if you’d like to add new normalization strategies or similarity metrics.