serafim / tf-idf
Library to calculate TF-IDF (Term Frequency - Inverse Document Frequency) for generic documents
Installs: 1 619
Dependents: 0
Suggesters: 0
Security: 0
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Requires
- php: ^8.1
- ext-intl: *
- ext-mbstring: *
- voku/portable-utf8: ^6.0
- voku/stop-words: ^2.0
Requires (Dev)
- phpunit/phpunit: ^9.5.20
- squizlabs/php_codesniffer: ^3.7
- symfony/var-dumper: ^5.4|^6.0
- vimeo/psalm: ^5.6
This package is auto-updated.
Last update: 2024-10-21 16:58:44 UTC
README
Introduction
TF-IDF is a method of information retrieval that is used to rank the importance of words in a document. It is based on the idea that words that appear in a document more often are more relevant to the document.
TF-IDF is the product of Term Frequency and Inverse Document Frequency. Here’s the formula for TF-IDF calculation.
TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)
Term Frequency
the ratio of the number of occurrences of a certain word to the total number of words in the document. Thus, the importance of the word within a single document is evaluated
where is the number of occurrences of the word in the document, and the denominator is the total number of words in the document.
Inverse Document Frequency
The inverse of the frequency with which a certain word occurs in the documents of the collection. The founder of this concept is Karen Spark Jones. Accounting for IDF reduces the weight of commonly used words. There is only one IDF value for each unique word within a given collection of documents.
where
- — The number of documents in the collection;
- — the number of documents in collection where occurs (when ).
The choice of the base of the logarithm in the formula does not matter, since changing the base changes the weight of each word by a constant factor, which does not affect the weight ratio.
Thus, the TF-IDF measure is the product of two factors:
High weight in TF-IDF will be given to words with high frequency within a particular document and low frequency in other documents.
Installation
TF-IDF is available as composer repository and can be installed using the following command in a root of your project:
$ composer require serafim/tf-idf
Quick Start
Getting information about words:
$vectorizer = new \Serafim\TFIDF\Vectorizer(); $vectorizer->addFile(__DIR__ . '/path/to/file-1.txt'); $vectorizer->addFile(__DIR__ . '/path/to/file-2.txt'); foreach ($vectorizer->compute() as $document => $entries) { var_dump($document); foreach ($entries as $entry) { var_dump($entry); } }
Example Result:
Serafim\TFIDF\Document\FileDocument { locale: "ru_RU" pathname: "/home/example/how-it-works.md" } Serafim\TFIDF\Entry { term: "работает" occurrences: 4 df: 1 tf: 0.012012012012012 idf: 0.69314718055995 tfidf: 0.0083260922589783 } Serafim\TFIDF\Entry { term: "php" occurrences: 26 df: 2 tf: 0.078078078078078 idf: 0.0 tfidf: 0.0 } Serafim\TFIDF\Entry { term: "запуска" occurrences: 2 df: 1 tf: 0.006006006006006 idf: 0.69314718055995 tfidf: 0.0041630461294892 } // ...etc...
Adding Documents
The IDF (Inverse Document Frequency) calculation requires several documents in the corpus. To do this, you can use several methods:
$vectorizer = new \Serafim\TFIDF\Vectorizer(); $vectorizer->addFile(__DIR__ . '/path/to/file.txt'); $vectorizer->addFile(new \SplFileInfo(__DIR__ . '/path/to/file.txt')); $vectorizer->addText('example text'); $vectorizer->addStream(fopen(__DIR__ . '/path/to/file.txt', 'rb')); // OR $vectorizer->add(new class implements \Serafim\TFIDF\Document\TextDocumentInterface { public function getLocale(): string { /* ... */ } public function getContent(): string { /* ... */ } });
Creating Documents
$vectorizer = new \Serafim\TFIDF\Vectorizer(); $file = $vectorizer->createFile(__DIR__ . '/path/to/file.txt'); $text = $vectorizer->createText('example text'); $stream = $vectorizer->createStream(fopen(__DIR__ . '/path/to/file.txt', 'rb'));
Computing
To calculate TF-IDF between loaded documents, use the compute(): iterable
method:
foreach ($vectorizer->compute() as $document => $result) { // $document = object(Serafim\TFIDF\Document\DocumentInterface) // $result = list<object(Serafim\TFIDF\Entry)> }
To calculate the TF-IDF between the loaded documents and the passed one, use
the computeFor(StreamingDocumentInterface|TextDocumentInterface): iterable
method:
$text = $vectorizer->createText('example text'); $result = $vectorizer->computeFor($text); // $result = list<object(Serafim\TFIDF\Entry)>
Custom Memory Driver
By default, all operations are calculated in memory. This happens pretty quickly, but it can overflow it. You can write your own driver if you need to save memory.
use Serafim\TFIDF\Vectorizer; use Serafim\TFIDF\Memory\FactoryInterface; use Serafim\TFIDF\Memory\MemoryInterface; $vectorizer = new Vectorizer( memory: new class implements FactoryInterface { // Method for creating a memory area for counters public function create(): MemoryInterface { return new class implements MemoryInterface, \IteratorAggregate { // Increment counter for the given term. public function inc(string $term): void { /* ... */ } // Return counter value for the given term or // 0 if the counter is not found. public function get(string $term): int { /* ... */ } // Should return TRUE if there is a counter for the // specified term. public function has(string $term): bool { /* ... */ } // Returns the number of registered counters. public function count(): int { /* ... */ } // Returns a list of terms and counter values in // format: [ WORD => 42 ] public function getIterator(): \Traversable { /* ... */ } // Destruction of the allocated memory area. public function __destruct() { /* ... */ } }; } } );
Custom Stop Words
In the case that it is required that some set of "stop words", which would not be taken into account in the result, a custom implementation should be specified.
Please note that by default, the list of stop words from the voku/stop-words package is used.
use Serafim\TFIDF\Vectorizer; use Serafim\TFIDF\StopWords\FactoryInterface; use Serafim\TFIDF\StopWords\StopWordsInterface; $vectorizer = new Vectorizer( stopWords: new class implements FactoryInterface { public function create(string $locale): StopWordsInterface { // You can use a different set of stop word drivers depending // on the locale ("$locale" argument) of the document. return new class implements StopWordsInterface { // TRUE should be returned if the word should be ignored. // For example prepositions. public function match(string $term): bool { return \in_array($term, ['and', 'or', /* ... */], true); } }; } } );
Custom Locale
use Serafim\TFIDF\Vectorizer; use Serafim\TFIDF\Locale\IntlRepository; $vectorizer = new Vectorizer( locales: new class extends IntlRepository { // Specifying the default locale public function getDefault(): string { return 'en_US'; } } );
Custom Tokenizer
If for some reason the analysis of words in the text does not suit you, you can write your own tokenizer.
use Serafim\TFIDF\Vectorizer; use Serafim\TFIDF\Tokenizer\TokenizerInterface; use Serafim\TFIDF\Document\StreamingDocumentInterface; use Serafim\TFIDF\Document\TextDocumentInterface; $vectorizer = new Vectorizer( tokenizer: new class implements TokenizerInterface { // Please note that there can be several types of document: // - Text Document: One that contains text in string representation. // - Streaming Document: One that can be read and may contain a // large amount of data. public function tokenize(StreamingDocumentInterface|TextDocumentInterface $document): iterable { $content = $document instanceof StreamingDocumentInterface ? \stream_get_contents($document->getContentStream()) : $document->getContent(); // Please note that the document also contains the locale, based on // which the term (word) separation logic can change. // // i.e. `if ($document->getLocale() === 'ar') { ... }` // return \preg_split('/[\s,]+/isum', $content); } } );