namoshek / laravel-scout-database
A generic Laravel Scout driver which performs full-text search on indexed model data using an SQL database as storage backend. Indexed data is stored in normalized form, allowing efficient search.
Installs: 60 347
Dependents: 0
Suggesters: 0
Security: 0
Stars: 11
Watchers: 3
Forks: 6
Open Issues: 0
Requires
- php: ^8.0
- ext-pdo: *
- illuminate/contracts: ^9.0|^10.0
- illuminate/database: ^9.0|^10.0
- illuminate/support: ^9.0|^10.0
- laravel/scout: ^9.0|^10.0
- staudenmeir/laravel-cte: ^1.0
- wamania/php-stemmer: ^2.0|^3.0
Requires (Dev)
- orchestra/testbench: ^7.0|^8.0
- squizlabs/php_codesniffer: ^3.5
README
This package provides a generic Laravel Scout driver which performs full-text search on indexed model data using an SQL database as storage backend. Indexed data is stored in normalized form, allowing efficient and fuzzy search which does not require a full and/or exact match.
This driver is an alternative to teamtnt/laravel-scout-tntsearch-driver
.
The primary difference is that this driver provides fewer features (like geo search).
Instead, it works with all database systems supported by Laravel itself (which are basically all PDO drivers).
Also, the search algorithm is slightly different.
All tests are run through GitHub Actions for PHP 8.0, 8.1 and 8.2 on the following database systems:
- SQLite 3
- MySQL 8.0
- PostgreSQL 13.1
- SQL Server 2017
Actual limitations regarding supported database systems are mostly related to the use of Common Table Expression using staudenmeir/laravel-cte. Please make sure your database system is supported before using the package, or you might run into database errors.
Installation
You can install the package via composer:
composer require namoshek/laravel-scout-database
After installing the package, the configuration file as well as the migrations need to be published:
php artisan vendor:publish --provider="Namoshek\Scout\Database\ScoutDatabaseServiceProvider" --tag="config" php artisan vendor:publish --provider="Namoshek\Scout\Database\ScoutDatabaseServiceProvider" --tag="migrations"
If you would like to use a different table prefix than scout_
for the tables created by this package,
you should change the configuration as well as the copied migrations accordingly.
When you have done so, you can then apply the database migrations:
php artisan migrate
Upgrading from v0.x
to v1.x
Migrations
With the new version, the database schema has changed and new migrations need to be published using:
php artisan vendor:publish --provider="Namoshek\Scout\Database\ScoutDatabaseServiceProvider" --tag="migrations"
The same hint as mentioned above in the Installation section applies to the newly published migrations as well.
Config
The configuration has been reduced slightly and you might want to compare the new configuration file with the old one to remove obsolete settings. Skipping this part has no negative impact on the performance of the Scout driver, though.
Commands
The \Namoshek\Scout\Database\Commands\CleanWordsTable::class
command has been removed and you should un-schedule it, if you added it previously.
Noteworthy
Most occurrences of protected
fields and methods have been changed to private
to simplify development in regard to backwards-compatibility breaking changes in the future.
If you have not been actively overriding parts of the implementation, this does not affect you at all.
Configuration
In order to instruct Scout to use the driver provided by this package, you need to change the driver
option in config/scout.php
to database
. If you did not change the Scout configuration file, you can also set the SCOUT_DRIVER
environment variable to database
instead.
All available configuration options of the package itself can be found in config/scout-database.php
.
The options are described thoroughly in the file itself. By default, the package uses the UnicodeTokenizer
and the PorterStemmer
which is suitable for the English language. The search adds a trailing wildcard to the
last token and not all search terms need to be found in order for a document to show up in the results (there must be at least one match though).
You may also add a wildcard to each search token by enabling wildcard_all_tokens
in the config file altough this is not recommended for performance reasons.
A basic installation most likely does not require you to change any of these settings. Just to make sure, you should have a look at the
connection
option though. If you want to change this, do so before running the migrations or the tables will be created using the wrong
database connection.
Supported Tokenizers
Currently, only a UnicodeTokenizer
is available. It will split strings at any character which is neither
a letter, nor a number according to the \p{L}
and \p{N}
regex patterns. This means that dots, colons, dashes, whitespace, etc. are split criteria.
If you have different requirements for a tokenizer, you can provide your own implementation via the configuration. Just make sure it implements the
Tokenizer
interface.
Supported Stemmers
Currently, all stemmers implemented by the wamania/php-stemmer
package are available. A wrapper class
has been added for each of them:
DanishStemmer
DutchStemmer
EnglishStemmer
FrenchStemmer
GermanStemmer
ItalianStemmer
NorwegianStemmer
NullStemmer
(can be used to disable stemming)PorterStemmer
(default, same asEnglishStemmer
)PortugueseStemmer
RomanianStemmer
RussianStemmer
SpanishStemmer
SwedishStemmer
If you have different requirements for a stemmer, you can provide your own implementation via the configuration. Just make sure it implements the
Stemmer
interface.
Usage
The package follows the available use cases described in the official Scout documentation. Please be aware of the listed limitations though.
How does it work?
The Indexing
The search driver internally uses a single table, which contains terms and the association to documents. When indexing documents (i.e. adding
or updating models in the search index) the engine will use the configured tokenizer to split the input of each column into tokens.
The tokenizer configured by default simply splits inputs into words consisting of any unicode letter or number, which means any other character
like ,
, .
, -
, _
, !
, ?
, /
, whitespace and all other special characters are considered separators for the tokens and will be removed
by the tokenizer. This way such characters will never end up in the search index itself.
After the inputs have been tokenized, each token (and at this point we actually expect our tokens to be words) is run through the configured
stemmer to retrieve the stem (i.e. root word). Performing this action allows us to search for similar words later.
The PorterStemmer
for example will produce intellig
as output for both intelligent
as well as
intelligence
as input. How this helps when searching will be clear in a moment.
Finally, the results of this process are stored in the database. The index table is filled with the results of the stemming process and the associations to the indexed models (model type and identifier). On top of that, for each row in the index, the database also contains the number of occurences in a document. We use this information for scoring within the search part of our engine.
The Search
When executing a search query, the same tokenizing and stemming process as used for indexing is applied to the search query string. The result of this process is a list of stems (or root words) which are then used to perform the actual search. Depending on the configuration of the package, the search will return documents which contain at least one or all of the stems. This is done by calculating a score for each match in the index based on the inverse document frequency (i.e. the ratio between indexed documents and documents containing one of the searched words), the term frequency (i.e. the number of occurrences of a search term within a document) and the term deviation (which is only relevant for the wildcard search). Returned are documents ordered by their score in descending order, until the desired limit is reached.
Extending the Search Index
It is possible to extend the search index table (scout_index
) with custom columns.
During indexing, these columns may be filled with custom content and during searching the searches can be scoped to these columns (exact match).
This feature is particularly useful when working with a multi-tenancy application where the search index is used by multiple tenants.
Example Migration
In our example, we add a mandatory tenant_id
column to the search index.
return new class extends Migration { public function up(): void { Schema::table('scout_index', function (Blueprint $table) { $table->uuid('tenant_id'); }); } public function down(): void { Schema::table('scout_index', function (Blueprint $table) { $table->dropColumn(['tenant_id']); }); } };
Indexing Example
The tenant_id
is added during indexing for each model:
class User extends Model { public function toSearchableArray(): array { return [ 'id' => $this->id, 'name' => $this->name, 'tenant_id' => new StandaloneField($this->tenant_id), ]; } }
Search Example
The tenant_id
is filtered during search based on the $tenantId
, which may for example be taken from the HTTP request:
User::search('Max Mustermann') ->where('tenant_id', $tenantId) ->get();
Limitations
Obviously, this package does not provide a search engine which (even remotely) brings the performance and quality a professional search engine like Elasticsearch offers. This solution is meant for smaller to medium-sized projects which are in need of a rather simple-to-setup solution.
Also worth noting, the following Scout features are currently not implemented:
- Soft Deletes
- Search custom index using
User::search('Mustermann')->within('users_without_admins')
- Search with custom order using
User::search('Musterfrau')->orderBy('age', 'desc')
- Implementing this feature would be difficult in combination with the scoring algorithm. Only the result of the database query could be ordered, while this could then lead to issues with pagination.
Known Issues
One issue with this search engine is that it can lead to issues if multiple queue workers work on the indexing of a single document concurrently (database will deadlock). To circumvent this issue, a the number of attempts used for transactions is configurable. By default, each transaction is tried a maximum of three times if a deadlock (or any other error) occurs.
License
The MIT License (MIT). Please see License File for more information.