mkcg / php-query-model
Agnostic model to efficiently query and scroll any kind of data (SQL, Search engine, HTTP API, CSV, ...) and push them anywhere with a ETL
Installs: 1 199
Dependents: 0
Suggesters: 0
Security: 0
Stars: 6
Watchers: 3
Forks: 2
Open Issues: 2
pkg:composer/mkcg/php-query-model
Requires
- php: ^7.2.0
- psr/http-client: ^1.0
Suggests
- algolia/algoliasearch-client-php: Needed to use the algolia driver
- doctrine/dbal: Needed to use the doctrine driver
- ethanhann/redisearch-php: Needed to use the redisearch driver
- guzzle/guzzle: Needed to use the Guzzle adapter
- guzzlehttp/guzzle: Needed to use the GuzzleHttp adapter
- mongodb/mongodb: Needed to use the Mongodb driver
- predis/predis: Needed to use the redis driver
README
Simple multi-database library to search content on different engines and aggregate those results into document-oriented structures.
The library define the class MKCG\Model\DBAL\QueryEngine to build documents using different Drivers.
It also defines a simple ETL to be able to easily and efficiently synchronize content between different datasources.
Engine API
The QueryEngine API define two methods : query() and scroll().
Each one can fetch and build documents using the provided \MKCG\Model\Model with the appropriate \MKCG\Model\DBAL\QueryCriteria.
However, the scroll() method return a \Generator and internally performs multiple batches to efficiently scroll big collections.
Examples
$model = Schema\User::make('default', 'user') ->with(Schema\Address::make()) ->with(Schema\Post::make()); $criteria = (new QueryCriteria()) ->forCollection('user') ->addFilter('status', FilterInterface::FILTER_IN, [ 2 , 3 , 5 , 7 ]) ->addFilter('registered_at', FilterInterface::FILTER_GREATER_THAN_EQUAL, '2000-01-01') ->addSort('firstname', 'ASC') ->addSort('lastname', 'ASC') ->setLimit(10) ->forCollection('addresses') ->setLimitByParent(2) ->forCollection('posts') ->addFilter('title', FilterInterface::FILTER_FULLTEXT_MATCH, 'ab') ; $users = $engine->query($model, $criteria); echo json_encode($users->getContent(), JSON_PRETTY_PRINT) . "\n"; echo "\nFound : " . $users->getCount() . " users\n"; $iterator = $engine->scroll($model, $criteria); foreach ($iterator as $user) { echo json_encode($user, JSON_PRETTY_PRINT) . "\n"; }
Drivers definition
Each Driver is responsible to perform queries on a single datasource (database, HTTP API, local files, ...) and must :
- implements the MKCG\Model\DBAL\Drivers\DriverInterface
- registered inside the QueryEngine
Example
use MKCG\Model\DBAL\QueryEngine; use MKCG\Model\DBAL\Drivers; $mongoClient = new MongoDB\Client('mongodb://root:password@mongodb'); $redisClient = new \Predis\Client([ 'scheme' => 'tcp', 'host' => 'redisearch', 'port' => 6379 ]); $sqlConnection = \Doctrine\DBAL\DriverManager::getConnection([ 'user' => 'root', 'password' => 'root', 'host' => 'mysql', 'driver' => 'pdo_mysql', ]); $engine = (new QueryEngine('mysql')) ->registerDriver(new Drivers\Doctrine($sqlConnection), 'mysql') ->registerDriver(new Drivers\CsvReader($fixturePath), 'csv') ->registerDriver(new Drivers\RssReader(new Adapters\Guzzle), 'rss') ->registerDriver(new Drivers\SitemapReader(new Adapters\Guzzle), 'sitemap') ->registerDriver(new Drivers\Http(new Adapters\Guzzle), 'http') ->registerDriver(new Drivers\HttpRobot(new Adapters\Guzzle), 'http_robot') ->registerDriver(new Drivers\MongoDB($mongoClient), 'mongodb')
Runtime behaviors
Drivers
| Type | Name | Description | 
|---|---|---|
| Document-oriented database | MongoDB | Driver for MongoDB 3.6+ | 
| Relational database | Doctrine | Doctrine DBAL Adapter (MySQL, MariaDB are supported, other might not) | 
| Search engine | Elasticsearch | Driver for Elasticsearch 5+ (Work in progress) | 
| Search engine | Redisearch | Driver for redisearch | 
| File reader | CsvReader | |
| HTTP | Http | Interact with remote url (Work in progress) | 
| HTTP | HttpRobot | Parse robots.txt from remote url (Work in progress) | 
| RSS | RssReader | Extract RSS from remote url | 
| Sitemap | SitemapReader | Extract Sitemap urlset from remote url | 
Features supported by driver
| Driver | Scrollable | Filterable | Sortable | Aggregatable | Count | 
|---|---|---|---|---|---|
| CsvReader | YES | YES | NO | YES | YES | 
| Doctrine | YES | YES | YES | YES | YES | 
| Http | YES | YES | NO | NO | NO | 
| HttpRobot | YES | NO | NO | NO | NO | 
| MongoDB | YES | YES | YES | YES | YES | 
| Redisearch | YES | YES | YES | YES | YES | 
| RssReader | YES | YES | NO | NO | NO | 
| SitemapReader | YES | YES | NO | NO | NO | 
| Algolia | |||||
| Cassandra | |||||
| Elasticsearch | |||||
| PostgreSQL | |||||
| ScyllaDB | |||||
| Solr | 
Query criteria options
HTTP-based drivers :
- HTTP
- HttpRobot
- RssReader
- SitemapReader
- Elasticsearch
Result-based filterable drivers :
- CsvReader
- RssReader
- SitemapReader
| Option | Drivers | Description | 
|---|---|---|
| case_sensitive | MongoDB, Result-based filterable drivers | Perform case sensitive FILTER_FULLTEXT_MATCHsearch , default :false | 
| filepath | CsvReader | Absolute or relative filepath of the CSV | 
| delimiter | CsvReader | Field delimiter | 
| json_formatter | HTTP | Format JSON response body using a callback | 
| multiple_requests | none , used by the QueryEngine | Disable sub-requests batching when including sub-models | 
| url | HTTP-based drivers | Define the URL to use to query | 
| url_generator | HTTP-based drivers | Use a callback to generate the URL to use based on the Query | 
| max_query_time | HTTP-based drivers , MongoDB | Max query time in milliseconds , default : 5000(5 seconds) | 
| allow_partial | MongoDB , Elasticsearch | Allow partial results to be returned , default : false | 
| readPreference | MongoDB | https://docs.mongodb.com/manual/core/read-preference/index.html | 
| readConcern | MongoDB | https://docs.mongodb.com/manual/reference/read-concern/index.html | 
| batchSize | MongoDB | https://docs.mongodb.com/manual/reference/method/cursor.batchSize/index.html | 
| diacriticSensitive | MongoDB | https://docs.mongodb.com/manual/reference/operator/query/text/index.html | 
| name | all | Overrides the name defined into the SchemaInterface | 
When both url_generator and url are provided, then only url_generator is used.
Filters
| Name | Constant name | Description | 
|---|---|---|
| IN | FILTER_IN | |
| NOT IN | FILTER_NOT_IN | |
| GT | FILTER_GREATER_THAN | |
| GTE | FILTER_GREATER_THAN_EQUAL | |
| LT | FILTER_LESS_THAN | |
| LTE | FILTER_LESS_THAN_EQUAL | |
| MATCH | FILTER_FULLTEXT_MATCH | Text search | 
| CUSTOM | FILTER_CUSTOM | Allow to use a callable to apply complex filters | 
Constants are defined by the interface MKCG\Model\DBAL\FilterInterface
Filters supported by driver
| Driver | IN | NOT IN | GT | GTE | LT | LTE | MATCH | CUSTOM | 
|---|---|---|---|---|---|---|---|---|
| Http | NO | NO | NO | NO | NO | NO | NO | NO | 
| HttpRobot | NO | NO | NO | NO | NO | NO | NO | NO | 
| Doctrine | YES | YES | YES | YES | YES | YES | Interpreted as LIKE "%value%" | YES | 
| Elasticsearch | YES | YES | YES | YES | YES | YES | YES , using elasticsearch matchfilter | WIP | 
| MongoDB | YES | YES | YES | YES | YES | YES | YES , using mongodb $textoperator | YES | 
| Redisearch | YES | YES | YES | YES | YES | YES | YES , using redisearch searchsyntax | NO | 
| CsvReader | YES | YES | YES | YES | YES | YES | Interpreted as LIKE "%value%" | YES | 
| RssReader | YES | YES | YES | YES | YES | YES | Interpreted as LIKE "%value%" | YES | 
| SitemapReader | YES | YES | YES | YES | YES | YES | Interpreted as LIKE "%value%" | YES | 
CUSTOM filter type
Custom filters can be applied by providing a callable to the QueryCriteria instance :
(new QueryCriteria()) ->forCollection('order') ->addCallableFilter(function(Query $query, ...$arguments) { // do something })
The first argument of the callable SHOULD always be the Query instance.
Other arguments might change depending on the driver.
Some Driver apply filters on fetched results and expect a false return value when the filter does not match. Internaly they apply a array_filter on each fetched result before :
- CsvReader
- RssReader
- SitemapReader
callable arguments by Driver
| Driver | First argument | Second argument | 
|---|---|---|
| Doctrine | \MKCG\Model\DBAL\Query | \Doctrine\DBAL\Query\QueryBuilder | 
| CsvReader | \MKCG\Model\DBAL\Query | arrayrepresenting a raw item | 
| RssReader | \MKCG\Model\DBAL\Query | arrayrepresenting a raw item | 
| SitemapReader | \MKCG\Model\DBAL\Query | arrayrepresenting a raw item | 
| MongoDB | \MKCG\Model\DBAL\Quert | arrayrepresenting the filters passed as first argument of\MongoDB\Collection::find() | 
ETL
A deadly simple ETL is defined as a single class \MKCG\Model\ETL.
It can be used in combination with the QueryEngine scroll API to transform then push content to different loaders;
Example
function pipelineEtl(QueryEngine $engine) { $model = Schema\Product::make('default', 'products'); $criteria = (new QueryCriteria()) ->forCollection('products') ->addFilter('sku.color', FilterInterface::FILTER_IN, ['aqua', 'purple']) ; $iterator = $engine->scroll($model, $criteria, 100); $pushed = ETL::extract($engine->scroll($model, $criteria, 100), 1000, 500) ->transform(function($item) { return [ 'id' => $item['_id'], 'sku' => $item['sku'] ]; }) ->transform(function($item) { return $item + [ 'sku_count' => count($item['sku'] ?? []) ]; }) ->load(function(iterable $bulk) { echo sprintf("[ETL] Loader 1 - Loading %d elements\n", count($bulk)); }) ->load(function(iterable $bulk) { echo sprintf("[ETL] Loader 2 - Loading %d elements\n", count($bulk)); }) ->load(function(iterable $bulk) { echo sprintf("[ETL] Loader 3 - Loading %d elements\n", count($bulk)); }) ->run(); echo sprintf("[ETL] Pushed %d elements\n", $pushed); }
Aggregations
| NAME | Description | 
|---|---|
| TERMS | Number of distinct elements by field , with the field filters considered by the aggregation | 
| FACET | Number of distinct elements by field , with the field filters excluded by the aggregation | 
| AVERAGE | Average value of a numeric field | 
| MIN | Min value of a field | 
| MAX | Max value of a field | 
| QUANTILE | Quantile value of a field | 
Aggregrations supported
| Driver | TERMS | FACET | AVERAGE | MIN | MAX | QUANTILE | 
|---|---|---|---|---|---|---|
| CsvReader | NO | NO | YES | YES | YES | NO | 
| Doctrine | YES | YES | YES | YES | YES | YES | 
| MongoDB | YES | YES | YES | YES | YES | YES | 
| Redisearch | YES | YES | YES | YES | YES | YES | 
Test and examples
No tests are provided although some will be made using Behat for the release of the version 1.0.0.
However a fully functionnal example is provided in examples/ and build documents using different kinds of Drivers
From : ./examples
docker-compose up --build -d docker exec -it php_query_model sh -c "cd /home/php-query-model/examples && composer install" docker exec -it php_query_model sh -c "php /home/php-query-model/examples/index.php"
By default this will run only two functions (located in index.php)
pipelineEtl($engine); searchOrder($engine); // searchProducts($engine); // searchGithubRobot($engine); // searchSitemaps($engine); // searchPackages($engine); // searchUsers($engine); // searchHackerNews($engine);
The pipelineEtl use the Engine scroll API to iterates a list of Product stored in MongoDB and apply different transformationsbefore pushing content with three loaders using the ETL component.
The searchOrder use the Engine scroll API to :
- scan a CSVfile containing ecommerceOrder- then inject their corresponding Productstored onMongoDB
- then inject their correspondng customers stored as UserintoMySQL- with their first two defined Addressalso stored inMysql
- and all their Poststored inMysql
 
- with their first two defined 
 
- then inject their corresponding 
You might want to uncomment the other search functions to execute HTTP queries and fetch :
- https://github.com/robots.txt with searchGithubRobot
- https://news.ycombinator.com/ top stories with searchHackerNews
- https://packagist.org/feeds/packages.rss RSS feed with searchPackages
- https://www.sitemaps.org/sitemap.xml Sitemap with searchSitemaps
Roadmap
Expected features
Work In progress
| Feature | Description | 
|---|---|
| Elasticsearch Driver | |
| Callable validation | Validate callable arguments using Reflection and PHP tokens | 
Backlog
| Feature | Description | 
|---|---|
| Async HTTP requests | Perform non-blocking HTTP requests | 
| Lazy requests | Only perform requests when the content is manipulated | 
| Cacheable requests | Cache results and detect what to invalidate using surrogate keys | 
| Content synchronizer | Use streamed eventlog to synchronize content between each datasource | 
| Error handling strategies | Allow to apply different strategy in case of a failure : crash, retry, fallback... | 
| Generate schema classes | Generate Schema classes by analyzing each database schema | 
| Content lifecycle | Allow to create / update / delete content | 
Drivers "nice to have"
Database Drivers
- Algolia
- ArangoDB
- Cassandra
- \Illuminate\Eloquent (library used by Laravel)
- Neo4J
- PostgreSQL
- ScyllaDB
- Solr
Streaming
- Kafka
- MySQL binlog
- RabbitMQ
Storage
- AWS S3
- File system
- OpenIO
Infrastructure
- AWS
- OVH
Service
- Cloudinary
- Sendinblue
Social Network
Contribution
Feel free to open a merge request for any suggestion or to contribute to this project.