jpuck / etl
Tools for extract, transform, and load processes.
Installs: 41
Dependents: 0
Suggesters: 0
Security: 0
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 12
pkg:composer/jpuck/etl
Requires
- php: ^7.0
- jpuck/phpdev: ^1.3
- sabre/xml: ^1.4
Requires (Dev)
- phpunit/phpunit: ^5.5
README
This is a collection of PHP 7 classes useful for extracting, transforming, and loading data between sources.
Hierarchical XML and JSON can be automatically converted to relational SQL. Support includes extracting data documents from a file system or REST API, and then loading the data into a DBMS like Microsoft SQL Server.
Values are surveyed for datatypes, numeric cardinality, and unique natural key candidates. Then this information is used to create a normalized multi-table database structure suited to insert the data.
| Branch | Tests | Code Coverage |
|---|---|---|
| master | ||
| dev |
Requirements
PHP >= 7.0
Installation
This library is registered on packagist and can be easily installed into your project using composer.
composer require jpuck/etl
Getting Started
There are 3 basic groups of interrelated classes: Sources provide Data which have Schemata.
-
Sources
Sources extend the abstract
Sourceclass and transportDatumobjects. In particular, the abstractDBclass has concrete class implementations such asMicrosoftSQLServer. -
Data
Data classes extend
Datumand must implement a valid parser, satisfied byParseValidator. It uses theSchematizerto construct the object from raw data, which can be overridden by passing an existingSchema. -
Schemata
A
Schemais a concrete class with aValidatorto enforce structure. TheMergerclass is for combining Schemas to create super-set Schemas. TheDDLtrait is used by theDBclass to generate SQL Data Definition Language which contains abstract methods to be implemented by a specific database management system.
Schematizer
The Schematizer class is for surveying the structure of the data.
It includes node names, the count of distinct element groupings,
numeric cardinality for relationships between subnodes, and descriptive
statistics about the values including uniqueness. Categorically, it recognizes
datetime, integer, and decimal datatypes. Decimals will include scale and
precision measurements suitable for SQL.
Schematizer::getPrecision returns the scale and precision of
numeric values suitable for the SQL DECIMAL(scale,precision) datatype.
This function has notable behavior in that trailing zeros are discarded
when passed as raw PHP float types. However, when passed as a string, then the
trailing zeros are preserved in the precision.
See SchematizerUtilitiesTest::precisionDataProvider for examples.
Note that in the XML class, parsed values are represented as strings
in PHP, so trailing zeros should be represented in the precision values.
node name
├── count
│ ├── max
│ │ ├── measure
│ │ └── value
│ └── min
│ ├── measure
│ └── value
├── unique (all values)
├── primaryKey
├── varchar ────┐
│ ├── max │
│ │ ├── measure │
│ │ └── value │
│ └── min │
│ ├── measure │
│ └── value │
├── datetime ├ datatypes
│ ├── max │
│ │ └── value │
│ └── min │
│ └── value │
├── int/decimal │
│ ├── max │
│ │ └── value │
│ └── min │
│ └── value ────┘
├── scale ────┐
│ ├── max │
│ │ ├── measure │
│ │ └── value │
│ └── min │
│ ├── measure │
│ └── value │
├── precision ├ if decimal
│ ├── max │
│ │ ├── measure │
│ │ └── value │
│ └── min │
│ ├── measure │
│ └── value ────┘
├── children
│ ├── distinct (count of children)
│ └── count
│ ├── max
│ │ └── measure
│ └── min
│ └── measure
├── attributes
│ └── ... (excluding count, which must be 1)
└── elements
└── ...
Database Connections
The DB class requires an instance of PDO in the constructor to connect,
but it is possible to pass a null value if only utilizing the class for DDL.
SQL Data Definition Language
When one-to-many XML nodes are used to represent one-to-one relationships, then
the Schematizer recognizes this and a DDL class flattens them as columns
on a table. If a node has more than one of its name or grandchildren, then the
one-to-many relationship is preserved in a separate normalized table. Surrogate
keys are created to maintain the Primary/Foreign Key referential integrity.
If the Schema has a primaryKey set, then that field will be used for DDL
generation instead of the surrogate. However, this Schema must also be passed
to the Datum constructor prior to being used with DB::insert, otherwise the
surrogate keys will be used by default and will result in a failed insertion if
the surrogate columns don't exist.
Saving Schemas
Generating Schemas can take a long time and may require customization, such
as adding primaryKey flags or removing unwanted fields to be ignored. Here are
some examples for exporting and importing:
$xml = file_get_contents("sample.xml"); $xml = new XML($xml); $schema = $xml->schema(); // normal JSON_PRETTY_PRINT echo $schema;
By simply echoing the object, output can be redirected to a file from console:
php script.php > myschema.json
Use file_put_contents to write to disk.
Schema::toJSON accepts all the json_encode options.
$string = $schema->toJSON(JSON_UNESCAPED_UNICODE); file_put_contents('myschema.json', $string); // native php array $array = var_export($schema->toArray(), true); $array = "<?php return $array;"; file_put_contents('myschema.php', $array);
Import any of those formats the same way by passing the filename, an array, or a JSON string to the constructor.
$schemas []= new Schema('myschema.php'); $schemas []= new Schema('myschema.json'); $schemas []= new Schema($schema->toArray()); $schemas []= new Schema($schema->toJSON()); $schemas []= new Schema($schema->toJSON(JSON_PRETTY_PRINT)); foreach($schemas as $s){ var_dump($schema == $s); }
Override the internal Schematizer by passing a Schema to the Datum
constructor.
$schema = new Schema('myschema.json'); $xml = file_get_contents("sample.xml"); $xml = new XML($xml, $schema);
You can also pass the Schema override through Source::fetch
$credentials = [ 'url' => 'https://api.example.com', 'username' => 'user', 'password' => 'P@55w0rd', ]; $source = new REST($credentials); $xml = $source->fetch('endpoint', XML::class, new Schema('myschema.json'));
Development
The development dependencies can be installed by running composer with or
without the --dev option (enabled by default).
composer install --dev
Testing
Tests are written for PHPUnit which is included as a composer dev-dependency. To run the whole test suite, then execute this command from the shell console:
php vendor/bin/phpunit
You might also be interested in an easy to read checklist output:
php vendor/bin/phpunit --testdox
When stepping through breakpoints in an IDE, like Netbeans, it's helpful to see the current test name output by setting the run configuration to debug:
php vendor/bin/phpunit --debug
A code coverage report is available if you have the xdebug extension
installed. In addition to the console text summary report, a full HTML GUI is
generated to explore in the coverage folder. The easiest way to view this is
to boot up a dev server:
php -S localhost:8080 -t coverage/
Database Testing
You must create the file (or symbolic link) tests/data/pdos/pdo.php in order
to run the database tests. This should simply return a PDO instance,
for example:
<?php return (function(){ $hostname = 'sql.example.com'; $database = 'mydb'; $username = 'user'; $password = 'P@55w0rd'; // https://www.microsoft.com/en-us/download/details.aspx?id=50419 $driver = 'ODBC Driver 13 for SQL Server'; $pdo = new PDO("odbc:Driver=$driver; Server=$hostname; Database=$database", $username, $password ); $pdo->setAttribute( PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION ); return $pdo; })();