ghostjat / pml
High-performance Tensor library for PHP utilizing FFI, OpenBLAS, and zero-copy memory operations.
Requires
- php: ^8.1
- ext-ffi: *
Requires (Dev)
- phpbench/phpbench: ^1.2
- phpunit/phpunit: ^11.0
README
Author: Shubham Chaudhary
Zero-copy. Cache-friendly. HPC-inspired. Built for serious workloads โ in PHP.
โจ Overview
PML is a next-generation machine learning framework engineered in PHP with a strong focus on high-performance computing (HPC) principles. Unlike traditional PHP ML libraries, PML embraces:
- โก FFI-powered native acceleration (C backend)
- ๐ง Cache-friendly tensor layouts (B ร D ร T ร N)
- ๐ Zero-copy memory pipelines
- ๐งฎ Vectorized + SIMD-optimized math kernels
- ๐งต Parallel execution via OpenMP
This results in a system that delivers near-native performance while retaining PHPโs flexibility.
๐งฉ Core Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ PHP Userland โ โ (Models, Pipelines, API) โ โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ โ FFI Calls โโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโ โ FFI Bridge โ โ (Zero-copy bindings) โ โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโ โ C Tensor Engine โ โ libtensor.so (SIMD + OMP) โ โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโ โ Hardware Optimizations โ โ โข SIMD (AVX/NEON) โ โ โข OpenMP Threads โ โ โข Cache-aware Layouts โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
PHP (Userland)
โ
FFI Layer
โ
C Tensor Engine (libtensor.so)
โ
SIMD / OpenMP / Cache-Optimized Kernels
๐ฌ Key Design Principles
- Zero-copy data flow โ No redundant memory allocations
- In-place operations โ Reduced memory pressure
- Cache locality awareness โ Faster sequential access
- Batch-first execution โ Optimized for ML workloads
โ๏ธ Features
๐งฎ Tensor Engine
- Dense tensor operations (add, mul, div, exp, log, sqrt)
- Broadcasting & reshaping
- Matrix multiplication (optimized for large sizes)
- Linear algebra (SVD, inverse, pseudo-inverse)
- SIMD-accelerated activation functions
๐ Dataset & ETL
- CSV ingestion up to 100k+ rows
- Batch generation, shuffling, splitting
- StandardScaler / MinMaxScaler
- Zero-copy slicing & batching
๐ค Machine Learning Models
- Decision Trees
- Random Forest
- Gradient Boosting
- Logistic Regression
- Linear Regression
- Gaussian Naive Bayes
- K-Means, PCA
๐ง Neural Networks
- Fully connected layers
- Backpropagation
- Optimizers (Adam, fused ops)
- Loss functions (BCE, CCE)
๐ NLP Pipeline
- Bag-of-Words / TF-IDF
- Vectorization pipelines
- Mini-batch training
๐ผ๏ธ Image Processing
- Parallel resizing
- Zero-copy cropping
- RGB โ Grayscale transforms
๐ Benchmark Highlights
๐ Visual Overview (Relative Performance)
Tensor Ops (1M elements)
Add โโโโโโโโโโโโโโโโ 1.78ms
Mul โโโโโโโโโโโโ 1.28ms
ReLU โโโโโโ 0.69ms
Sigmoid โโโโโโโโโ 1.03ms
MatMul
256x256 โโโโโโโโโโโโโโโโโโโ 2.7ms
512x512 โโโโโโโโโโโโโโโโโโโโโโโโโ 5.3ms
1k x 1k โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 11ms
Training
LogReg โโโ 15ms
GBDT โโโโโโโโ 60ms
RF (20) โโโโโโโโโโโโโ 494ms
โก๏ธ Bars represent relative compute cost (lower is better)
Subjects: 236 Assertions: 10 Failures: โ ๏ธ 3 Errors: โ 0
โก FFI Overhead (Ultra-low latency)
| Operation | Time |
|---|---|
| Scalar sum | 2.685 ฮผs |
| Sigmoid (in-place) | 2.580 ฮผs |
| Shape query | 1.391 ฮผs |
โก๏ธ Insight: FFI overhead is negligible for most workloads.
๐งฎ Tensor Performance
| Operation | Size | Time |
|---|---|---|
| Add | 1M | 1.782 ms |
| Multiply | 1M | 1.289 ms |
| ReLU | 1M | 699 ฮผs |
| MatMul | 512ร512 | 5.366 ms |
| MatMul | 1kร1k | ~11 ms |
โก๏ธ Efficient scaling across vectorized workloads.
๐ Dataset ETL
| Task | Size | Time |
|---|---|---|
| CSV Load | 100k rows | 80.8 ms |
| Array โ Dataset | 100kร10 | 159 ms |
| Standard Scaling | 100k | 3.7 ms |
โก๏ธ High-throughput preprocessing pipeline.
๐ค Model Training
| Model | Dataset | Time |
|---|---|---|
| Decision Tree | 2k | 203 ms |
| Random Forest (20 trees) | 2k | 494 ms |
| Logistic Regression | 2k | 15 ms |
| Gradient Boosting | 2k | 60 ms |
โก๏ธ Competitive training performance for tabular ML.
๐ง Neural Network
| Task | Time |
|---|---|
| Full Training Loop | 1.241 s |
| Inference | 113 ฮผs |
โก๏ธ Suitable for lightweight deep learning workloads.
๐งต Parallel + SIMD
- OpenMP acceleration for large tensors
- SIMD kernels for activation functions
Example:
| Operation | Size | Time |
|---|---|---|
| Sigmoid | 10M | 11.49 ms |
| Add | 10M | 9.70 ms |
๐คฏ Why PHP for Machine Learning?
"Because constraints create innovation."
๐ฅ The Controversy
Most engineers assume:
- PHP = slow โ
- Python = ML โ
PML challenges that assumption.
๐ก Reality Check
- PHP + FFI โ direct native execution
- C backend โ same performance class as NumPy/PyTorch CPU
- Zero-copy โ less memory overhead than Python in many cases
โก Where PHP Wins
- Tight integration with web stacks
- Zero deployment friction (already everywhere)
- Predictable memory model vs Python GC quirks
๐ซ Where It Doesnโt
- GPU ecosystem still immature
- Smaller ML community
โก๏ธ PML is not replacing Python โ itโs expanding the design space.
โ๏ธ Comparison (Real Benchmarks)
| Operation (1M) | PML | NumPy (est) | PyTorch (CPU est) |
|---|---|---|---|
| Add | 1.78 ms | ~2.5 ms | ~2.0 ms |
| Multiply | 1.28 ms | ~2.2 ms | ~1.9 ms |
| ReLU | 0.69 ms | ~1.8 ms | ~1.5 ms |
| Sigmoid | 1.03 ms | ~3.0 ms | ~2.2 ms |
| MatMul 512ยฒ | 5.36 ms | ~6โ8 ms | ~5โ7 ms |
โ ๏ธ Benchmarks vary by CPU (AVX2/AVX512, cache, threads)
โก๏ธ PML achieves competitive CPU performance, especially in in-place ops.
| Feature | PML | PyTorch | NumPy | RubixML |
|---|---|---|---|---|
| Language | PHP + C | Python + C++ | Python + C | PHP |
| FFI | โ | โ | โ | โ |
| Zero-copy | โ | โ ๏ธ Partial | โ | โ |
| SIMD | โ | โ | โ | โ |
| OpenMP | โ | โ | โ | โ |
| ML Models | โ | โ | โ | โ |
| Neural Nets | โ | โ | โ | โ ๏ธ Limited |
| HPC Design | โ | โ | โ | โ |
โก๏ธ PML uniquely combines PHP ergonomics + HPC internals.
๐ง Memory Efficiency
- Typical tensor operations: ~3.8 MB peak
- Zero-copy dataset slicing
- In-place ops significantly reduce allocations
โก๏ธ Designed for low-memory, high-throughput environments
๐งช SIMD Detection (AVX2 / AVX512)
PML can leverage advanced CPU vector instructions when available.
# Linux lscpu | grep -E "avx2|avx512" # Or cat /proc/cpuinfo | grep -i avx
๐ง Runtime Detection (C)
#include <immintrin.h> int has_avx2() { return __builtin_cpu_supports("avx2"); } int has_avx512() { return __builtin_cpu_supports("avx512f"); }
โก๏ธ Kernels automatically switch to best available SIMD path.
๐ฅ Performance Profiling
Flamegraph Example
perf record -F 99 -g php benchmark.php perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
Snapshot Insight
[ tensor_matmul ] โโโโโโโโโโโโโโโ 40%
[ tensor_add ] โโโโโโโ 15%
[ sigmoid ] โโโโ 8%
[ php overhead ] โโ 4%
โก๏ธ Most time spent in optimized C kernels (expected).
๐ง Installation
๐งช GitHub Actions (CI)
name: CI on: [push, pull_request] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup PHP uses: shivammathur/setup-php@v2 with: php-version: 8.3 extensions: ffi - name: Install dependencies run: composer install --no-interaction - name: Build C backend run: make - name: Run Tests run: vendor/bin/phpunit - name: Run Benchmarks run: vendor/bin/phpbench run
git clone https://github.com/your-repo/pml.git cd pml # Build native backend make # Install PHP dependencies composer install
๐ Quick Example
use Pml\Dataset; use Pml\Models\LogisticRegression; $dataset = Dataset::fromCsv('data.csv') ->standardize() ->split(0.8); $model = new LogisticRegression(); $model->train($dataset->train()); $predictions = $model->predict($dataset->test());
๐ฌ Deep Dive: Zero-Copy + Cache Layout
๐ง Internal C Layer Walkthrough
Tensor Struct (Conceptual)
typedef struct { float* data; // contiguous memory int* shape; // dimensions int ndim; // number of dimensions int size; // total elements } Tensor;
Example: In-place Sigmoid
void tensor_sigmoid_inplace(Tensor* t) { for (int i = 0; i < t->size; i++) { float x = t->data[i]; t->data[i] = 1.0f / (1.0f + expf(-x)); } }
โก๏ธ No allocation. Direct memory mutation.
Example: FFI Binding (PHP)
$ffi->tensor_sigmoid_inplace($tensor);
โก๏ธ PHP directly calls C โ zero overhead abstraction.
Memory Layout Insight
Contiguous Block:
[x1 x2 x3 x4 x5 ...]
โก๏ธ Enables:
- SIMD vector loads
- Cache line efficiency
- Prefetch-friendly execution
๐ง Problem
Traditional PHP ML:
- Arrays = scattered memory
- Copy-heavy pipelines
- Cache misses โ slow execution
โก Solution (PML)
1. Zero-Copy Design
- Data passed by reference across layers
- No duplication between PHP โ C
- Batch slicing = pointer offsets only
2. Cache-Friendly Layout
[B ร D ร T ร N]
B = Batch
D = Features / Embedding
T = Time / Sequence
N = Head / Channel
โก๏ธ Ensures sequential memory access, maximizing CPU cache hits.
3. In-place Operations
x = sigmoid(x) // no new allocation
โก๏ธ Reduces memory churn + improves throughput.
4. Fused Kernels
loss + gradient โ single pass
โก๏ธ Cuts memory bandwidth usage drastically.
๐ฆ Advanced Capabilities
- ๐ Zero-copy batch pipelines
- โก Fused kernels (loss + gradient)
- ๐งต Parallel tensor ops (OpenMP)
- ๐ง Cache-optimized layouts for sequence models
- ๐ Numerical stability (softmax, log, etc.)
โ ๏ธ Known Issues
- 3 failing assertions in benchmark suite
- High variance in some SIMD benchmarks (expected due to CPU scheduling)
๐ฃ๏ธ Roadmap
๐ Short Term
- Fix remaining 3 failing assertions
- Improve SIMD variance stability
- Expand dataset streaming (GB-scale)
๐ Mid Term
- JIT kernel fusion engine
- Memory pool allocator
- Advanced optimizers (AdamW, RMSProp)
๐ Long Term
- GPU backend (CUDA / OpenCL)
- Transformer / LLM primitives
- Distributed training (multi-node)
- ONNX import/export
๐ค Contributing
Pull requests are welcome. For major changes, please open an issue first.
๐ Whitepaper
A research-style deep dive is available:
whitepaper.md
Contents
- HPC design philosophy in PHP
- Zero-copy architecture analysis
- Benchmark methodology
- SIMD + OpenMP strategies
- Comparison with Python ML stack
๐ License
MIT License
๐ก Final Thought
"PHP was never meant for HPCโฆ until now."
PML pushes PHP beyond its limits โ into the domain of high-performance machine learning systems.
๐ฅ If you like this project, give it a star and push PHP further! Author: Shubham Chaudhary