README

A PHP 7.4+ compatible port of OpenAI's tiktoken tokenizer.

This package is a backward-compatible fork that brings tiktoken functionality to PHP 7.4+, making it accessible to projects that haven't yet migrated to PHP 8.1+.

Features

✅ PHP 7.4+ compatibility (downgraded from PHP 8.1+)
✅ Support for all OpenAI models (GPT-3.5, GPT-4, GPT-4o, etc.)
✅ Multiple encoding formats (r50k_base, p50k_base, cl100k_base, o200k_base)
✅ Efficient caching system
✅ Optional FFI-based native library support for better performance
✅ Full compatibility with original tiktoken API

Installation

composer require purewater2011/tiktoken-php7

Requirements

PHP 7.4 or higher
ext-ffi (optional, for LibEncoder performance boost)

Quick Start

<?php

use Purewater2011\TiktokenPhp7\EncoderProvider;

$provider = new EncoderProvider();

// Get encoder for a specific model
$encoder = $provider->getForModel('gpt-3.5-turbo');
$tokens = $encoder->encode('Hello, world!');
print_r($tokens);
// Output: [9906, 11, 1917, 0]

// Decode tokens back to text
$text = $encoder->decode($tokens);
echo $text; // Output: "Hello, world!"

// Get encoder by encoding name
$encoder = $provider->get('cl100k_base');
$tokens = $encoder->encode('Hello, world!');
print_r($tokens);
// Output: [9906, 11, 1917, 0]

Supported Models

This package supports all current OpenAI models:

Model Family	Encoding
GPT-4o, GPT-4o mini	o200k_base
GPT-4, GPT-3.5-turbo	cl100k_base
GPT-3 (Davinci, Curie, etc.)	p50k_base
GPT-3 (Ada, Babbage)	r50k_base

Advanced Usage

Encoding in Chunks

For processing large texts, you can encode in chunks:

$encoder = $provider->getForModel('gpt-4');
$chunks = $encoder->encodeInChunks($largeText, 1000); // Max 1000 tokens per chunk

foreach ($chunks as $chunk) {
    echo "Chunk has " . count($chunk) . " tokens\n";
}

Custom Cache Directory

By default, vocabulary files are cached in the system temp directory. You can customize this:

// Via environment variable
putenv('TIKTOKEN_CACHE_DIR=/path/to/cache');

// Or via method call
$provider = new EncoderProvider();
$provider->setVocabCache('/path/to/cache');

Using Custom Vocabulary Loader

use Purewater2011\TiktokenPhp7\Vocab\Loader\DefaultVocabLoader;

$provider = new EncoderProvider();
$provider->setVocabLoader(new DefaultVocabLoader('/custom/cache/path'));

Performance Optimization with LibEncoder (Experimental)

For better performance with large texts, you can use the FFI-based LibEncoder:

use Purewater2011\TiktokenPhp7\Encoder\LibEncoder;
use Purewater2011\TiktokenPhp7\EncoderProvider;

// Initialize the library path
LibEncoder::init('/path/to/libtiktoken_php.so');

// Use LibEncoder for better performance
$provider = new EncoderProvider(true);
$encoder = $provider->getForModel('gpt-4');

Building the Native Library

If you want to use LibEncoder, you need to build the Rust library:

Requirements

Rust >= 1.85

Build Steps

git clone https://github.com/purewater2011/tiktoken-php7.git
cd tiktoken-php7
cargo build --release

Copy the appropriate binary:

libtiktoken_php.so (Linux)
libtiktoken_php.dylib (macOS)
tiktoken_php.dll (Windows)

Token Counting Examples

$provider = new EncoderProvider();
$encoder = $provider->getForModel('gpt-3.5-turbo');

// Count tokens in a message
$message = "How many tokens is this?";
$tokenCount = count($encoder->encode($message));
echo "Token count: $tokenCount\n";

// Useful for staying within API limits
$maxTokens = 4096;
$prompt = "Your long prompt here...";
$promptTokens = count($encoder->encode($prompt));

if ($promptTokens > $maxTokens) {
    echo "Prompt too long! Tokens: $promptTokens, Max: $maxTokens\n";
}

Differences from Original

This package maintains full API compatibility with the original yethee/tiktoken but with these key changes:

PHP 7.4+ compatibility instead of PHP 8.1+
Updated namespace: Purewater2011\TiktokenPhp7 instead of Yethee\Tiktoken
Compatible dependency versions for PHP 7.4
All modern PHP 8.1+ syntax converted to PHP 7.4 compatible code

Migration Guide

If you're migrating from yethee/tiktoken:

Update your composer requirement:

composer remove yethee/tiktoken
composer require purewater2011/tiktoken-php7

Update namespace imports:

// Old
use Yethee\Tiktoken\EncoderProvider;

// New  
use Purewater2011\TiktokenPhp7\EncoderProvider;

All other usage remains identical!

Limitations

GPT-2 encoding is not supported
Special tokens (like <|endofprompt|>) are not supported
LibEncoder::encodeInChunks() method is not yet implemented

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT

Credits

Original tiktoken implementation by OpenAI
PHP port by yethee
PHP 7.4 compatibility by purewater2011

purewater2011 / tiktoken-php7

Maintainers

Details

README

Features

Installation

Requirements

Quick Start

Supported Models

Advanced Usage

Encoding in Chunks

Custom Cache Directory

Using Custom Vocabulary Loader

Performance Optimization with LibEncoder (Experimental)

Building the Native Library

Requirements

Build Steps

Token Counting Examples

Differences from Original

Migration Guide

Limitations

Contributing

License

Credits