purewater2011/tiktoken-php7

PHP 7.4+ compatible version of tiktoken - OpenAI's tiktoken tokenizer ported to PHP

dev-master 2025-09-10 08:05 UTC

This package is auto-updated.

Last update: 2025-09-10 08:06:54 UTC


README

Packagist Version PHP Version License

A PHP 7.4+ compatible port of OpenAI's tiktoken tokenizer.

This package is a backward-compatible fork that brings tiktoken functionality to PHP 7.4+, making it accessible to projects that haven't yet migrated to PHP 8.1+.

Features

  • PHP 7.4+ compatibility (downgraded from PHP 8.1+)
  • ✅ Support for all OpenAI models (GPT-3.5, GPT-4, GPT-4o, etc.)
  • ✅ Multiple encoding formats (r50k_base, p50k_base, cl100k_base, o200k_base)
  • ✅ Efficient caching system
  • ✅ Optional FFI-based native library support for better performance
  • ✅ Full compatibility with original tiktoken API

Installation

composer require purewater2011/tiktoken-php7

Requirements

  • PHP 7.4 or higher
  • ext-ffi (optional, for LibEncoder performance boost)

Quick Start

<?php

use Purewater2011\TiktokenPhp7\EncoderProvider;

$provider = new EncoderProvider();

// Get encoder for a specific model
$encoder = $provider->getForModel('gpt-3.5-turbo');
$tokens = $encoder->encode('Hello, world!');
print_r($tokens);
// Output: [9906, 11, 1917, 0]

// Decode tokens back to text
$text = $encoder->decode($tokens);
echo $text; // Output: "Hello, world!"

// Get encoder by encoding name
$encoder = $provider->get('cl100k_base');
$tokens = $encoder->encode('Hello, world!');
print_r($tokens);
// Output: [9906, 11, 1917, 0]

Supported Models

This package supports all current OpenAI models:

Model Family Encoding
GPT-4o, GPT-4o mini o200k_base
GPT-4, GPT-3.5-turbo cl100k_base
GPT-3 (Davinci, Curie, etc.) p50k_base
GPT-3 (Ada, Babbage) r50k_base

Advanced Usage

Encoding in Chunks

For processing large texts, you can encode in chunks:

$encoder = $provider->getForModel('gpt-4');
$chunks = $encoder->encodeInChunks($largeText, 1000); // Max 1000 tokens per chunk

foreach ($chunks as $chunk) {
    echo "Chunk has " . count($chunk) . " tokens\n";
}

Custom Cache Directory

By default, vocabulary files are cached in the system temp directory. You can customize this:

// Via environment variable
putenv('TIKTOKEN_CACHE_DIR=/path/to/cache');

// Or via method call
$provider = new EncoderProvider();
$provider->setVocabCache('/path/to/cache');

Using Custom Vocabulary Loader

use Purewater2011\TiktokenPhp7\Vocab\Loader\DefaultVocabLoader;

$provider = new EncoderProvider();
$provider->setVocabLoader(new DefaultVocabLoader('/custom/cache/path'));

Performance Optimization with LibEncoder (Experimental)

For better performance with large texts, you can use the FFI-based LibEncoder:

use Purewater2011\TiktokenPhp7\Encoder\LibEncoder;
use Purewater2011\TiktokenPhp7\EncoderProvider;

// Initialize the library path
LibEncoder::init('/path/to/libtiktoken_php.so');

// Use LibEncoder for better performance
$provider = new EncoderProvider(true);
$encoder = $provider->getForModel('gpt-4');

Building the Native Library

If you want to use LibEncoder, you need to build the Rust library:

Requirements

Build Steps

git clone https://github.com/purewater2011/tiktoken-php7.git
cd tiktoken-php7
cargo build --release

Copy the appropriate binary:

  • libtiktoken_php.so (Linux)
  • libtiktoken_php.dylib (macOS)
  • tiktoken_php.dll (Windows)

Token Counting Examples

$provider = new EncoderProvider();
$encoder = $provider->getForModel('gpt-3.5-turbo');

// Count tokens in a message
$message = "How many tokens is this?";
$tokenCount = count($encoder->encode($message));
echo "Token count: $tokenCount\n";

// Useful for staying within API limits
$maxTokens = 4096;
$prompt = "Your long prompt here...";
$promptTokens = count($encoder->encode($prompt));

if ($promptTokens > $maxTokens) {
    echo "Prompt too long! Tokens: $promptTokens, Max: $maxTokens\n";
}

Differences from Original

This package maintains full API compatibility with the original yethee/tiktoken but with these key changes:

  • PHP 7.4+ compatibility instead of PHP 8.1+
  • Updated namespace: Purewater2011\TiktokenPhp7 instead of Yethee\Tiktoken
  • Compatible dependency versions for PHP 7.4
  • All modern PHP 8.1+ syntax converted to PHP 7.4 compatible code

Migration Guide

If you're migrating from yethee/tiktoken:

  1. Update your composer requirement:

    composer remove yethee/tiktoken
    composer require purewater2011/tiktoken-php7
  2. Update namespace imports:

    // Old
    use Yethee\Tiktoken\EncoderProvider;
    
    // New  
    use Purewater2011\TiktokenPhp7\EncoderProvider;
  3. All other usage remains identical!

Limitations

  • GPT-2 encoding is not supported
  • Special tokens (like <|endofprompt|>) are not supported
  • LibEncoder::encodeInChunks() method is not yet implemented

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT

Credits