mlocati / unipoints
A Unicode Codepoint library for PHP
Fund package maintenance!
mlocati
paypal.me/mlocati
Requires
- php: ^8.1
Requires (Dev)
- phpunit/phpunit: ^9.6
This package is auto-updated.
Last update: 2025-01-16 09:59:33 UTC
README
A Unicode Codepoint library for PHP
Simplified Unicode Terminology
Codepoints
Codepoints are characters, spaces, symbols, punctuations, separators, ... that is, the single units that compose texts.
Blocks
Codepoints are grouped in blocks, that is, groups of contiguous codepoints that are part of a common set.
Examples:
- a is contained in the
Basic Latin
block - α is contained in the
Greek and Coptic
block - 𝅘𝅥𝅮 is contained in the
Musical Symbols
block - ↩ is contained in the
Arrows
block - ☂ is contained in the
Miscellaneous Symbols
block
Planes
Planes are blocks of 65,536 contiguous codepoints and may contain zero, one or many blocks.
General Category
This library also provides the general category of every codepoint, that is, you can know if a codepoint is a lowercase letter, a symbol, a punctuation, and so on.
Surrogate Codepoints
In order to extend the number of codepoints that can be represented with 16 bits, Unicode introduced "Surrogates". A single character (or punctuation, ...) can be represented by combining two consecutive surrogates (called "high surrogate" and "low surrogate"). That means that such codepoints have a meaning only in pair.
Sample Usage
Codepoints are listed in the string-backed MLUnipoints\Codepoint
enum.
The value of the enum cases strings contain the unicode symbol: that way, for example in order to get the case of a
, you simply can simply write:
use MLUnipoints\Codepoint; $codepoint = Codepoint::from('a');
Since the MLUnipoints\Codepoint
enum is rather big (it can use tens of MB of memory when you autoload it), you can also use the block-specific instances defined under the MLUnipoints\Codepoint
namespace (but that requires that you already know the block in advance).
For example:
use MLUnipoints\Codepoint; $codepoint = Codepoint\Basic_Latin::from('a');
Every case of the MLUnipoints\Codepoint
enum has a MLUnipoints\Info\CodepointInfo
attribute.
You can easily retrieve this attribute by writing
use MLUnipoints\Codepoint; use MLUnipoints\Info\CodepointInfo; $codepoint = Codepoint::from('a'); $codepointInfo = CodepointInfo::from(Codepoint::from('a'));
This attribute provides the numeric value of the codepoint, the Unicode name, the general category, and (if you don't use the block-specific enums) the block.
You can also similarly the details of the block, plane and the general category.
For example, this code:
use MLUnipoints\Codepoint; use MLUnipoints\Info\BlockInfo; use MLUnipoints\Info\CategoryInfo; use MLUnipoints\Info\CodepointInfo; use MLUnipoints\Info\PlaneInfo; $codepoint = Codepoint::from('a'); $codepointInfo = CodepointInfo::from($codepoint); $categoryInfo = CategoryInfo::from($codepointInfo->category); $blockInfo = BlockInfo::from($codepointInfo->block); $planeInfo = PlaneInfo::from($blockInfo->plane); echo 'Codepoint: ', $codepointInfo->id, "\n"; echo 'Codepoint name: ', $codepointInfo->name, "\n"; echo 'Codepoint general category: ', $categoryInfo->description, "\n"; foreach ($categoryInfo->parentCategories as $parentCategory) { echo 'Codepoint parent general category: ', CategoryInfo::from($parentCategory)->description, "\n"; } echo 'Block name: ', $blockInfo->name, "\n"; echo 'Plane name: ', $planeInfo->name, "\n"; echo 'Plane short name: ', $planeInfo->shortName, "\n";
will output:
Codepoint: 97
Codepoint name: LATIN SMALL LETTER A
Codepoint general category: a lowercase letter
Codepoint parent general category: a cased letter
Codepoint parent general category: a letter
Block name: Basic Latin
Plane name: Basic Multilingual Plane
Plane short name: BMP
You can also use the Unicode enums to print out characters and symbols.
For example:
use MLUnipoints\Codepoint; echo Codepoint::SUN_BEHIND_CLOUD->value;
will print
⛅
Do you really want to say thank you?
You can offer me a monthly coffee or a one-time coffee 😉