robthree/humanoid

Friendly ID generator. Converts integers to words and back

1.1.0 2022-10-18 21:10 UTC

This package is auto-updated.

Last update: 2024-10-19 05:54:21 UTC


README

Packagist Packagist

This class can be used to generate "friendly id's" where numerical ID's are replaced with words. A well known example is Gfycat which uses "adjectiveadjectiveanimal": https://gfycat.com/gracefulspanishgemsbuck.

This class has two methods: create(int $id) which returns a string generated from the id and parse(string $text) which returns the id from the parsed text. It supports custom word / category lists, an optional separator and optional formatting.

Quickstart

Installation:

composer require robthree/humanoid

Usage:

use RobThree\HumanoID\HumanoIDs;

// Create new instance of HumanoID via the builder
$zooIdGen = HumanoIDs::zooIdGenerator();

// Convert ID to HumanoID
$zooId = $zooIdGen->create(96712);
echo sprintf("HumanoID   : %s\n", $zooId);
    
// Convert back to ID
$id = $zooIdGen->parse($zooId);
echo sprintf("Decoded ID : %d\n", $id);

Output (depending on the wordlist used):

HumanoId   : 'sick-yellow-wolf'
Decoded ID : 96712

We provide a zooIdGenerator and a spaceIdGenerator out-of-the box but it's easy (and recommended) to create your own HumanoID by implementing the HumanoIDInterface.

API

The HumanoID has a constructor with 4 arguments; all of which but the first are optional:

  • $wordSets: The words-structure (see below) to use as 'dictionary'
  • $categories (optional, see below): if you want to use a different order for categories than the default order (which is the order of the keys of the $words argument)
  • $separator (optional, see below): the separator, if any, to use
  • $format (optional, see below): the format to use

The HumanoID has two public methods:

  • create(int $id): string: Converts an integer into an ID
  • parse(string $text): int: Converts text into an integer

Word lists and categories

You can use custom word lists; you can store these anywhere you want like in a JSON file or in a database. As long as you initialize the HumanoID class with the following data structure:

[
    'adjectives' => ['big', 'smart', 'funky'],
    'colors'     => ['red', 'green', 'blue'],
    'animals'    => ['cow', 'whale', 'monkey'],
]

The HumanoID will automatically determine which 'categories' are available. In the above example generated HumanoIDs would take the form adjective-color-animal. Whenever this should turn out to be not enough, the HumanoID automatically repeats the first category as often as needed; so this would result in adjective-adjective-color-animal or even adjective-adjective-adjective-color-animal and so on. However, the order of the categories can be specified by passing an array of words to the $categories argument of the HumanoID class. You could, for example, pass ['colors', 'adjectives', 'animals'] which will result in HumanoIDs that take the form color-adjective-animal or, again, when this should not be enough: color-color-color-adjective-animal.

Ofcourse you don't have to use adjectives, colors and animals. It can be anything you want. So, more generalized, you can provide any data structure in the form

[
    'category1' =>  ['value', 'value', 'value', ...],
    'category2' =>  ['value', 'value', 'value', ...],
    ...
]

Separator

By default HumanoID uses the - character to separate words. This results in HumanoIDs like big-red-whale. You can specify any desired string as a separator; it helps if the separator string is not contained in any of the words.

It is possible to specify an empty ('') or null separator. This will result in HumanoIDs like bigredwhale. This is the closest to what Gfycat url's look like. However, you need to take extreme care that the words don't overlap. If, for example, the adjectives would contain both old and cold a HumanoID like genericoldpanda will result in an ambiguous result ("generi", "cold", "panda" vs. "generic", "old", "panda"). With a carefully generated wordlist this shouldn't have to be a problem.

Formats

A few formats are supported which can be specified when constructing an instance of the HumanoID class. The currently supported formats are provided via a WordFormatOption Enum class.

The options provided are:

  • WordFormatOption::ucfirst,
  • WordFormatOption::lcfirst,
  • WordFormatOption::upper,
  • WordFormatOption::lower, and
  • no-format (null).

All options do what their name implies; so ucfirst would result in Big-Red-Whale and upper in BIG-RED-WHALE. The "no-format" option just keeps the words intact as formatted from the provided wordset.

How it works

ID (integer) to HumanoID conversion

The create(int $id): string method takes the ID and, basically, does a base conversion similarly to how you would convert the decimal value 967 to the hexadecimal value 3C7.

However, this time we don't have 16 'digits' (0..9, A..F), but any number of words representing a digit.

HumanoID to ID (integer) conversion

The parse(string $text): int method does, basically the opposite of the create method; it takes a string and tries to do another base conversion, similarly as how you would convert the hexadecimal value 3C7 to the decimal value 967. However, this time it's a bit more complicated...

If we could assume there will always be a separator 'per digit' in the string, we could simply split the string at the separator and do our calculations. Even if the separator would not be used but, for example, the ucfirst option (resulting in BigRedWhale) we could split out the words pretty easily.

However; We wanted to stay as close as possible to the Gfycat implementation. And that complicates things. This, basically, meant we had the following requirements: the url should be case insensitive and contain an optional separator.

The 'decoding' of HumanoIDs relies on a lookup table which is created when the HumanoID class is initialized (which, by the way, is a pretty expensive operation; keep the instance around as long as you can if you need to generate or parse more than one HumanoID!).

We won't go into too much detail, but in essence a tree is created on a per-character-basis in reverse order. When decoding HumanoIDs the algorithm starts at the end working it's way to the beginning of the string while meanwhile working it's way down this tree and looking up word indices in their respective categories. Whenever an index is determined it can be used in the 'base-N' conversion and the algorithm continues until the beginning of the HumanoIDs is reached or a lookup failed.

General advice

  • Don't change your wordlist once you go into production. Imagine reassigning or reordering the value of the values A..F in the hexadecimal system. It will be very hard, if not impossible, to make this work correctly without resulting in incorrectly converted HumanoIDs to ID's or causing ambiguous results etc.
  • Use large word-lists. Don't go overboard, but categories with a handful of words don't help much (unless you don't mind either long HumanoIDs.(red-blue-blue-red-red-blue-funky-monkey for example) or have some more smaller categories).
  • Whatever wordlist/separator/format you decide on, once you picked it, you're stuck with it (unless you want to break all your HumanoIDs or you'll need to do some (on-the-fly?) conversion.
  • When not using any separator, try to use longer, unique, words that are not contained in other words (so, for example, avoid "old, cold" or "expensive,inexpensive").
  • If ambiguous words without a separator are unavoidable or desired you could consider only using the create(int $id) method and storing the result alongside the id with your data. Apply a unique constraint and index if you can. That way you can use the field with the HumanoID value to do a lookup.
  • As mentioned earlier; try to keep this class around for as long as possible. The constructor contains some fairly CPU intensive code (building the lookup table), so ideally you keep an instance alive for as long as possible.

Notes

The 'ambiguous words' problem can probably be solved in a later version by changing the iteratively lookup process into a recusive algorithm; that way when a lookup turns out to fail (again, cold, old for example) the next value can be tried recursively until the HumanoID is completely decoded correctly (or still fail as a whole).

Benchmarks

Some basic, preliminary benchmarks can be found here, more are in the works.

Pronunciation

HumanoID is pronounced "humano i d" (/ˈhjuːmənəʊ aɪ diː/), but "humanoid" (/ˈhjuːmənɔɪd/) is fine too if you like to please the robots 🤖

License

Licensed under MIT license. See LICENSE for details.

Logo and icon based on icons made by Vectorslab (#7323564) and Those Icons (#523788) over at FlatIcon.