Introducing the Token Map

There’s a new data structure coming in WordPress 6.6: the WP_Token_Map, but what is it? The ongoing work in the HTML API required the ability to find named character references like … in an HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. document, but given that there are 2,231 names this is no easy task. The WP_Token_Map is a purpose-built class designed to take a large set of static string tokens with static string replacements, and then be used as a lookup and replacement mechanism. That’s all very technical, so let’s see it in action and consider how it might be useful to you.

As an important note: you probably don’t need this because it’s an optimized data structure built for lookup with a very large set of tokens. The examples are unrealistically short but are meant only to illustrate the class usage in a readable form. If working with small datasets it’s possibly faster for most purposes to stick with familiar tools like in_array() or associative arrays.

The Token Map, in short, is a new semantic utility meant to answer one simple question: given a string and a byte offset into that string, does the next sequence of bytes match one of a known set of tokens, and if so, what is the replacement for that token? It’s designed as a low-level utility for use in combination with other string parsing logic.

Table of Contents

  1. The Token Map is an optimization.
  2. What does the Token Map provide?
    1. Replacing animal Emoji.
    2. Quick set membership test.
  3. Creating Token Maps.
    1. From an associative array.
    2. From an optimized pre-computed table.
  4. When should Token Maps be used?
  5. Conclusion

The Token Map is an optimization.

Consider having a relatively large set of lookup tokens or strings. This could be, for example, the HTML5 named character references, or it could be the names given to various emoji like :jazz_hands:. When working with these kinds of replacement sets, we often want to replace all of the occurrences of any of the tokens with their respective replacements. We might write code like the following:

$animals = array(
	':cat:'     => '🐈',
	':dog:'     => '🐶',
	':fish:'    => '🐟',
	':mammoth:' => '🦣',
	':seal:'    => '🦭',
);

$output = str_replace( array_keys( $animals ), array_values( $animals ), $input );

This works pretty well here with a small set of tokens where none of them overlap. But if the list of replacements is very large the performance can degrade quickly as PHPPHP The web scripting language in which WordPress is primarily architected. WordPress requires PHP 5.6.20 or higher has to check for each of the potential tokens starting at every character in the input string. Worse, if not careful, some tokens could overlap and be incorrectly replaced.

$a_words = array(
	'a'       => '🅰',
	'aa'      => '😩',
	'ar'      => '🏴‍☠️',
	'arizona' => '🌵',
);

$output = str_replace( …, …, 'another arid day for the arizona aardvarks' );

Without looking it up, can you say what the replacements here would be? It’s difficult to say because the replacement tokens form substrings and supersets of each other. This is how many of the HTML character references are. In fact, some of the named character references appear with and without their trailing semicolon, meaning that it’s critical to replace longer matches before any potential shorter matches could be replaced.

Enough waiting, here’s the replacement!

🅰nother 🅰rid d🅰y for the 🅰rizon🅰 🅰🅰rdv🅰rks

Wild isn’t it? The only replacement that occurred was for a, because it matched before any of the other tokens had a chance to match.

It’s possible to work around this by sorting the array by its keys, but there still remain a couple of problems. First, with very large sets of tokens the performance balloons. Second, str_replace() is an all-or-nothing system. It doesn’t provide a way to start processing and then stop early.

The Token Map has taken these considerations into play to make this kind of transformation efficient.

What does the Token Map provide?

The Token Map has one primary responsibility: is there a known token at a given location within a string, and if so, what is its replacement value? It answers this question in a slightly unexpected way; consider this example code.

$replacement = $html5_named_character_references->read_token( $attribute_value, $at, $token_length );
if ( isset( $replacement ) ) {
    $found_token = substr( $attribute_value, $at, $token_length );
    echo "'{$found_token}' transforms into '{$replacement}'";
    $at += $token_length;
}

The primary method for a token map is read_token(). This takes a string, a byte-offset into the string for where to start looking, and a third argument passed by reference. Like with preg_match(), this argument ($token_length) is passed by reference and doesn’t communicate anything to the method. Rather, it’s used to hold the resulting length of the found input token in bytes, to be used by the calling code – it’s an “out” argument.

Instead of returning two values (which requires creating an array or a class and allocating an additional string) the Token Map only returns a single value. If it failed to find a token then it returns null, but if it does find one, it returns the replacement value and will set the value of the “out” argument to the length of the token that it found. If the calling code only cares about the replacement value then it never has to create the “matched token.” While individually this may not seem like much, when processing large documents with thousands of potential matches the performance adds up and this is one more way that the Token Map does its job efficiently.

The read_token() method can be used to identify substrings within a larger string, but unlike strpos(), it doesn’t instantly jump to the next match within the string. It’s designed to look at a specific byte offset and indicate if a token is found only at that location. The calling code needs to advance its own cursor before asking it to read a token. This may seem inconvenient, but it’s a consequence of the Token Map being designed to be part of other systems, instead of being a complete system in itself.

Replacing animal Emoji.

Consider the example from above with the animal Emoji slugs. Every Emoji slug starts with a :, meaning it’s possible to quickly find all potential places where a token may begin, and then rely on the Token Map for confirmation and replacement. The following code will return a new string from the $input with all of the corresponding animal names replaced by their Emoji.

function replace_animals( $input ) {
	$animals = WP_Token_Map::from_array( array(
		':cat:'         => '🐈',
		':caterpillar:' => '🐛',
		':dog:'         => '🐶',
		':fish:'        => '🐟',
		':mammoth:'     => '🦣',
		':monkey:'      => '🐒',
	) );

	$at     = 0;
	$was_at = 0;
	$end    = strlen( $input );
	$output = '';

	while ( $at < $end ) {
		$next_at = strpos( $input, ':', $at );

		// There is no more possibility to find the Emoji slugs.
		if ( false === $next_at ) {
			$output .= substr( $input, $was_at );
			return $output;
		}

		// An Emoji slug might have been found.
		$emoji = $animals->read_token( $input, $next_at, $slug_length );

		// If none was found then continue;
		if ( null === $emoji ) {
			$at = $next_at + 1;
			continue;
		}

		$output .= substr( $input, $was_at, $next_at - $was_at );
		$output .= $emoji;
		$was_at  = $next_at + $slug_length;
		$at      = $was_at;
	}

	return $output . substr( $input, $was_at );
}

And what do you know, we’ve built a text parser! It works too – we replaced only the Emoji slugs for ones that are recognized and the others are left in place.

php > var_dump( replace_animals( 'A :cat: and a :dog: walk up to a :bear: eating a :fish:.' ) );
string(52) "A 🐈 and a 🐶 walk up to a :bear: eating a 🐟."

Notice how this differs from calling preg_replace_callback( '~:[^:]+:~', $replacer ). There’s a ticking performance bomb in this code because [^:]+ can match an extremely long part of the input document before calling the replacer, allocating potentially megabytes of content in the meantime. The Token Map only needs to examine at most as many bytes as are in the longest token; though usually it’s considerably fewer.

Quick set membership test.

The Token Map can also be used for quickly determining if a given token is inside a set of known tokens. This is what contains( $token ) provides.

$animals->contains( ':cat:' );    // returns `true`.
$animals->contains( ':badger:' ); // returns `false`.

The original class started out this way as WP_Token_Set, providing a means for improving the lookup of HTML named character references inside of esc_attr(). It currently relies on a basic check for if ( ! in_array( $possible_name, $allowed_names, true ) ).

Creating Token Maps.

There are two ways to create a Token Map: one is from an associative array whose keys are the tokens and whose values are the replacements. The other is an optimization for large token maps (referring again here to HTML with more than two thousand tokens).

From an associative array.

The normative or basic form of creating a Token Map is to construct an associative array whose keys are the tokens and whose values are the replacements. This can be passed in to the static constructor function and the result will be the optimized map.

$animals = WP_Token_Map::from_array( array(
	':cat:'         => '🐈',
	':caterpillar:' => '🐛',
	':dog:'         => '🐶',
	':fish:'        => '🐟',
	':mammoth:'     => '🦣',
	':monkey:'      => '🐒',
) );

From an optimized pre-computed table.

After creating a token map in the normal way above, it’s possible to export it as PHP source code. This export creates a fast-loading structure that skips the initial sorting and processing necessary to build the internal lookup tables. This initial computation is pretty fast, but there’s no need to repeat it if creating a Token Map repeatedly, or with very large data sets. If your mappings won’t be changing, you might as well precompute them.

echo $animals->precomputed_php_source_table();

// Output follows…
WP_Token_Map::from_precomputed_table(
	array(
		"storage_version" => "6.6.0",
		"key_length" => 2,
		"groups" => ":c\x00:d\x00:f\x00:m\x00",
		"long_words" => array(
			// :caterpillar[🐛] :cat:[🐈].
			"\x0aaterpillar\x04🐛\x03at:\x04🐈",
			// :dog:[🐶].
			"\x03og:\x04🐶",
			// :fish:[🐟].
			"\x04ish:\x04🐟",
			// :mammoth:[🦣] :monkey:[🐒].
			"\x07ammoth:\x04🦣\x06onkey:\x04🐒",
		),
		"small_words" => "",
		"small_mappings" => array()
)

While this is interesting output, none of it needs to make sense. The Token Map exported itself in a way that it can quickly load, and all details are considered internal and private. What you can do with it is save it into a standard PHP file as part of your project. Store it into a variable, or return it from a function, or use it in a function, for quick lookup.

<?php

function contains_animal( $text ) {
	$animals = WP_Token_Map::from_precomputed_table(
		array(
			"storage_version" => "6.6.0",
			"key_length" => 2,
			"groups" => ":c\x00:d\x00:f\x00:m\x00",
			"long_words" => array(
				// :caterpillar[🐛] :cat:[🐈].
				"\x0aaterpillar\x04🐛\x03at:\x04🐈",
				// :dog:[🐶].
				"\x03og:\x04🐶",
				// :fish:[🐟].
				"\x04ish:\x04🐟",
				// :mammoth:[🦣] :monkey:[🐒].
				"\x07ammoth:\x04🦣\x06onkey:\x04🐒",
			),
			"small_words" => "",
			"small_mappings" => array()
		)
	);

	...
}

When should Token Maps be used?

Measure, measure, and then measure some more. It’s not worth the added complexity in many cases to introduce this optimized data structure when simpler cases do just as well. You probably don’t need this; but if you do, consider running an experiment to see if it will help.

For example, the HTML Processor contains is_void() and is_special() to classify an element by its tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.) name. Those methods currently use a chain of string equalities like $tag === 'META' || $tag === 'LINK || $tag === ...'. When rewriting those methods to use a Token Map, there was no significant impact on performance. Over a run of 100,000 HTML pages classifying every tag, the Token Map was just over 100ms slower when performing case-insensitive comparisons and about 100ms faster when performing case-sensitive comparisons. The is_special() method even compares against 96 different tags in the or chain!

So like any tool, this one should only be used when appropriate. One situation where it’s very appropriate, as with HTML character references or Emoji slugs, is when parsing a string and it’s not simple to segment the tokens from the document. For example, in HTML, there are additional context-sensitive rules for whether a semicolon is needed at the end of a character reference, and that’s not something str_replace() can represent.

Another appropriate use-case for the Token Map might be finding and replacing @-mentions in a document. If your organization has a couple thousand usernames, the Token Map can serve as an efficient way to find those. Username mentions fit this niche well: there are potentially hundreds or thousands of the lookup terms; and the terms have varying lengths without a trailing marker, making it slightly more difficult to segment with a regular expression.

It’s even possible to serialize data as the token replacement, storing things like username to user id mappings.

function linkify_usernames( $input, $username_token_map ) {
	$at     = 0;
	$was_at = 0;
	$end    = strlen( $input );
	$output = '';

	while ( $at < $end ) {
		$next_at = strpos( $input, '@', $at );

		// There is no more possibility to find a username.
		if ( false === $next_at ) {
			$output .= substr( $input, $was_at );
			return $output;
		}

		// A username might have been found (username starts after the `@`).
		$packed_user_id = $username_token_map->read_token( $input, $next_at + 1, $username_length );

		// If none was found then continue.
		if ( null === $known_username ) {
			$at = $next_at + 1;
			continue;
		}

		$output .= substr( $input, $was_at, $next_at - $was_at );

		$user_id  = unpack( 'V', $packed_user_id )[1];
		$username = substr( $input, $next_at + 1, $username_length );
		$user_url = esc_url( "https://profiles.wordpress.org/{$username}/" );
		$output .= "<a href=\"{$user_url}\" data-user-id=\"{$user_id}\">{$username}</a>";
		$was_at  = $next_at + 1 + $username_length;
		$at      = $was_at;
	}

	return $output . substr( $input, $was_at );
}

Conclusion

Hopefully the code examples have been helpful here in demonstrating ways to use this new data structure in WordPress 6.6, the WP_Token_Map. It adds a new semantic utility: “does this string at this offset contain one of a known set of tokens, and if it does, what is the replacement value for that token?”

For further questions check out the development ticket in Trac: #60698. To see how it’s used in the HTML APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways., check out the new HTML text decoder from Trac ticket #61072.

Props to @bernhard-reiter, @bph, @gziolo, @jonsurrell, @juanmaguitar
for review of this post before publishing and for catching a number of mistakes.

#6-6, #dev-note, #dev-notes, #dev-notes-6-6