WordPress.org

Welcome!

The WordPress coreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress. development team builds WordPress! Follow this site for general updates, status reports, and the occasional code debate. There’s lots of ways to contribute:

Found a bugbug A bug is an error or unexpected result. Performance improvements, code optimization, and are considered enhancements, not defects. After feature freeze, only bugs are dealt with, with regressions (adverse changes from the previous version) being the highest priority.? Create a ticket in the bug tracker.
Want to contribute? Get started quickly with tickets marked as good first bugs for new contributors or join a bug scrub. There’s more on the reports page, like patches needing testing, and on feature projects page.
Other questions? Here is a detailed handbook for contributors, complete with tutorials.

WordPress 6.6 includes a helpful maintenance release to the HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.. Included in this work are a few new features and a major improvement to the usability of the HTML Processor. This continues paced development since WordPress 6.5.

A spec-compliant text decoder.

This may be surprising, but PHPPHP The web scripting language in which WordPress is primarily architected. WordPress requires PHP 5.6.20 or higher leaves us hanging if we want to properly read the text content of an HTML document. The html_entity_decode() and htmlspecialchars_decode() functions work somewhat well for pure XML documents, but HTML contains more complicated rules for decoding, rules which change depending on whether the text is found inside an attribute value or normal text. These functions default to XML and HTML4 parsing rules and require manually setting the ENT_HTML5 flag on every invocation (for example, HTML5 redefined two of HTML4’s character references), but are still wrong in many cases.

Luckily you shouldn’t need to know about or call the new decoder, developed in Core-61072. It fits into get_modified_text(), further improving the HTML API’s implementation without requiring you to change any of your existing code. With WordPress 6.6 your existing code becomes more reliable for free.

One part of this change you might want to know about is WP_HTML_Decoder::attribute_starts_with(). This new method takes a plaintext prefix and a raw attribute value and indicates if the decoded value starts with the given prefix. This can be invaluable for efficiently detecting strings at the start of an attribute, as some attributes can be extremely large, and if not careful, naive parsers can overlook content hidden behind long slides of zeros.

$html = 'bob&#x00000000000000000003a,';

'bob&#x00000000000000000003a,' === html_entity_decode( $html, ENT_HTML5 );
'bob:,' === WP_Text_Decoder::decode_attribute( $html );
true    === WP_Text_Decoder::attribute_starts_with( $html, 'bob:' );

In the case of extremely long attribute values (for example, when pasting content from cloud document editors which send images as data URIs), the attribute_starts_with() can avoid megabytes of memory overhead and return much quicker than when calling functions which entirely decode the attribute value.

The new text decoder will mostly help ensure that the HTML API remains safe and reliable. There are complicated rules in parsing HTML, so as always, it’s best to leave the low-level work to the HTML API, preferring to call functions like get_attribute() and get_modified_text() directly instead of parsing raw text segments.

An idealized view of an HTML document.

The Tag Processor was initially designed to jump from tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.) to tag, then it was refactored to allow scanning every kind of syntax token in an HTML document. Likewise, the HTML Processor was initially designed to jump from tag to tag, all the while also acknowledging the complex HTML parsing rules. These rules largely exist in the form of a stack machine that tracks which elements are currently open. While the HTML Processor has always maintained this stack, it has never exposed it to calling code.

In WordPress 6.6 the HTML Processor underwent a major internal refactor to report those stack events (when an element opens and when an element closes) rather than when it finds raw text that represents things like tag openers and tag closers. This is a really big change for calling code! Previously, the HTML Processor would track all elements, but only return when a tag or token appeared in an HTML document. For instance, it always knew that <p><p> represents two sibling P elements, but it only presented each opening P tag to calling code. Now, the HTML processor is going to present not only the tags and tokens that exist in the raw HTML text, but also the “virtual nodes” that are implied but not textually present.

$processor = WP_HTML_Processor::create_fragment( '<h1>One</h3><h2>Two<p>Three<p>Four<h3>Five' );
while ( $processor->next_token() ) {
	$depth = $processor->get_current_depth();
    $slash = $processor->is_tag_closer() ? '/' : '';
	echo "{$depth}: {$slash}{$processor->get_token_name()}: {$processor->get_modifiable_text()}\n";
}

Let’s compare the output in WordPress 6.5 against the output in WordPress 6.6.

HTML Processor in WordPress 6.5

H1:
#text: One
/H3:
H2:
#text: Two
P:
#text: Three
P:
#text: Four
H3:
#text: Five

HTML Processor in WordPress 6.6

3: H1:
4: #text: One
2: /H1:
3: H2:
4: #text: Two
4: P:
5: #text: Three
4: /P:
4: P:
5: #text: Four
3: /P:
3: /H2:
3: H3:
4: #text: Five
0: /H3:

With the HTML API in WordPress 6.6, it’s possible to treat an HTML document in the idealized way we often think about it: where every tag has an appropriate corresponding closing tag in the right place, and no tags overlap. In WordPress 6.5, only the opening tags which appeared in the document return from next_tag(), and the </h3> closing tag appears as an H3 closing tag, even though the HTML specification indicates that it closes the already-open H1 element. In WordPress 6.6, every opening tag gets its closer, and the </h3> appears as if it were an </h1>. This is because the HTML Processor is exposing the document structure instead of the raw text.

Two new methods make working with HTML even easier:

WP_HTML_Processor->get_current_depth() returns the depth into the HTML structure where the current node is found.
WP_HTML_Processor->expects_closer() indicates if the opened node expects a closing tag or if it will close automatically when proceeding to the next token in the document. For example, text nodes and HTML comments and void elements never expect a closer.

With the help of these methods it’s possible to trivially detect when an element opens and closes, because the HTML Processor guarantees a “perfect” view of the structure.

$processor = WP_HTML_Processor( $block_html );
if ( ! $processor->next_tag( 'DIV' ) ) {
	return $block_html;
}

$depth = $processor->get_current_depth();
while ( $processor->get_current_depth() > $depth && $processor->next_token() ) {
	// Everything inside of here is inside the open DIV.
}
if ( ! isset( $processor->get_last_error() ) ) {
	// This is where the DIV closed.
}

An optimized class for looking up string tokens and their associated mappings.

As part of the text decoder work the WP_Token_Map was introduced. This is a handy and efficient utility class for mapping between keys or tokens and their replacements. It’s also handy for efficient set membership; for example, to determine if a given username is found within a set of known usernames.

Read more in the Token Map announcement.

Features

The HTML Processor will now return the depth of the current node in the stack of open elements with get_current_depth(). [58191]
The HTML Processor now includes expects_closer() to indicate the currently-matched node expect a closing token. For example, no HTML void element expects a closer, no text node expects a closer, and none of the elements treated specially in the HTML API as atomic elements (such as SCRIPT, STYLE, TITLE, or TEXTAREA) expect a closer. [58192]
The WP_HTML_Decoder class can take a raw HTML attribute or text value and decode it, assuming that the source and destination are UTF-8. The HTML API now uses this instead of html_entity_decode() for more reliable parsing of HTML text content. [58281]
The HTML Processor now visits all real and virtual nodes, not only those which are also present in the text of the HTML, but those which are implied by what’s there or not there. [58304]

Bug Fixes

Funky-comments whose contents are only a single character are now properly recognized. Previously the parser would get off track in these situations, consuming text until the next > after the funky comment. [58040]
The HTML Processor now respects the class_name argument if passed to next_tag(). Formerly it was overlooking this constraint. [58190]
The Tag Processor was incorrectly tracking the position of the last character in some tokens, internally and when bookmarking. While this bugbug A bug is an error or unexpected result. Performance improvements, code optimization, and are considered enhancements, not defects. After feature freeze, only bugs are dealt with, with regressions (adverse changes from the previous version) being the highest priority. did not affect the operation of the Tag Processor, it has been fixed so that future code which might rely upon it will work properly. [58233]
When subclassing WP_HTML_Processor the ::create_fragment() method will return the subclass instance instead of a WP_HTML_Processor instance. [58365]

Props to @gziolo, @jonsurrell, @juanmaguitar, and @westonruter for reviewing this post and providing helpful feedback.

#6-6, #dev-note, #dev-notes, #dev-notes-6-6, #html-api

Dennis Snell 10:38 pm on June 24, 2024

Update: Corrected get_depth() to get_current_depth() in the feature list and added note of expected_closer() in the same. Thanks @westonruter for noticing the mistakes.
wpsoul 7:22 pm on June 25, 2024

Still no option to generate content and put it in node
- Dennis Snell 7:25 pm on June 25, 2024
  
  That’s correct @wpsoul. It’s possible to do this by taking things into your own hands but CoreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress. is not offering that ability until it can ensure that it’s safe by default.
  
  This was a big release though, and the refactor in the HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. Processor may look small, but it mostly unlocks adding support for the remaining unsupported HTML tags. Once those are in and it’s reliable in read mode, you can be certain it will start accumulating write and modify functionality.
  - wpsoul 7:52 pm on June 25, 2024
    
    Yes, but I thought it was on plans for 6.6 with set_inner_html()
    
    https://github.com/WordPress/gutenberg/issues/59623
    - Dennis Snell 8:03 pm on June 25, 2024
      
      This is good to call out!
      
      It was originally on the roadmap but wasn’t ready in time for WordPress 6.6. The update to the HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. Processor, however, is an important piece in this puzzle.
      
      One valuable side-effect of the change is that whenever an element closes, the HTML Processor has a reference to the token or virtual node which opened the element. This makes it even more convenient to find and replace the inner or outer content because the calling code doesn’t need to manually track it.
      
      During the release cycle it became apparent that the HTML APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. needs to do more than simply allow the swap of raw HTML if it wants to ensure that WordPress remains safe by default. That pushed back having set_inner_html() in the public interface.
      
      If you are wanting to explore this though and are willing to work with us then reach out in SlackSlack Slack is a Collaborative Group Chat Platform https://slack.com/. The WordPress community has its own Slack Channel at https://make.wordpress.org/chat/. in the #coreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress.-html-api channel. I’m personally glad to help. The HTML API can do much more than it looks like from the public interface, but nothing is going to be part of the final package until it passes an extremely high bar for quality (not that there won’t be bugs or mistakes).

Welcome!

Communication

Updates to the HTML API in 6.6

Table of Contents

A spec-compliant text decoder.

An idealized view of an HTML document.

An optimized class for looking up string tokens and their associated mappings.

Features

Bug Fixes

Welcome!

Communication

Table of Contents

A spec-compliant text decoder.

An idealized view of an HTML document.

An optimized class for looking up string tokens and their associated mappings.

Features

Bug Fixes

Share this: