Updates to the HTML API in 6.5

WordPress 6.5 brings significant updates to the HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.. Notably, the Tag Processor received a major overhaul allowing it to scan every token in an HTML document and unlocking a broad array of new functionality. The HTML Processor supports much more of the HTML specification than it did when it was minimally introduced in WordPress 6.4.

New functionality in the Tag Processor.

The Tag Processor was designed to scan every tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.) in an HTML document, skipping all of the non-tag tokens. In WordPress 6.5 it can scan everything. Tokens are the fundamental pieces that make up a document; in the case of HTML, they are the tags, comments, doctype definitions, and text nodes. This means that it’s now possible to read the text content of an HTML document, trivializing previously-complicated operations like stripping tags or truncating HTML.

Supporting this work is the introduction of a new concept called modifiable text. Modifiable text represents content within token boundaries, or content which can be changed without affecting the structure of the document as a whole. Different tokens contain different types of modifiable text:

  • The entire span of a text node is modifiable text.
  • The inside contents of an HTML comment is modifiable text.
  • The content between the opening and closing tags of special elements is modifiable content.

In this case, special elements refer to elements whose inner contents cannot contain other HTML tags. These include the contents of SCRIPT and STYLE elements as well as the contents of the TITLE and TEXTAREA elements. There are a few more representing deprecated or invalidinvalid A resolution on the bug tracker (and generally common in software development, sometimes also notabug) that indicates the ticket is not a bug, is a support request, or is generally invalid. syntax: see the class documentation for a comprehensive list.

New Methods.

  • next_token() advances to the next token in the document, including closing tags. Unlike next_tag() there is no query for this method. It scans everything.
  • get_token_type() indicates what kind of token was found, e.g. a #tag or #text or #comment.
  • get_token_name() returns a value roughly corresponding to the DOM nodeName, e.g. SPAN or #text or html (for doctype declarations).
  • get_modifiable_text() returns the properly-decoded text content for a matched token.
  • get_comment_type() indicates why the token is a comment, since different kinds of invalid markup become HTML comments. For example, <​![CDA​TA[​text]​]> is an HTML comment of the type COMMENT_AS_CDATA_LOOKALIKE because it appears as though it was intended to be a CDATA section, which doesn’t exist within HTML content.
  • paused_at_incomplete_token() indicates if the Tag Processor reached the end of the document in the middle of a token. For example, it may have been truncated in the middle of a tag.

Example usage:

$title = null;
$text_content = '';
$processor = new WP_HTML_Tag_Processor( $html );
while ( $processor->next_token() ) {
    if ( '#text' === $processor->get_token_type() ) {
        $text_content .= $processor->get_modifiable_text();
    } elseif ( null === $title && 'TITLE' === $processor->get_token_name() ) {
        $title = $processor->get_modifiable_text();

echo $title ? $title : '(untitled post)';
echo "\n\n";
echo $text_content;

Expanded support in the HTML Processor.

WordPress is now able to recognize and properly parse most HTML. There are only a few major exceptions.

  • The HTML Processor will bail any time it encounters a FORM, MATH, SVG, TABLE, or TEMPLATE tag. These elements introduce complicated parsing rules that require additional care, testing, and design work.
  • There are times when an HTML parser needs to implicitly create an element that wasn’t in the HTML itself, or when it needs to move a tag to an earlier place in the document. This requires further design work to properly communicate that the document has retroactively changed.
  • If the HTML Processor is fed full HTML documents or snippets that exist outside of a BODY element (e.g. a document HEAD), then it will abort, as it currently only supports parsing HTML inside a BODY context.

The last point highlights that the HTML Processor has been developed following the needs of blockBlock Block is the abstract term used to describe units of markup that, composed together, form the content or layout of a webpage using the WordPress editor. The idea combines concepts of what in the past may have achieved with shortcodes, custom HTML, and embed discovery into a single consistent API and user experience.-related code and focused on rules in the HTML specification relating to how tags ought to be interpreted inside a BODY element. Once this section is complete, the others will be added. Thankfully, this is the hardest section by a long shot, and parsing full documents should follow quickly after the primary work is done.

Notes for experimenters.

If you’ve been experimenting with the HTML API and building custom subclasses of the Tag Processor then it’s important to understand two changes: handling of incomplete documents, and special HTML tags with modifiable text.

Incomplete documents.

Previously, when the Tag Processor reached the end of a document and there was an incomplete token (for instance, “<div cl“), it would return false and finish parsing the document. Now that the Tag Processor can scan and visit every token in the document, however, it will indicate if it encountered such a situation. When it does, it will still return false, but it will back up to the beginning of the token and freeze. In a future release, it will be possible to extend the document and continue processing. For now, if paused_at_incomplete_token() returns true, then you can know that there might have been more to the original HTML that was sent into the processor than it received.

Special HTML tags with modifiable text.

The HTML Processor properly tracks document structure and is the appropriate method for determining when an element starts and ends. Many simpler approaches will often seem correct enough for practical use, but these will all break down in common situations. While the Tag Processor makes it easier to estimate structure than regex-based approaches, be cautioned whenever skipping the rules that the HTML Processor implements, as even normative HTML often breaks simple mental models for translating HTML text into DOM structure.

For those who are relying on the Tag Processor to estimate structure, there’s a change in behavior that might break your subclasses: the Tag Processor no longer visits the closing tag for the group of special elements. These are tags like SCRIPT, STYLE, and TITLE. The reason for this change is that these elements cannot contain other tags inside of them. It will often look like they do (for instance, with <title>an <img> is plain text</title>), but the inner content is parsed as text and not as HTML. This change prevents misinterpreting those inner contents as HTML tags.

When the Tag Processor encounters a SCRIPT tag or any of these special elements, it will continue parsing until it finds the associated closing tag and then the inner contents will be available as modifiable text. For algorithms tracking depth in a document, they will not only need to check if the tag is_void(), but also if it’s a special element since the closing tag will no longer be separately visited.

Note that it’s still possible to visit a closing tag for a special element, but only if they are unexpected and come without a corresponding opening tag.

Follow along in the CoreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress. SlackSlack Slack is a Collaborative Group Chat Platform https://slack.com/. The WordPress community has its own Slack Channel at https://make.wordpress.org/chat/. channel #core-html-api or watch here for updated posts to be part of the conversation driving and developing this API.

Props to @stevenlinx for copy review on this post.

#dev-notes #dev-notes-6-5 #6-5