Updates to the HTML API in 6.7

After important internal changes to the HTML API in 6.6, WordPress 6.7 brings forward a major leap in support with the HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. Processor.

Major Updates

Full HTML Support

After a long time of only supporting a subset of HTML, the HTML Processor now supports practically all HTML and all HTML tags1. The internal work during the past few releases made it possible to add the remaining support needed to tackle all the colorful varieties of HTML that exist. Only in the rarest situations will the HTML Processor now give up, and in none of these cases is it because the processor doesn’t know how to handle a given tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.).

Effectively, the HTML APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. is entering a new phase, where the rules for understanding HTML are finally baked in, and the API can start to proliferate into new developer APIs and replace older and buggier HTML processing code.

Ability to modify most text in a document

The first of a new set of modification methods has been added to the HTML API: set_modifiable_text(). When paused on a text node, a special self-contained element like SCRIPT, STYLE, or TITLE, or on a comment, it’s possible to replace the text content of that node.

while ( $processor->next_tag( 'SCRIPT' ) ) {
    $existing_script = $processor->get_modifiable_text();
    $prefix          = "// Careful, this is executable!\n";
    $processor->set_modifiable_text( "{$prefix}{$existing_script}" );
}

This method will handle the appropriate escaping for the kind of node it is. For example, it will escape the text for a SCRIPT element in a different way than for a STYLE element or a raw text node. This is important, because it means you don’t have to think about what kind of text it is, but it will remain reliable.

It’s important to understand here that this operates on the level of individual text nodes. set_modifiable_text() is different than set_inner_text(). A given element may contain multiple text nodes which are separated by things like HTML comments, specific byte sequences, and invalidinvalid A resolution on the bug tracker (and generally common in software development, sometimes also notabug) that indicates the ticket is not a bug, is a support request, or is generally invalid. markup which is interpreted as a comment. For set_inner_text() it will be best to wait until the HTML API itself supports these operations.

A full-parser mode in the HTML Processor

Until now, the HTML Processor has provided a slightly unexpected interface. Creating a processor involved a call to create_fragment(), which creates something called a “fragment parser.” There was a good reason for this: the fragment parser is the version that’s used in JavaScriptJavaScript JavaScript or JS is an object-oriented computer programming language commonly used to create interactive effects within web browsers. WordPress makes extensive use of JS for a better user experience. While PHP is executed on the server, JS executes within a user’s browser. https://www.javascript.com/. when assigning node.innerHTML = newHtml. Unfortunately, the fragment parser cannot ingest an entire document from start to finish. If you worked with the HTML Processor you would have noticed that it gave up as soon as it found the starting <!DOCTYPE html> or <html> tokens.

In WordPress 6.7 the HTML Processor offers create_full_parser() to fill this need. Thanks to completing support for all of the remaining contexts in a document, the HTML Processor can now start with arbitrary HTML and understand how it turns into a DOM tree.

While possibly unexpected, the default implementation with the fragment processor is still the best option in most cases. If you’ve ever had to deal with HTML parsing libraries adding <!DOCTYPE html><html><head> (and other tags) to the start of every snippet of HTML that they process, it’s because these tools are performing full parses and full serialization. When working with anything other than a complete HTML document, the fragment parser remains the most appropriate tool. Process fragments of HTML with the fragment parser, and full documents with the full parser.

Normalization of input documents

With the HTML specification rules in place it’s possible to start building higher-level tools and helpers without having to worry about the complexities of HTML. The first of these is WP_HTML_Processor::normalize(), which gives us the HTML we always wanted. Normalizing is the process of taking any HTML input and returning a fully well-formed equivalent document. Since a picture is worth a thousand words:

$html = '</p><p class=wp class=duplicate><[GETTING[started]]><p><?not a tag?></p done=yet?><script><!--<script></script>';
echo WP_HTML_Processor::normalize( $html );
// <p></p><p class="wp">&lt;[GETTING[started]]></p><p><!--?not a tag?--></p>

This little function is packed with utility. It interprets the input HTML the way a browser would, meaning that it applies all of the complicated parsing rules, and it prints out a new serialization of that parsed structure. Attribute quoting, mismatched tags, missing tags, invalid syntax, unexpected whitespace, comments, and more – it prints out tags without duplicates, where each value is double-quoted, where all relevant syntax characters are encoded, and it even removes partial tokens at the end of the input. That’s right, normalize() won’t leave you hanging – literally!

$html = 'Just a <em>sneaky but <!-- foiled attempt to hide the rest of the page';
echo WP_HTML_Processor::normalize( $html );
// Just a <em>sneaky but </em>

Significantly, there’s something extremely valuable in the output of this function: once HTML has been normalized, it’s considerably safer to process with naive parsing tools. There’s a guarantee on how the text it produces will appear, so string-based and regular-expression-based tools will instantly become more reliable; the edge cases they fail to handle will never make it out of normalize().

CSS selectors and HTML interact via the document mode, which is either in standards mode or quirks mode. In standards mode, CSS selectors match classes and IDs in a byte-for-byte match. Otherwise they are matched in an ASCII case-insensitive manner. Confusing!

Thankfully, the entire HTML API was audited to ensure compliance even in how CSS class selectors work, and the behavior of methods like has_class(), add_class(), and remove_class() will now respect each document’s mode. For almost every case this should be the “standards mode” and so both of the HTML API processors default to it, matching class names in a byte-for-byte manner. When creating an HTML Processor though with the full parser, since it’s able to determine the document mode from the input HTML, it will make the same choice a browser makes, and when appropriate, will match case-insensitively.

Important changes for existing code.

There should be no significant changes required to existing code written with previous versions of the HTML API. While almost all HTML documents are now supported, it’s still required to verify if the HTML Processor bailed while attempting to process unsupported input.

There were several changes to the way null bytes and whitespace are handled in specific circumstances. These cases represent invalid sections of documents and normal HTML inputs aren’t affected by them.

Since changes were also made with the handling of ASCII casing in CSS class names, there may be cases where a class name previously matched and now won’t. For example, a browser will not match the CSS class name selector for full-width with an element containing FULL-WIDTH in its class attribute unless the document is parsed in quirks-mode. The HTML API is now processing its inputs in standards mode, matching the behavior of a browser, unless created with a full parser containing the appropriate DOCTYPE to indicate quirks mode. Code that relied on has_class() need not change, but the results in WordPress 6.7 may be different at times because it has become more reliable to the proper handling of these class names.

Enhancements

  • Low-level parsing updates improved the scanning speed of the Tag Processor by 3.5-7.5%. [Core-61545]
  • When removing class names, leading whitespace in front of the class name is now removed. [Core-61531]
  • CSS class names and class selectors behave according to the document mode for the input document, or standards mode if none can be directly inferred. [Core-61531]
  • Improved spec-compliance by handling remaining HTML tags and insertion modes, including SVG, TABLE, TEMPLATE, and more. [Core-61576]
  • Added WP_HTML_Processor::create_full_parser() to process entire HTML documents from start to finish. [Core-61576]
  • Added get_qualified_tag_name() and get_qualified_attribute_name() to return case-variants for certain tags inside foreign content where the HTML rules specify. [Core-61576]
  • It’s now possible to replace the value of modifiable text. [Core-61617]
  • Added get_unsupported_exception() to provide more debugging information when the HTML Processor bails on unsupported input. [Core-61646]
  • set_attribute() returns false when WordPress rejects the update. [Core-61719]
  • Added subdivide_text_appropriately() to aid in algorithms wanting to skip null byte or whitespace-only prefixes of text nodes. [Core-61974]
  • Added normalize() and serialize() to return a normalized representation of parsed input HTML. [Core-62036]

Bug Fixes

  • class_list() now replaces null bytes with the Unicode replacement character. [Core-61531]
  • The HTML Processor now properly generates implied end tags. Previously, in some cases, it would generate the end tag but leave the element open on the stack of open elements, leading to mistakes in how it handled later tags in the document. [Core-61576]
  • The HTML Processor now respects the array form of a next_tag() query when specifying the tag_name. [Core-61581]
  • The HTML API no longer hangs when fed documents ending with open SCRIPT elements and which end in one of the - or < characters. [Core-61810]
  • The Tag Processor now reports all un-closed funky comments as having paused at incomplete input. Previously, when the document ended immediately after opening the funky comment, e.g. </# with no more characters, it failed to report the incomplete token. [Core-61831]
  1. There are some unusual situations in HTML which are not supported. They relate to what HTML calls “fostering” and “adoption,” where a node’s placement in the document tree does not correspond to where it appears in the HTML source. This arises in roughly 0.5% of all websites on the web and isn’t currently supported in the HTML API. The deprecated PLAINTEXT element is also unsupported. ↩︎

Props @jonsurrell and @fabiankaegy for review.

#dev-notes, #dev-notes-6-7, #html-api