Updates to the HTML API in 6.4

WordPress 6.4 includes continued development of the HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways., including the introduction of a minimal HTML Processor in #58517 and the addition of a couple of CSSCSS Cascading Style Sheets./class helpers in the Tag Processor in #59209.

A minimal HTML Processor and its breadcrumbs.

When the HTML Processor was introduced in WordPress 6.2, it carried the stipulation that it did not understand HTML structure. While this is an intentional limitation so that the Tag Processor can maintain predictable performance and simplicity, it leaves certain conditions awkward and bugbug A bug is an error or unexpected result. Performance improvements, code optimization, and are considered enhancements, not defects. After feature freeze, only bugs are dealt with, with regressions (adverse changes from the previous version) being the highest priority. prone. Consider, for example, the goal of finding an IMG element inside a surrounding DIV.

$processor = new WP_HTML_Tag_Processor( $html );

// Find the wrapping DIV element.
if ( ! $processor->next_tag( 'DIV' ) {
    return $html;
}

// Find the inner IMG element.
while ( $processor->next_tag( array( 'tag_closers' => 'visit' ) ) ) {
    // Abort if leaving the DIV wrapper before finding the image.
    if ( 'DIV' === $processor->get_tag() && $processor->is_tag_closer() ) {
        return $html;
    }

    if ( 'IMG' === $processor->get_tag() ) {
        do_something_to_img();
        return $processor->get_updated_html();
    }
}

This code maintains the safety against unexpected HTML inputs in the way that a regex-based approach would not, but it’s already a bit cumbersome, and could be mistaken by an unexpected HTML layout. For example, it assumes that a closing DIV tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.) will appear and that it will only appear after the IMG. It will fail to recognize the nested IMG in the following HTML because of the nested DIV before it.

<div class="profile">
    <div class="name">WordPress</div>
    <img src="wordpress-logo.png">
</div>

The new WP_HTML_Processor() class is being built in order to not only understand HTML syntax, but also its semantics – it understands HTML structure and all the quirks involved when discussing “malformed HTML.” In WordPress 6.4 the HTML Processor is available, and it adds a new concept of “breadcrumbs” to the API.

Breadcrumbs represent the path into an HTML document for a given element and will be familiar to those who navigate around a document in the browser development tools.

A browser displays the breadcrumbs for a selected element when inspecting the document.

Breadcrumbs always start with HTML as the start of the path, will list every ancestor of a given element, and include the element itself as the last segment in the path. The HTML Processor introduces a few ways to use these:

  • It’s possible to search for this structure by supplying array( 'breadcrumbs' => $breadcrumbs ) inside a call to next_tag().
  • The get_breadcrumbs() method reports the entire breadcrumb array from HTML to the matched element.
  • The matches_breadcrumbs( $breadcrumbs ) method indicates whether the matched element can be found at the given breadcrumbs.

Using the example above, it’s possible to search for images that are direct children of a DIV.

$processor = WP_HTML_Processor::create_fragment( $html );

if ( ! $processor->next_tag( array( 'breadcrumbs' => array( 'DIV', 'IMG' ) ) ) {
    return $html;
}

do_something_to_img();
return $processor->get_updated_html();

Properly handling this structure comes with additional costs, so the HTML Processor cannot provide the predictable performance that the Tag Processor can. It should still be fast enough to use though when rendering a blockBlock Block is the abstract term used to describe units of markup that, composed together, form the content or layout of a webpage using the WordPress editor. The idea combines concepts of what in the past may have achieved with shortcodes, custom HTML, and embed discovery into a single consistent API and user experience.’s output.

One important note about using the HTML Processor is that it’s a work-in-progress and only supports a subset of allowable HTML. If it encounters a tag or a specific kind of HTML it doesn’t support, then it will abort processing to avoid corrupting your content. The HTML Processor reports if it gave up with the get_last_error() method. Currently, there’s only a small set of HTML that the HTML Processor supports, so don’t be surprised when it aborts before finding the tag you’re looking for. Each WordPress release will expand this support until it can read all possible HTML documents.

if ( $processor->next_tag( 'breadcrumbs' => array( 'figure', 'img' ) ) ) {
    // The tag was found in the HTML document.
    do_something_to_img();
    return $processor->get_updated_html();
}

if ( null === $processor->get_last_error() ) {
    // The tag was not in the HTML document.
} else {
    // It was not possible to determine if the tag was in the HTML document.
}

For more complicated queries, the matches_breadcrumbs() method can be used inside a next_tag() loopLoop The Loop is PHP code used by WordPress to display posts. Using The Loop, WordPress processes each post to be displayed on the current page, and formats it according to how it matches specified criteria within The Loop tags. Any HTML or PHP code in the Loop will be processed on each post. https://codex.wordpress.org/The_Loop.. The * value is a special wildcard term. It only matches one of any tag, so if no open element exists in its place then the match fails.

while ( $processor->next_tag( 'IMG' ) ) {
    // Skip images that are already inside a FIGURE element.
    if ( $processor->matches_breadcrumbs( array( 'FIGURE', 'IMG' ) ) ) {
        continue;
    }

    // Only process images that are great-grand-children of a BLOCKQUOTE element.
    if ( $processor->matches_breadcrumbs( array( 'BLOCKQUOTE', '*', '*', 'IMG' ) ) ) {
        do_something_to_img();
    }
}

The HTML Processor is a subclass of the Tag Processor and so retains all the underlying methods to read and modify the HTML. In the future, it will be possible to insert and remove entire tags and to read and modify the inner markup inside a tag. In WordPress 6.4, however, the only new feature is the concept of breadcrumbs.

CSS helpers for the Tag Processor

It’s been possible to search for a tag containing a specific class with next_tag( array( 'class_name' => $class_name ) ) but not possible to search for a tag containing more than one class name, or to search for a tag not containing a given class name. In WordPress 6.4, it’s possible to do this with the new has_class() method. This method does exactly what it sounds like: it reports if a matched tag contains the given class name in its class attribute.

$processor = new WP_HTML_Processor( $html );

while ( $processor->next_tag() ) {
    // Skip an element if it's not supposed to be processed.
    if ( $processor->has_class( 'data-wp-ignore' ) ) {
        continue;
    }

    if ( $processor->has_class( 'data-wp-context' ) && $processor->has_class( 'active' ) ) {
        // Process the context…
    }
}

The has_class() method knows how to split CSS class names properly. It’s valuable to rely on this method instead of manually processing the class attribute value to avoid several common pitfalls, especially when wanting to know about the presence or absence of multiple class names.


Props to @bph @codente @webcommsat and @nalininonstopnewsuk for peer review.

#6-4, #dev-notes, #dev-notes-6-4