Introducing the HTML API in WordPress 6.2

This post was co-authored by Adam Zielinski @zieladam and Dennis Snell @dmsnell

WordPress 6.2 introduces WP_HTML_Tag_Processor – a tool for blockBlock Block is the abstract term used to describe units of markup that, composed together, form the content or layout of a webpage using the WordPress editor. The idea combines concepts of what in the past may have achieved with shortcodes, custom HTML, and embed discovery into a single consistent API and user experience. authors to adjust HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.) attributes in block markup within PHPPHP The web scripting language in which WordPress is primarily architected. WordPress requires PHP 5.6.20 or higher. It’s the first component in a new HTML processing APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways..

Updating HTML in WordPress has always required using uncomfortable tools. Regular expressions are difficult and prone to all kinds of errors. DOMDocument is resource-heavy, fails to handle modern HTML correctly, and isn’t available on many hosting platforms.

WP_HTML_Tag_Processor takes the first step towards bridging this gap.

The Tag Processor can reliably update HTML attributes

The Tag Processor finds specific tags and can change its attributes. Here’s an example setting an alt attribute on the first img tag within a block of HTML.

$html = '<img src="/husky.jpg">';

$p = new WP_HTML_Tag_Processor( $html );

if ( $p->next_tag() ) {
    $p->set_attribute( 'alt', 'Husky in the snow' );
}

echo $p->get_updated_html();

// Output:
// <img alt="Husky in the snow" src="/husky.jpg">

The next_tag() method moves to the next available tag in the HTML, but also accepts a tag name, a CSSCSS Cascading Style Sheets. class, or both in order to find specific tags. According to the HTML specification, lookup of tag and attribute names aren’t case-sensitive, but CSS class names are.

if ( $p->next_tag( array( 'tag_name' => 'DIV', 'class_name' => 'block-GROUP' ) ) ) {
    $p->remove_class( 'block-group' );
    $p->add_class( 'wp-block-group' );
}

Operations are safe by default:

  • remove an attribute without first checking if it exists,
  • add a CSS class which might already be there,
  • set an attribute value without ensuring that it’s not duplicating an existing one.

You no longer need to be concerned that your code mistakes for a real tag the content inside a <textarea>, and attribute value, or even inside an HTML comment.

The Tag Processor conforms to the HTML5 specification, so you don’t have to. It automatically escapes and decodes values where necessary and knows how to handle malformed markup.

$ugly_html = <<<HTML
<textarea title='<div> elements are semantically void'>
    <div><!--<div attr-->="</div>"></div>">
</textarea>
<div></div>
HTML;

$p = new WP_HTML_Tag_Processor( $ugly_html );
if ( $p->next_tag( 'div' ) ) {
    $p->add_class( 'bold' );
}

echo $p->get_updated_html();
// Output:
// <textarea title='<div> elements are semantically void'>
//     <div><!--<div attr-->="</div>"></div>">
// </textarea>
// <div class="bold"></div>

The Tag Processor operates fast enough to run in critical hot code paths and incurs almost no memory overhead. In WordPress 6.2 it replaces bugbug A bug is an error or unexpected result. Performance improvements, code optimization, and are considered enhancements, not defects. After feature freeze, only bugs are dealt with, with regressions (adverse changes from the previous version) being the highest priority.-prone code relying on regular expressions and string-searching to perform similar updates.

For more advanced use of the Tag Processor, read through the extensive in-class documentation and learn how to…

  • …set bookmarks to re-visit parts of the document which have already been scanned and modified.
  • …visit closing tags like </div> in addition to the opening tags.
  • …run advanced and stateful queries by visiting every tag in a document.

Further considerations

There are many things the Tag Processor doesn’t do: it doesn’t build a DOM document tree, find nested tags, or update a tag’s inner HTML or inner text. Work on new HTML-related APIs continues, and a future WordPress release will build upon this work to enable accessing all of a block’s attributes from within PHP (if the block supplies a block.json file), finding tags using a CSS selector, and modifying the HTML structure with new tags, removed tags, and updated inner markup.

You can keep up with further development via this overview issues on the GutenbergGutenberg The Gutenberg project is the new Editor Interface for WordPress. The editor improves the process and experience of creating new content, making writing rich content much simpler. It uses ‘blocks’ to add richness rather than shortcodes, custom HTML etc. https://wordpress.org/gutenberg/ GitHubGitHub GitHub is a website that offers online implementation of git repositories that can easily be shared, copied and modified by other developers. Public repositories are free to host, private repositories require a paid subscription. GitHub introduced the concept of the ‘pull request’ where code changes done in branches by contributors can be reviewed and discussed before being merged be the repository owner. https://github.com/ Repo.

#6-2, #dev-notes, #dev-notes-6-2