Progress Report: HTML API

Quick review

WordPress’s HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. is developing at a rapid pace. Introduced in WordPress 6.2, and expanded in each release since, the HTML Processor is expected to support all HTML inputs1 in WordPress 6.7. What does this mean for you and where does development go once “full support” has been reached?

What’s being worked on right now?

Full HTML support

The major internal changes to the HTML Processor in WordPress 6.6 opened the door to finish adding HTML support, meaning things like TEMPLATE tags, tables, SVG, and forms. These are now all supported in the HTML API.

The HTML Processor also gained a new constructor method: create_full_parser(). The full parser expects an HTML document in its entirety, including the doctype declaration, <html> tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.), and all. The existing mechanism, create_fragment(), assumes that the given input exists within an existing HTML context, such as inside a <body> tag. If you’ve ever wondered by DOMDocument creates an HTML and HEAD element even when none exists, it’s because it’s performing a full document parse. While both methods have their place, create_fragment() is probably still what you want to use because WordPress rarely presents a complete HTML document to its functions, plugins, and themes.

All supported tags, but with a caveat

There are a few situations that the HTML Processor still won’t support for WordPress 6.7. These are cases where nodes are supposed to be re-parented and they have remained an obstacle for the past few releases. The problem isn’t knowing how to parse these situations, but rather that they imply that parts of a document have changed after the processor has already seen them. They occur most frequently when formatting elements are implicitly closed and then need to be re-opened and when content inside a TABLE element is not in a cell. Such can be seen with the following HTML:

<table><td>1</td>2

This looks innocent enough, but the 2 belongs in front of the TABLE element. When nodes are moved around after the fact like this, it’s possible that the parser might have seen hundreds of kilobytes of content in the meantime. This disconnect between the order of the HTML text and the traversal order in the DOM tree raises complicated questions, including for situations where inner markup is being modified.

This snippet is equivalent to the following HTML.

2<table><tbody><tr><td>1</td></tr></tbody></table>

A cursory analysis of high-ranked domains on the broader internet suggests that around 1% of real web pages involve these and other unsupported situations. Luckily, most content within WordPress and all content generated by frameworks like ReactReact React is a JavaScript library that makes it easy to reason about, construct, and maintain stateless and stateful user interfaces. https://reactjs.org/. avoid these invalidinvalid A resolution on the bug tracker (and generally common in software development, sometimes also notabug) that indicates the ticket is not a bug, is a support request, or is generally invalid. HTML constructions.

Reading and modifying inner HTML

One question that has delayed the introduction of methods like set_inner_html() is how to know whether content is already escaped or not. Can a system be built that allows “trusted” input without introducing opportunities for vulnerabilities? With the expansion of support and the unlocking of new fragment parser contexts, there’s finally a way forward to address this question: do what a browser does.

To set inner content, the HTML Processor will create a new fragment parser using the current element as the context element. It will run the fragment parser to its end and transfer the parsed nodes into the target, just like node.innerHTML = 'string of <code>HTML</code>' in the DOM does. This approach seems obvious from some angles, but implies that the HTML API will be parsing (and possibly re-parsing) any contents that are being set into a target document – this is a slower approach than naïvely slicing in new HTML.

This approach has a fascinating upside though: in addition to ensuring that content is properly escaped, it will normalize the content and prevent the new HTML contents from escaping its outer node. It properly isolates HTML updates. For example, suppose one wants to set the inner HTML for a DIV element.

$processor = WP_HTML_Processor::create_fragment( '<main><div>Content here</div></main>' );
if ( $processor->next_tag( 'DIV' ) ) {
    $processor->set_inner_html( 'What will </div> do?' );
}
echo $processor->get_updated_html();

What should happen with the </div>? If the HTML API were simply replacing text or concatenating HTML, then the update would close the open <div> and the “ do?” content would now appear outside of the intended spot.

// Broken by string concatenation unaware of HTML.
<main><div>What will </div> do?</div></main>

Since the HTML API creates a fragment parser in the appropriate context, however, when it finds the </div> token it will realize that there’s no open <div> and ignore the token (note the extra space below, which in a browser will collapse into single space). The content will remain inside the original DIV.

<main><div>What will  do?</div></main>

Things can appear fairly wacky at times, but the normalization process will leave the HTML better than when it found it.

if ( $processor->next_tag( 'DIV' ) ) {
    $processor->set_inner_html(
        '<em class=i class="dup"></p class="not-here"></div><li>1<li>2<!--'
    );
}

echo $processor->get_updated_html();
<main><div><em class="i"><p></p><li>1</li><li>2</li></div></main>

Modifying HTML content mostly depends on being able to create a new fragment parser in the context of the current element. Since this is an ongoing project it will not likely make it into WordPress 6.7, but should be available early in the 6.8 development cycle.

Review of all CSSCSS Cascading Style Sheets. semantics

The job of the HTML API is to know all of the minute details of HTML parsing so that you don’t have to. To that end, all of the CSS-related methods have been audited to ensure that they reliably match the behavior in a browser. This mostly involved examining the impact of “quirks mode,” which determines if class selectors match names in a case sensitive manner or not.

So if you ask whether CSS class names are matched case-sensitively or not then the HTML API will answer based on the HTML document it was provided. The Tag Processor and fragment parser will assume no-quirks mode (standards mode) while the full parser will choose based on the rules in the HTML specification.

What’s coming after reaching full tag support?

Continuing towards 100%

Parsing HTML involves no uncertainly: WordPress needs to understand all possible HTML documents. In the coming releases work will continue on the tricker design details of supporting the final unsupported bits.

In some situations an element is found outside where it should be. For example, a <p> tag might appear after a BODY element is closed, but it belongs inside the BODY element. The HTML Processor could report this at the proper breadcrumbs (because it knows it belongs at HTML > BODY > P), but that would appear as though the parser jumped backwards in the document back into the BODY. A document should only ever open and close the BODY once. Whether it’s an errant <meta> tag, a <script> after the BODY, or an HTML comment after the BODY, these situations account for about half of the situations that aren’t supported.

The other half of the unsupported situations are more complicated. They involve things which are easy in the DOM but hard in HTML. Specifically, there are times where elements get moved up higher in the tree, or earlier, or are duplicated in different parts of the tree. The HTML API knows where these elements should move, but it doesn’t know until after it has parsed and moved on from those places. It may be possible to look ahead in the document and rearrange the virtual sequence in the HTML, but this is a complicated and delicate operation that requires lots of care.

However these situations are handled, they will probably be done so in an opt-in manner. Lookahead might resolve the problem of accurately representing the document, but in some extreme cases, it may be required to fully parse the entire document before returning control to the calling code. Potentially there could be a user-selectable amount of lookahead before the processor gives up, and a small default value could strike a good balance between fully-supported parsing and surprise performance behaviors.

$processor->set_allowable_lookahead( 3 )

These unsupported situations present some design issues that also need to be solved when discussing modification of an HTML document. For example, in cases where formatting elements are reconstructed, if some code adds a new attribute to the real formatting token then the attribute value appears also on all of the recreated versions of it. How should the HTML API know if someone wanted to set an attribute only on the first element or if they wanted to set it also on all reconstructed elements?

In some cases this can all be avoided by using the HTML API to normalize a document before processing, but in other cases these questions need to be explored before making any hasty decisions. Supporting out-of-order elements makes reading and writing inner HTML significantly more complicated because it can’t return a simple substring; while auto-fixing could invoke catastrophic backtracking into the parser.

Reading sourced blockBlock Block is the abstract term used to describe units of markup that, composed together, form the content or layout of a webpage using the WordPress editor. The idea combines concepts of what in the past may have achieved with shortcodes, custom HTML, and embed discovery into a single consistent API and user experience. attributes on the server

Although this feature has also been pushed back a couple of releases the work continues. There wasn’t sufficient HTML support in WordPress 6.6 to allow reading block attributes reliably enough to be usable. With the practically-complete support in WordPress 6.7 this will no longer be a problem. The primary challenge now is building an interface to parse a CSS selector in a way that properly translates into HTML API code.

When this work was first explored it relied on the Tag Processor and started searching in a top-down manner based on the selector. This isn’t the most efficient way to search, however. In rebuilding the block attribute sourcer for the HTML Processor, the parsing of the CSS selector is going to need to figure out how to search in a bottom-up manner. It’s also going to need to pay attention to things like class names, attribute values, and nth-child counts for relevant tokens while parsing.

WordPress Bits

WordPress 6.7 is not going to see any major work on the Bits proposal, but the work is still being actively pursued. Other developments on templating, custom data types/fields, and block bindings continues to explore the space from the UIUI User interface and UXUX User experience angle.

Native PHPPHP The web scripting language in which WordPress is primarily architected. WordPress requires PHP 5.6.20 or higher support

The HTML API will be significantly faster when it’s available in PHP, written in C instead of user-land PHP. The first proposal to move the HTML API into PHP itself is in the RFC process, adding a decode_html() function corresponding to the recently-introduced WP_HTML_Decoder class.

Additional notes

An HTML API debugger makes exploring HTML fun

@jonsurrell built a WordPress plugin to add a visual HTML API debugger in wp-admin and it’s built with the Interactivity API. This is a fun tool that shows how the HTML API parses a given HTML input. In addition, it provide insight into the parsing mechanics that lead to the final structure. One can learn a lot about HTML parsing just by seeing how different inputs are processed. For example, the “virtual” nodes in the output represent tags that weren’t in the HTML text but were implicitly created based on HTML’s parsing rules.

Following the progress

For updates to the HTML API keep watch here for new posts.

If you want to follow the development work you can review Trac tickets in the HTML API component or start watching new HTML API tickets from the component overview page. If you want to talk details or bugs or applications, check out the #core-html-api channel in WordPress.orgWordPress.org The community site where WordPress code is created and shared by the users. This is where you can download the source code for WordPress core, plugins and themes as well as the central location for community conversations and organization. https://wordpress.org/ SlackSlack Slack is a Collaborative Group Chat Platform https://slack.com/. The WordPress community has its own Slack Channel at https://make.wordpress.org/chat/..

See also the previous Progress Report and Updates to the HTML API in 6.6 for recently-completed developments.

Acknowledgements

Thanks to @annezazu, @gziolo, @jonsurrell, @westonruter for review and guidance when writing this post.

  1. “All” is mostly correct, because while all HTML tags are supported, there are still situations that account for around 1% of all HTML documents on the internet that get into edge cases that it can’t parse. These cases all deal with content that’s found outside of the place it should be, where the tree-order of the nodes in a DOM are rearranged from the order in which the HTML tags are found. The HTML Processor still aborts parsing when it encounters unsupported markup, so it should continue to be reliable for everything it supports. ↩︎

#html-api, #html-api-progress-report