Progress Report: HTML API

The HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. continues to receive paced and steady development. The Tag Processor was introduced in WordPress 6.2, the HTML Processor in WordPress 6.4, and the ability to traverse all syntax tokens in a document was added in WordPress 6.5. What’s been happening since then and what’s in store for the coming releases?

Quick Review

The last Progress Report explained what the HTML API provides and why it was introduced. The HTML API is being designed to provide a reliable, safe, convenient, and efficient means for understanding and modifying HTML from the server. The driving motivation is to support all server-side needs when working with HTML, while the pace of development is governed by moving as fast as possible without compromising safety and reliability.

While the eventual goal is to provide an assortment of DOM-like methods (for example, set_inner_html()), work continues to ensure that the foundation on which these are built is steady, so that you can trust it from day one.

What’s being worked on right now?

Finish the HTML Processor.

The largest remaining goal is adding full support to the HTML Processor. While the Tag Processor understands each tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.), each comment, each text node in isolation and can even change a tag’s attributes, it’s unaware of where one element begins and ends. HTML is, from a certain perspective, a shorthand script for a full DOM document, and the tags which are or are not present may not correspond to what a browser creates when parsing them. The HTML Processor takes the stream of tokens produced by the Tag Processor and builds a map of how the tags relate to elements.

<p>This is <b>bold.<p>This is also bold.</p>

For example, a <p> tag can appear after another <p> tag but before a </p> has been found. This is often referred to as having missing or overlapping tags. However, the HTML specification’s rules are clear in this situation: the second <p> tag implies the closing of the first one as well as any other elements which were already open, and it starts a new P elements as a sibling of the first. The HTML Processor knows this so you don’t have to.

Unfortunately there are extensive rules and the HTML Processor doesn’t currently understand them all. Thankfully, if it doesn’t understand a situation it finds itself in, then it will “bail” by returning false and storing an error, retrievable with get_last_error().

Represent virtual nodes.

The past couple releases saw major progression in supporting more HTML elements, but one obstacle has stood in the way: so-called virtual nodes that don’t exist in the HTML text itself.

<li><p>One</p></p><p>Two</p></li>

Given the snippet of HTML above, what would you guess is the innerText of the second P element (p:nth-child(2))? You probably already know this is a trick question: it’s empty!

The rules of HTML dictate that when encountering an unexpected closing </p> tag for which there’s no corresponding opening tag, an empty P element without any attributes is created. Most tags don’t do this, but P does. If someone is using the HTML Processor to find the second P element in a document, it shouldn’t lead them astray, but how should it represent tags which in a sense don’t actually exist?

This problem has held up a number of the remaining tags because many of them lead to situations where not only one element appears, but possibly many elements are created between a very simple-looking boundary.

The HTML Processor is therefore undergoing an internal refactor to change the way it represents movement through the document. It will pause at these virtual nodes and represent them so that calling code can find them, though they will remain read-only for now. There’s a spectacular side-effect to this change though: the HTML Processor will be presenting a view of HTML in an idealized form. Because virtual nodes appear not only on openings, but also on closings, the processor will present the document as if it were entirely normative. There are no missing, unexpected, or overlapping tags; every opening tag expecting a closer will find one in the right place. It will be possible through the HTML Processor interface to reliably assume basic structure of HTML: that’s right, all of those “what if” question about quirky or mangled HTML are irrelevant, because the HTML Processor thinks in terms of HTML and not in terms of strings.

<a>link<a>link</a><ul><li><p><b>One<li>Two</b><li><p>Three</li></ul>Four

This snippet above looks broken, and very few parsers will know how to handle it the way a browser does. Consider, however, the sequence of tags or tokens that the HTML Processor will find (in the next snippet, imagine that the loopLoop The Loop is PHP code used by WordPress to display posts. Using The Loop, WordPress processes each post to be displayed on the current page, and formats it according to how it matches specified criteria within The Loop tags. Any HTML or PHP code in the Loop will be processed on each post. https://codex.wordpress.org/The_Loop. is printing a + for an opening tag and a - for a closing tag):

while ( $html_processor->next_token() ) { … }

[
  '+A', '#text', '-A',
  '+A', '#text', '-A',
  '+UL',
    '+LI', '+P', '+B', '#text', '-B', '-P', '-LI',
    '+LI', '+B', '#text', '-B', '-LI',
    '+LI', '+P', '#text', '-P', '-LI',
  '-UL',
  '#text'
]

While this may appear confusing, it’s worth spending some time pondering. The HTML Processor found many more tags than actually exist in the text of the HTML, because it knows it needs to create them as it steps through the document. There’s a really incredible implication here: traversing inside an element is trivial. This means that finding inner content and matching tags can be done the way we often expect or want it to, and that it’s a reliable means to do so. Every <a> opening tag will be followed by a </a> no matter what else is in the HTML. Even when dealing with nested tags and sibling tags, finding the end of an element is as simple as looking for the first closing tag of the same name at the same depth – it’s guaranteed to exist!

$div_depth = $processor->get_depth();
// Find where the DIV closes.
while (
    'DIV' !== $processor->get_tag() &&
    ! $processor->is_tag_closer() &&
    $div_depth !== $processor->get_depth() &&
    $processor->next_token()
) {
    // Whatever this is, it's inside the DIV.
}

Eliminate text decoding problems.

Like most things in HTML, proper parsing is more complicated than it would appear at first. When decoding text content from the HTML, this is worse than it seems, because PHPPHP The web scripting language in which WordPress is primarily architected. WordPress requires PHP 5.6.20 or higher currently lacks the ability to properly decode character references. Character references are what people often call “entities” (a term borrowed from XML). They start with & and what follows is either a number (in decimal or hexadecimal) that represents a Unicode code point, or a name found in a lookup table to map to a specific code point or sequence of code points (for example, &NotEqualTilde; maps to U+2242 U+0338 producing a not-approximately-equal sign ≂̸).

HTML incorporates special rules when decoding these character references to maintain backwards compatibility with common practices that predate HTML5. These mostly revolve around what happens when the final semicolon ; is missing. In many cases the ; isn’t necessary while in other cases it is. Notably, when the semicolon is optional, there are additional restrictions inside attribute values when the reference name could be ambiguous. This rule preserves cases when URLURL A specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org query arguments aren’t properly encoded, as every & in a URL ought to be encoded as &amp;, but often aren’t.

PHP’s html_entity_decode() currently lacks full support:

  • There are 1,730 named character references it’s unaware of.
  • &lang; and &rang; are improperly decoded as if decoding HTML4 instead of HTML5.
  • It doesn’t provide a way to govern whether the ambiguous ampersand rule should be applied.
  • It’s unable to decode named character references without a trailing ;.
  • It doesn’t handle border cases where the character references end a string, such as at the end of an attribute value or right before another tag appears (at the end of a text node).

Further, HTML applies special rules to a range of code points when decoded from numeric character references. Many of you have problem seen cases where “smart quotes” turn into junk when rendered. This is because HTML may store certain code points as if they representing the Windows-1252 encoding, but only from numeric character references. This transformation is not applied in html_entity_decode() and the references are left intact.

The HTML API needs a mechanism for properly decoding text content by default, and that will likely appear in the form of two new methods:

  • WP_HTML_Decoder::decode_text_node() for decoding text found in normal markup (#text nodes).
  • WP_HTML_Decoder::decode_attribute() for decoding text found inside attribute values.

There is nothing special about these methods other than they should be a reliable mechanism for reading (in UTF-8) the actual text a browser would read for the given HTML. The HTML API knows the intricate details of HTML so you don’t have to.


For WordPress 6.6 these are the two primary changes planned: ensure that the HTML Processor visits all virtual nodes, and properly decode all text. For the most part that means no major changes to the interface; everything is in a sense a bugbug A bug is an error or unexpected result. Performance improvements, code optimization, and are considered enhancements, not defects. After feature freeze, only bugs are dealt with, with regressions (adverse changes from the previous version) being the highest priority.-fix to more closely conform the implementation to its design.

What’s coming after these internal improvements?

Reading and modifying sourced blockBlock Block is the abstract term used to describe units of markup that, composed together, form the content or layout of a webpage using the WordPress editor. The idea combines concepts of what in the past may have achieved with shortcodes, custom HTML, and embed discovery into a single consistent API and user experience. attributes.

There’s a very exciting development that is resurfacing that started over a year ago: reading block attributes from the server, at least for blocks with a block.json file. The ability to read the “sourced attributes” for a block was one of the driving reasons that work on the HTML API progressed beyond the Tag Processor. It mainly comprises parsing a CSSCSS Cascading Style Sheets. selector, finding a matching location with an HTML document, and then reading an attribute, inner text, or inner HTML.

While the initial prototypes were encouraging, it was clear that it would be important to truly understand HTML structure in order to do this right. The concept of balanced tags1 simply isn’t a very useful model for understanding HTML. Now that the HTML Processor is so much further along, however, rebuilding the attribute sourcer is turning out to be not only more reliable, but considerably simpler too!

It’s our hope that we have a working system ready by the time that WordPress 6.6 is released so that we can test it for the whole 6.7 release cycle. This is useful not only for individual render functions, but the Block Bindings project needs to be able to read and modify these attributes as well. By ensuring that the system is robust to handle whatever HTML comes its way, we can make Block Bindings work for any block attribute.

Full HTML support.

With the issue of virtual nodes taken care of it is possible to push forward on supporting even more HTML tags and situations. One issue will remain, which is a tricky situation where something called fostering occurs. This can happen, for instance, when tags are found where they shouldn’t be, such as a <div> tag inside a TABLE element but not inside a TD or TH. Strangely enough, that DIV element will be moved up in front of the TABLE element. This implies that there’s a kind of retroactive change in the document after we’ve visited a location in it. There is currently no clear way to represent this situation or communicate it to calling code, so the HTML Processor will continue to bail in these rare scenarios.

Apart from that the rest of the HTML support is large and tedious but straightforward. Expect that in WordPress 6.7 you will be able to send it almost any HTML and it will be able to fully understand the document.

Little bits of semantic meaning.

With the introduction of the block editor, WordPress largely lost Shortcodes, which were the go-to way of incorporating small tidbits of external content or meaning into a post. Shortcodes had their shortcomings, but they also had value. For more than a couple years we’ve been discussing various approaches to bringing shortcodes back: safely and without the most significant drawbacks (breaking HTML, taking over layout, ambiguous nesting rules, introducing a full page of content, etc…). The HTML API changed the game for all of these explorations because it offers a way to build a context-aware auto-escaping templating engine that can power the next generation of Shortcodes, what we have lately been calling “Bits” (Blocks are big, and Bits are small 😉).

While it’s currently possible to add a post author block into a post template, or use a block binding to replace a paragraph’s content with a custom fieldCustom Field Custom Field, also referred to as post meta, is a feature in WordPress. It allows users to add additional information when writing a post, eg contributors’ names, auth. WordPress stores this information as metadata. Users can display this meta data by using template tags in their WordPress themes., Bits will open up new opportunity to place these snippets of external content anywhere you want them, even inside the middle of a paragraph or image caption!

In WordPress 6.6 look for explorations in the editor for how to enter and configure Bits. Work has already started in WordPress’s backend to ensure that the Bit syntax is preserved through post saves and renders. This is going to be a large project with many different systems working together, so it won’t likely be available anytime soon, but many of the independent pieces will be appearing in the next couple releases for those who want to explore the foundations of how they work.

Following the progress

For updates to the HTML API keep watch here for new posts.

If you want to follow the development work you can review Trac tickets in the HTML API component or start watching new HTML API tickets from the component overview page. If you want to talk details or bugs or applications, check out the #core-html-api channel in WordPress.orgWordPress.org The community site where WordPress code is created and shared by the users. This is where you can download the source code for WordPress core, plugins and themes as well as the central location for community conversations and organization. https://wordpress.org/ SlackSlack Slack is a Collaborative Group Chat Platform https://slack.com/. The WordPress community has its own Slack Channel at https://make.wordpress.org/chat/..

Acknowledgements

Thanks @gziolo, @jonsurrell, and @westonruter for helping create and edit this post.

#html-api

  1. “Balanced tags” is a best-effort guess at HTML structure based on scanning from an element’s opening tag until an appropriate closing tag is found. Opening tags along the way will increase a depth just as closing tags decrease the depth. In the idealized view of HTML that the HTML Processor provides this guess is reliable, but without that, it’s very difficult in practice with real HTML documents to reliably understand where an element opens and closes. ↩︎