Dennis Snell 5:08 pm on November 21, 2025
Tags: 6-9 ( 87 ), dev-notes ( 622 ), dev-notes-6-9 ( 25 ), html-api

Updates to the HTML API in 6.9

WordPress 6.9 brings an abundance of quiet improvements to the HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.. Updates in this release mostly represent applications of the HTML API to existing code in CoreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress.; these updates increase WordPress’ reliability, improve its security hardening, and reduce maintenance burden on the project.

Major Updates

`WP_HTML_Processor::serialize_token()` is now public.

The HTML Processor’s serialize_token() method returns a fully-normalized and well-formed representation of the currently-matched token. It was introduced in #62036 for WordPress 6.7 as a private method which performs the heavy-lifting for how the HTML API turns “junk” inputs into equivalent well-formed outputs. For example:

$html = '5 < 8 & <tag a=v a="dup"id=di></3>bl&#97rg';
echo WP_HTML_Processor::normalize( $html );
// 5 &lt; 8 &amp; <tag a="v" id="di"><!--3-->blarg</tag>

Its value outside of WP_HTML_Processor::normalize() became evident, however, particularly in the creation of “serialization builders¹” which make it possible to modify more of the HTML structure than the HTML Processor itself does. In typical HTML API loops, this method can be used to partially extract portions of the document safely:

// Extract the outerHTML of every paragraph element.
$processor = WP_HTML_Processor::create_fragment( $html );
$content   = '';
while ( $processor->next_tag( 'P' ) ) {
    $content .= $processor->serialize_token();
    $depth    = $processor->get_current_depth();
    while (
        $processor->next_token() &&
        $processor->get_current_depth() > $depth
    ) {
        $content .= $processor->serialize_token();
    }
    $content .= $processor->serialize_token();
    $content .= "\n\n";
}

WordPress understands JavaScriptJavaScript JavaScript or JS is an object-oriented computer programming language commonly used to create interactive effects within web browsers. WordPress makes extensive use of JS for a better user experience. While PHP is executed on the server, JS executes within a user’s browser. https://www.javascript.com `.dataset` properties.

HTML provides a convenient mechanism tying HTML and JavaScript together through the custom data attributes on a tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.). These are the attributes starting with data- like data‑wp‑interactive or data‑post‑id and their values are available on the corresponding Element object in JavaScript through the .dataset property:

<span data-order="Carrots please!">
    What should we order?
</span>
<script>
document.body.addEventListener(
    'click',
    event => alert( event.target.dataset.order )
);
</script>

There are endless ways this integration can be used to add a level of dynamism to a site. Unfortunately, how the name of these attributes is transformed looks simpler than it is. For example, the data‑wp‑bind‑‑class HTML attribute corresponds to the wpBind‑Class dataset property.

To prevent confusion, WordPress 6.9 includes two new functions to map between the HTML and JavaScript names: wp_js_dataset_name() indicates what would appear on the .dataset property in a browser while wp_html_custom_data_attribute_name() indicates what name should be used in HTML to produce the .dataset property of a given name. For example:

// What would this HTML attribute name correspond to in JavaScript?
echo wp_js_dataset_name( 'data-one-two--three---four' );
// oneTwo-Three--Four

// What HTML attribute name is necessary to produce the given JavaScript name?
echo wp_html_custom_data_attribute_name( 'postId.guid' );
// data-post-id.guid

No more hard-coding HTML string assertions in unit tests.

WordPress is full of unit tests asserting specific HTML transformations. The expected outputs for these tests are usually hard-coded and sent to $this->assertSame() to compare against the actual outputs from the code under test. Unfortunately this tends to produce a high rate of false positives because of trivialities like adding an attribute in a different order than was expected, using single-quotes around an attribute value rather than double-quotes, leaving extra whitespace or not enough, or using the mistaken self-closer on an <img> or <br> tag.

When two HTML strings produce the same result in a browser they should pass regardless of their insignificant differences. To ease the development of these kinds of tests and to reduce their false-positive rates, WordPress 6.9 introduces a new method on the WP_UnitTestClass base class: $this->assertEqualHTML().

This new test assertion verifies that two strings are equivalent representations of the same normative HTML. They compare HTML strings semantically, provide more useful output than string comparison when they fail to assert, and they’re even aware of blockBlock Block is the abstract term used to describe units of markup that, composed together, form the content or layout of a webpage using the WordPress editor. The idea combines concepts of what in the past may have achieved with shortcodes, custom HTML, and embed discovery into a single consistent API and user experience. semantics.

$this->assertEqualHTML(
  "<img src='puppy&period;jpg'   loading=lazy>",
  '<img loading="l&#97zy"src="puppy.jpg"/>'
);

 ✔︎ Is equivalent html

Time: 00:00.038, Memory: 40.00 MB

OK (1 test, 1 assertion)

This test case would pass since the arguments are two equivalent constructions of the same IMG element. However, a few small changes and it succinctly highlights their differences. The addition of the block comment delimiter is for illustrative purposes only.

$this->assertEqualHTML(
    "<!-- wp:image {\"id\":5} --><img src='puppy.jpg' loading=lazy>",
    '<!-- wp:img {"id":6} --><img loading="lazy" data-priority=5 src=puppy.jpg/>'
);

 ✘ Is equivalent html
   ┐
   ├ HTML markup was not equivalent.
   ├ Failed asserting that two strings are identical.
   ┊ ---·Expected
   ┊ +++·Actual
   ┊ @@ @@
   ┊ -'BLOCK["core/image"]
   ┊ +'BLOCK["core/img"]
   ┊    {
   ┊ -····"id": 5
   ┊ +····"id": 6
   ┊    }
   ┊    <img>
   ┊ +····data-priority="5"
   ┊      loading="lazy"
   ┊ -····src="puppy.jpg"
   ┊ +····src="puppy.jpg/"
   ┊  '
   │
   ╵ /WordPress-develop/tests/phpunit/includes/abstract-testcase.php:1235
   ╵ /WordPress-develop/tests/phpunit/tests/html/equivalentHtmlTest.php:10
   ┴

Time: 00:00.038, Memory: 40.00 MB

The HTML API received minor updates.

The Tag Processor’s constructor will now cast null to an empty string. Similarly, the static creator methods on the HTML Processor will return null instead of an instance of the WP_HTML_Processor class. In each case a _doing_it_wrong() notice will alert developers that these classes expect a string input. This change prevents burying the type errors, which leads to unexpected crashes later on, such as when calling get_updated_html().
When calling set_modifiable_text() on a SCRIPT element, updates are rejected if they contain <script or </script in them. This is a conservative measure to avoid entering the script data double escaped state (personal blogblog (versus network, site)) which is prone to misinterpretation.

Full Changelog

Enhancements

wp_js_dataset_name() and wp_html_custom_data_attribute_name() map between HTML attributes and the .dataset property in JavaScript. [#61501, PR#9953]
The WP_UnitTestClass now contains an assertEqualHTML() method which determines if two strings represent the same normative HTML. [#63527, PR#8882]
Multiple length checks are safely skipped when processing SCRIPT content due to an early minimum-length check. [#63738, PR#9230]
Encoding detection in METAMeta Meta is a term that refers to the inside workings of a group. For us, this is the team that works on internal WordPress sites like WordCamp Central and Make WordPress. tags is simplified, leading to a minor performance lift. [#63738, PR#9231]
WP_HTML_Processor::serialize_token() is now public, making it easier to mix the full safety of the HTML API with outside code modifying and combining HTML. [#63823, PR#9456]
The Tag Processor and HTML Processor handle invalidinvalid A resolution on the bug tracker (and generally common in software development, sometimes also notabug) that indicates the ticket is not a bug, is a support request, or is generally invalid. null inputs safely. [#63854, PR#9545]
set_modifiable_text() rejects additional contents inside a SCRIPT element when the contents could disturb its normal closing. [#63738, PR#9560]

Bug Fixes

Attempting to avoid the HTTP Referer problem, quirks mode is referred to as indicated_compatibility_mode. [#63391, PR#9401]
wp_kses() no longer unescapes escaped numeric character references for users without unfiltered_html, preserving more of the actual entered content in a post or comment. [#63630, PR#9099]
SCRIPT tags are properly closed in the presence of abruptly-closed HTML comments within the contents, and when the closing SCRIPT tag’s tag name is delimited by a form-feed. [#63738, PR#9397]
wp_kses() now allows some previously-missing HTML5 semantic tags and their attributes. [#63786, PR#9379]
set_attribute() directly escapes syntax characters into HTML character references to avoid problems with double-escaping logic. This ensures that all values are represented accurately in the resulting HTML. [#64054, PR#10143]

Core refactors

A number of places in Core were updated to benefit from the HTML API.

Several of the unit tests now rely on assertEqualHTML(), including for block supports, wp_rel_nofollow(), wp_rel_ugc(), wp_kses, post-filtering, media, oEmbed filtering. [#59622, #63694, PR#5486, PR#9251, PR#9255, PR#9257, PR#9258, PR#9259, PR#9264]
get_url_in_content() relies on the Tag Processor to more reliably detect links. Besides improving general HTML parsing, this new version always returns the decoded href attribute, preventing confusion in downstream code. [#63694, PR#9272]
Processing for image blocks in classic themes is now performed via the HTML API rather than with PCREs. [#63694, PR#10218]

Acknowledgements

Props to @jonsurrell and @westonruter for reviewing this post.

Methods to replace innerHTML and outerHTML, wrap an element, unwrap an element, insert elements, and more are possible by scanning through a document and conditionally copying the normalized tokens into an output string. ↩︎

#6-9, #dev-notes, #dev-notes-6-9, #html-api

Dennis Snell 10:27 am on October 17, 2024
Tags: dev-notes ( 622 ), dev-notes-6-7 ( 15 ), html-api

Updates to the HTML API in 6.7

After important internal changes to the HTML API in 6.6, WordPress 6.7 brings forward a major leap in support with the HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. Processor.

Major Updates

Full HTML Support

After a long time of only supporting a subset of HTML, the HTML Processor now supports practically all HTML and all HTML tags¹. The internal work during the past few releases made it possible to add the remaining support needed to tackle all the colorful varieties of HTML that exist. Only in the rarest situations will the HTML Processor now give up, and in none of these cases is it because the processor doesn’t know how to handle a given tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.).

Effectively, the HTML APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. is entering a new phase, where the rules for understanding HTML are finally baked in, and the API can start to proliferate into new developer APIs and replace older and buggier HTML processing code.

Ability to modify most text in a document

The first of a new set of modification methods has been added to the HTML API: set_modifiable_text(). When paused on a text node, a special self-contained element like SCRIPT, STYLE, or TITLE, or on a comment, it’s possible to replace the text content of that node.

while ( $processor->next_tag( 'SCRIPT' ) ) {
    $existing_script = $processor->get_modifiable_text();
    $prefix          = "// Careful, this is executable!\n";
    $processor->set_modifiable_text( "{$prefix}{$existing_script}" );
}

This method will handle the appropriate escaping for the kind of node it is. For example, it will escape the text for a SCRIPT element in a different way than for a STYLE element or a raw text node. This is important, because it means you don’t have to think about what kind of text it is, but it will remain reliable.

It’s important to understand here that this operates on the level of individual text nodes. set_modifiable_text() is different than set_inner_text(). A given element may contain multiple text nodes which are separated by things like HTML comments, specific byte sequences, and invalidinvalid A resolution on the bug tracker (and generally common in software development, sometimes also notabug) that indicates the ticket is not a bug, is a support request, or is generally invalid. markup which is interpreted as a comment. For set_inner_text() it will be best to wait until the HTML API itself supports these operations.

A full-parser mode in the HTML Processor

Until now, the HTML Processor has provided a slightly unexpected interface. Creating a processor involved a call to create_fragment(), which creates something called a “fragment parser.” There was a good reason for this: the fragment parser is the version that’s used in JavaScriptJavaScript JavaScript or JS is an object-oriented computer programming language commonly used to create interactive effects within web browsers. WordPress makes extensive use of JS for a better user experience. While PHP is executed on the server, JS executes within a user’s browser. https://www.javascript.com when assigning node.innerHTML = newHtml. Unfortunately, the fragment parser cannot ingest an entire document from start to finish. If you worked with the HTML Processor you would have noticed that it gave up as soon as it found the starting <!DOCTYPE html> or <html> tokens.

In WordPress 6.7 the HTML Processor offers create_full_parser() to fill this need. Thanks to completing support for all of the remaining contexts in a document, the HTML Processor can now start with arbitrary HTML and understand how it turns into a DOM tree.

While possibly unexpected, the default implementation with the fragment processor is still the best option in most cases. If you’ve ever had to deal with HTML parsing libraries adding <!DOCTYPE html><html><head> (and other tags) to the start of every snippet of HTML that they process, it’s because these tools are performing full parses and full serialization. When working with anything other than a complete HTML document, the fragment parser remains the most appropriate tool. Process fragments of HTML with the fragment parser, and full documents with the full parser.

Normalization of input documents

With the HTML specification rules in place it’s possible to start building higher-level tools and helpers without having to worry about the complexities of HTML. The first of these is WP_HTML_Processor::normalize(), which gives us the HTML we always wanted. Normalizing is the process of taking any HTML input and returning a fully well-formed equivalent document. Since a picture is worth a thousand words:

$html = '</p><p class=wp class=duplicate><[GETTING[started]]><p><?not a tag?></p done=yet?><script><!--<script></script>';
echo WP_HTML_Processor::normalize( $html );
// <p></p><p class="wp">&lt;[GETTING[started]]></p><p><!--?not a tag?--></p>

This little function is packed with utility. It interprets the input HTML the way a browser would, meaning that it applies all of the complicated parsing rules, and it prints out a new serialization of that parsed structure. Attribute quoting, mismatched tags, missing tags, invalid syntax, unexpected whitespace, comments, and more – it prints out tags without duplicates, where each value is double-quoted, where all relevant syntax characters are encoded, and it even removes partial tokens at the end of the input. That’s right, normalize() won’t leave you hanging – literally!

$html = 'Just a <em>sneaky but <!-- foiled attempt to hide the rest of the page';
echo WP_HTML_Processor::normalize( $html );
// Just a <em>sneaky but </em>

Significantly, there’s something extremely valuable in the output of this function: once HTML has been normalized, it’s considerably safer to process with naive parsing tools. There’s a guarantee on how the text it produces will appear, so string-based and regular-expression-based tools will instantly become more reliable; the edge cases they fail to handle will never make it out of normalize().

CSS selectors and HTML interact via the document mode, which is either in standards mode or quirks mode. In standards mode, CSS selectors match classes and IDs in a byte-for-byte match. Otherwise they are matched in an ASCII case-insensitive manner. Confusing!

Thankfully, the entire HTML API was audited to ensure compliance even in how CSS class selectors work, and the behavior of methods like has_class(), add_class(), and remove_class() will now respect each document’s mode. For almost every case this should be the “standards mode” and so both of the HTML API processors default to it, matching class names in a byte-for-byte manner. When creating an HTML Processor though with the full parser, since it’s able to determine the document mode from the input HTML, it will make the same choice a browser makes, and when appropriate, will match case-insensitively.

Important changes for existing code.

There should be no significant changes required to existing code written with previous versions of the HTML API. While almost all HTML documents are now supported, it’s still required to verify if the HTML Processor bailed while attempting to process unsupported input.

There were several changes to the way null bytes and whitespace are handled in specific circumstances. These cases represent invalid sections of documents and normal HTML inputs aren’t affected by them.

Since changes were also made with the handling of ASCII casing in CSS class names, there may be cases where a class name previously matched and now won’t. For example, a browser will not match the CSS class name selector for full-width with an element containing FULL-WIDTH in its class attribute unless the document is parsed in quirks-mode. The HTML API is now processing its inputs in standards mode, matching the behavior of a browser, unless created with a full parser containing the appropriate DOCTYPE to indicate quirks mode. Code that relied on has_class() need not change, but the results in WordPress 6.7 may be different at times because it has become more reliable to the proper handling of these class names.

Enhancements

Low-level parsing updates improved the scanning speed of the Tag Processor by 3.5-7.5%. [Core-61545]
When removing class names, leading whitespace in front of the class name is now removed. [Core-61531]
CSS class names and class selectors behave according to the document mode for the input document, or standards mode if none can be directly inferred. [Core-61531]
Improved spec-compliance by handling remaining HTML tags and insertion modes, including SVG, TABLE, TEMPLATE, and more. [Core-61576]
Added WP_HTML_Processor::create_full_parser() to process entire HTML documents from start to finish. [Core-61576]
Added get_qualified_tag_name() and get_qualified_attribute_name() to return case-variants for certain tags inside foreign content where the HTML rules specify. [Core-61576]
It’s now possible to replace the value of modifiable text. [Core-61617]
Added get_unsupported_exception() to provide more debugging information when the HTML Processor bails on unsupported input. [Core-61646]
set_attribute() returns false when WordPress rejects the update. [Core-61719]
Added subdivide_text_appropriately() to aid in algorithms wanting to skip null byte or whitespace-only prefixes of text nodes. [Core-61974]
Added normalize() and serialize() to return a normalized representation of parsed input HTML. [Core-62036]

Bug Fixes

class_list() now replaces null bytes with the Unicode replacement character. [Core-61531]
The HTML Processor now properly generates implied end tags. Previously, in some cases, it would generate the end tag but leave the element open on the stack of open elements, leading to mistakes in how it handled later tags in the document. [Core-61576]
The HTML Processor now respects the array form of a next_tag() query when specifying the tag_name. [Core-61581]
The HTML API no longer hangs when fed documents ending with open SCRIPT elements and which end in one of the - or < characters. [Core-61810]
The Tag Processor now reports all un-closed funky comments as having paused at incomplete input. Previously, when the document ended immediately after opening the funky comment, e.g. </# with no more characters, it failed to report the incomplete token. [Core-61831]

There are some unusual situations in HTML which are not supported. They relate to what HTML calls “fostering” and “adoption,” where a node’s placement in the document tree does not correspond to where it appears in the HTML source. This arises in roughly 0.5% of all websites on the web and isn’t currently supported in the HTML API. The deprecated PLAINTEXT element is also unsupported. ↩︎

Props @jonsurrell and @fabiankaegy for review.

#dev-notes, #dev-notes-6-7, #html-api

Dennis Snell 12:30 am on September 11, 2024
Tags: html-api, html-api-progress-report

Progress Report: HTML API

Quick review

WordPress’s HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. is developing at a rapid pace. Introduced in WordPress 6.2, and expanded in each release since, the HTML Processor is expected to support all HTML inputs¹ in WordPress 6.7. What does this mean for you and where does development go once “full support” has been reached?

What’s being worked on right now?

Full HTML support

The major internal changes to the HTML Processor in WordPress 6.6 opened the door to finish adding HTML support, meaning things like TEMPLATE tags, tables, SVG, and forms. These are now all supported in the HTML API.

The HTML Processor also gained a new constructor method: create_full_parser(). The full parser expects an HTML document in its entirety, including the doctype declaration, <html> tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.), and all. The existing mechanism, create_fragment(), assumes that the given input exists within an existing HTML context, such as inside a <body> tag. If you’ve ever wondered by DOMDocument creates an HTML and HEAD element even when none exists, it’s because it’s performing a full document parse. While both methods have their place, create_fragment() is probably still what you want to use because WordPress rarely presents a complete HTML document to its functions, plugins, and themes.

All supported tags, but with a caveat

There are a few situations that the HTML Processor still won’t support for WordPress 6.7. These are cases where nodes are supposed to be re-parented and they have remained an obstacle for the past few releases. The problem isn’t knowing how to parse these situations, but rather that they imply that parts of a document have changed after the processor has already seen them. They occur most frequently when formatting elements are implicitly closed and then need to be re-opened and when content inside a TABLE element is not in a cell. Such can be seen with the following HTML:

<table><td>1</td>2

This looks innocent enough, but the 2 belongs in front of the TABLE element. When nodes are moved around after the fact like this, it’s possible that the parser might have seen hundreds of kilobytes of content in the meantime. This disconnect between the order of the HTML text and the traversal order in the DOM tree raises complicated questions, including for situations where inner markup is being modified.

This snippet is equivalent to the following HTML.

2<table><tbody><tr><td>1</td></tr></tbody></table>

A cursory analysis of high-ranked domains on the broader internet suggests that around 1% of real web pages involve these and other unsupported situations. Luckily, most content within WordPress and all content generated by frameworks like ReactReact React is a JavaScript library that makes it easy to reason about, construct, and maintain stateless and stateful user interfaces. https://reactjs.org avoid these invalidinvalid A resolution on the bug tracker (and generally common in software development, sometimes also notabug) that indicates the ticket is not a bug, is a support request, or is generally invalid. HTML constructions.

Reading and modifying inner HTML

One question that has delayed the introduction of methods like set_inner_html() is how to know whether content is already escaped or not. Can a system be built that allows “trusted” input without introducing opportunities for vulnerabilities? With the expansion of support and the unlocking of new fragment parser contexts, there’s finally a way forward to address this question: do what a browser does.

To set inner content, the HTML Processor will create a new fragment parser using the current element as the context element. It will run the fragment parser to its end and transfer the parsed nodes into the target, just like node.innerHTML = 'string of <code>HTML</code>' in the DOM does. This approach seems obvious from some angles, but implies that the HTML API will be parsing (and possibly re-parsing) any contents that are being set into a target document – this is a slower approach than naïvely slicing in new HTML.

This approach has a fascinating upside though: in addition to ensuring that content is properly escaped, it will normalize the content and prevent the new HTML contents from escaping its outer node. It properly isolates HTML updates. For example, suppose one wants to set the inner HTML for a DIV element.

$processor = WP_HTML_Processor::create_fragment( '<main><div>Content here</div></main>' );
if ( $processor->next_tag( 'DIV' ) ) {
    $processor->set_inner_html( 'What will </div> do?' );
}
echo $processor->get_updated_html();

What should happen with the </div>? If the HTML API were simply replacing text or concatenating HTML, then the update would close the open <div> and the “ do?” content would now appear outside of the intended spot.

// Broken by string concatenation unaware of HTML.
<main><div>What will </div> do?</div></main>

Since the HTML API creates a fragment parser in the appropriate context, however, when it finds the </div> token it will realize that there’s no open <div> and ignore the token (note the extra space below, which in a browser will collapse into single space). The content will remain inside the original DIV.

<main><div>What will  do?</div></main>

Things can appear fairly wacky at times, but the normalization process will leave the HTML better than when it found it.

if ( $processor->next_tag( 'DIV' ) ) {
    $processor->set_inner_html(
        '<em class=i class="dup"></p class="not-here"></div><li>1<li>2<!--'
    );
}

echo $processor->get_updated_html();

<main><div><em class="i"><p></p><li>1</li><li>2</li></div></main>

Modifying HTML content mostly depends on being able to create a new fragment parser in the context of the current element. Since this is an ongoing project it will not likely make it into WordPress 6.7, but should be available early in the 6.8 development cycle.

Review of all CSSCSS Cascading Style Sheets. semantics

The job of the HTML API is to know all of the minute details of HTML parsing so that you don’t have to. To that end, all of the CSS-related methods have been audited to ensure that they reliably match the behavior in a browser. This mostly involved examining the impact of “quirks mode,” which determines if class selectors match names in a case sensitive manner or not.

So if you ask whether CSS class names are matched case-sensitively or not then the HTML API will answer based on the HTML document it was provided. The Tag Processor and fragment parser will assume no-quirks mode (standards mode) while the full parser will choose based on the rules in the HTML specification.

What’s coming after reaching full tag support?

Continuing towards 100%

Parsing HTML involves no uncertainly: WordPress needs to understand all possible HTML documents. In the coming releases work will continue on the tricker design details of supporting the final unsupported bits.

In some situations an element is found outside where it should be. For example, a <p> tag might appear after a BODY element is closed, but it belongs inside the BODY element. The HTML Processor could report this at the proper breadcrumbs (because it knows it belongs at HTML > BODY > P), but that would appear as though the parser jumped backwards in the document back into the BODY. A document should only ever open and close the BODY once. Whether it’s an errant <meta> tag, a <script> after the BODY, or an HTML comment after the BODY, these situations account for about half of the situations that aren’t supported.

The other half of the unsupported situations are more complicated. They involve things which are easy in the DOM but hard in HTML. Specifically, there are times where elements get moved up higher in the tree, or earlier, or are duplicated in different parts of the tree. The HTML API knows where these elements should move, but it doesn’t know until after it has parsed and moved on from those places. It may be possible to look ahead in the document and rearrange the virtual sequence in the HTML, but this is a complicated and delicate operation that requires lots of care.

However these situations are handled, they will probably be done so in an opt-in manner. Lookahead might resolve the problem of accurately representing the document, but in some extreme cases, it may be required to fully parse the entire document before returning control to the calling code. Potentially there could be a user-selectable amount of lookahead before the processor gives up, and a small default value could strike a good balance between fully-supported parsing and surprise performance behaviors.

$processor->set_allowable_lookahead( 3 )

These unsupported situations present some design issues that also need to be solved when discussing modification of an HTML document. For example, in cases where formatting elements are reconstructed, if some code adds a new attribute to the real formatting token then the attribute value appears also on all of the recreated versions of it. How should the HTML API know if someone wanted to set an attribute only on the first element or if they wanted to set it also on all reconstructed elements?

In some cases this can all be avoided by using the HTML API to normalize a document before processing, but in other cases these questions need to be explored before making any hasty decisions. Supporting out-of-order elements makes reading and writing inner HTML significantly more complicated because it can’t return a simple substring; while auto-fixing could invoke catastrophic backtracking into the parser.

Reading sourced blockBlock Block is the abstract term used to describe units of markup that, composed together, form the content or layout of a webpage using the WordPress editor. The idea combines concepts of what in the past may have achieved with shortcodes, custom HTML, and embed discovery into a single consistent API and user experience. attributes on the server

Although this feature has also been pushed back a couple of releases the work continues. There wasn’t sufficient HTML support in WordPress 6.6 to allow reading block attributes reliably enough to be usable. With the practically-complete support in WordPress 6.7 this will no longer be a problem. The primary challenge now is building an interface to parse a CSS selector in a way that properly translates into HTML API code.

When this work was first explored it relied on the Tag Processor and started searching in a top-down manner based on the selector. This isn’t the most efficient way to search, however. In rebuilding the block attribute sourcer for the HTML Processor, the parsing of the CSS selector is going to need to figure out how to search in a bottom-up manner. It’s also going to need to pay attention to things like class names, attribute values, and nth-child counts for relevant tokens while parsing.

WordPress Bits

WordPress 6.7 is not going to see any major work on the Bits proposal, but the work is still being actively pursued. Other developments on templating, custom data types/fields, and block bindings continues to explore the space from the UIUI User interface and UXUX User experience angle.

Native PHPPHP The web scripting language in which WordPress is primarily architected. WordPress requires PHP 7.4 or higher support

The HTML API will be significantly faster when it’s available in PHP, written in C instead of user-land PHP. The first proposal to move the HTML API into PHP itself is in the RFC process, adding a decode_html() function corresponding to the recently-introduced WP_HTML_Decoder class.

Additional notes

An HTML API debugger makes exploring HTML fun

@jonsurrell built a WordPress plugin to add a visual HTML API debugger in wp-admin and it’s built with the Interactivity API. This is a fun tool that shows how the HTML API parses a given HTML input. In addition, it provide insight into the parsing mechanics that lead to the final structure. One can learn a lot about HTML parsing just by seeing how different inputs are processed. For example, the “virtual” nodes in the output represent tags that weren’t in the HTML text but were implicitly created based on HTML’s parsing rules.

Following the progress

For updates to the HTML API keep watch here for new posts.

If you want to follow the development work you can review Trac tickets in the HTML API component or start watching new HTML API tickets from the component overview page. If you want to talk details or bugs or applications, check out the #core-html-api channel in WordPress.orgWordPress.org The community site where WordPress code is created and shared by the users. This is where you can download the source code for WordPress core, plugins and themes as well as the central location for community conversations and organization. https://wordpress.org/ SlackSlack Slack is a Collaborative Group Chat Platform https://slack.com/. The WordPress community has its own Slack Channel at https://make.wordpress.org/chat/.

See also the previous Progress Report and Updates to the HTML API in 6.6 for recently-completed developments.

Acknowledgements

Thanks to @annezazu, @gziolo, @jonsurrell, @westonruter for review and guidance when writing this post.

“All” is mostly correct, because while all HTML tags are supported, there are still situations that account for around 1% of all HTML documents on the internet that get into edge cases that it can’t parse. These cases all deal with content that’s found outside of the place it should be, where the tree-order of the nodes in a DOM are rearranged from the order in which the HTML tags are found. The HTML Processor still aborts parsing when it encounters unsupported markup, so it should continue to be reliable for everything it supports. ↩︎

#html-api, #html-api-progress-report

Dennis Snell 9:48 pm on June 24, 2024
Tags: 6-6 ( 57 ), dev-notes ( 622 ), dev-notes-6.6 ( 19 ), html-api

Updates to the HTML API in 6.6

WordPress 6.6 includes a helpful maintenance release to the HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.. Included in this work are a few new features and a major improvement to the usability of the HTML Processor. This continues paced development since WordPress 6.5.

A spec-compliant text decoder.
1. An idealized view of an HTML document.
2. An optimized class for looking up string tokens and their associated mappings.
Features
Bug Fixes

A spec-compliant text decoder.

This may be surprising, but PHPPHP The web scripting language in which WordPress is primarily architected. WordPress requires PHP 7.4 or higher leaves us hanging if we want to properly read the text content of an HTML document. The html_entity_decode() and htmlspecialchars_decode() functions work somewhat well for pure XML documents, but HTML contains more complicated rules for decoding, rules which change depending on whether the text is found inside an attribute value or normal text. These functions default to XML and HTML4 parsing rules and require manually setting the ENT_HTML5 flag on every invocation (for example, HTML5 redefined two of HTML4’s character references), but are still wrong in many cases.

Luckily you shouldn’t need to know about or call the new decoder, developed in Core-61072. It fits into get_modified_text(), further improving the HTML API’s implementation without requiring you to change any of your existing code. With WordPress 6.6 your existing code becomes more reliable for free.

One part of this change you might want to know about is WP_HTML_Decoder::attribute_starts_with(). This new method takes a plaintext prefix and a raw attribute value and indicates if the decoded value starts with the given prefix. This can be invaluable for efficiently detecting strings at the start of an attribute, as some attributes can be extremely large, and if not careful, naive parsers can overlook content hidden behind long slides of zeros.

$html = 'bob&#x00000000000000000003a,';

'bob&#x00000000000000000003a,' === html_entity_decode( $html, ENT_HTML5 );
'bob:,' === WP_Text_Decoder::decode_attribute( $html );
true    === WP_Text_Decoder::attribute_starts_with( $html, 'bob:' );

In the case of extremely long attribute values (for example, when pasting content from cloud document editors which send images as data URIs), the attribute_starts_with() can avoid megabytes of memory overhead and return much quicker than when calling functions which entirely decode the attribute value.

The new text decoder will mostly help ensure that the HTML API remains safe and reliable. There are complicated rules in parsing HTML, so as always, it’s best to leave the low-level work to the HTML API, preferring to call functions like get_attribute() and get_modified_text() directly instead of parsing raw text segments.

An idealized view of an HTML document.

The Tag Processor was initially designed to jump from tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.) to tag, then it was refactored to allow scanning every kind of syntax token in an HTML document. Likewise, the HTML Processor was initially designed to jump from tag to tag, all the while also acknowledging the complex HTML parsing rules. These rules largely exist in the form of a stack machine that tracks which elements are currently open. While the HTML Processor has always maintained this stack, it has never exposed it to calling code.

In WordPress 6.6 the HTML Processor underwent a major internal refactor to report those stack events (when an element opens and when an element closes) rather than when it finds raw text that represents things like tag openers and tag closers. This is a really big change for calling code! Previously, the HTML Processor would track all elements, but only return when a tag or token appeared in an HTML document. For instance, it always knew that <p><p> represents two sibling P elements, but it only presented each opening P tag to calling code. Now, the HTML processor is going to present not only the tags and tokens that exist in the raw HTML text, but also the “virtual nodes” that are implied but not textually present.

$processor = WP_HTML_Processor::create_fragment( '<h1>One</h3><h2>Two<p>Three<p>Four<h3>Five' );
while ( $processor->next_token() ) {
	$depth = $processor->get_current_depth();
    $slash = $processor->is_tag_closer() ? '/' : '';
	echo "{$depth}: {$slash}{$processor->get_token_name()}: {$processor->get_modifiable_text()}n";
}

Let’s compare the output in WordPress 6.5 against the output in WordPress 6.6.

HTML Processor in WordPress 6.5

H1:
#text: One
/H3:
H2:
#text: Two
P:
#text: Three
P:
#text: Four
H3:
#text: Five

HTML Processor in WordPress 6.6

3: H1:
4: #text: One
2: /H1:
3: H2:
4: #text: Two
4: P:
5: #text: Three
4: /P:
4: P:
5: #text: Four
3: /P:
3: /H2:
3: H3:
4: #text: Five
0: /H3:

With the HTML API in WordPress 6.6, it’s possible to treat an HTML document in the idealized way we often think about it: where every tag has an appropriate corresponding closing tag in the right place, and no tags overlap. In WordPress 6.5, only the opening tags which appeared in the document return from next_tag(), and the </h3> closing tag appears as an H3 closing tag, even though the HTML specification indicates that it closes the already-open H1 element. In WordPress 6.6, every opening tag gets its closer, and the </h3> appears as if it were an </h1>. This is because the HTML Processor is exposing the document structure instead of the raw text.

Two new methods make working with HTML even easier:

WP_HTML_Processor->get_current_depth() returns the depth into the HTML structure where the current node is found.
WP_HTML_Processor->expects_closer() indicates if the opened node expects a closing tag or if it will close automatically when proceeding to the next token in the document. For example, text nodes and HTML comments and void elements never expect a closer.

With the help of these methods it’s possible to trivially detect when an element opens and closes, because the HTML Processor guarantees a “perfect” view of the structure.

$processor = WP_HTML_Processor( $block_html );
if ( ! $processor->next_tag( 'DIV' ) ) {
	return $block_html;
}

$depth = $processor->get_current_depth();
while ( $processor->get_current_depth() >= $depth && $processor->next_token() ) {
	// Everything inside of here is inside the open DIV.
}
if ( ! isset( $processor->get_last_error() ) ) {
	// This is where the DIV closed.
}

An optimized class for looking up string tokens and their associated mappings.

As part of the text decoder work the WP_Token_Map was introduced. This is a handy and efficient utility class for mapping between keys or tokens and their replacements. It’s also handy for efficient set membership; for example, to determine if a given username is found within a set of known usernames.

Read more in the Token Map announcement.

Features

The HTML Processor will now return the depth of the current node in the stack of open elements with get_current_depth(). [58191]
The HTML Processor now includes expects_closer() to indicate the currently-matched node expect a closing token. For example, no HTML void element expects a closer, no text node expects a closer, and none of the elements treated specially in the HTML API as atomic elements (such as SCRIPT, STYLE, TITLE, or TEXTAREA) expect a closer. [58192]
The WP_HTML_Decoder class can take a raw HTML attribute or text value and decode it, assuming that the source and destination are UTF-8. The HTML API now uses this instead of html_entity_decode() for more reliable parsing of HTML text content. [58281]
The HTML Processor now visits all real and virtual nodes, not only those which are also present in the text of the HTML, but those which are implied by what’s there or not there. [58304]

Bug Fixes

Funky-comments whose contents are only a single character are now properly recognized. Previously the parser would get off track in these situations, consuming text until the next > after the funky comment. [58040]
The HTML Processor now respects the class_name argument if passed to next_tag(). Formerly it was overlooking this constraint. [58190]
The Tag Processor was incorrectly tracking the position of the last character in some tokens, internally and when bookmarking. While this bugbug A bug is an error or unexpected result. Performance improvements, code optimization, and are considered enhancements, not defects. After feature freeze, only bugs are dealt with, with regressions (adverse changes from the previous version) being the highest priority. did not affect the operation of the Tag Processor, it has been fixed so that future code which might rely upon it will work properly. [58233]
When subclassing WP_HTML_Processor the ::create_fragment() method will return the subclass instance instead of a WP_HTML_Processor instance. [58365]

Props to @gziolo, @jonsurrell, @juanmaguitar, and @westonruter for reviewing this post and providing helpful feedback.

#6-6, #dev-notes, #dev-notes-6-6, #html-api

Dennis Snell 7:04 pm on May 7, 2024
Tags: html-api

Progress Report: HTML API

The HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. continues to receive paced and steady development. The Tag Processor was introduced in WordPress 6.2, the HTML Processor in WordPress 6.4, and the ability to traverse all syntax tokens in a document was added in WordPress 6.5. What’s been happening since then and what’s in store for the coming releases?

Quick Review

The last Progress Report explained what the HTML API provides and why it was introduced. The HTML API is being designed to provide a reliable, safe, convenient, and efficient means for understanding and modifying HTML from the server. The driving motivation is to support all server-side needs when working with HTML, while the pace of development is governed by moving as fast as possible without compromising safety and reliability.

While the eventual goal is to provide an assortment of DOM-like methods (for example, set_inner_html()), work continues to ensure that the foundation on which these are built is steady, so that you can trust it from day one.

What’s being worked on right now?

Finish the HTML Processor.

The largest remaining goal is adding full support to the HTML Processor. While the Tag Processor understands each tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.), each comment, each text node in isolation and can even change a tag’s attributes, it’s unaware of where one element begins and ends. HTML is, from a certain perspective, a shorthand script for a full DOM document, and the tags which are or are not present may not correspond to what a browser creates when parsing them. The HTML Processor takes the stream of tokens produced by the Tag Processor and builds a map of how the tags relate to elements.

<p>This is <b>bold.<p>This is also bold.</p>

For example, a <p> tag can appear after another <p> tag but before a </p> has been found. This is often referred to as having missing or overlapping tags. However, the HTML specification’s rules are clear in this situation: the second <p> tag implies the closing of the first one as well as any other elements which were already open, and it starts a new P elements as a sibling of the first. The HTML Processor knows this so you don’t have to.

Unfortunately there are extensive rules and the HTML Processor doesn’t currently understand them all. Thankfully, if it doesn’t understand a situation it finds itself in, then it will “bail” by returning false and storing an error, retrievable with get_last_error().

Represent virtual nodes.

The past couple releases saw major progression in supporting more HTML elements, but one obstacle has stood in the way: so-called virtual nodes that don’t exist in the HTML text itself.

<li><p>One</p></p><p>Two</p></li>

Given the snippet of HTML above, what would you guess is the innerText of the second P element (p:nth-child(2))? You probably already know this is a trick question: it’s empty!

The rules of HTML dictate that when encountering an unexpected closing </p> tag for which there’s no corresponding opening tag, an empty P element without any attributes is created. Most tags don’t do this, but P does. If someone is using the HTML Processor to find the second P element in a document, it shouldn’t lead them astray, but how should it represent tags which in a sense don’t actually exist?

This problem has held up a number of the remaining tags because many of them lead to situations where not only one element appears, but possibly many elements are created between a very simple-looking boundary.

The HTML Processor is therefore undergoing an internal refactor to change the way it represents movement through the document. It will pause at these virtual nodes and represent them so that calling code can find them, though they will remain read-only for now. There’s a spectacular side-effect to this change though: the HTML Processor will be presenting a view of HTML in an idealized form. Because virtual nodes appear not only on openings, but also on closings, the processor will present the document as if it were entirely normative. There are no missing, unexpected, or overlapping tags; every opening tag expecting a closer will find one in the right place. It will be possible through the HTML Processor interface to reliably assume basic structure of HTML: that’s right, all of those “what if” question about quirky or mangled HTML are irrelevant, because the HTML Processor thinks in terms of HTML and not in terms of strings.

<a>link<a>link</a><ul><li><p><b>One<li>Two</b><li><p>Three</li></ul>Four

This snippet above looks broken, and very few parsers will know how to handle it the way a browser does. Consider, however, the sequence of tags or tokens that the HTML Processor will find (in the next snippet, imagine that the loopLoop The Loop is PHP code used by WordPress to display posts. Using The Loop, WordPress processes each post to be displayed on the current page, and formats it according to how it matches specified criteria within The Loop tags. Any HTML or PHP code in the Loop will be processed on each post. https://codex.wordpress.org/The_Loop is printing a + for an opening tag and a - for a closing tag):

while ( $html_processor->next_token() ) { … }

[
  '+A', '#text', '-A',
  '+A', '#text', '-A',
  '+UL',
    '+LI', '+P', '+B', '#text', '-B', '-P', '-LI',
    '+LI', '+B', '#text', '-B', '-LI',
    '+LI', '+P', '#text', '-P', '-LI',
  '-UL',
  '#text'
]

While this may appear confusing, it’s worth spending some time pondering. The HTML Processor found many more tags than actually exist in the text of the HTML, because it knows it needs to create them as it steps through the document. There’s a really incredible implication here: traversing inside an element is trivial. This means that finding inner content and matching tags can be done the way we often expect or want it to, and that it’s a reliable means to do so. Every <a> opening tag will be followed by a </a> no matter what else is in the HTML. Even when dealing with nested tags and sibling tags, finding the end of an element is as simple as looking for the first closing tag of the same name at the same depth – it’s guaranteed to exist!

$div_depth = $processor->get_depth();
// Find where the DIV closes.
while (
    'DIV' !== $processor->get_tag() &&
    ! $processor->is_tag_closer() &&
    $div_depth !== $processor->get_depth() &&
    $processor->next_token()
) {
    // Whatever this is, it's inside the DIV.
}

Eliminate text decoding problems.

Like most things in HTML, proper parsing is more complicated than it would appear at first. When decoding text content from the HTML, this is worse than it seems, because PHPPHP The web scripting language in which WordPress is primarily architected. WordPress requires PHP 7.4 or higher currently lacks the ability to properly decode character references. Character references are what people often call “entities” (a term borrowed from XML). They start with & and what follows is either a number (in decimal or hexadecimal) that represents a Unicode code point, or a name found in a lookup table to map to a specific code point or sequence of code points (for example, &NotEqualTilde; maps to U+2242 U+0338 producing a not-approximately-equal sign ≂̸).

HTML incorporates special rules when decoding these character references to maintain backwards compatibility with common practices that predate HTML5. These mostly revolve around what happens when the final semicolon ; is missing. In many cases the ; isn’t necessary while in other cases it is. Notably, when the semicolon is optional, there are additional restrictions inside attribute values when the reference name could be ambiguous. This rule preserves cases when URLURL A specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org query arguments aren’t properly encoded, as every & in a URL ought to be encoded as &, but often aren’t.

PHP’s html_entity_decode() currently lacks full support:

There are 1,730 named character references it’s unaware of.
&lang; and &rang; are improperly decoded as if decoding HTML4 instead of HTML5.
It doesn’t provide a way to govern whether the ambiguous ampersand rule should be applied.
It’s unable to decode named character references without a trailing ;.
It doesn’t handle border cases where the character references end a string, such as at the end of an attribute value or right before another tag appears (at the end of a text node).

Further, HTML applies special rules to a range of code points when decoded from numeric character references. Many of you have problem seen cases where “smart quotes” turn into junk when rendered. This is because HTML may store certain code points as if they representing the Windows-1252 encoding, but only from numeric character references. This transformation is not applied in html_entity_decode() and the references are left intact.

The HTML API needs a mechanism for properly decoding text content by default, and that will likely appear in the form of two new methods:

WP_HTML_Decoder::decode_text_node() for decoding text found in normal markup (#text nodes).
WP_HTML_Decoder::decode_attribute() for decoding text found inside attribute values.

There is nothing special about these methods other than they should be a reliable mechanism for reading (in UTF-8) the actual text a browser would read for the given HTML. The HTML API knows the intricate details of HTML so you don’t have to.

For WordPress 6.6 these are the two primary changes planned: ensure that the HTML Processor visits all virtual nodes, and properly decode all text. For the most part that means no major changes to the interface; everything is in a sense a bugbug A bug is an error or unexpected result. Performance improvements, code optimization, and are considered enhancements, not defects. After feature freeze, only bugs are dealt with, with regressions (adverse changes from the previous version) being the highest priority.-fix to more closely conform the implementation to its design.

What’s coming after these internal improvements?

Reading and modifying sourced blockBlock Block is the abstract term used to describe units of markup that, composed together, form the content or layout of a webpage using the WordPress editor. The idea combines concepts of what in the past may have achieved with shortcodes, custom HTML, and embed discovery into a single consistent API and user experience. attributes.

There’s a very exciting development that is resurfacing that started over a year ago: reading block attributes from the server, at least for blocks with a block.json file. The ability to read the “sourced attributes” for a block was one of the driving reasons that work on the HTML API progressed beyond the Tag Processor. It mainly comprises parsing a CSSCSS Cascading Style Sheets. selector, finding a matching location with an HTML document, and then reading an attribute, inner text, or inner HTML.

While the initial prototypes were encouraging, it was clear that it would be important to truly understand HTML structure in order to do this right. The concept of balanced tags¹ simply isn’t a very useful model for understanding HTML. Now that the HTML Processor is so much further along, however, rebuilding the attribute sourcer is turning out to be not only more reliable, but considerably simpler too!

It’s our hope that we have a working system ready by the time that WordPress 6.6 is released so that we can test it for the whole 6.7 release cycle. This is useful not only for individual render functions, but the Block Bindings project needs to be able to read and modify these attributes as well. By ensuring that the system is robust to handle whatever HTML comes its way, we can make Block Bindings work for any block attribute.

Full HTML support.

With the issue of virtual nodes taken care of it is possible to push forward on supporting even more HTML tags and situations. One issue will remain, which is a tricky situation where something called fostering occurs. This can happen, for instance, when tags are found where they shouldn’t be, such as a <div> tag inside a TABLE element but not inside a TD or TH. Strangely enough, that DIV element will be moved up in front of the TABLE element. This implies that there’s a kind of retroactive change in the document after we’ve visited a location in it. There is currently no clear way to represent this situation or communicate it to calling code, so the HTML Processor will continue to bail in these rare scenarios.

Apart from that the rest of the HTML support is large and tedious but straightforward. Expect that in WordPress 6.7 you will be able to send it almost any HTML and it will be able to fully understand the document.

Little bits of semantic meaning.

With the introduction of the block editor, WordPress largely lost Shortcodes, which were the go-to way of incorporating small tidbits of external content or meaning into a post. Shortcodes had their shortcomings, but they also had value. For more than a couple years we’ve been discussing various approaches to bringing shortcodes back: safely and without the most significant drawbacks (breaking HTML, taking over layout, ambiguous nesting rules, introducing a full page of content, etc…). The HTML API changed the game for all of these explorations because it offers a way to build a context-aware auto-escaping templating engine that can power the next generation of Shortcodes, what we have lately been calling “Bits” (Blocks are big, and Bits are small 😉).

While it’s currently possible to add a post author block into a post template, or use a block binding to replace a paragraph’s content with a custom fieldCustom Field Custom Field, also referred to as post meta, is a feature in WordPress. It allows users to add additional information when writing a post, eg contributors’ names, auth. WordPress stores this information as metadata. Users can display this meta data by using template tags in their WordPress themes., Bits will open up new opportunity to place these snippets of external content anywhere you want them, even inside the middle of a paragraph or image caption!

In WordPress 6.6 look for explorations in the editor for how to enter and configure Bits. Work has already started in WordPress’s backend to ensure that the Bit syntax is preserved through post saves and renders. This is going to be a large project with many different systems working together, so it won’t likely be available anytime soon, but many of the independent pieces will be appearing in the next couple releases for those who want to explore the foundations of how they work.

Following the progress

For updates to the HTML API keep watch here for new posts.

Acknowledgements

Thanks @gziolo, @jonsurrell, and @westonruter for helping create and edit this post.

#html-api

“Balanced tags” is a best-effort guess at HTML structure based on scanning from an element’s opening tag until an appropriate closing tag is found. Opening tags along the way will increase a depth just as closing tags decrease the depth. In the idealized view of HTML that the HTML Processor provides this guess is reliable, but without that, it’s very difficult in practice with real HTML documents to reliably understand where an element opens and closes. ↩︎

Birgit Pauli-Haack 3:17 pm on March 7, 2023
Tags: 6.2 ( 90 ), dev-notes ( 622 ), dev-notes-6.2 ( 19 ), html-api

Introducing the HTML API in WordPress 6.2

This post was co-authored by Adam Zielinski @zieladam and Dennis Snell @dmsnell

WordPress 6.2 introduces WP_HTML_Tag_Processor – a tool for blockBlock Block is the abstract term used to describe units of markup that, composed together, form the content or layout of a webpage using the WordPress editor. The idea combines concepts of what in the past may have achieved with shortcodes, custom HTML, and embed discovery into a single consistent API and user experience. authors to adjust HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.) attributes in block markup within PHPPHP The web scripting language in which WordPress is primarily architected. WordPress requires PHP 7.4 or higher. It’s the first component in a new HTML processing APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways..

Updating HTML in WordPress has always required using uncomfortable tools. Regular expressions are difficult and prone to all kinds of errors. DOMDocument is resource-heavy, fails to handle modern HTML correctly, and isn’t available on many hosting platforms.

WP_HTML_Tag_Processor takes the first step towards bridging this gap.

The Tag Processor can reliably update HTML attributes

The Tag Processor finds specific tags and can change its attributes. Here’s an example setting an alt attribute on the first img tag within a block of HTML.

$html = '<img src="/husky.jpg">';

$p = new WP_HTML_Tag_Processor( $html );

if ( $p->next_tag() ) {
    $p->set_attribute( 'alt', 'Husky in the snow' );
}

echo $p->get_updated_html();

// Output:
// <img alt="Husky in the snow" src="/husky.jpg">

The next_tag() method moves to the next available tag in the HTML, but also accepts a tag name, a CSSCSS Cascading Style Sheets. class, or both in order to find specific tags. According to the HTML specification, lookup of tag and attribute names aren’t case-sensitive, but CSS class names are.

if ( $p->next_tag( array( 'tag_name' => 'DIV', 'class_name' => 'block-GROUP' ) ) ) {
    $p->remove_class( 'block-group' );
    $p->add_class( 'wp-block-group' );
}

Operations are safe by default:

remove an attribute without first checking if it exists,
add a CSS class which might already be there,
set an attribute value without ensuring that it’s not duplicating an existing one.

You no longer need to be concerned that your code mistakes for a real tag the content inside a <textarea>, and attribute value, or even inside an HTML comment.

The Tag Processor conforms to the HTML5 specification, so you don’t have to. It automatically escapes and decodes values where necessary and knows how to handle malformed markup.

$ugly_html = <<<HTML
<textarea title='<div> elements are semantically void'>
    <div><!--<div attr-->="</div>"></div>">
</textarea>
<div></div>
HTML;

$p = new WP_HTML_Tag_Processor( $ugly_html );
if ( $p->next_tag( 'div' ) ) {
    $p->add_class( 'bold' );
}

echo $p->get_updated_html();
// Output:
// <textarea title='<div> elements are semantically void'>
//     <div><!--<div attr-->="</div>"></div>">
// </textarea>
// <div class="bold"></div>

The Tag Processor operates fast enough to run in critical hot code paths and incurs almost no memory overhead. In WordPress 6.2 it replaces bugbug A bug is an error or unexpected result. Performance improvements, code optimization, and are considered enhancements, not defects. After feature freeze, only bugs are dealt with, with regressions (adverse changes from the previous version) being the highest priority.-prone code relying on regular expressions and string-searching to perform similar updates.

For more advanced use of the Tag Processor, read through the extensive in-class documentation and learn how to…

…set bookmarks to re-visit parts of the document which have already been scanned and modified.
…visit closing tags like </div> in addition to the opening tags.
…run advanced and stateful queries by visiting every tag in a document.

Further considerations

There are many things the Tag Processor doesn’t do: it doesn’t build a DOM document tree, find nested tags, or update a tag’s inner HTML or inner text. Work on new HTML-related APIs continues, and a future WordPress release will build upon this work to enable accessing all of a block’s attributes from within PHP (if the block supplies a block.json file), finding tags using a CSS selector, and modifying the HTML structure with new tags, removed tags, and updated inner markup.

You can keep up with further development via this overview issues on the GutenbergGutenberg The Gutenberg project is the new Editor Interface for WordPress. The editor improves the process and experience of creating new content, making writing rich content much simpler. It uses ‘blocks’ to add richness rather than shortcodes, custom HTML etc. https://wordpress.org/gutenberg/ GitHubGitHub GitHub is a website that offers online implementation of git repositories that can easily be shared, copied and modified by other developers. Public repositories are free to host, private repositories require a paid subscription. GitHub introduced the concept of the ‘pull request’ where code changes done in branches by contributors can be reviewed and discussed before being merged by the repository owner. https://github.com/ Repo.

#6-2, #dev-notes, #dev-notes-6-2, #html-api

Welcome!

Communication

Major Updates

WP_HTML_Processor​::​serialize_token() is now public.

No more hard-coding HTML string assertions in unit tests.

The HTML API received minor updates.

Full Changelog

Enhancements

Bug Fixes

Core refactors

Acknowledgements

Major Updates

Full HTML Support

Ability to modify most text in a document

A full-parser mode in the HTML Processor

Normalization of input documents

Thorough review and alignment of CSSCSS Cascading Style Sheets.-related behaviors

Important changes for existing code.

Enhancements

Bug Fixes

Quick review

What’s being worked on right now?

Full HTML support

All supported tags, but with a caveat

Reading and modifying inner HTML

Review of all CSSCSS Cascading Style Sheets. semantics

What’s coming after reaching full tag support?

Continuing towards 100%

WordPress Bits

Native PHPPHP The web scripting language in which WordPress is primarily architected. WordPress requires PHP 7.4 or higher support

Additional notes

An HTML API debugger makes exploring HTML fun

Following the progress

Acknowledgements

Table of Contents

A spec-compliant text decoder.

An idealized view of an HTML document.

An optimized class for looking up string tokens and their associated mappings.

Features

Bug Fixes

Quick Review

What’s being worked on right now?

Finish the HTML Processor.

Represent virtual nodes.

Eliminate text decoding problems.

What’s coming after these internal improvements?

Full HTML support.

Little bits of semantic meaning.

Following the progress

Acknowledgements

The Tag Processor can reliably update HTML attributes

Further considerations

Site resources

`WP_HTML_Processor::serialize_token()` is now public.