Progress Report: HTML API

The Tag Processor was introduced into Core with 6.2, but development on the HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. is continuing as we build a more complete system for interacting with HTML. This posts covers why the HTML API was first introduced, how to take advantage of it today, and preview what’s coming with the second phase, the HTML Processor.

Glossary of Terms

  • HTML API: the new PHPPHP The web scripting language in which WordPress is primarily architected. WordPress requires PHP 5.6.20 or higher subsystem in WordPress designed to reliably and efficiently interact with HTML and HTML-related needs.
  • HTML Tag Processor: the first (and lower-level) interface in the HTML API. This class scans through an HTML document from tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.) to tag and makes it possible to read and modify attributes on existing HTML tags. Because it’s only tokenizing the input stream, it doesn’t understand HTML structure and can’t match starting and ending tags or know if something is a child element of another.
  • HTML Processor: the second (and higher-level) interface in the HTML API. This class makes it possible to interact with nested HTML structure, insert and remove tags, change inner and outer contents, and perform complicated and CSSCSS Cascading Style Sheets.-based queries. Because it knows the semantic rules for handling HTML it’s able to properly cope with so-called “malformed markup” including unclosed and overlapping tags.

Why was the HTML API introduced?

The Tag Processor is the culmination of many years of struggling to properly handle HTML within WordPress and within PHP. It follows perennial issues surrounding corruption of HTML, security vulnerabilities, and escaping problems. The HTML API has been designed to provide a convenient and reliable way to interact with HTML.

As many of you are already aware, all of the existing tools (mostly regular expressions and DOMDocument) are insufficient for conveniently and reliably querying and modifying HTML. In addition to carrying a heavy runtime cost and raising availability issues, DOMDocument misinterprets HTML for many common scenarios (e.g. it thinks tags can exist inside a TEXTAREA which they cannot, and also that <3 is a tag and not a heart which it is – there are too many cases to enumerate here).

DOMDocument was never the escape hatch.

Consider the contents of a function attempting to use DOMDocument appropriately.

// Make sure we parse in HTML mode and not XML mode.
$document = ( new DOMImplementation() )->createDocument( null, 'html' );
// Disable error reporting because it spits out a lot of noise.
libxml_use_internal_errors( true );
// Force HTML5 mode and UTF-8 to avoid encoding troubles.
$preamble = '<!DOCTYPE html><meta charset="utf-8"><body>';
$document->loadHTML( $preamble . $html );
// process the document somehow
// Build the output while avoiding the addition of extra tags that weren't in the source.
$output = '';
$body   = $document.getElementsByTagName( 'body' )->item( 0 );
foreach ( $body->childNodes as $node ) {
    $output .= $document->saveHTML( $node );
}
return $output;

This code looks like it should be solid and supporting everything it needs to, but in fact it still fails in a variety of common kinds of input.

<script>
<!-- console.log( "<script>This is just text</script>" ); -->
</script>

In this case DOMDocument converts the --> into --&gt;, which leaves the SCRIPT element unclosed, and this will cause the following HTML to be interpreted as JavaScriptJavaScript JavaScript or JS is an object-oriented computer programming language commonly used to create interactive effects within web browsers. WordPress makes extensive use of JS for a better user experience. While PHP is executed on the server, JS executes within a user’s browser. https://www.javascript.com/. instead of as HTML.

<textarea>This is not an <img src="dangerous.pdf"> because it's inside a </textarea>

Here DOMDocument finds an IMG element inside the contents of the TEXTAREA and will visit it. By interpreting these contents it opens up opportunities for user input to change the parse of the document and lead to unexpected vulnerabilities.

<a title="A &notin B">A &notin B</a>

In both the title attribute and the link text the named character reference is invalidinvalid A resolution on the bug tracker (and generally common in software development, sometimes also notabug) that indicates the ticket is not a bug, is a support request, or is generally invalid.. However, in the browser the title attribute evaluates to A &notin B while the link text evaluates to A ¬in B. DOMDocument breaks the markup by rewriting the link text to the following, which preserves the & in the rendered output instead of interpreting it as the “not in” mathematical symbol.

<a title="A &notin B">A &notin B</a>

At face value these may seem like rare exceptions, but many inputs contain these kinds of issues and others that translate into corrupted rendering and worse. Each of these failures presents an opportunity for someone to construct malicious inputs which exploit them to inject unwanted behaviors into your site.

After a sequence of crashing bugs in GutenbergGutenberg The Gutenberg project is the new Editor Interface for WordPress. The editor improves the process and experience of creating new content, making writing rich content much simpler. It uses ‘blocks’ to add richness rather than shortcodes, custom HTML etc. https://wordpress.org/gutenberg/ related to HTML parsing a couple of developers reached an exasperated tipping point (that started brewing at least as far back as 2016 with the introduction of srcset support) and determined to build a reliable, performant, and convenient API within WordPress for working with HTML. Typical regexp-based approaches are fast when examined in isolation but are very unreliable; attempts to improve their reliability end up in a never-ending cycle of growing complexity that sacrifices readability and eventually some of the same runtime performance that made them compelling in the first place.

Robust regular expressions are extremely confusing and still unreliable.

Given the following regex-based code, try to determine what the goal of the snippet is, what it’s trying to do.

$class_name = gutenberg_get_elements_class_name( $block );
// Like the layout hook this assumes the hook only applies to blocks with a single wrapper.
// Retrieve the opening tag of the first HTML element.
$html_element_matches = array();
preg_match( '/<[^>]+>/', $block_content, $html_element_matches, PREG_OFFSET_CAPTURE );
$first_element = $html_element_matches[0][0];
// If the first HTML element has a class attribute just add the new class
// as we do on layout and duotone.
if ( str_contains( $first_element, 'class="' ) ) {
	$content = preg_replace(
		'/' . preg_quote( 'class="', '/' ) . '/',
		'class="' . $class_name . ' ',
		$block_content,
		1
	);
} else {
	// If the first HTML element has no class attribute we should inject the attribute before the attribute at the end.
	$first_element_offset = $html_element_matches[0][1];
	$content              = substr_replace( $block_content, ' class="' . $class_name . '"', $first_element_offset + strlen( $first_element ) - 1, 0 );
}
return $content;

If you think you can quickly and clearly understand this, did you notice that it breaks if the class attribute is quoted with a single quote ' or without quotes? Did you notice that if there’s an attribute containing > that it breaks apart the tag? Did you notice that it makes the wrong replacement if we find something like <img data-custom-class="zebra" src="zebra.jpg" class="full-width">? Did you notice that it’s calling preg_quote() needlessly? Did you notice that it removes all the existing classes if there’s extra but allowable space in class = "all my existing classes"? Did you notice that it won’t find the attribute if someone spells it in uppercase with CLASS="classes"? Are you confident that you could find this code as the source of a problem that mangles an HTML page?

The above is a typical regular-expression-plus-string-replace approach and it gets complicated.

The following, on the other hand, was an actual update inside of Gutenberg to replace the above with the Tag Processor from the HTML API.

// Add the class name to the first element, presuming it's the wrapper, if it exists.
$tags = new WP_HTML_Tag_Processor( $block_content );
if ( $tags->next_tag() ) {
	$tags->add_class( gutenberg_get_elements_class_name( $block ) );
}
return $tags->get_updated_html();

Which would you rather work with?

One may wonder, “why is WordPress only now getting a reliable HTML parser?” or “why hasn’t anyone built this before?” and the answer is probably related to how complicated and major a task it is to build the required safety and security into the system (thanks Matt and Automattic for sponsoring this massive investment into WordPress).

There’s another reason though that isn’t as obvious: it’s easy to push the proverbial cart before the horse with HTML. Tools like DOMDocument expose a full DOM interface, and given that, it seems obvious that it’s the right tool for the job. Surprisingly, this is a rare need and the interface is a poor match for the kind of processing typically done on the server. Typical changes involve things like scanning the HTML for specific content, modifying attributes, or making small changes to the HTML structure. In addition, the server needs to be lean in its memory use and as fast as can be. A streaming parser is more appropriate here but the relative availability of DOMDocument and similar parsers has given the impression that a suitable tool is at hand. It distracted us from simpler approaches that work better for WordPress.

How can we use the HTML API today?

It’s easy to get started with the HTML API today by using the Tag Processor and reading through its documentation. The Tag Processor is useful for reading and writing attributes on HTML tags.

$processor = new WP_HTML_Tag_Processor( $html );

// For all images in a document:
while ( $processor->next_tag( 'img' ) ) {
    // If they already have a non-empty ALT text, leave them alone.
    $alt = $processor->get_attribute( 'alt' );
    if ( null !== $alt && ! empty( $alt ) ) {
        continue;
    }

    // Otherwise try to generate a description for the image
    // from an AI source and replace the ALT with that description.
    $src = $processor->get_attribute( 'src' );
    $alt = ai_generate_image_alt( $src );
    if ( null !== $alt ) {
        $processor->set_attribute( 'alt', $alt );
    }
}

return $processor->get_updated_html();

Here are three quick things to know about the HTML API:

  • The Tag Processor is the first in a series of new interfaces to work with HTML. It’s a low-level API on top of which more convenient functions are being built. If you want to use it today, the “sweet spot” is replacing existing regular expressions and invocations of DOMDocument where the goal is to modify HTML attributes. More complicated work is possible, but that work will be easier in the future with the expansion of the HTML API.
  • Updates through the HTML API are properly handled by default. Because it’s aware of the context in which it’s operating, you don’t need to call esc_attr(), esc_html(), or other escaping (or unescaping) functions. The inputs are plain PHP values while the outputs will be the HTML which faithfully represents those values in the browser. get_attribute() will, for example, return " (a double quote) whether the HTML attribute contains "&quot;" or '"'.
  • Today it’s only possible to modify HTML attributes: it’s not possible to insert tags, wrap a tag, replace inner HTML, etc. The HTML API work prioritizes reliability and trust over power and flexibility, so the good things to come will appear in good time as we’re able to build it and maintain full confidence in the system.

What’s in the future for the HTML API?

Working with HTML structure

The most obvious direction for the evolution of the HTML API is the HTML Processor, the high-level counterpart to the Tag Processor. It’s designed to query and manipulate nested HTML structure, providing the ability to do things like add or remove HTML tags, wrap or unwrap HTML tags, replace HTML tags, and more. Expect a CSS selector interface for querying specific locations in a document.

It may not make sense why this wasn’t a basic project requirement – to be able to add a new tag – but there’s a major increase in complexity moving from being able to understand the stream of tokens in an HTML document and being able to manipulate the structure those tags represent. There are far more complicated rules and edge cases (for example, what should occur when closing tags are missing or when tags overlap), and parsing the structure requires more memory and more time. The Tag Processor is designed to present predictable performance, but when examining the structure of an HTML document it’s not possible to provide those same guarantees since the performance also depends on the contents of that document. For this reason the two interfaces will remain separate and serve distinct purposes. In fact, for many cases, even when the HTML Processor is complete there will be good reason to stick with the simpler Tag Processor.

The HTML Processor is being developed and tested in stages.

The HTML Processor in its most basic form was merged into WordPress immediately after the WordPress 6.3 branchbranch A directory in Subversion. WordPress uses branches to store the latest development code for each major release (3.9, 4.0, etc.). Branches are then updated with code for any minor releases of that branch. Sometimes, a major version of WordPress and its minor versions are collectively referred to as a "branch", such as "the 4.0 branch". was created. It includes the most basic scaffolding necessary to provide the means for handling HTML in all its colorful renditions, but does very little at the moment.

It adds a new method for querying a location in an HTML document by “breadcrumbs.” For example, “find the next image that’s a child of a figure element.”

if ( $processor->next_tag( array( 'breadcrumbs' => array( 'figure', 'img' ) ) ) ) {
    $default_image_src = $processor->get_attribute( 'url' );
}

With the scaffolding in place it’s possible to incrementally support more and more of the HTML5 specification. It would have been possible to bring this in all at once, but the absolute minimum size of the change would involve thousands of lines of codeLines of Code Lines of code. This is sometimes used as a poor metric for developer productivity, but can also have other uses. that would be unreasonable for anyone to review or test. By building the smallest usable pieces at a time it gives space to focus on each change and the many complications it might imply.

The HTML Processor is going to be able to add and remove tags.

Before it’s finished, the HTML Processor is going to expose several functions reminiscent of DOM interfaces: set_inner_html(), set_outer_html(), parent_node(), insert_before(), insert_after(), remove_node(), replace_node() etc…

These functions will provide a safe means of modifying HTML content without breaking the document.

while ( $processor->next_tag( array( 'selector' => '[data-wp-show]' ) ) ) {
    if ( 'false' !== $processor->get_attribute( 'data-wp-show' ) ) {
        continue;
    }

    $processor->wrap_with( '<template>' ); 
}

The full details of the function names and specific interfaces are currently being explored, but in the end we are going to try and balance two competing tensions: meeting frequent needs for CoreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress., pluginPlugin A plugin is a piece of software containing a group of functions that can be added to a WordPress website. They can extend functionality or add new features to your WordPress websites. WordPress plugins are written in the PHP programming language and integrate seamlessly with WordPress. These can be free in the WordPress.org Plugin Directory https://wordpress.org/plugins/ or can be cost-based plugin from a third-party, and theme developers within WordPress; and remaining familiar to those who have worked in JavaScript and/or with the DOM.

Some functions will be intentionally distinct from the DOM to communicate the different ways this streaming interface behaves. For example, consider a case where replacing a stretch of text would break the HTML document structure. This isn’t possible when setting node.innerHTML = … from JavaScript, but it might be in the HTML API. In a case like this the function will not be called set_inner_html() because it breaks the expectations one might have for that function. Instead, it might be called something like set_raw_inner_markup() to suggest different or new expectations.

For now, the decisions for how to handle situations like these aren’t answered as they involve careful tradeoffs between performance, clarity, and convenience. This is part of current ongoing explorations as the fundamental aspect of parsing HTML structure progresses.

The HTML Processor refuses to break HTML.

It’s going to have bugs and they’ll be fixed, but the philosophy behind the HTML Processor is that if it ever encounters HTML that it can’t confidently understand and parse, then it will stop processing and refuse to proceed. While this leads to what feels like a slow start in building out the API, it also means that you can trust it from day one to do what it claims to do, and support will only improve with time. You don’t have to worry that it will break with certain kinds of input until the day comes when it has the support for it; instead, the HTML Processor will halt processing and communicate that it encounters unsupported markup.

HTML Templating

One idea that’s particularly exciting is the possibility to use the features from the HTML API (which are normally used for reading HTML) to create a context-aware auto-escaping HTML templating system. We’ve been exploring an sprintf-like function that relies on some obscure HTML syntax to create placeholders that avoid creating new syntax like {$var} or [icon]; whatever is in the template is 100% HTML and will work in any standard HTML editor.

WP_HTML::render(
    '<div class="profile"><img src="</%avatar_url>"></%display_name></div>',
    array(
        'avatar_url'   => 'https://…',
        'display_name' => 'Sal Amander'
    )
);

Traditional templating systems are convenient, but also suffer from a few of problems related to their custom syntax. (the use of the triple curly brackets is supposed to be illustrative of the concept and any resemblance to existing templating engines is entirely coincidental).

  • It can be difficult at times to know what is a template and what is normal HTML, and it can be unclear how syntax errors should be resolved. For example, if wanting to store placeholder values as {{{title}}}, how should this resolve? <a href="/reviews">{{{title</a> It’s not clear if that’s supposed to be normal HTML with three curly brackets or a typo adding the placeholder. It can be even more confusing if HTML existings inside the template: for example, with {{{<em>display_name</em>}}} or <p>{{{count out of {{{available}}} available</p>.
  • These kinds of template placeholders leave escaping needs up to the programmer. In <a href="{{{url}}}" title="{{{url}}}">{{{url}}}</a> there are three separate ways that the url value needs to be escaped. It’s often possible to add inline filters such as {{{url|attribute}}} or {{{url|json}}} or {{{url|script|json|titlecase|dangit}}} but this is fundamentally a broken process that relies on humans remembering to do the right thing every time.
  • While some templating systems are supported in popular editors, each one is required to manually add support and manually make choices about the syntax errors and how to represent them. In most cases, unless someone can find an appropriate extension, there’s no syntax highlighting or error linting in an editor for such an HTML template.

If WordPress can provide an HTML templating solution that only uses existing HTML syntax then all editors should at least provide syntax highlighting as HTML is already a universal language. Relying on HTML syntax means that in cases of syntax errors, everyone will agree on how to resolve the error: the browsers, text editors, WordPress, and more. Finally, by parsing a template when rendering it’s possible to know if a placeholder is found within HTML markup, within an attribute, within a script, within a URLURL A specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org-referencing attribute like src or href, and automatically escape provided values as is appropriate.

While this is still early in its exploratory phase and almost all of the details remain unsettled, this will open up a new, clear, and safe way of constructing HTML where WordPress developers are freed from the need to escape values or think about escaping at all. It can make the escaping decisions reliably and the calling code need only to think about the actual data values as PHP sees them.

Expanding access to blocks from within PHP

Finally, one long-awaited feature at the end of this work is gaining access to a blockBlock Block is the abstract term used to describe units of markup that, composed together, form the content or layout of a webpage using the WordPress editor. The idea combines concepts of what in the past may have achieved with shortcodes, custom HTML, and embed discovery into a single consistent API and user experience.’s sourced attributes on the server. Using the CSS selectors in a block.json file, the HTML Processor will be able to read all of a block’s attributes inside PHP (this is all attributes, not only those found in the JSONJSON JSON, or JavaScript Object Notation, is a minimal, readable format for structuring data. It is used primarily to transmit data between a server and web application, as an alternative to XML. attributes in the block comment delimiter).

Similarly, being able to read and understand the block markup and attributes on the server also means that it will be be possible to run block updates on posts in the database without loading them in the editor. All you’ll need to update those old blocks is a transformer function and a batch process to bulk-update your posts, pages, widgets, and theme templates.

These updates can happen ad-hoc on render as well and this mechanism is being explored for modifying a rendered block to replace one of its block attributes with values from custom fields and other sources. The ability to parse the HTML and the CSS selector in the block.json attribute definition is what makes this possible.

Rewiring WordPress internals

Having a reliable HTML parser opens a wide range of opportunities that didn’t previously seem possible. It was proposed during the Core merge that the Tag Processor might be able to replace functionality in wp_kses() and in the numerous functions in formatting.php. It may seem odd to consider them HTML parsers since none of the code snippets expressly communicate that’s what they’re doing, but a cursory examination revealed at least 46 different existing HTML parsers in formatting.php alone. These are a handful of regular expressions and state machines in PHP, each of which has its own quirks and failure modes, each of which has its own interface, almost all of which are very similar but subtly different, all of which are unreliable HTML parsers.

It may be possible to replace a significant amount of this legacy code with the HTML API and come out not just more reliable, but potentially faster overall as well. Any of this work will be slow though given the need to be careful and avoid breaking existing code and plugins and themes. Nevertheless, the future is bright in regards to some of the longest-standing sources of corruption and breakage that are due to parsing failures.

Summary

The HTML API is a set of reliable interfaces for reading and modifying HTML from within PHP. It trades some amount of performance in order to avoid breaking pages but remains fast enough to be usable without worry. It borrows the language that HTML uses to talk about itself so that your code can clearly express what it wants in talking about tags, attributes, and classes instead of characters, regular expression patterns, and quoting mechanisms.

As the HTML API is still new, try exploring and see if the Tag Processor can clear up your HTML-mangling code, and if you are feeling adventurous, follow the ongoing work in the HTML Processor.

Following the progress

For updates to the HTML API keep watch here for new posts.

If you want to follow the development work you can review Trac tickets in the HTML API component or start watching new HTML API tickets from the component overview page.

You are invited to join the Developer Hours call on August 30, 2023 whose topic is the introduction to the HTML API.

Finally, if you want to talk details or bugs or applications, check out the #core-html-api channel in WordPress.orgWordPress.org The community site where WordPress code is created and shared by the users. This is where you can download the source code for WordPress core, plugins and themes as well as the central location for community conversations and organization. https://wordpress.org/ SlackSlack Slack is a Collaborative Group Chat Platform https://slack.com/. The WordPress community has its own Slack Channel at https://make.wordpress.org/chat/..

Acknowledgements

Many people reviewed this post and I’d like to thank them all for their support. I will probably miss some folks, but thanks belong to @artemiosans, @bernhard-reiter, @bph, @luisherranz, @juanmaguitar, @santosguillamot, @poliuk, @zieladam, and those whom I’m unintentionally overlooking now.