A new system for simply and reliably updating HTML attributes

This call for feedback will be open until September 9th.

Let’s introduce a reliable tool WordPress could use to adjust the HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. blockBlock Block is the abstract term used to describe units of markup that, composed together, form the content or layout of a webpage using the WordPress editor. The idea combines concepts of what in the past may have achieved with shortcodes, custom HTML, and embed discovery into a single consistent API and user experience. markup. The current practice of using basic replacements seems fine at a first glance but is easy to break. The system proposed here will help avoid these common pitfalls.

Consider this example of adding a style HTML attribute in the cover block:

preg_replace( '/class=\".*?\"/', '${0} style="' . $styles . '"', $html );

It assumes a specific HTML structure:

  • There is a class attribute
  • The style attribute isn’t already defined, as browsers ignore the repeated attributes.
  • There is no other attribute ending with the string class="", such as data-replace-class="…" or title='how to set the class="" attribute'
  • The existing class attribute uses double quotes and no single quotes or no quotes.
  • Regular HTML does not contain the class="" substring, for example in a post describing how to use the class attribute in an HTML document.

These assumptions are typically true, but only until they’re not. For example, applying a padding produces a markup such as below where the browsers ignore the second style attribute:

<!-- Formatting applied for clarity -->
<div
    class="wp-block-cover"
    style="background-image:url(/img.png);"
    style="padding-top:4px">

The original preg_replace could be patched, but eventually another assumption would break. The deeper, fundamental problem is that string replacements are not the right tool for updating HTML. They’re used out of necessity as WordPress does not provide any better tools. Well, let’s change it!

Let’s introduce a dedicated tool for reliably updating the HTML markup. It’s called WP_HTML_Walker:

$w = new WP_HTML_Walker( '<div></div>' );
$w->next_tag();
$w->set_attribute( 'style', $styles );
$updated_html = (string) $w
// <div style="display: block"></div>

Simple string replacements don’t account for nuances in HTML

The problem of updating HTML attributes frequently appears in block-related work. Recently @dmsnell and I (@zieladam) investigated how HTML attributes are typically updated in the WordPress codebase while exploring adding a CSS class to all Gutenberg blocks. We found the typical approach is to use string replacements similar to the one covered above.

Here are a few examples already in CoreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress. where we run into these nuanced problems:

The Site Logo block attempts to remove the href attribute:

// Remove the link.
preg_replace( '#<a.*?>(.*?)</a>#i', '\1', $custom_logo );

However, it also unintentionally removes all the attributes, including class and style.

The gallery block adds a CSSCSS Cascading Style Sheets. class:

preg_replace('/' . preg_quote( 'class="', '/' ) . '/', 'class="' . $class . ' ', $content, 1);

However, if there’s no existing class attribute, it will skip over the tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.) without adding the required class. The same technique is used in the duotone feature, and the block supports API.

As a side note, it’s easy to lean on the existing pattern of using more complicated functions such as preg_replace(). Calling preg_quote() in this example isn’t appropriate and the entire regular expression pattern does nothing more than a basic str_replace().

The block-supports APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. attempts to find the first HTML tag:

preg_match( '/<[^>]+>/', $block_content, $html_element_matches, PREG_OFFSET_CAPTURE );

However, it also matches non-tags like text hearts and mathematical expressions (<3, f(x) = {x, x<5; -1, x>=5}), DOCTYPE declarations, and HTML comments.

The media library adds the srcset attribute:

preg_replace( '/<img ([^>]+?)[\/ ]*>/', '<img $1' . $attr . ' />', $image );

However, a > character inside a tag attribute (e.g. title="why tacos > burritos")  would break the srcset functionality and potentially introduce a vector for injection attacks.

The list goes on, and it’s not just blocks.

Here’s an example from shortcodes:

if ( preg_match( '#((?:<a [^>]+>\s*)?<img [^>]+>(?:\s*</a>)?)(.*)#is', $content, $matches ) ) {

Media Embeds:

if ( preg_match_all( '#<(?P<tag>' . $tags . ')[^<]*?(?:>[\s\S]*?<\/(?P=tag)>|\s*\/>)#', $content, $matches ) ) {

Media Galleries:

preg_match_all( '#src=([\'"])(.+?)\1#is', $gallery, $src, PREG_SET_ORDER );

Twentynineteen:

preg_replace( '/<a\s.*?>/', $link, $item_output, 1 );

And many, many other places. The point is, this is how WordPress does it today.

Many features demand a more reliable way of updating HTML attributes. Block theming code, in particular, tends to modify block markup to apply visual styling:

The only way to reliably update HTML attributes is to follow the HTML specification. However, doing that from scratch every time a CSS class is added would only cloud the entire codebase with HTML parsing nuances and distract from the work being done. That’s why @dmsnell, @gziolo, and I (@zieladam) want to move this complexity into Core. It would be exposed as a tailored and restricted API that’s easy to use, hard to mess up, and easy to find.

The new system tokenizes HTML

WP_HTML_Walker (Pull Request 43268) recognizes HTML tags and updates their attributes. It’s reliable because it implements the same official HTML specification as WebKit, Chrome, Firefox, and other major browser engines.

Unlike full-fledged HTML parsers, the walker avoids handling malformed markup, semantic problems, and building a document tree. Any problems that are present on the input are passed on to the browser. The walker doesn’t fix HTML just as it won’t break HTML.

The tradeoff is that it only offers a simplified API to modify HTML attributes. If you want to replace an img tag with a full-fledged figure layout, this API won’t offer that functionality. Similarly, the walker won’t help you replace all the child nodes of a particular div with a completely new markup. This system is focused on finding specific HTML tags and adding, removing, or updating the attributes on those tags.

Remove the href attribute from an anchor tag:

$w = new WP_HTML_Walker( $html );
$w->next_tag( 'a' );
$w->remove_attribute( 'href' );

Add a style attribute to the first tag in the document:

$w = new WP_HTML_Walker( $html );
$w->next_tag();
$w->set_attribute( 'style', 'display: none' );

Add a CSS class to the first tag having the wp-block-media-text__content class:

$w = new WP_HTML_Walker( $html );
$w->next_tag(array(
    'class_name' => 'wp-block-media-text__content'
));
$w->add_class( 'wp-foo-bar' );

Add the srcset attribute to all image tags:

$w = new WP_HTML_Walker( $html );
while ( $w->next_tag() ) {
    if (
        isset( $w->get_attribute( 'src' ) ) &&
        ! isset( $w->get_attribute( 'srcset' )
    ) {
        $srcset = build_srcset( $w->get_attribute( 'src' ) );
        $w->set_attribute( 'srcset', $srcset );
    }
}

Processing HTML using this Core API could help avoid a broad array of mistakes that appear due to the oversimplification presented by the array of ad-hoc solutions. A common interface for operations on block markup would alleviate competition between changes. You can check the refactoring PR to see how this new API could improve code readability in the existing core blocks.

Why build a new API instead of using DOMDocument?

Using DOMDocument was extensively discussed. It’s not installed on every host so a polyfill would still be necessary. And even if it was available everywhere, it’s based on libxml2 designed to parse XML. Libxml2 does not implement the WHATWG HTML parsing spec, does not support HTML5, and brings with it a variety of parsing failures and quirks.

Like many DOM libraries, DOMDocument is a heavy interface that rewrites entire documents after several stages of transformation. In contrast, the walker exposes a focused interface closer to what string functions offer. For the kind of modifications occurring in WordPress this is a more natural and convenient approach.

If this resonates with you then please speak out before September 9th

This post will be open for feedback for the next three weeks until September 9th. After that @dmsnell, @gziolo, and @zieladam would like to merge the new API into the GutenbergGutenberg The Gutenberg project is the new Editor Interface for WordPress. The editor improves the process and experience of creating new content, making writing rich content much simpler. It uses ‘blocks’ to add richness rather than shortcodes, custom HTML etc. https://wordpress.org/gutenberg/ pluginPlugin A plugin is a piece of software containing a group of functions that can be added to a WordPress website. They can extend functionality or add new features to your WordPress websites. WordPress plugins are written in the PHP programming language and integrate seamlessly with WordPress. These can be free in the WordPress.org Plugin Directory https://wordpress.org/plugins/ or can be cost-based plugin from a third-party to power adding CSS classes to Gutenberg blocks, block layout improvements, and changes in CSS style variations.

Also see the WP_HTML_Walker Pull Request.

Props to Dennis Snell (@dmsnell), Grzegorz Ziółkowski (@gziolo), Andrei Draganescu (@andraganescu), Carolina Nymark (@poena), and Ramon Dodd (@ramonopoly) for their reviews and help in putting this proposal together.
#core, #editor, #gutenberg, #proposal