Eleven years ago, in Core-31992, someone proposed allowing non-US-ASCII email address support in WordPress. The software world has changed considerably since then: internationalized domain names and paths are uniformly handled in browsers, email systems support the wide range of Unicode characters as raw UTF-8, and UTF-8 is the only recommended text encoding for interchange between systems. This means that people are free to use their own names when communicating with others, whether they are Jake, Klára, আরিয়া , അമൽ, or any other name containing letters outside the A-Z range. Unfortuantely, WordPress has not kept up with these changes, and that’s what this post is all about.
This post is a request for comment on adding that support. There are a number of complications with potentially far-reaching implications.
TL;DR
- WordPress’ email sanitization is based on US-ASCII characters and needs to be relaxed to allow for valid UTF-8, but this introduces new risks, including but not limited to: confusable characters, equivalence through normalization, and non-visible characters.
- Sites whose databases cannot store full UTF-8 may fail to save valid email addresses. This could be confusing to the site owner and to people attempting to sign up on the site unless properly communicated.
- Any additional code that assumes emails are encoded as single-byte US-ASCII will need updating, specifically because it was always an invariant before that emails would not contain multi-byte Unicode characters. Filters may start seeing characters they believed were impossible to receive.
If you have experience with email issues, deploy Launching code from a local development environment to the production web server, so that it's available to visitors. email services, or know about certain critical aspects of this proposal, please share your thoughts here or in Core-31992.
Unicode in email addresses was historically more complicated.
When email sprung up, servers were passing US-ASCII as a 7-bit encoding. The need to send text with characters beyond that range appeared shortly afterwards, and MIME text encoding was standardized in RFC 2047. This is what WordPress refers to in its wp_iso_descrambler() function: a mechanism for transmitting non-ASCII characters using only ASCII bytes. Critically, it only applied to certain headers and could not be applied to email addresses.
This funny-looking string indicates that it is encoded…
- with the ISO-8859-2 character set.
- using the quoted form, with escaped hex-codes for non-ASCII characters.
=?ISO-8859-2?Q?=A3=F3d=BC?=
It encodes the latin2 string "Łódź"
While MIME encoding alleviated the problem of sending non-English content, it did nothing to remove the need for people to romanize or ASCIIize their names and institutions. Punycode opened the door for internationalized domain names, again by encoding non-US-ASCII bytes through all-ASCII characters, but this applied only to domain names and remained unrecognizable when not parsed.
This indecipherable string encodes a state machine which, when decoded, produces a UTF-8 byte stream.
xn--l8je6s7a45b.com
It encodes the Japanese domain "あーるいん.com"
As protocols gained more functionality for unescaped UTF-8, such as in IMAP’s UTF-8 extension, more and more servers started allowing non-US-ASCII bytes as long as they were valid UTF-8. Even still, this did not change the state for email addresses, unfortunately, as the old restrictions on that header The header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. still applied.
Eventually, major email providers started allowing and passing valid UTF-8 sequences as email addresses, making them a de-facto supported feature. A comprehensive take is standardized in RFC 6530. See last year’s talk at FOSDEM for more information.
What is the proposal for WordPress?
Allow storing Unicode email addresses. (Core-31992)
Functions like is_email(), sanitize_email() and antispambot() need to be extended to support non-ASCII addresses. PHPMailer updates in WordPress 6.9 already made it possible for WordPress to send to Unicode addresses, but it’s not possible for users to use or store them on their account.
PR#5237 unlocks saving Unicode email addresses by modifying these functions, as long as the database permits it. Its validation is locked to the behaviors of <input type=email> elements to ensure compatibility with the browser and a predictable experience.
Back in April, during WordCamp WordCamps are casual, locally-organized conferences covering everything related to WordPress. They're one of the places where the WordPress community comes together to teach one another what they’ve learned throughout the year and share the joy. Learn more. Vienna, geoTLD.group and ICANN sponsored a contributor challenge to work on this very problem. @agulbra, @akirk, @benniledl, and @dmsnell worked together on this problem and proposed a new WP_Email_Address class which can parse email addresses and return the decoded local and domain parts. This class is then used by a filter Filters are one of the two types of Hooks https://codex.wordpress.org/Plugin_API/Hooks. They provide a way for functions to modify data of other functions. They are the counterpart to Actions. Unlike Actions, filters are meant to work in an isolated manner, and should never have side effects such as affecting global variables and output. to replace the decisions from is_email() sanitize_email() with their new counterparts: wp_is_unicode_email() and wp_sanitize_unicode_email(). This approach provides a path for interoperability with modern standards while preserving the ability to maintain the legacy behaviors, and it provides a helpful new class for structurally working with email addresses in various forms and places.
While Unicode email addresses should be supported, it’s still necessary to be able to apply legacy restrictions in some cases, such as for WordPress’ own sender address/RETURN FROM address, which must remain US-ASCII-only1. This proposal is exclusively about supporting Unicode email addresses for WordPress user accounts.
What could go wrong with storing Unicode email addresses?
If the database or site doesn’t support UTF-8 then there is a problem, because there is no guarantee that the email addresses will be able to be stored and retrieved without corruption. The linked pull request includes a new filter which restricts Unicode email support to sites with utf8mb4 databases. That’s a solid and simple restriction that nevertheless allows the overwhelming majority of sites to support the addresses. But this restriction needs to be communicated to site owners in a clear way.
Existing filters and plugin A plugin is a piece of software containing a group of functions that can be added to a WordPress website. They can extend functionality or add new features to your WordPress websites. WordPress plugins are written in the PHP programming language and integrate seamlessly with WordPress. These can be free in the WordPress.org Plugin Directory https://wordpress.org/plugins/ or can be cost-based plugin from a third-party. or theme code expecting all-US-ASCII email addresses might start receiving data that was never expected. Things as simple as calls to strlen() will return incorrect values when applied to UTF-8 strings containing multi-byte characters, and validation scripts and sanitization scripts need to be aware of the changes. For example, antispambot() needs updating because it assumes every byte is representable as a hex escape sequence, which is not the case for multibyte strings. Further, Unicode normalization properties means that two strings, which are essentially equivalent, may be treated as two distinct strings by PHP The web scripting language in which WordPress is primarily architected. WordPress requires PHP 7.4 or higher, and various functions need to agree on how to handle these to avoid conflating addresses.
Summary
The task of adding full Unicode support to identifiers in WordPress is worthwhile, despite being a broad and fuzzy challenge.
- WordPress can start parsing addresses on supporting sites using modern standards.
- Plugins can disable the modern email parsing.
- An audit of Core Core is the set of software required to run WordPress. The Core Development Team builds WordPress. and plugins is necessary to uncover where assumptions about US-ASCII email characters will be broken when WordPress starts allowing Unicode email addresses.
- Your feedback will help make this process smooth and successful.
Props to Dennis Snell for help with this blog (versus network, site) posting, as well as to Manuel Camargo, Dovid Levine, Tushar Bharti, Mukesh Panchal, and Dennis for help with the code.
#charset, #email, #unicode