Preferred Languages: Research

Following the first meeting of the Preferred Languages project, some time was spent on researching popular platforms and other systems to see how they handle the issue of setting multiple preferred languages. To recap, this is needed to show things in the most suitable language for users in case their requested language is not installed.

Browsers & Operating Systems

All major browsers and operating systems have some UI where the user can set their preferred languages. Mostly used for the Accept-Language HTTP header, one can set multiple languages and order them by preference. These systems usually know multiple variants like German and German (Switzerland), but not formal / informal variants.

Worth noting that on these systems you can often choose more settings that are influenced by the language, like temperature units and date formats. Related: #18146.

The Accept-Language HTTP Header

RFC 7231 about the HTTP/1.1 standard says the following about this header:

The “Accept-Language” header field can be used by user agents to indicate the set of natural languages that are preferred in the response. Each language-range can be given an associated quality value representing an estimate of the user’s preference for the languages specified by that range.

This header is quite powerful. For example, Accept-Language: da, en-gb;q=0.8, en;q=0.7 would mean: “I prefer Danish, but will accept British English and other types of English”. I can only recommend reading more about it in the RFC, as it helps getting a better understanding of the problems it tries to solve.

This header is usually used by websites to redirect users to the correct version or display a hint like “This content is also available in XY”. While each language in the header has a specific numeric priority, there is usually only a drag & drop interface to determine the order.

Unicode Common Locale Data Repository

The Unicode CLDR provides key building blocks for software to support the world’s languages, with the largest and most extensive standard repository of locale data available. It contains an interesting chart about Language Matching, data that is used to match the user’s desired language/locales against an application’s supported languages/locales.

There’s also a technical standard about Unicode Locale Data Markup Language (LDML), an XML format for the exchange of structured locale data. The section about Locale Inheritance and Language Matching is particularly interesting. For instance, it describes finding the most well suited language using a weighted graph and gives a better picture of dealing with more complex language fallbacks.

It also takes geographic “closeness” into account, arguing that English (Slovakia) should fall back to something within Europe (e.g. British English) in preference to something far away and unrelated like English (Singapore). It’s clearly stated that these fallbacks aren’t as simple as just saying Spanish (Mexico) -> Spanish (Spain) -> English (US).

Note that this technical standard is about finding the best supported locale based on the requested list of languages. The requested list could come from different sources, such as such as the user’s list of preferred languages in the OS Settings, or from a browser’s Accept-Language list.

Wikipedia

Wikipedia is built on the MediaWiki software, which has a hardcoded list of fallback chains for some locales. This is where MediaWiki will fall back on a different language if it cannot find what it needs. For example, French (Cajun) automatically falls back to French (France) when it doesn’t have all messages defined in it. Unfortunately, there’s only little documentation about this.

Apart from that, MediaWiki distinguishes between various kinds of languages: the site content language, the user interface language and the page content language. The latter can differ from the first two and it influences the language the user views the page in, which depends on the user’s preferences, the available languages, and the defined fallbacks.

It’s worth noting that on Wikipedia, you can not only select your preferred language, but also your preferred gender. In addition to that, you can select multiple different locales for both display and input.

Joomla

I tried the official Joomla Demo to test its language management which is a bit overwhelming at first. After installing all the available languages the user is eventually able to select their preferred language for the front end and the back end. There are a few other settings for language switching on multilingual sites, but that’s pretty much it. So basically the same settings as currently in WordPress 4.7.

Joomla User Settings

Joomla User Settings

 

Drupal

Drupal 8 has a rather powerful user interface text language detection mechanism. There is a per session, per user and per browser option in the detection settings. However, users can only choose one language, so they cannot say (in core at least) that they want German primarily and Spanish if German is not available. But the language selected by the user is part of the larger fallback system, so it may fall back further down to other options.

The Language fallback module allows defining one fallback for a language, while the Language Hierarchy module provides a GUI to change the language fallback system. It allows setting up language hierarchies where translations of a site’s content, settings and interface can fall back to parent language translations, without ever falling back to English. This module might be the most interesting one for our research.

Drupal Language Hierarchy

Drupal Language Hierarchy

TYPO3

TYPO3 has had a locale hierarchy since 2011. Quote:

TYPO3 4.6 comes with a clever fallback mechanism when a label is not found in the requested language; instead of returning the default (English) version, it allows you to define your own hierarchy of locales.

By default, French (Canada) will first use French before falling back to English. Similarly, missing labels in Brazilian Portuguese will first try to return Portuguese labels before the English ones. This feature also accommodates completely custom fallbacks.

The same goes for the underlying Flow Framework and the Neos CMS. This is only a front end thing though. On the back end, a user can only configure one language. If there’s no translation in that language, it will fall back to the site’s default and eventually to English.

TYPO3 Back end Language

TYPO3 Back end Language Setting

WordPress.com

On WordPress.com there’s a user interface language selection in the account settings. One cannot select multiple preferred languages though, but only one.

WordPress.com User Interface Language

WordPress.com User Interface Language Setting

Facebook

Facebook only has a really simple language settings. Although a user can select multiple languages they understand for use in the News Feed, they can only choose one specific locale for the overall UI.The locales that Facebook supports are referenced in the Facebook Locales XML file. That file includes multiple variants for various languages, but only one variant for others. For example, there’s only de_DE, but no de_CH. Plus, you can’t choose between formal and informal variants.

Facebook User Language Setting

Facebook User Language Setting

Summary

We’ve now covered a bunch of well-known web platforms and content management systems to see how they are handling this problem. Various techniques exist to assist with finding the right translation. Sometimes they are automated, but most of the time the choice is left up to the individual user.

This research should help with the next steps of the Preferred Languages project as these observations need to be adapted for WordPress so that we can learn from them. Feel free to leave any comments about this research in the comments. After that, I try to schedule a new meeting in December where the next steps can be discussed.

#feature-projects, #i18n, #preferred-languages