The Improved Fatal Error Protection

Following the post on Site Health mechanisms released in WordPress 5.1, the feature labelled “Fatal Error Protection” (see #44458) was reverted, resulting in it not ending up as part of that release.

This was necessary due to several security concerns, partly discovered by the team, partly by third-party security experts:

  • A bad request could be made to the WordPress site targeting a specific plugin, for example with a request method or parameters that said plugin does not expect. Following that, the plugin might throw an exception, causing the plugin to be paused – i.e. an attacker could intentionally use such requests to force pausing of a plugin. This was arguably the most severe concern, since, while plugins should absolutely validate parameters rather than causing a fatal error, many have weaknesses in this area.
  • A flaw in one plugin could cause another plugin to be paused, rather than the flawed plugin itself. A good example of this is exceeding the memory limit: plugin 1 runs a way too expensive procedure, but then the memory limit is reached by a random plugin 2, causing the latter to be the origin of the fatal error.
  • A plugin failure in the frontend could affect that plugin to be paused in the backend, although it might not have caused a fatal error there. The frontend is a “non-protected endpoint”, for which plugins or themes should never be paused.

Multiple follow-up tickets were created to mitigate these issues, but eventually the team came to the conclusion that all these tweaks would have only slightly reduced the attack vector, rather than eliminating it.

A completely new approach was required, which would require additional time to be planned and implemented.

This post outlines the new proposed approach in detail. Please share it and request feedback, both from community members and people less active in the WordPress ecosystem – particularly security experts and hosting engineers. We would like to ensure that the approach is solid to proceed with before it is fully implemented.

Goal of the Fatal Error Protection

The primary goal of the feature remains the same as it was originally:

It should be possible for an administrator to access their admin backend, even in case of a fatal error.

While this does not reduce the risk factor of a PHP or extension update, it encourages users to perform an update despite the risks, as they will still be able to at least temporarily fix the problem.

If an update breaks their site, it will currently cause a so-called “white screen of death”. With the fatal error protection, it should display a specific screen informing the user about that there is a technical problem with the site. The administrator should then be able to access their backend. There they can look at the extensions that are broken and either deactivate them to fix the problem, or find out about the error and forward it to a developer of their choice. Even in the latter case, the administrator might see enough value in deactivating the broken extensions to at least temporarily ensure the site’s frontend is accessible again.

A secondary goal of the feature is to inform the site owner about fatal errors that have occurred for visitors of their site, without the owner being involved. This ensures that in case an administrator forgets to check their site after an update or if an error has happened in a rather specific area, they still find out when their site is inaccessible.

Explicitly not a goal of the fatal error protection is error logging. While it should be possible for an administrator to see the latest error which an extension has caused and which has resulted in that extension’s pausing, not more errors than necessary as that should be stored in the database. Error-logging is a different feature focused more towards developer audiences, and an implementation using the WordPress database could easily clutter that and decrease performance of a site, thus introduce negative side effects for minor benefits. It should also be noted that the fatal error protection is not related to recovering a hacked site.

As the feature is targeted towards rather non-technical administrators that do not have capabilities to access or modify their site’s codebase, it should be opt-out via code mechanisms such as constants or filters, so that more advanced developers can disable the feature if they prefer.

The feature is targeted for WordPress 5.2 and will be in support of bumping the minimum required PHP version of WordPress to 5.6, which will happen as part of that release as well.

Introducing a Recovery Mode

What we recognized is that the idea of pausing extensions (plugins and themes) globally is the problem. If an attacker can force-pause extensions, essentially causing them to be deactivated on protected endpoints such as the admin backend or the login page, it can have severe impact for example on security. A security plugin might have a random weakness that, while being an indicator of a lack in quality, is not necessarily security-relevant. However, if that plugin also added two-factor-authentication to the WordPress login page, using the security-irrelevant flaw to pause the plugin would allow the attacker to bypass two-factor-authentication on the site.

The idea of pausing extensions makes sense, as it is impossible to act more granularly on a fatal error due to the lack of sandboxing in WordPress. However, it needs to be ensured that pausing happens only for certain users that have sufficient capabilities to fix the issues, either by permanently deactivating the flawed extension or contacting a developer to take care of it – typically the site owner, which in the case of WordPress is arguably an administrator user.

The new approach relies on the concept of a recovery mode. Only specific users should be able to enter recovery mode, and only in recovery mode extensions should be eligible for being paused. This ensures that a fatal error does not have global impact. If an attacker intentionally triggered a fatal error in a plugin, they would not get any more from that than they would at the moment.

The Intended User Flow

View Original (props @miss_jwo)

There are two different scenarios for when a fatal error happens.

A user is browsing the site (logged-in or not).

  1. A fatal error occurs, causing the user to see a screen saying that the site is experiencing technical difficulties.
  2. At the same time, a notification email is sent to the admin email address, informing about the fatal error, including the error details and a link with a secret nonce to enter recovery mode. This process is rate-limited, so a maximum of 1 notification email will be sent per hour (filterable).
  3. If the user does not have access to the admin email address, the process stops here – they should not be able to enter recovery mode then.
  4. The user accesses their emails and clicks on the included recovery mode link.
  5. The WordPress site verifies that the nonce in the link is correct, sets a cookie containing another secret key on the user’s machine, and redirects them to the login page. At this stage, the user (or more specifically the client) has effectively entered recovery mode. On every subsequent request, the cookie will be verified by checking the secret key.
  6. The user can now regularly login. If a fatal error occurs on the site, the extension causing it will be paused for the user, followed by a redirect to the same URL, until no longer a fatal error occurs.
  7. The user can now reliably browse the admin backend. In the dashboard, they will see a notice informing about which extensions are currently paused for them. On the Plugins / Themes screen, they can either resume these extensions (if they think the problem has been fixed) or completely deactivate them (with that action affecting the site globally, as usual).
  8. If the user wants to exit recovery mode, there is a link in the admin bar that lets them do so. Once the user clicks that, the cookie will be deleted and extensions will no longer be paused for them.

No user is browsing the site.

  1. A fatal error occurs (for example in a Cron request), causing the same screen to be internally rendered (even though technically nobody will see it).
  2. At the same time, the notification email mentioned above is sent (if not over the rate limit), to inform the person associated with the admin email (typically referred to as the “site owner”) for the website about the error.

Further Considerations

There are a number of additional considerations for the feature, other than the user flow:

  • The feature should only support pausing of extensions that are user-controllable, as in that access is not necessary to set them up. It therefore excludes support for drop-ins and must-use plugins.
  • At least in the first iteration of the feature, it should also not support multisite and the pausing of network plugins. While these plugins can technically be altered without access to the codebase, setting up multisite requires access to the codebase and is an advanced and, relatively seen, rare use-case. Adding support for multisite and network plugins would require much additional exploration and work regarding UI (e.g. should only network administrators or also site administrators be able to access the backend and resume those plugins?). Furthermore there are technical quirks that would need to be figured out in addition (cookie constants, which are needed for the new approach, are only available after network plugins have been loaded). While the feature could generally be supported in multisite even without supporting network plugins, it was decided to explicitly disable the fatal error handler and recovery mechanism for multisite in the first iteration, to explore a solid approach in the future.
  • It is impossible to pause WordPress core, which should not be an issue as core defines which PHP versions it supports and should be compatible with all of them. Even more than today though, it is important to ensure that full compatibility with the latest PHP versions is maintained.
  • The feature cannot function properly if a connection to the database or object cache (if used) cannot be established. The implementation will account for these cases so that it does not trigger PHP notices or exceptions, but it is impossible to work around these limitations further.
  • The mechanisms which relate to initiating the recovery mode must all be error proof and therefore executed early, before extensions which might cause breakage are loaded. This technically means execution must happen after the call to register_theme_directory() in wp-settings.php. These mechanisms are:
    • Verifying a request made from a recovery link, setting the recovery mode cookie and redirecting the user.
    • Verifying the recovery mode cookie and based on that enabling recovery mode for the current request.

Future Iterations

While the previously outlined approach is targeted towards WordPress 5.2, there is room for future iterations that can be released in follow-up versions. Current ideas include:

  • Improving the flow of how an administrator can receive a recovery link, since, in several cases, the admin email address does not cover such users correctly. The fatal error screen could show a link to a new screen that looks similar to the password recovery screen. Here a user could enter their email address, and if they had sufficient permissions on the site, a recovery link would be sent to their email address. In other words, in addition to the link going out to the admin email address, administrators could request their own recovery link manually.
  • Adding support for multisite and network-active plugins. Due to both technical and UI complications and the rather small and usually more advanced amount of multisites, it was decided to omit this for the first version. Since network plugins are however still user-managed (contrary to drop-ins and must-use plugins), it would make sense to eventually support them as well. This would also require thinking about a solid flow to network-resume plugins and handle network-wide recovery mode and related user capabilities.
  • Supporting further customization. While the first version allows overriding the feature, in the future more granular adjustments might become possible. Most likely, this will happen naturally through developer requests that can then be evaluated.

Feedback Requested

We would like to get feedback on the proposal, to make sure to get the implementation right this time. In particular, security and UX require thorough reviews. Please share your thoughts, and also spread the word about this post with others you think would be interested. You can also track the implementation via #46130 and the accompanying pull-request.

You can give feedback by commenting on this post or in the #core-php channel on Slack. We also have weekly meetings on Monday, 16:00 UTC.

Props @afragen, @miss_jwo, @nerrad, @spacedmonkey, @timothyblynjacobs for review.

#5-2, #servehappy