Best practices for performance measurement

This article highlights several best practices to be aware of when benchmarking and measuring performance. The majority of the best practices apply to both benchmarking and profiling, though they are particularly relevant for benchmarking to ensure clean comparisons, e.g. to assess the performance impact of a WordPress coreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress. pull request or compare the overall performance of one WordPress release with that of another release.

Gather datasets based on several requests

Possibly the most important consideration when it comes to measuring performance is that it varies. Even if you test the exact same version of a URLURL A specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org twice, you will likely get slightly different results. Depending on which metric you are measuring, the variance may be larger or smaller. For example, if you implement a performance enhancement like marking a script tag in the <head> with the defer attribute (which is known to improve performance) and compare the LCP metric from a single request before and after the change, it may happen that the request with the deferred scripts shows a slower result. That probably doesn’t mean that your site actually regressed in performance though, it may just be variance. The site may have responded more slowly in the second request due to your network connection, or maybe your machine was under heavier load. In any case, you will not be able to rely on this data since it is just from a single run for a metric as broad as LCP. Sometimes choosing a more granular metric can help to reduce the variance, but avoiding variance completely is impossible. For any performance comparison, it is critical to use data from at least several requests.

For example, here is how you can use the “benchmark-web-vitals” command to benchmark based on 20 requests:

npm run research -- benchmark-web-vitals -u http://localhost:8889 -n 20

Another example uses the “benchmark-server-timing” command to benchmark based on 100 requests:

npm run research -- benchmark-server-timing -u http://localhost:8889 -n 100

Top ↑

Use a consistent site setup

When benchmarking performance, the most crucial requirement is to use a consistent setup for your WordPress site — the same versions, database, content, theme and plugins active. It is furthermore recommended to disable any unnecessary plugins and any debug mode. Basically, make sure that nothing differs between the benchmarking scenarios, except for the one thing that you want to benchmark. For example, change only the code from WordPress core trunk vs the code from a specific WordPress core pull request you want to measure.

Top ↑

Test against local sites

To eliminate at least some of the variance, performance benchmarks should (if at all possible) be run against local sites. That way you can reduce a lot of the variance from the network connection, since you are no longer issuing requests to an external URL. Of course, this is not always feasible. There can be reasons to benchmark with a production siteProduction Site A production site is a live site online meant to be viewed by your visitors, as opposed to a site that is staged for development or testing., e.g. benchmarking in a more realistic hosting environment. In that case, a higher number of requests is advised to counter the higher variance in network connection.

Top ↑

Choose the right set of metrics

This point is particularly important when assessing the impact of a single performance change, e.g. a WordPress core pull request. For general performance comparisons, e.g. between two WordPress releases, a good practice is to focus on overall load time performance metrics like TTFB, FCP, and LCP. You may also want to focus on other metrics not related to timing, such as CPU or memory usage. To assess individual performance changes though, it is advisable to use a more specific set of metrics, encompassing both broad and granular metrics.

For the broad metric, it is a good idea to always also capture the relevant overall performance metric for the change: For a server-side performance change, it would be good to measure TTFB, or alternatively the overall WordPress load time via the Performance Lab pluginPlugin A plugin is a piece of software containing a group of functions that can be added to a WordPress website. They can extend functionality or add new features to your WordPress websites. WordPress plugins are written in the PHP programming language and integrate seamlessly with WordPress. These can be free in the WordPress.org Plugin Directory https://wordpress.org/plugins/ or can be cost-based plugin from a third-party’s Server-Timing API. For a client-side performance change, it would be good to measure LCP, or even more specifically the difference between LCP and TTFB, to eliminate any of the variance introduced by the server-side performance (since TTFB is part of LCP).

In addition to the broad metric, a more granular metric should be captured whenever possible. How that metric is defined heavily depends on the performance change to assess. For example, if the changes touches something that is expected to affect performance of the WordPress “init” action, the time that action hook takes to fire would be worth measuring. This use-case can easily be handled with Server-Timing. For even more granular comparisons, you can use a profiling tool like XHProf to compare the performance of an individual function before and after the change.

Top ↑

Achieve a relatively stable median by fine tuning the number of requests

The number of requests to use for a comparison depends a bit on the variance you are seeing. For some performance metric comparisons, comparing medians from 5 requests may be sufficient. For others, you may need to compare medians from 500 requests. To quantify the variance and its impact, it can also be helpful to compare the minimum or maximum values from all requests, or the values at specific percentiles. In either case, it would be unreasonable to manually perform such a benchmark, which is why tooling is necessary to automate benchmarking. In order to determine a reasonable number of requests for a specific comparison, some questions you should ask yourself are:

  • How many requests do I need to make the median value relatively stable?
    • You can answer this question by trying to run a number of benchmarks against one consistent scenario.
    • Let’s say you first try 10 runs a few times, and your medians are more than a few percent off between the individual “batches”, you may need to increase the number of runs. You may try increasing the number of runs to 20, or 50, for example, and at some point you may see that the median results become more stable.
    • It also depends heavily on the metric you are measuring. The broader the metric, the larger the variance, and the more requests you need.
  • How much median difference is acceptable for what I am measuring?
    • This directly ties in with the previous question and is probably the most challenging question to answer: How much can the percentage of difference in medians against a consistent scenario be off to still be reliable?
    • It also depends on what you are trying to achieve with the comparison: For example, do you just want to determine whether a change brings any notable performance benefit, or do you want to determine how much benefit it brings? The latter will require a lot more precise data.
  • How costly or how resource-consuming or time-consuming is each request?
    • Of course, in a perfect world, you could just always make an extremely large number of requests as that should give the most reliable data. However, when you benchmark performance on a day-to-day basis, you may not always want to wait for 1000 requests to complete between two comparisons.
    • If the mechanism that you are benchmarking is fairly “cheap” and fast (e.g. making a curl request to the URL), it’s more reasonable to argue that you can simply use more requests for the comparison to get more accurate data. If the mechanism is more expensive and slower (e.g. loading the URL in a headless browser), you may want to be more cautious with how many requests you go with.
    • Another consideration here is how much the process of making all those requests affects your machine. For example, if at some point making the requests brings your CPU usage to excessive levels, your data may at some point just show to be slower because of your own machine reaching its resource limits.

Top ↑

Be mindful of inconsistencies in your own environment

Another important consideration is that running performance benchmarks can produce vastly different numbers depending on the environment they are run in. This is particularly important to keep in mind when interpreting results. When it comes to broad metrics like e.g. TTFB, you may get data around 60ms while another contributor gets the same metric to be around 150ms. That is perfectly fine and expected. Maybe they have a less powerful device, or maybe their device was under more load when they conducted their benchmark. What matters is the relative differences from the benchmarks, and that they are hopefully somewhat similar. If they are not, then either the approach chosen is still inconclusive, or one of the contributors may have had an issue in their setup. Because the results also depend on the resource usage and device used, performance benchmarks should be measured in a relatively consistent environment. It is generally advised that you run the full benchmark for the comparison around a similar time and try to limit other usage of your device while the requests are being made. Don’t run the benchmark for the “before” dataset today and the benchmark for the “after” dataset tomorrow — do it right after each other if possible. Similarly, don’t run one of the benchmarks if you hear your computer’s fan working heavily — wait for things to calm down.

Top ↑

Interpret results carefully

Last but not least, any benchmarking results need to be interpreted with caution. If the performance difference between the scenarios to compare is quite small (<1%), or especially if only some benchmarks show a positive impact, such data may not be a good way to prove that a certain performance change is indeed beneficial. In that case, the approach chosen and the number of requests made may deserve another iteration to gain more confidence in the data. Alternately this may show that the performance change is not actually beneficial, or is potentially a micro-optimization that only brings a tiny benefit that is too small to surface in the metrics. It is worth highlighting that the benchmarks may contradict assumptions we may have about certain performance changes.

Overall it is safe to say that we should trust the data that we get from performance benchmarks when it is consistent. We need to question it when it seems inaccurate, which is why it’s worth sharing as detailed data as possible. Sometimes the approach to get the data did not produce results as precise enough as needed, and in such a case we have to acknowledge that the data may not be reliable. In all other cases, if a particular performance impact, whether positive or negative, is evident from the data, we need to take it into consideration for how to proceed.

Props @flixos90 @joemcgill @spacedmonkey @westonruter for contributing to this article.

Last updated: