Changes to prevent search engines indexing sites.

In WordPress 5.3 the method used to discourage indexing will change on sites enabling the option “discourage search engines from indexing this site” in the WordPress dashboard. These changes were made as part of ticketticket Created for both bug reports and feature development on the bug tracker. #43590.

These changes are intended to better discourage search engines from listing a site rather than only preventing them from crawling the site.

robots.txt file changes.

In previous versions of WordPress, Disallow: / was added to the robots.txt file to prevent search engines from crawling the site. This has been removed for non-public websites in WordPress 5.3.

As Joost de Valk writes in an explainer on search engine exclusion, disallowing crawling can have the effect of allowing a site to be indexed:

A site doesn’t have to be [crawled] to be listed. If a link points to a page, domain or wherever, Google follows that link. If the robots.txt on that domain prevents [crawling] of that page by a search engine, it’ll still show the URLURL A specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org in the results if it can gather … it might be worth looking at.

MetaMeta Meta is a term that refers to the inside workings of a group. For us, this is the team that works on internal WordPress sites like WordCamp Central and Make WordPress. tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.) changes.

Sites with the “discourage search engines from indexing this site” option enabled will display an updated robots meta tag to prevent the site from being listed in search engines: <meta name='robots' content='noindex,nofollow' />.

This meta tag requests search engines exclude the page from indexing and discourages them from further crawling the website.

Excluding development servers from search engines.

The most effective method to exclude development sites from being indexed by search engines is to include the HTTPHTTP HTTP is an acronym for Hyper Text Transfer Protocol. HTTP is the underlying protocol used by the World Wide Web and this protocol defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. HeaderHeader The header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. X-Robots-Tag: noindex, nofollow when serving all assets for your site: images, PDFs, video and other assets.

As most non-HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. assets are served directly by the web server on a WordPress site, the coreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress. software is unable to set this HTTP header. You should consult your web server’s documentation or your host to ensure these assets are excluded on development sites.

#5-3, #dev-notes