Peter Wilson 3:27 am on September 2, 2019
Tags: 5.3 ( 80 ), dev-notes ( 520 )

Changes to prevent search engines indexing sites.

In WordPress 5.3 the method used to discourage indexing will change on sites enabling the option “discourage search engines from indexing this site” in the WordPress dashboard. These changes were made as part of ticketticket Created for both bug reports and feature development on the bug tracker. #43590.

These changes are intended to better discourage search engines from listing a site rather than only preventing them from crawling the site.

`robots.txt` file changes.

In previous versions of WordPress, Disallow: / was added to the robots.txt file to prevent search engines from crawling the site. This has been removed for non-public websites in WordPress 5.3.

As Joost de Valk writes in an explainer on search engine exclusion, disallowing crawling can have the effect of allowing a site to be indexed:

A site doesn’t have to be [crawled] to be listed. If a link points to a page, domain or wherever, Google follows that link. If the robots.txt on that domain prevents [crawling] of that page by a search engine, it’ll still show the URLURL A specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org in the results if it can gather … it might be worth looking at.

MetaMeta Meta is a term that refers to the inside workings of a group. For us, this is the team that works on internal WordPress sites like WordCamp Central and Make WordPress. tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.) changes.

Sites with the “discourage search engines from indexing this site” option enabled will display an updated robots meta tag to prevent the site from being listed in search engines: <meta name='robots' content='noindex,nofollow' />.

This meta tag requests search engines exclude the page from indexing and discourages them from further crawling the website.

Excluding development servers from search engines.

The most effective method to exclude development sites from being indexed by search engines is to include the HTTPHTTP HTTP is an acronym for Hyper Text Transfer Protocol. HTTP is the underlying protocol used by the World Wide Web and this protocol defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. HeaderHeader The header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. X-Robots-Tag: noindex, nofollow when serving all assets for your site: images, PDFs, video and other assets.

As most non-HTMLHTML HyperText Markup Language. The semantic scripting language primarily used for outputting content in web browsers. assets are served directly by the web server on a WordPress site, the coreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress. software is unable to set this HTTP header. You should consult your web server’s documentation or your host to ensure these assets are excluded on development sites.

#5-3, #dev-notes

Daniel Llewellyn 10:57 pm on September 2, 2019

Could not a fenced `HeaderHeader The header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. set` rule be added to the `.htaccess` to attempt to add that HTTPHTTP HTTP is an acronym for Hyper Text Transfer Protocol. HTTP is the underlying protocol used by the World Wide Web and this protocol defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. Header automatically? This would help anyone running on ApacheApache Apache is the most widely used web server software. Developed and maintained by Apache Software Foundation. Apache is an Open Source software available for free. with `.htaccess` supported by the server setup. Obviously not every site which has search engines discouraged would benefit depending on their setup, but some is better than none…

`### START WordPress Robots

Header set X-Robots-Tag “no-index, no-follow”

### END WordPress Robots`
- Daniel Llewellyn 10:59 pm on September 2, 2019
  That should be:
```
### START WordPress Robots
<IfModule mod_headers.c>
    Header set X-Robots-Tag "no-index, no-follow"
</IfModule>
### END WordPress Robots
```
- Peter Wilson 11:26 pm on September 2, 2019
  
  I considered this but as server configurations may change, there is no way to guarantee the .htaccess file will remain writable between enabling and disabling the setting.
  
  This could result in the X-Robots-Tag headerHeader The header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. blocking search engines once a site goes live if the file can’t be updated. Therefore, adding and removing the HTTPHTTP HTTP is an acronym for Hyper Text Transfer Protocol. HTTP is the underlying protocol used by the World Wide Web and this protocol defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. header is best left to the developer.
  - Luke Cavanagh 7:47 pm on September 4, 2019
    
    As well as not all WordPress sites are hosted using a LAMPLAMP LAMP is an acronym for Linux, Apache, MySql, PHP – a stack of free software programs that can function as the environment for running WordPress. stack and are using NGNIX in a LEMPLEMP LEMP is an acronym for Linux, NGINX, MySQL, and PHP – a stack of free software programs that can function as the environment for running WordPress. stack.
Ben 1:35 pm on September 3, 2019

Hi Peter,

Thanks for pushing this out. 5.3 doesn’t actually have a release schedule yet and since Google stopped supporting the noindex among other rules on September 1st, do you by chance have a guesstimation on 5.3’s release?

Also, are you adding “noindex” by default on the first WordPress install? If you’re leaving the “Discourage SEs…” checkbox at first login, having it checked by default would save us all a step! lol 😉

Keep up the great work!
Ben
- Peter Wilson 9:48 pm on September 3, 2019
  
  WordPress 5.3 is scheduled for early November, I’ve updated the page linked to on each page of this site to make it easier to find.
  
  Google stopped supporting the noindex among other rules on September 1st
  
  This was a non-standard rule they supported in robots.txt files. As WordPress didn’t use this rule, the WordPress changes are unaffected by Google’s policy change.
  
  If you’re leaving the “Discourage SEs…” checkbox at first login, having it checked by default would save us all a step!
  
  Yes, the checkbox will continued to be displayed during the installation process. As most sites are intended to be indexed by search engines, it remains unchecked by default.
BackuPs 5:50 am on September 4, 2019

The most effective method to exclude development sites from being indexed by search engines is to include the HTTPHTTP HTTP is an acronym for Hyper Text Transfer Protocol. HTTP is the underlying protocol used by the World Wide Web and this protocol defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. HeaderHeader The header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. X-Robots-Tag: noindex, nofollow when serving all assets for your site: images, PDFs, video and other assets.

And how does this work in a multi site where one has maybe onle one or two website that do not need to be indexed?
- BackuPs 5:51 am on September 4, 2019
  
  Sorry: typo in my question. Here is the correct one.
  
  The most effective method to exclude development sites from being indexed by search engines is to include the HTTPHTTP HTTP is an acronym for Hyper Text Transfer Protocol. HTTP is the underlying protocol used by the World Wide Web and this protocol defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. HeaderHeader The header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. X-Robots-Tag: noindex, nofollow when serving all assets for your site: images, PDFs, video and other assets.
  
  And how does this work in a multi site where one has maybe only one or two website that do not need to be indexed?
  - Peter Wilson 6:20 am on September 4, 2019
    
    For the discourage search engine setting in WordPress, the option is per site.
    
    There’s no universal procedure for adding HTTPHTTP HTTP is an acronym for Hyper Text Transfer Protocol. HTTP is the underlying protocol used by the World Wide Web and this protocol defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. Headers on a per site basis, you’ll need to consult your web server’s documentation.
    - BackuPs 6:42 am on September 5, 2019
      
      Hi Peter ,
      
      I am a bit dissatisfied by your answer
      
      “you’ll need to consult your web server’s documentation.”
      
      WP multisitemultisite Used to describe a WordPress installation with a network of multiple blogs, grouped by sites. This installation type has shared users tables, and creates separate database tables for each blog (wp_posts becomes wp_0_posts). See also network, blog, site is a product of wordpress. Your article says
      
      The most effective method to exclude development sites from being indexed by search engines is to include the HTTPHTTP HTTP is an acronym for Hyper Text Transfer Protocol. HTTP is the underlying protocol used by the World Wide Web and this protocol defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. HeaderHeader The header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. X-Robots-Tag: noindex, nofollow when serving all assets for your site: images, PDFs, video and other assets.
      
      But this only applies to wp single site. For a multisite there is no effective way todo this.
      
      Maybe adjust your article on this or drop this at the wpmultisite team so a better solution can be provided.
      - Aaron Jorbin 10:42 pm on September 5, 2019
        
        Sending the httpHTTP HTTP is an acronym for Hyper Text Transfer Protocol. HTTP is the underlying protocol used by the World Wide Web and this protocol defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. headerHeader The header of your site is typically the first thing people will experience. The masthead or header art located across the top of your page is part of the look and feel of your website. It can influence a visitor’s opinion about your content and you/ your organization’s brand. It may also look different on different screen sizes. is out of scope for WordPress to do in both single and multisitemultisite Used to describe a WordPress installation with a network of multiple blogs, grouped by sites. This installation type has shared users tables, and creates separate database tables for each blog (wp_posts becomes wp_0_posts). See also network, blog, site as that is something best done from the server software as WordPress doesn’t handle all requests.
        
        The actual configuration is going to depend on the server you use, the way you have multisite setup, and how your current server is configured.
anonymized-21e4dd0abfe8a0337cb8692304d453ea 7:53 am on October 28, 2019

It should be noindex, dofollow.

The reason for that is because if you have 300 already indexed pages and you want it to be deindexed, you should allow the bots to crawl properly the whole site and discover the noindex. By adding “noindex” your are suppressing the bots from seeing the noindex tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.) on the other pages linked to that page. That will result in a much slowed deindexation than it would normally would be. The nofollow attribute will also cause the bots to crawl the site less frequently, so you’ll end up with pages that are in the index for a very long time, if you had them already indexed.

Noindex, nofollow would be a good solution only if you don’t have a site indexed already.

If the site was indexed already, it should be noindex, dofollow.
- anonymized-21e4dd0abfe8a0337cb8692304d453ea 7:54 am on October 28, 2019
  
  By adding “nofollow*” your are suppressing the bots from seeing the noindex tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.) on the other pages linked to that page.