What Do Popular WordPress Websites Put In Their Robots.txt File?

The Robots.txt File lets you control what search engine spiders should and shouldn’t crawl on your website. It’s a file that many WordPress users forget to add. For those who do, it can be unclear what folders should and shouldn’t be blocked.

I think that the leaner robots.txt file from Jeff Star is a good starting point for WordPress users as it specifies the most important folders that should be blocked from the search engines such as your admin area, includes directory and theme and plugin folders.

There isn’t a standardised robots.txt file for WordPress. As a result, the contents of Robots.txt varies quite differently from site to site. The great thing about the robots.txt file is that it is publicly viewable so I thought it would be interesting to look at some popular WordPress websites and see how they have configured their robots.txt file.

Please bear in mind that whilst I have an understanding of how you create a robots.txt file and what properties are available (e.g. Allow, Disallow, Sitemap etc) when creating one; I don’t consider myself to be an expert on the Robots Exclusion Protocol or on search engine optimisation. What has become clear to me when researching this post is that there doesn’t seem to be a right or wrong robots.txt file. Most WordPress websites are blocking different directories from search engines and many website owners disagree about what you should and shouldn’t put in the file (or whether you should even have a robots.txt file at all).

WordPress

Let’s start by looking at how the official home of WordPress handles search engine robots :)

WordPress.org actually has a fairly simple robots.txt file. The main focus seems to be on blocking search functionality and the RSS feed.

User-agent: *
Disallow: /search
Disallow: /support/search.php
Disallow: /extend/plugins/search.php
Disallow: /extend/themes/search.php
Disallow: /support/rss

WordPress.com sites are a little different. All WordPress.com sites have the same robots.txt file. It defines the crawl delay for the IRLbot crawler and blocks a few folders such as the sign up and activation directories.

# If you are regularly crawling WordPress.com sites please use our firehose to receive real-time push updates instead.
# Please see http://en.wordpress.com/firehose/ for more details.
 
Sitemap: http://wordpress.com/sitemap.xml
 
User-agent: IRLbot
Crawl-delay: 3600
 
User-agent: *
Disallow: /next/
 
# har har
User-agent: *
Disallow: /activate/
 
User-agent: *
Disallow: /signup/
 
User-agent: *
Disallow: /related-tags.php
 
# MT refugees
User-agent: *
Disallow: /cgi-bin/
 
User-agent: *
Disallow:

The Suggested Robots.txt File from WordPress hasn’t been updated in years though it is still a good indicator of what directories shouldn’t be crawled by search engines.

In help me write this article I checked the robots.txt file of dozens and dozens of WordPress websites. To my surprise, the only website I found to be using the suggested robots.txt file was John Chow. I’m surprised more top blogs haven’t used the example used on WordPress or at least a close variation of it.

Sitemap: http://www.example.com/sitemap.xml
 
# Google Image
User-agent: Googlebot-Image
Disallow:
Allow: /*
 
# Google AdSense
User-agent: Mediapartners-Google*
Disallow:
 
# digg mirror
User-agent: duggmirror
Disallow: /
 
# global
User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Disallow: /trackback/
Disallow: /feed/
Disallow: /comments/
Disallow: /category/*/*
Disallow: */trackback/
Disallow: */feed/
Disallow: */comments/
Disallow: /*?
Allow: /wp-content/uploads/

Allow Everything

I came across a few large WordPress websites such as ProBlogger, SitePoint and Spotify who simply allow everything.

User-agent: *
Disallow:

Block WP-Admin and WP-Includes

One of the most common robots.txt file I came across was simply to block crawlers from indexing the wp-admin and wp-includes folder. It’s used by BuddyPress, Digging Into WordPress, WebDesignerWall, WooThemes, StudioPress and DailyBlogTips.

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

Other Examples

There are millions of WordPress websites online. Whilst many have the same robots.txt file, a large number of websites have created something unique for themselves.

WP Mods

I can’t recall the original source/inspiration for the WP Mods robots.txt file. I suspect I started with an example I found online and then made some changes along the way. In the file I block folders like wp-admin, wp-includes and wp-content as well as all files that start with wp (e.g. wp-login.php and wp-register.php). Google images is blocked though Adsense is allowed and the sitemaps generated by Yoast SEO are denoted at the bottom.

User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /trackback/
Disallow: /archives/
Disallow: /category/
Disallow: /tag/*
Disallow: /tag/
Disallow: /wp-*
Disallow: /login/
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.php$
 
User-agent: Googlebot-Image
Disallow: /
 
User-agent: Mediapartners-Google*
Disallow:
 
User-agent: ia_archiver
Disallow: /
 
User-agent: duggmirror
Disallow: /
 
Sitemap: http://www.wphub.com/category-sitemap.xml
Sitemap: http://www.wphub.com/page-sitemap.xml
Sitemap: http://www.wphub.com/post-sitemap.xml
Sitemap: http://www.wphub.com/post_tag-sitemap.xml

Mashable

Mashable block the WordPress theme and plugin folders. They also block search engines from indexing the WordPress readme.html and rpc_relay.html. I found this a little strange as it’s much easier to just delete these files rather than block them (plus it’s not like hackers can’t get these files from downloading WordPress anyway). They also block a lot of other HTML files and folders where ad scripts are installed.

User-agent: *
Disallow: /adcentric
Disallow: /adinterax
Disallow: /atlas
Disallow: /doubleclick
Disallow: /eyereturn
Disallow: /eyewonder
Disallow: /klipmart
Disallow: /pointroll
Disallow: /smartadserver
Disallow: /unicast
Disallow: /viewpoint
Disallow: /addineyeV2.html
Disallow: /canvas.html
Disallow: /DARTIframe.html
Disallow: /interim.html
Disallow: /oggiPlayerLoader.htm
Disallow: /videoeggbackup.html
 
Disallow: /facebook_xd_receiver.html
Disallow: /readme.html
Disallow: /rpc_relay.html
Disallow: /twitterlists/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/

Matt Mullenweg

WordPress co-founder Matt Mullenweg has a simple robots.txt that allows Google Adsense and blocks search engines from Dropbox, contact page, login page and admin area.

User-agent: *
Disallow:
 
User-agent: Mediapartners-Google*
Disallow:
 
User-agent: *
Disallow: /dropbox
Disallow: /contact
Disallow: /blog/wp-login.php
Disallow: /blog/wp-admin

Smashing Magazine

Apart from the Yoast generated sitemaps, Smashing Magazine blocks all crawlers from indexing their RSS feed. 5 specific crawlers are prevented from indexing anything from the site.

Sitemap: http://www.smashingmagazine.com/post-sitemap.xml
Sitemap: http://www.smashingmagazine.com/page-sitemap.xml
Sitemap: http://www.smashingmagazine.com/category-sitemap.xml
Sitemap: http://www.smashingmagazine.com/post_tag-sitemap.xml
 
User-agent: *
Disallow: /wp-rss.php
Disallow: /wp-rss2.php
 
User-agent: MSIECrawler
Disallow: /
 
User-agent: psbot
Disallow: /
 
User-agent: Fasterfox
Disallow: /
 
User-agent: Xenu
Disallow: /
 
User-agent: SiteSucker
Disallow: /

WP Beginner

The interesting thing about the WP Beginner robots.txt is that they only set rules for the Google crawler. Google Images is blocked from the wp-includes folder though Adsense is allowed. Wp-content and wp-admin have been blocked but wp-includes can still be crawled. Trackbacks, feeds and their gallery have been blocked too.

User-agent: Googlebot
 
Allow: /?display=wide
Disallow: /wp-content/
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /feed/
Disallow: /index.php
Disallow: /*?
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: */feed/
Disallow: */trackback/
Disallow: /link.php
Disallow: /gallery2
Disallow: /gallery2/
Disallow: /refer/
 
User-agent: Googlebot-Image
Disallow: /wp-includes/
 
User-agent: Mediapartners-Google*
Disallow:

CopyBlogger

CopyBlogger have a simple robots.txt that blocks the feed and trackbacks from being indexed.

User-agent: *
Disallow: /*/feed/
Disallow: /*/trackback/

WPMU

WPMU have not specified any rules for crawlers to obey. The only thing they confirm is the location of their sitemap (last year Peter Handley from MediaFlow left a comment on SEORoundTable stating that he does the same).

# BEGIN XML-SITEMAP-PLUGIN
Sitemap: http://wpmu.org/sitemap.xml.gz
# END XML-SITEMAP-PLUGIN

No Robots.txt File

The first thing that a search engine robot looks for when it visits a website is the robots.txt file though many WordPress websites such as the theme store Elegant Themes and the personal blogs of Mark Jaquith and Tim Ferriss do not have one.

Is this necessarily a bad thing? I’m not sure. A year ago Google employee JohnMu gave the following advice to someone on Webmaster Central:

I would recommend going even a bit further, and perhaps removing the robots.txt file completely. The general idea behind blocking some of those pages from crawling is to prevent them from being indexed. However, that’s not really necessary — websites can still be crawled, indexed and ranked fine with pages like their terms of service or shipping information indexed (sometimes that’s even useful to the user :-)).

The main argument for not doing this seems to be that 404 errors are generated if a search engine crawler can’t find the robots.txt file.

Overview

As I noted at the beginning of this article, I am not an SEO expert so I am not sure what the best practice is for creating a robots.txt file for your website. I think that blocking important areas of your site such as your admin area, theme and plugin folders and trackbacks is a good idea. Plus, making sure that important files unique to your own website is recommended (e.g. www.site.com/mydocuments/).

There does seem to be a lack of consistency between websites on what the robots.txt should actually block and there are those that argue that is not needed at all.

How much importance do you place on the robots.txt file? Feel free to share a link to your robots.txt file (or code from it) in the comments area together with your reasons for blocking specific directories and files.

Thanks for reading :)

Kevin

This article was authored by:

Kevin Muldoon is a professional blogger with a love of travel. He writes regularly about topics such as WordPress, Blogging, Productivity, Internet Marketing and Social Media on his personal blog and and provides technical support at Rise Forums. He can also be found on Twitter: @KevinMuldoon and Google+.

Kevin Muldoon has authored 833 posts.Visit Website

Showing 5 Comments

  • Interesting and Informative post :)

    REPLY
  • As robots file is a site level directive (or rather application level directive) and the crawlers actually running on predefined instruction set (this is the basic of Multi Agent System) by Google Search and Search bots (it will index everything by default), our robots file (what ever the cms is) might not be able to really block all the disallowed parts specifically if Google has fed it with special directive. Apparently there is connection between other bots of Google and search bot.

    robots.txt is needed to perform only two things – (1) saving the bandwidth and time by disallowing the CMS core files (2) to fight with duplicate content issues.

    Ultimately the thing is that, except the owners of Google no one perfectly knows the real configuration and logics of their Multi Agent System (search bot in our case). That is why people search here and there and do a trial and error thing for ‘optimized’ robots file.

    If robots file was not needed, Google itself would have chopped it out (WP or non-WP does not matter) – http://www.google.com/robots.txt ; you can see /maps has been controlled nicely.

    REPLY
  • Great post, more information on search engines out there would be helpful.

    REPLY
  • goodwebdesign

    Good stuff. Nice to see the differences. Although it looks to me most do some sort of random thing with their robot.txt file.

    REPLY
  • Could you explain to me, why you would disable comments?

    Disallow: /comments/

    Wanting to learn more on this, do you think it causes negative ranking to have too many comments that are keyword stuffed?

    REPLY

Add Your Voice: