What Do Popular WordPress Websites Put In Their Robots.txt File?
The Robots.txt File lets you control what search engine spiders should and shouldn’t crawl on your website. It’s a file that many WordPress users forget to add. For those who do, it can be unclear what folders should and shouldn’t be blocked.
I think that the leaner robots.txt file from Jeff Star is a good starting point for WordPress users as it specifies the most important folders that should be blocked from the search engines such as your admin area, includes directory and theme and plugin folders.
There isn’t a standardised robots.txt file for WordPress. As a result, the contents of Robots.txt varies quite differently from site to site. The great thing about the robots.txt file is that it is publicly viewable so I thought it would be interesting to look at some popular WordPress websites and see how they have configured their robots.txt file.
Please bear in mind that whilst I have an understanding of how you create a robots.txt file and what properties are available (e.g. Allow, Disallow, Sitemap etc) when creating one; I don’t consider myself to be an expert on the Robots Exclusion Protocol or on search engine optimisation. What has become clear to me when researching this post is that there doesn’t seem to be a right or wrong robots.txt file. Most WordPress websites are blocking different directories from search engines and many website owners disagree about what you should and shouldn’t put in the file (or whether you should even have a robots.txt file at all).
Let’s start by looking at how the official home of WordPress handles search engine robots
WordPress.org actually has a fairly simple robots.txt file. The main focus seems to be on blocking search functionality and the RSS feed.
User-agent: * Disallow: /search Disallow: /support/search.php Disallow: /extend/plugins/search.php Disallow: /extend/themes/search.php Disallow: /support/rss
WordPress.com sites are a little different. All WordPress.com sites have the same robots.txt file. It defines the crawl delay for the IRLbot crawler and blocks a few folders such as the sign up and activation directories.
# If you are regularly crawling WordPress.com sites please use our firehose to receive real-time push updates instead. # Please see http://en.wordpress.com/firehose/ for more details. Sitemap: http://wordpress.com/sitemap.xml User-agent: IRLbot Crawl-delay: 3600 User-agent: * Disallow: /next/ # har har User-agent: * Disallow: /activate/ User-agent: * Disallow: /signup/ User-agent: * Disallow: /related-tags.php # MT refugees User-agent: * Disallow: /cgi-bin/ User-agent: * Disallow:
The Suggested Robots.txt File from WordPress hasn’t been updated in years though it is still a good indicator of what directories shouldn’t be crawled by search engines.
In help me write this article I checked the robots.txt file of dozens and dozens of WordPress websites. To my surprise, the only website I found to be using the suggested robots.txt file was John Chow. I’m surprised more top blogs haven’t used the example used on WordPress or at least a close variation of it.
Sitemap: http://www.example.com/sitemap.xml # Google Image User-agent: Googlebot-Image Disallow: Allow: /* # Google AdSense User-agent: Mediapartners-Google* Disallow: # digg mirror User-agent: duggmirror Disallow: / # global User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /wp-content/plugins/ Disallow: /wp-content/cache/ Disallow: /wp-content/themes/ Disallow: /trackback/ Disallow: /feed/ Disallow: /comments/ Disallow: /category/*/* Disallow: */trackback/ Disallow: */feed/ Disallow: */comments/ Disallow: /*? Allow: /wp-content/uploads/
User-agent: * Disallow:
Block WP-Admin and WP-Includes
One of the most common robots.txt file I came across was simply to block crawlers from indexing the wp-admin and wp-includes folder. It’s used by BuddyPress, Digging Into WordPress, WebDesignerWall, WooThemes, StudioPress and DailyBlogTips.
User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/
There are millions of WordPress websites online. Whilst many have the same robots.txt file, a large number of websites have created something unique for themselves.
I can’t recall the original source/inspiration for the WP Mods robots.txt file. I suspect I started with an example I found online and then made some changes along the way. In the file I block folders like wp-admin, wp-includes and wp-content as well as all files that start with wp (e.g. wp-login.php and wp-register.php). Google images is blocked though Adsense is allowed and the sitemaps generated by Yoast SEO are denoted at the bottom.
User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /trackback/ Disallow: /archives/ Disallow: /category/ Disallow: /tag/* Disallow: /tag/ Disallow: /wp-* Disallow: /login/ Disallow: /*.js$ Disallow: /*.inc$ Disallow: /*.css$ Disallow: /*.php$ User-agent: Googlebot-Image Disallow: / User-agent: Mediapartners-Google* Disallow: User-agent: ia_archiver Disallow: / User-agent: duggmirror Disallow: / Sitemap: http://www.wphub.com/category-sitemap.xml Sitemap: http://www.wphub.com/page-sitemap.xml Sitemap: http://www.wphub.com/post-sitemap.xml Sitemap: http://www.wphub.com/post_tag-sitemap.xml
Mashable block the WordPress theme and plugin folders. They also block search engines from indexing the WordPress readme.html and rpc_relay.html. I found this a little strange as it’s much easier to just delete these files rather than block them (plus it’s not like hackers can’t get these files from downloading WordPress anyway). They also block a lot of other HTML files and folders where ad scripts are installed.
User-agent: * Disallow: /adcentric Disallow: /adinterax Disallow: /atlas Disallow: /doubleclick Disallow: /eyereturn Disallow: /eyewonder Disallow: /klipmart Disallow: /pointroll Disallow: /smartadserver Disallow: /unicast Disallow: /viewpoint Disallow: /addineyeV2.html Disallow: /canvas.html Disallow: /DARTIframe.html Disallow: /interim.html Disallow: /oggiPlayerLoader.htm Disallow: /videoeggbackup.html Disallow: /facebook_xd_receiver.html Disallow: /readme.html Disallow: /rpc_relay.html Disallow: /twitterlists/ Disallow: /wp-content/plugins/ Disallow: /wp-content/themes/
WordPress co-founder Matt Mullenweg has a simple robots.txt that allows Google Adsense and blocks search engines from Dropbox, contact page, login page and admin area.
User-agent: * Disallow: User-agent: Mediapartners-Google* Disallow: User-agent: * Disallow: /dropbox Disallow: /contact Disallow: /blog/wp-login.php Disallow: /blog/wp-admin
Apart from the Yoast generated sitemaps, Smashing Magazine blocks all crawlers from indexing their RSS feed. 5 specific crawlers are prevented from indexing anything from the site.
Sitemap: http://www.smashingmagazine.com/post-sitemap.xml Sitemap: http://www.smashingmagazine.com/page-sitemap.xml Sitemap: http://www.smashingmagazine.com/category-sitemap.xml Sitemap: http://www.smashingmagazine.com/post_tag-sitemap.xml User-agent: * Disallow: /wp-rss.php Disallow: /wp-rss2.php User-agent: MSIECrawler Disallow: / User-agent: psbot Disallow: / User-agent: Fasterfox Disallow: / User-agent: Xenu Disallow: / User-agent: SiteSucker Disallow: /
The interesting thing about the WP Beginner robots.txt is that they only set rules for the Google crawler. Google Images is blocked from the wp-includes folder though Adsense is allowed. Wp-content and wp-admin have been blocked but wp-includes can still be crawled. Trackbacks, feeds and their gallery have been blocked too.
User-agent: Googlebot Allow: /?display=wide Disallow: /wp-content/ Disallow: /trackback/ Disallow: /wp-admin/ Disallow: /feed/ Disallow: /index.php Disallow: /*? Disallow: /*.js$ Disallow: /*.inc$ Disallow: /*.css$ Disallow: */feed/ Disallow: */trackback/ Disallow: /link.php Disallow: /gallery2 Disallow: /gallery2/ Disallow: /refer/ User-agent: Googlebot-Image Disallow: /wp-includes/ User-agent: Mediapartners-Google* Disallow:
CopyBlogger have a simple robots.txt that blocks the feed and trackbacks from being indexed.
User-agent: * Disallow: /*/feed/ Disallow: /*/trackback/
WPMU have not specified any rules for crawlers to obey. The only thing they confirm is the location of their sitemap (last year Peter Handley from MediaFlow left a comment on SEORoundTable stating that he does the same).
# BEGIN XML-SITEMAP-PLUGIN Sitemap: http://wpmu.org/sitemap.xml.gz # END XML-SITEMAP-PLUGIN
No Robots.txt File
The first thing that a search engine robot looks for when it visits a website is the robots.txt file though many WordPress websites such as the theme store Elegant Themes and the personal blogs of Mark Jaquith and Tim Ferriss do not have one.
Is this necessarily a bad thing? I’m not sure. A year ago Google employee JohnMu gave the following advice to someone on Webmaster Central:
I would recommend going even a bit further, and perhaps removing the robots.txt file completely. The general idea behind blocking some of those pages from crawling is to prevent them from being indexed. However, that’s not really necessary — websites can still be crawled, indexed and ranked fine with pages like their terms of service or shipping information indexed (sometimes that’s even useful to the user ).
The main argument for not doing this seems to be that 404 errors are generated if a search engine crawler can’t find the robots.txt file.
As I noted at the beginning of this article, I am not an SEO expert so I am not sure what the best practice is for creating a robots.txt file for your website. I think that blocking important areas of your site such as your admin area, theme and plugin folders and trackbacks is a good idea. Plus, making sure that important files unique to your own website is recommended (e.g. www.site.com/mydocuments/).
There does seem to be a lack of consistency between websites on what the robots.txt should actually block and there are those that argue that is not needed at all.
How much importance do you place on the robots.txt file? Feel free to share a link to your robots.txt file (or code from it) in the comments area together with your reasons for blocking specific directories and files.
Thanks for reading