The robots.txt file is a very powerful file if you’re working on a site’s SEO. At the same time, it also has to be used with care. It allows you to deny search engines access to certain files and folders, but that’s very often not what you want to do. Over the years, especially Google changed a lot in how it crawls the web, so old best practices are no longer valid. This post explains what the new best practices are and why.
Google fully renders your site
No longer is Google the dumb little kid that just fetches your sites HTML and ignores your styling and JavaScript. It fetches everything and renders your pages completely. This means that when you deny Google access to your CSS or JavaScript files, it doesn’t like that at all. This post about Google Panda 4 shows an example of this.
The old best practices of having a robots.txt
that blocks access to your wp-includes
directory and your plugins directory are no longer valid. This is why, in WordPress 4.0, I opened the issue and wrote the patch to remove wp-includes/.*
from the default WordPress robots.txt
.
A lot of themes also use asynchronous JavaScript requests, so-called AJAX, to add content to the page. By default, WordPress used to block these. So I created the ticket for WordPress to allow Google to crawl the admin-ajax.php
URL in wp-admin
. This was fixed in WordPress 4.4.
Robots.txt denies links their value
Something else is very important to keep in mind. If you block a URL with your site’s robots.txt
, search engines will not crawl those pages. This also means that they cannot distribute the link value pointing at those URLs. So if you have a section of your site that you’d rather not have showing in the search results, but does get a lot of links, don’t use the robots.txt
file. Instead, use a robots meta tag with a value noindex, follow
. This allows search engines to properly distribute the link value for those pages across your site.
Our WordPress robots.txt example
So, what should be in your WordPress robots.txt? Ours is very clean now. We no longer block anything! We don’t block our /wp-content/plugins/
directory, as plugins might output JavaScript or CSS that Google needs to render the page. We also do not block our /wp-includes/
directory, as the default JavaScripts that come with WordPress, which many themes use, come from these directories.
We also do not block our /wp-admin/
folder. The reason is simple: if you block it, but link to it somewhere by chance, people will still be able to do a simple [inurl:wp-admin]
query in Google and find your site. This type of query is the type of query malicious hackers love to do. If you don’t do anything, WordPress has (by my doing) a robots meta x-http header on the admin pages that prevents search engines from showing these pages in the search results, a much cleaner solution.
What you should do with your robots.txt
You should log into Google Search Console and under Crawl → Fetch as Google, use the Fetch and Render option:
If it doesn’t look like what you’re seeing when you browse your site, or it throws errors or notices: fix them by removing the lines that block access to those URLs from your robots.txt
file.
Should you link to your XML Sitemap from your robots.txt
?
We’ve always felt linking to your XML sitemap from your robots.txt is a bit nonsense. You should be adding them manually to your Google and Bing Webmaster Tools and make sure you look at their feedback about your XML sitemap. This is the reason our Yoast SEO plugin doesn’t add it to your robots.txt
. Don’t really on them to find out about your XML sitemap through your robots.txt
.
Read more: ‘Google Panda 4, and blocking your CSS & JS’ »
from Yoast • The Art & Science of Website Optimization https://yoast.com/wordpress-robots-txt-example/
No comments:
Post a Comment