Robots.txt best practices & tips
-
Hey,
I was wondering if someone could give me some advice on whether I should block the robots.txt file from the average user (not from googlebot, yandex, etc)?
If so, how would I go about doing this? With .htaccess I'm guessing - but not an expert.
What can people do with the information in the file? Maybe someone can give me some "best practices"? (I have a wordpress based website)
Thanks in advance!
-
Asking about the ideal configuration for a robots.txt file for WordPress is opening a huge can of worms There's plenty of discussion and disagreement about exactly what's best, but a lot of it depends on the actual configuration and goals of your own website. That's too long a discussion to get into here, but below is what I can recommend as a pretty basic, failsafe version that should work for most sites:
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/Sitemap: http://www.yoursite.com/sitemap.xml
I always prefer to explicitly declare the location of my site map, even if it's in the default location.
There are other directives you can include, but they depend more on how you have handled other aspects of your website - e.g. trackbacks, comments and search results pages as well as feeds. This is where the list can get grey, as there are multiple ways to accomplish this, depending how your site is optimised, but here's a representative example.
Disallow: /trackback/
Disallow: /feed/
Disallow: /comments/
Disallow: /category//
Disallow: /trackback/
Disallow: /feed/
Disallow: /comments/
Disallow: /?
Disallow: /?Sorry I can't be more specific on the above example, but it's where things really come down to how you're managing your specific site, and are a much bigger discussion. A web search for "best WordPress robots.txt file" will certainly show you the range of opinions on this.
The key thing to remember with a robots.txt file is that it does not cause blocked URLs to be removed from the index, it only stops the crawlers from traversing those pages. It's designed to help the crawlers spend their time on the pages that you have declared useful, instead of wasting their time on pages that are more administrative in nature. A crawler has a limited amount of time to spend on your site, and you want it to spend that time looking at the valuable pages, not the backend.
Paul
-
Thanks for the detailed answer Paul!
Do you think there is anything I should block for a wordpress website? I blocked /admin.
-
There is really no reason to block the robots.txt file from human users, Jazy. They'll never see it unless they actively go looking for it, and even if they do, it's just directives for where you want the search crawlers to go and where you want them to stay away from.
The only thing a human user will learn from this, is what sections of your site you consider to be nonessential to a search crawler. Even without the robots file, if they were really interested in this information, they could acquire it in other ways.
If you're trying to use your robots.txt file to block information about pages on your website you want to keep private or you don't want anyone to know about, doing it in the robots.txt file is the wrong place anyway. (That's done in .htaccess, which should be blocked from human readers.)
There's enough complexity to managing a website, there's no reason to add more by trying to block your robots file from human users.
Hope that helps?
Paul
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Best practices for lazy loading (content)
Hi all, We are working on a new website and we want to know the best practices for lazy loading of google for content.
Technical SEO | | JohnPalmer
My best sample is: bloomberg.com , look at their homepage. Thank y'all!0 -
2 sitemaps on my robots.txt?
Hi, I thought that I just could link one sitemap from my site's robots.txt but... I may be wrong. So, I need to confirm if this kind of implementation is right or wrong: robots.txt for Magento Community and Enterprise ...
Technical SEO | | Webicultors
Sitemap: http://www.mysite.es/media/sitemap/es.xml
Sitemap: http://www.mysite.pt/media/sitemap/pt.xml Thanks in advance,0 -
Best Schema Advice
Hi, I am new here and I have searched for but not got a definitive answer for this. I am sorting out a website which is a scaffolding company operating in a particular area. They are only interested in targeting a particular area and from what I have read through here I need to mark the site up with schema mentioning their company name and address. My issue is that I seem to find lots of conflicting advice about what should go it and how it should be laid out. I would love to know peoples opinions on where the best guide for setting up schema correctly for a site like this. They use wordpress, I am ok with inserting code to the site etc, I just want to make sure I get it right from the start. Once I have done this, I understand that I need to get local citations using the same NAP as how the site is marked up. Sorry for what might seem like a daft question but I am a designer and I am still learning the ins and outs of SEO. Thanks
Technical SEO | | kirstyseo0 -
How ro write a robots txt file to point to your site map
Good afternoon from still wet & humid wetherby UK... I want to write a robots text file that instruct the bots to index everything and give a specific location to the sitemap. The sitemap url is:http://business.leedscityregion.gov.uk/CMSPages/GoogleSiteMap.aspx Is this correct: User-agent: *
Technical SEO | | Nightwing
Disallow:
SITEMAP: http://business.leedscityregion.gov.uk/CMSPages/GoogleSiteMap.aspx Any insight welcome 🙂0 -
Best Implementation of a Title Tag
If My Targeted keyword are: Mussoorie Hotels Hotels in Mussoorie Mussoorie Resorts Resorts in Mussoorie What of the below 3 will be the best Title Tag After Panda and Penguine ? Hotels and Resorts in Mussoorie Mussoorie Hotels | Mussoorie Resorts | Luxury Budget & Economical Accommodation in Mussoorie Mussoorie Hotels, Mussoorie Resorts, Hotels in Mussoorie, Resorts in Musoorie please suggest!
Technical SEO | | WildHawk0 -
Blocked by meta-robots but there is no robots file
OK, I'm a little frustred here. I've waited a week for the next weekly index to take place after changing the privacy setting in a wordpress website so Google can index, but I still got the same problem. Blocked by meta-robots, no index, no follow. But I do not see a robot file anywhere and the privacy setting in this Wordpress site is set to allow search engines to index this site. Website is www.marketalert.ca What am I missing here? Why can't I index the rest of the website and is there a faster way to test this rather than wait another week just to find out it didn't work again?
Technical SEO | | Twinbytes0 -
Best Way To Handle Expired Content
Hi, I have a client's site that posts job openings. There is a main list of available jobs and each job has an individual page linked to from that main list. However, at some point the job is no longer available. Currently, the job page goes away and returns a status 404 after the job is no longer available. The good thing is that the job pages get links coming into the site. The bad thing is that as soon as the job is no longer available, those links point to a 404 page. Ouch. Currently Google Webmaster Tools shows 100+ 404 job URLs that have links (maybe 1-3 external links per). The question is what to do with the job page instead of returning a 404. For business purposes, the client cannot display the content after the job is no longer available. To avoid duplicate content issues, the old job page should have some kind of unique content saying the job is longer available. Any thoughts on what to do with those old job pages? Or would you argue that it is appropriate to return 404 header plus error page since this job is truly no longer a valid page on the site? Thanks for any insights you can offer.
Technical SEO | | Matthew_Edgar
Matthew1 -
Best SEO strategy for a site that has been down
Because of hosting problems we're trying to work out, our domain was down all weekend, and we have lost all of our rankings. Doe anyone have any experience with this kind of thing in terms of how long it takes to figure out where you stand once you have the site back up? what the best SEO strategy is for immediately addressing this problem? Besides just plugging away at getting links like normal, is there anything specific we should do right away when the site goes back up? Resubmit a site map, etc? Thanks!
Technical SEO | | OneClickVentures0