Robots.txt best practices & tips
-
Hey,
I was wondering if someone could give me some advice on whether I should block the robots.txt file from the average user (not from googlebot, yandex, etc)?
If so, how would I go about doing this? With .htaccess I'm guessing - but not an expert.
What can people do with the information in the file? Maybe someone can give me some "best practices"? (I have a wordpress based website)
Thanks in advance!
-
Asking about the ideal configuration for a robots.txt file for WordPress is opening a huge can of worms There's plenty of discussion and disagreement about exactly what's best, but a lot of it depends on the actual configuration and goals of your own website. That's too long a discussion to get into here, but below is what I can recommend as a pretty basic, failsafe version that should work for most sites:
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/Sitemap: http://www.yoursite.com/sitemap.xml
I always prefer to explicitly declare the location of my site map, even if it's in the default location.
There are other directives you can include, but they depend more on how you have handled other aspects of your website - e.g. trackbacks, comments and search results pages as well as feeds. This is where the list can get grey, as there are multiple ways to accomplish this, depending how your site is optimised, but here's a representative example.
Disallow: /trackback/
Disallow: /feed/
Disallow: /comments/
Disallow: /category//
Disallow: /trackback/
Disallow: /feed/
Disallow: /comments/
Disallow: /?
Disallow: /?Sorry I can't be more specific on the above example, but it's where things really come down to how you're managing your specific site, and are a much bigger discussion. A web search for "best WordPress robots.txt file" will certainly show you the range of opinions on this.
The key thing to remember with a robots.txt file is that it does not cause blocked URLs to be removed from the index, it only stops the crawlers from traversing those pages. It's designed to help the crawlers spend their time on the pages that you have declared useful, instead of wasting their time on pages that are more administrative in nature. A crawler has a limited amount of time to spend on your site, and you want it to spend that time looking at the valuable pages, not the backend.
Paul
-
Thanks for the detailed answer Paul!
Do you think there is anything I should block for a wordpress website? I blocked /admin.
-
There is really no reason to block the robots.txt file from human users, Jazy. They'll never see it unless they actively go looking for it, and even if they do, it's just directives for where you want the search crawlers to go and where you want them to stay away from.
The only thing a human user will learn from this, is what sections of your site you consider to be nonessential to a search crawler. Even without the robots file, if they were really interested in this information, they could acquire it in other ways.
If you're trying to use your robots.txt file to block information about pages on your website you want to keep private or you don't want anyone to know about, doing it in the robots.txt file is the wrong place anyway. (That's done in .htaccess, which should be blocked from human readers.)
There's enough complexity to managing a website, there's no reason to add more by trying to block your robots file from human users.
Hope that helps?
Paul
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
SEO & IFrame problem
Hi All, I will try and keep this as simple as possible. My product page links to a separate page with an IFrame, giving my users the option to upload artwork for the product. The IFrame contains the external file upload site (mail big file). When finished, the user can use a button link to return to the product page to continue with their order. As soon as the page with the IFrame was crawled by Google, the IFrame page started to rank in place of where my product page used to rank, yet there is no content on the page relating to my product (just a file upload). So now users are visiting the IFrame via the same query which must be an absolute headache and not useful at all. I have tried the following: 1. Added a line in body text which contains an internal link pointing to the product page using exact match anchor text for the query. (This didn't work) 2. I applied a no index tag to the IFrame, and now my product page is no longer ranking at all. Can anyone help me solve this puzzle. I believe I might be missing something. Kind regards, Adam
Technical SEO | | SO_UK0 -
Will a Robots.txt 'disallow' of a directory, keep Google from seeing 301 redirects for pages/files within the directory?
Hi- I have a client that had thousands of dynamic php pages indexed by Google that shouldn't have been. He has since blocked these php pages via robots.txt disallow. Unfortunately, many of those php pages were linked to by high quality sites mulitiple times (instead of the static urls) before he put up the php 'disallow'. If we create 301 redirects for some of these php URLs that area still showing high value backlinks and send them to the correct static URLs, will Google even see these 301 redirects and pass link value to the proper static URLs? Or will the robots.txt keep Google away and we lose all these high quality backlinks? I guess the same question applies if we use the canonical tag instead of the 301. Will the robots.txt keep Google from seeing the canonical tags on the php pages? Thanks very much, V
Technical SEO | | Voodak0 -
What are some best practices for optimizing alternate versions of a brand name?
What are the best methods for ensuring that the correct spelling/formatting of a brand name rank in the SERP when an alternate formatting/spelling of the brand name is searched. Take for example the brand name (made up for example purposes), "SuperFry". Many customers search using the term "Super Fry" (with a space). To make things worse, not only does Google not return the brand name SuperFry, but it also auto corrects to another brand name "Super-Fri". Is there a common best practice to ensure the customer finds the intended brand name when they simply add a space in the search term? I assume a quick fix would be to create an ad words campaign for the alternate spellings/formatting. What about an organic solution? Perhaps we could create a special page talking about the alternate ways to spell the brand name? Would this solution send mixed signals to Google and potential hurt the over all rankings? Thanks much for any advice!
Technical SEO | | Vspeed0 -
Google indexing despite robots.txt block
Hi This subdomain has about 4'000 URLs indexed in Google, although it's blocked via robots.txt: https://www.google.com/search?safe=off&q=site%3Awww1.swisscom.ch&oq=site%3Awww1.swisscom.ch This has been the case for almost a year now, and it does not look like Google tends to respect the blocking in http://www1.swisscom.ch/robots.txt Any clues why this is or what I could do to resolve it? Thanks!
Technical SEO | | zeepartner0 -
Robots.txt file
How do i get Google to stop indexing my old pages and start indexing my new pages even months down the line? Do i need to install a Robots.txt file on each page?
Technical SEO | | gimes0 -
Block or remove pages using a robots.txt
I want to use robots.txt to prevent googlebot access the specific folder on the server, Please tell me if the syntax below is correct User-Agent: Googlebot Disallow: /folder/ I want to use robots.txt to prevent google image index the images of my website , Please tell me if the syntax below is correct User-agent: Googlebot-Image Disallow: /
Technical SEO | | semer0 -
Robots.txt questions...
All, My site is rather complicated, but I will try to break down my question as simply as possible. I have a robots.txt document in the root level of my site to disallow robot access to /_system/, my CMS. This looks like this: # /robots.txt file for http://webcrawler.com/
Technical SEO | | Horizon
# mail webmaster@webcrawler.com for constructive criticism **User-agent: ***
Disallow: /_system/ I have another robots.txt file in another level down, which is my holiday database - www.mysite.com/holiday-database/ - this is to disallow access to /holiday-database/ControlPanel/, my database CMS. This looks like this: **User-agent: ***
Disallow: /ControlPanel/ Am I correct in thinking that this file must also be in the root level, and not in the /holiday-database/ level? If so, should my new robots.txt file look like this: # /robots.txt file for http://webcrawler.com/
# mail webmaster@webcrawler.com for constructive criticism **User-agent: ***
Disallow: /_system/
Disallow: /holiday-database/ControlPanel/ Or, like this: # /robots.txt file for http://webcrawler.com/
# mail webmaster@webcrawler.com for constructive criticism **User-agent: ***
Disallow: /_system/
Disallow: /ControlPanel/ Thanks in advance. Matt0 -
Site not being Indexed that fast anymore, Is something wrong with this Robots.txt
My wordpress site's robots.txt used to be this: User-agent: * Disallow: Sitemap: http://www.domainame.com/sitemap.xml.gz I also have all in one SEO installed and other than posts, tags are also index,follow on my site. My new posts used to appear on google in seconds after publishing. I changed the robots.txt to following and now post indexing takes hours. Is there something wrong with this robots.txt? User-agent: * Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/cache Disallow: /wp-content/themes Disallow: /wp-login.php Disallow: /wp-login.php Disallow: /trackback Disallow: /feed Disallow: /comments Disallow: /author Disallow: /category Disallow: */trackback Disallow: */feed Disallow: */comments Disallow: /login/ Disallow: /wget/ Disallow: /httpd/ Disallow: /*.php$ Disallow: /? Disallow: /*.js$ Disallow: /*.inc$ Disallow: /*.css$ Disallow: /*.gz$ Disallow: /*.wmv$ Disallow: /*.cgi$ Disallow: /*.xhtml$ Disallow: /? Disallow: /*?Allow: /wp-content/uploads User-agent: TechnoratiBot/8.1 Disallow: ia_archiverUser-agent: ia_archiver Disallow: / disable duggmirror User-agent: duggmirror Disallow: / allow google image bot to search all imagesUser-agent: Googlebot-Image Disallow: /wp-includes/ Allow: /* # allow adsense bot on entire siteUser-agent: Mediapartners-Google* Disallow: Allow: /* Sitemap: http://www.domainname.com/sitemap.xml.gz
Technical SEO | | ideas1230