How do I use the Robots.txt "disallow" command properly for folders I don't want indexed?
-
Today's sitemap webinar made me think about the disallow feature, seems opposite of sitemaps, but it also seems both are kind of ignored in varying ways by the engines.
I don't need help semantically, I got that part. I just can't seem to find a contemporary answer about what should be blocked using the robots.txt file.
For example, I have folders containing site comps for clients that I really don't want showing up in the SERPS. Is it better to not have these folders on the domain at all?
There are also security issues I've heard of that make sense, simply look at a site's robots file to see what they are hiding. It makes it easier to hunt for files when they know the directory the files are contained in. Do I concern myself with this?
Another example is a folder I have for my xml sitemap generator. I imagine google isn't going to try to index this or count it as content, so do I need to add folders like this to the disallow list?
-
Hi,
Usin;
User-agent: *
Disallow: /folder/subfolderis fine, however if you have information stored in your website that you certainly want crawled make sure it is in your site map and use ...
User-agent: *
allow: /folder/subfolderadding a no follow attribute to all of your pages wont be practical, if a spam crawler ignores the robots.txt it will ignore your no follow attribute. If anything new occurs with robots.txt check large website's robots.txt as they always update to new trends i.e
Hope this helps:)
-
Hi Jay,
There's actually a recent similar discussion at http://www.seomoz.org/q/what-reasons-exist-to-use-noindex-robots-txt regarding deciding what to block via robots.
For site comps for clients, you could also password-protect those to help hide them, or do a different domain that you have entirely excluded in robots. I've also seen services like Basecamp used for posting comps. It all depends on how much you want to hide the comps.
You do want your sitemap itself to be crawled, but I'm presuming this is in the root directory so that shouldn't be a problem. Folders like your sitemap generator and other purely-framework folders can certainly be disallowed. Blocking the files that list the version of your website (if you're using a CMS) can help prevent people from searching for opportunities to hack that version and finding your site.
Also, just do a site:domain.com search on your domain, see what's indexed, see what content from there you don't want indexed, and use that as a starting point.
Are you running on a content management system, or a custom site? For a CMS, here are example robots.txt files for several popular CMSs. http://www.stayonsearch.com/robots-txt-guide
-
You may also want to think about slapping a robots noindex on the individual pages as well.
-
You can type the following syntax:
after User-agent: *
Disallow: /foldername/subfoldername
also, you can name your sitemaps in the robots.txt file.
They can be defined as
Sitemap: http://www.yourdomain.com/sitemap.xml
If you have multiple sitemaps, you can have multiple sitemaps listed.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Disallow wildcard match in Robots.txt
This is in my robots.txt file, does anyone know what this is supposed to accomplish, it doesn't appear to be blocking URLs with question marks Disallow: /?crawler=1
Technical SEO | | AmandaBridge
Disallow: /?mobile=1 Thank you0 -
Duplicate pages with "/" and without "/"
I seem to have duplicate pages like the examples below: https://example.com https://example.com/ This is happening on 3 pages and I'm not sure why or how to fix it. The first (https://example.com) is what I want and is what I have all my canonicals set too, but that doesn't seem to be doing anything. I've also setup 301 redirects for each page with "/" to be redirected to the page without it. Doing this didn't seem to fix anything as when I use the (https://example.com/) URL it doesn't redirect to (https://example.com) like it's supposed to. This issue has been going on for some time, so any help would be much appreciated. I'm using Squarespace as the design/hosting site.
Technical SEO | | granitemountain0 -
Robots.txt on http vs. https
We recently changed our domain from http to https. When a user enters any URL on http, there is an global 301 redirect to the same page on https. I cannot find instructions about what to do with robots.txt. Now that https is the canonical version, should I block the http-Version with robots.txt? Strangely, I cannot find a single ressource about this...
Technical SEO | | zeepartner0 -
Duplicate Titles Aren't Actually Duplicate
I am seeing duplicate title errors, but when I go to fix the problem, the titles are not actually identical. Any advice? Becky
Technical SEO | | Becky_Converge0 -
Robots.txt best practices & tips
Hey, I was wondering if someone could give me some advice on whether I should block the robots.txt file from the average user (not from googlebot, yandex, etc)? If so, how would I go about doing this? With .htaccess I'm guessing - but not an expert. What can people do with the information in the file? Maybe someone can give me some "best practices"? (I have a wordpress based website) Thanks in advance!
Technical SEO | | JonathanRolande0 -
Duplicate Meta Descriptions From Pages That Don't Exist
Hi Guys I am hoping someone can help me out here. I have had a new site built with a unique theme and using wordpress as the CMS. Everything was going fine but after checking webmaster tools today I noticed something that I just cannot get my head around. Basically I am getting warnings of Duplicate page warnings on a couple of things. 1 of which i think i can understand but do not know how to get the warning to go. Firstly I get this warning of duplicate meta desciption url 1: / url 2: /about/who-we-are I understand this as the who-we-are page is set as the homepage through the wordpress reading settings. But is there a way to make the dup meta description warning disappear The second one I am getting is the following: /services/57/ /services/ Both urls lead to the same place although I have never created the services/57/ page the services/57/ page does not show on the xml sitemap but Google obviously see it because it is a warning in webmaster tools. If I press edit on services/57/ page it just goes to edit the /services/ page/ is there a way I can remove the /57/ page safely or a method to ensure Google at least does not see this. Probably a silly question but I cannot find a real comprehensive answer to sorting this. Thanks in advance
Technical SEO | | southcoasthost0 -
Sitemap coming up in Google's index?
I apologize if this question's answer is glaringly obvious, but I was using Google to view all the pages it has indexed of our site--by searching for our company and then clicking the link that says to display more results for the site. On page three, it has the sitemap indexed as if it wee just another page of our site. <cite>www.stadriemblems.com/sitemap.xml</cite> Is this supposed to happen?
Technical SEO | | UnderRugSwept0 -
Why is this url showing as "not crawled" on opensiteexplorer, but still showing up in Google's index?
The below url is showing up as "not crawled" on opensitexplorer.com, but when you google the title tag "Joel Roberts, Our Family Doctors - Doctor in Clearwater, FL" it is showing up in the Google index. Can you explain why this is happening? Thank you http://doctor.webmd.com/physician_finder/profile.aspx?sponsor=core&pid=14ef09dd-e216-4369-99d3-460aa3c4f1ce
Technical SEO | | nicole.healthline0