How do I use the Robots.txt "disallow" command properly for folders I don't want indexed?
-
Today's sitemap webinar made me think about the disallow feature, seems opposite of sitemaps, but it also seems both are kind of ignored in varying ways by the engines.
I don't need help semantically, I got that part. I just can't seem to find a contemporary answer about what should be blocked using the robots.txt file.
For example, I have folders containing site comps for clients that I really don't want showing up in the SERPS. Is it better to not have these folders on the domain at all?
There are also security issues I've heard of that make sense, simply look at a site's robots file to see what they are hiding. It makes it easier to hunt for files when they know the directory the files are contained in. Do I concern myself with this?
Another example is a folder I have for my xml sitemap generator. I imagine google isn't going to try to index this or count it as content, so do I need to add folders like this to the disallow list?
-
Hi,
Usin;
User-agent: *
Disallow: /folder/subfolderis fine, however if you have information stored in your website that you certainly want crawled make sure it is in your site map and use ...
User-agent: *
allow: /folder/subfolderadding a no follow attribute to all of your pages wont be practical, if a spam crawler ignores the robots.txt it will ignore your no follow attribute. If anything new occurs with robots.txt check large website's robots.txt as they always update to new trends i.e
Hope this helps:)
-
Hi Jay,
There's actually a recent similar discussion at http://www.seomoz.org/q/what-reasons-exist-to-use-noindex-robots-txt regarding deciding what to block via robots.
For site comps for clients, you could also password-protect those to help hide them, or do a different domain that you have entirely excluded in robots. I've also seen services like Basecamp used for posting comps. It all depends on how much you want to hide the comps.
You do want your sitemap itself to be crawled, but I'm presuming this is in the root directory so that shouldn't be a problem. Folders like your sitemap generator and other purely-framework folders can certainly be disallowed. Blocking the files that list the version of your website (if you're using a CMS) can help prevent people from searching for opportunities to hack that version and finding your site.
Also, just do a site:domain.com search on your domain, see what's indexed, see what content from there you don't want indexed, and use that as a starting point.
Are you running on a content management system, or a custom site? For a CMS, here are example robots.txt files for several popular CMSs. http://www.stayonsearch.com/robots-txt-guide
-
You may also want to think about slapping a robots noindex on the individual pages as well.
-
You can type the following syntax:
after User-agent: *
Disallow: /foldername/subfoldername
also, you can name your sitemaps in the robots.txt file.
They can be defined as
Sitemap: http://www.yourdomain.com/sitemap.xml
If you have multiple sitemaps, you can have multiple sitemaps listed.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Using one robots.txt for two websites
I have two websites that are hosted in the same CMS. Rather than having two separate robots.txt files (one for each domain), my web agency has created one which lists the sitemaps for both websites, like this: User-agent: * Disallow: Sitemap: https://www.siteA.org/sitemap Sitemap: https://www.siteB.com/sitemap Is this ok? I thought you needed one robots.txt per website which provides the URL for the sitemap. Will having both sitemap URLs listed in one robots.txt confuse the search engines?
Technical SEO | | ciehmoz0 -
I'm struggling to understand (and fix) why I'm getting a 404 error. The URL includes this "%5Bnull%20id=43484%5D" but I cannot find that anywhere in the referring URL. Does anyone know why please? Thanks
Can you help with how to fix this 404 error please? It appears that I have a redirect from one page to the other, although the referring page URL works, but it appears to be linking to another URL with this code at the end of the the URL - %5Bnull%20id=43484%5D that I'm struggling to find and fix. Thanks
Technical SEO | | Nichole.wynter20200 -
No: 'noindex' detected in 'robots' meta tag
I'm getting an error in Search Console that pages on my site show No: 'noindex' detected in 'robots' meta tag. However, when I inspect the pages html, it does not show noindex. In fact, it shows index, follow. Majority of pages show the error and are not indexed by Google...Not sure why this is happening. Unfortunately I can't post images on here but I've linked some url's below. The page below in search console shows the error above... https://mixeddigitaleduconsulting.com/ As does this one. https://mixeddigitaleduconsulting.com/independent-school-marketing-communications/ However, this page does not have the error and is indexed by Google. The meta robots tag looks identical. https://mixeddigitaleduconsulting.com/blog/leadership-team/jill-goodman/ Any and all help is appreciated.
Technical SEO | | Sean_White_Consult0 -
Redirect url that don't end with slash
Hi! I need to add in my .htaccess a way to redirect URL that don't end with / to the one thats end wth /.For example, I need http://example.com to redirect to http://www.example.com/. And, of course, this redirection should be only for URLs that don't end with /, so I don't get double // at the end. I already have the code to redirect the non-www to the www, but can't find a way to do the slash thing. Thanks!
Technical SEO | | arielbortz0 -
How to properly change your website's address in Webmaster Tools?
Hi There,We've launched a new website and as part of the update have changed our domain name - now we need to tell Google of the changes: Both sites were verified in Webmaster Tools From the old site's gear icon, we chose "Change of address" As part of the "Change of address" checklist Google presented, we added 301 redirects to redirect the old domain to the new one But now that the 301 redirects are in place, Google can no longer verify the old site And because it can no longer verify the old site, Google won't let us complete the change of address form How do we tell Google of the change of address in this instance - and has anyone else encountered this?CheersBen
Technical SEO | | cmscss0 -
My beta site (beta.website.com) has been inadvertently indexed. Its cached pages are taking traffic away from our real website (website.com). Should I just "NO INDEX" the entire beta site and if so, what's the best way to do this? Please advise.
My beta site (beta.website.com) has been inadvertently indexed. Its cached pages are taking traffic away from our real website (website.com). Should I just "NO INDEX" the entire beta site and if so, what's the best way to do this? Are there any other precautions I should be taking? Please advise.
Technical SEO | | BVREID0 -
Should we use "and" or "&"?
Our client has an ampersand in their brand name. The logo has "&", their url is spelled out. I'm trying to get them to standardize the use of the name for directories/listings. Should we use "and" or "&"?
Technical SEO | | vernonmack0 -
Robots.txt
Hi everyone, I just want to check something. If you have this entered into your robots.txt file: User-agent: *
Technical SEO | | PeterM22
Disallow: /fred/ This wouldn't block /fred-review/ from being crawled would it? Thanks0