Robots.txt Syntax

DRSearchEngOpt

I have been having a hard time finding any decent information regarding the robots.txt syntax that has been written in the last few years and I just want to verify some things as a review for myself. I have many occasions where I need to block particular directories in the URL, parameters and parameter values. I just wanted to make sure that I am doing this in the most efficient ways possible and thought you guys could help.

So let's say I want to block a particular directory called "this" and this would be an example URL:

www.domain.com/folder1/folder2/this/file.html
or
www.domain.com/folder1/this/folder2/file.html

In order for me to block any URL that contains this folder anywhere in the URL I would use:

User-agent: *
Disallow: /this/

Now lets say I have a parameter "that" I want to block and sometimes it is the first parameter and sometimes it isn't when it shows up in the URL. Would it look like this?

User-agent: *
Disallow: ?that=
Disallow: &that=

What about if there is only one value I want to block for "that" and the value is "NotThisGuy":

User-agent: *
Disallow: ?that=NotThisGuy
Disallow: &that=NotThisGuy

My big questions here are what are the most efficient ways to block a particular parameter and block a particular parameter value. Is there a more efficient way to deal with ? and & for when the parameter and value are either first or later? Secondly is there a list somewhere that will tell me all of the syntax and meaning that can be used for a robots.txt file?

Thanks!

MichaelC-15022

My advice is to go easy with robots.txt--it's a bit like dynamite, powerful, but can take your leg (or entire website) off.

I like this checker:

http://tool.motoricerca.info/robots-checker.phtml

If you look ok after running that checker, then use the built-in Google one.

Note that robots.txt syntax DOES NOT have wildcards. Apparently this doesn't stop a ton of people from using wildcards in them (to no effect, and clearly they didn't bother to test!).

Another reason to avoid disallow in robots.txt is that if you disallow the engines from looking at a page's contents, then you're ALSO stopping the link juice that might have flowed to other pages it links to.

So let's say you have 100 pages on your site that you're currently blocking with disallow in robots.txt. If instead, you put a meta robots "noindex,follow" in each of those pages, then every page linked to from those 100 pages (i.e. everything in your main menu) would get an extra 100 internal links worth of link juice.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Robots.txt Syntax

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Robots.txt question

I have two sitemaps which partly duplicate - one is blocked by robots.txt but can't figure out why!

Meta canonical or simply robots.txt other domain names with same content?

Best practices for robotx.txt -- allow one page but not the others?

Could you use a robots.txt file to disalow a duplicate content page from being crawled?

Using 2 wildcards in the robots.txt file

Reciprocal Links and nofollow/noindex/robots.txt

Blocking Dynamic URLs with Robots.txt