Robots.txt

TomKing

Hi all,

Happy New Year!

I want to block certain pages on our site as they are being flagged (according to my Moz Crawl Report) as duplicate content when in fact that isn't strictly true, it is more to do with the problems faced when using a CMS system...

Here are some examples of the pages I want to block and underneath will be what I believe to be the correct robots.txt entry...

http://www.XYZ.com/forum/index.php?app=core&module=search&do=viewNewContent&search_app=members&search_app_filters[forums][searchInKey]=&period=today&userMode=&followedItemsOnly=

Disallow: /forum/index.php?app=core&module=search

http://www.XYZ.com/forum/index.php?app=core&module=reports&rcom=gallery&imageId=980&ctyp=image

Disallow: /forum/index.php?app=core&module=reports

http://www.XYZ.com/forum/index.php?app=forums&module=post§ion=post&do=reply_post&f=146&t=741&qpid=13308

Disallow: /forum/index.php?app=forums&module=post

http://www.XYZ.com/forum/gallery/sizes/182-promenade/small/

http://www.XYZ.com/forum/gallery/sizes/182-promenade/large/

Disallow: /forum/gallery/sizes/

Any help \ advice would be much appreciated.

Many thanks

Andy

Travis_Bailey

You may be better off just doing a pattern match if your CMS generates a lot of junk URLs. You could save yourself a lot of time and heartache with the following:

User-agent: *
Disallow: /*?

That will block everything with with a ? in the string. So yeah, use with caution - as always.

If you're quite certain you want to block access to the image sizes subdirectory you may use:

User-agent: *

Disallow: /sizes*/

More on all of that fun from Google and SEO Book.

Robots.txt is almost as unforgiving as .htaccess, especially once you start pattern matching. Make sure to test everything thoroughly before you push to a live environment. For serious. You have been warned.

Google WMT and Bing WMT also provide parameter handling tools. Once you tell Bing and/or Google that you want their bots to ignore urls with certain parameter(s) you select. So if you wanted to handle it that way, it looks like ignoring the app= parameter should do the trick for most of your expressed concerns.

Good luck! explosions in the distance XD

TomKing

Thanks DC1611, I will look into the other options but I have hundreds (and I mean hundreds) of examples that I would need to investigate...

Andy

DirkC

You can quite easily check if these filters work - using Google Webmastertools (crawl section > robots.txt tester).
In the test-tool you can enter the criteria & check if they do block Googlebot from indexing these pages. I tried a few of the examples you gave & they seem to work.

Apart from updating your robots.txt (which seems quite a radical solution) you could also consider implementing canonical url's for these duplicate url's.

Another alternative is to configure url parameters in Google Webmastertools (also in the crawl section) - where you can indicate which parameters need to be ignored.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Robots.txt

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Block session id URLs with robots.txt

Should I disallow all URL query strings/parameters in Robots.txt?

Robots.txt - blocking JavaScript and CSS, best practice for Magento

Google: How to See URLs Blocked by Robots?

Is our robots.txt file correct?

Search Engine Blocked by robots.txt for Dynamic URLs

Old pages still crawled by SE returning 404s. Better to put 301 or block with robots.txt ?

10,000 New Pages of New Content - Should I Block in Robots.txt?