Trying to reduce pages crawled to within 10K limit via robots.txt

AspenFasteners

Our site has far too many pages for our 10K page PRO account which are not SEO worthy. In fact, only about 2000 pages qualify for SEO value. Limitations of the store software only permit me to use robots.txt to sculpt the rogerbot site crawl. However, I am having trouble getting this to work. Our biggest problem is the 35K individual product pages and the related shopping cart links (at least another 35K); these aren't needed as they duplicate the SEO-worthy content in the product category pages.

The signature of a product page is that it is contained within a folder ending in -p. So I made the following addition to robots.txt:

User-agent: rogerbot
Disallow: /-p/

However, the latest crawl results show the 10K limit is still being exceeded. I went to Crawl Diagnostics and clicked on Export Latest Crawl to CSV. To my dismay I saw the report was overflowing with product page links:

e.g. www.aspenfasteners.com/3-Star-tm-Bulbing-Type-Blind-Rivets-Anodized-p/rv006-316x039354-coan.htm

The value for the column "Search Engine blocked by robots.txt" = FALSE; does this mean blocked for all search engines? Then it's correct. If it means "blocked for rogerbot? Then it shouldn't even be in the report, as the report seems to only contain 10K pages.

Any thoughts or hints on trying to attain my goal would REALLY be appreciated, I've been trying for weeks now. Honestly - virtual beers for everyone!

Carlo

andresgmontero

Wow! thank you, many of the robots.txt testers still show them as disallow, good to know! thank you!

AspenFasteners

Hi Andres!

Sorry, I thought I answered this earlier. If I understand correctly wildcards ARE allowed, according to this reply to my question on the topic: http://www.seomoz.org/q/does-rogerbot-read-url-wildcards-in-robots-txt

Hope THIS reply sticks this time!

andresgmontero

Hi, as far as I know wildcard characters (like "*") are not allowed there, the line must be an allow, disallow, comment or a blank line statement, so before you get angry at Roger for not listening to you, go to Google Webmaster Tools > Crawler Access and test the robots.txt file. Hope it works.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Trying to reduce pages crawled to within 10K limit via robots.txt

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Robots.txt vs. meta noindex, follow

Google Crawling Issues! How Can I Get Google to Crawl My Website Regularly?

Block Domain in robots.txt

"Extremely high number of URLs" warning for robots.txt blocked pages

How do I keep Google from crawling my PPC landing page?

What is the best way to find missing alt tags on my site (site wide - not page by page)?

SEOMoz is indicating I have 40 pages with duplicate content, yet it doesn't list the URL's of the pages???

Site not being Indexed that fast anymore, Is something wrong with this Robots.txt