Trying to reduce pages crawled to within 10K limit via robots.txt
-
Our site has far too many pages for our 10K page PRO account which are not SEO worthy. In fact, only about 2000 pages qualify for SEO value. Limitations of the store software only permit me to use robots.txt to sculpt the rogerbot site crawl. However, I am having trouble getting this to work. Our biggest problem is the 35K individual product pages and the related shopping cart links (at least another 35K); these aren't needed as they duplicate the SEO-worthy content in the product category pages.
The signature of a product page is that it is contained within a folder ending in -p. So I made the following addition to robots.txt:
User-agent: rogerbot
Disallow: /-p/However, the latest crawl results show the 10K limit is still being exceeded. I went to Crawl Diagnostics and clicked on Export Latest Crawl to CSV. To my dismay I saw the report was overflowing with product page links:
e.g. www.aspenfasteners.com/3-Star-tm-Bulbing-Type-Blind-Rivets-Anodized-p/rv006-316x039354-coan.htm
The value for the column "Search Engine blocked by robots.txt" = FALSE; does this mean blocked for all search engines? Then it's correct. If it means "blocked for rogerbot? Then it shouldn't even be in the report, as the report seems to only contain 10K pages.
Any thoughts or hints on trying to attain my goal would REALLY be appreciated, I've been trying for weeks now. Honestly - virtual beers for everyone!
Carlo
-
Wow! thank you, many of the robots.txt testers still show them as disallow, good to know! thank you!
-
Hi Andres!
Sorry, I thought I answered this earlier. If I understand correctly wildcards ARE allowed, according to this reply to my question on the topic: http://www.seomoz.org/q/does-rogerbot-read-url-wildcards-in-robots-txt
Hope THIS reply sticks this time!
-
Hi, as far as I know wildcard characters (like "*") are not allowed there, the line must be an allow, disallow, comment or a blank line statement, so before you get angry at Roger for not listening to you, go to Google Webmaster Tools > Crawler Access and test the robots.txt file. Hope it works.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Site Crawl -> Duplicate Page Content -> Same pages showing up with duplicates that are not
These, for example: | https://im.tapclicks.com/signup.php/?utm_campaign=july15&utm_medium=organic&utm_source=blog | 1 | 2 | 29 | 2 | 200 |
Technical SEO | | writezach
| https://im.tapclicks.com/signup.php?_ga=1.145821812.1573134750.1440742418 | 1 | 1 | 25 | 2 | 200 |
| https://im.tapclicks.com/signup.php?utm_source=tapclicks&utm_medium=blog&utm_campaign=brightpod-article | 1 | 119 | 40 | 4 | 200 |
| https://im.tapclicks.com/signup.php?utm_source=tapclicks&utm_medium=marketplace&utm_campaign=homepage | 1 | 119 | 40 | 4 | 200 |
| https://im.tapclicks.com/signup.php?utm_source=blog&utm_campaign=first-3-must-watch-videos | 1 | 119 | 40 | 4 | 200 |
| https://im.tapclicks.com/signup.php?_ga=1.159789566.2132270851.1418408142 | 1 | 5 | 31 | 2 | 200 |
| https://im.tapclicks.com/signup.php/?utm_source=vocus&utm_medium=PR&utm_campaign=52release | Any suggestions/directions for fixing or should I just disregard this "High Priority" moz issue? Thank you!0 -
Will it be possible to point diff sitemap to same robots.txt file.
Will it be possible to point diff sitemap to same robots.txt file.
Technical SEO | | nlogix
Please advice.0 -
From page 1th to page 18th @ Google
Hello Mozzers! I have a question, you may help.. How may it be possible that a page ranking well (1th result) goes from 1th result to the 18th page just in 1 day? It doesnt seem to be any kind of penalization.. I now had all suspicious outgoing links to be nofollow (they were not before), this may be a cause .. (?) Do you have any other suggestion? Thanks
Technical SEO | | socialengaged0 -
Page not cached
Hi there, we uploaded a page but unfortunately didn't realise it had noindex,nofollow in the meta tags. Google had cached it then decached it (i guess thats possible) it seems? now it will not cache even though the correct meta tags have been put in and we have sent links to it internally and externally. Anyone know why this page isn't being cached, the internal link to it is on the homepage and that gets cached almost every day. I even submitted it to webmaster tools to index.
Technical SEO | | pauledwards0 -
Can't find mistake in robots.txt
Hi all, we recently filled our robots.txt file to prevent some directories from crawling. Looks like: User-agent: * Disallow: /Views/ Disallow: /login/ Disallow: /routing/ Disallow: /Profiler/ Disallow: /LILLYPROFILER/ Disallow: /EventRweKompaktProfiler/ Disallow: /AccessIntProfiler/ Disallow: /KellyIntProfiler/ Disallow: /lilly/ now, as Google Webmaster Tools hasn't updated our robots.txt yet, I checked our robots.txt in some ckeckers. They tell me that the User agent: * contains an error. **Example:** **Line 1: Syntax error! Expected <field>:</field> <value></value> 1: User-agent: *** **`I checked other robots.txt written the same way --> they work,`** accordign to the checkers... **`Where the .... is the mistake???`** ```
Technical SEO | | accessKellyOCG0 -
How ro write a robots txt file to point to your site map
Good afternoon from still wet & humid wetherby UK... I want to write a robots text file that instruct the bots to index everything and give a specific location to the sitemap. The sitemap url is:http://business.leedscityregion.gov.uk/CMSPages/GoogleSiteMap.aspx Is this correct: User-agent: *
Technical SEO | | Nightwing
Disallow:
SITEMAP: http://business.leedscityregion.gov.uk/CMSPages/GoogleSiteMap.aspx Any insight welcome 🙂0 -
Handling 301s: Multiple pages to a single page (consolidation)
Been scouring the interwebs and haven't found much information on redirecting two serparate pages to a single new page. Here is what it boils down to: Let's say a website has two pages, both with good page authority of products that are becoming fazed out. The products, Widget A and Widget B, are still popular search terms, but they are being combined into ONE product, Widget C. While Widget A and Widget B STILL have plenty to do with Widget C, Widget C is now the new page, the main focus page, and the page you want everyone to see and Google to recognize. Now, do I 301 Widget A and Widget B pages to Widget C, ALTHOUGH Widgets A and B previously had nothing to do with one another? (Remember, we want to try and keep some of that authority the two page have had.) OR do we keep Widget A and Widget B pages "alive", take them off the main navigation, and then put a "disclaimer" on the pages announcing they are now part of Widget C and link to Widget C? OR Should Widgets A and B page be canonicalized to Widget C? Again, keep in mind, widgets A and B previously were not similar, but NOW they are and result in Widget C. (If you are confused, we can provide a REAL work example of what we are talkinga about, but decided to not be specific to our industry for this.) Appreciate any and all thoughts on this.
Technical SEO | | JU19850