Robots review
-
Anything in this that would have caused Rogerbot to stop indexing my site? It only saw 34 of 5000+ pages on the last pass. It had no problems seeing the whole site before.
User-agent: Rogerbot
Disallow: /default.aspx?*
//Keep from crawling the CMS urls default.aspx?Tabid=234. Real home page is home.aspxDisallow: /ctl/
// Keep from indexing the admin controlsDisallow: ArticleAdmin
// Keep from indexing article admin pageDisallow: articleadmin
// same in lower caseDisallow: /images/
// Keep from indexing CMS imagesDisallow: captcha
// keep from indexing the captcha image which appears to be a page to crawls.general rules lacking wildcards
User-agent: * Disallow: /default.aspx Disallow: /images/ Disallow: /DesktopModules/DnnForge - NewsArticles/Controls/ImageChallenge.captcha.aspx
-
Well, our crawler is supposed to respect all standard robots.txt rules, so you should be good just adding them all back in as you normally would and seeing what happens. If it doesn't go through properly, I'll ask our engineers to take a look and find out what's happening!
-
Thanks Aaron.
I will add the rules back as I want Roger to have nearly the same experience to Google and Bing.
Is it best to add one at a time? That could take over a month to figure out what's happening. Is there an easier way to test? Perhaps something like the Google Webmaster Tools Crawler Access tool?
-
Hey! Sorry you didn't have a good experience with your help ticket. I talked with Chiaryn and it sounds like there was some confusion over what you wanted removed from your crawl; it had mentioned that you wanted only one particular page blocked. I think she found something different in your robots.txt - the rules you outline above - so she tried to help you with that situation. Roger does honor all robots.txt parameters so the crawl should only be limited in the way you define, though the wildcards do open you up to a lot of blockage.
It looks like you've since removed your restrictions from roger. Chiaryn and I spoke about it and we'll try to help with your specific site over your ticket. Hope this helps explain! If you want to re-add those parameters and then see what pages are wrongly blocked, I'd love to do that with you - just let us know when you've changed the robots.txt.
-
All urls are rewritten to default.aspx?Tabid=123&Key=Var. None of these are publicly visible once the re-writer is active. I added the rule just to make sure the page is never accidentally exposed and indexed
-
Could you clarify the URL structure for the default.aspx and the true home page. It's only because if you add Disallow: /default.aspx?* (with the wild card) then it will disallow all pages within the /default.aspx folder structure. Just use the same rule for rogerbot as you did for the general rule, this being Disallow: /default.aspx Hope this helps, Vahe
-
Actually, I asked help this question (essentially) first then the lady said she wasn't a web developer and I should ask the community. I was a little taken back frankly.
-
Can't. Default.aspx is the root of the CMS and the redirect will take down the entire website. Rule exists for only a small period where Google indexed the page incorrectly.
-
Hi,
If I was you, I would 301 redirect the default.aspx to the real home page. Once you do that simply remove it from the robots.txt file.
Not only would you strengthen the true home page, but prevent from crawling errors to occur.
There would be a concern that people might even still link to default.aspx which might be causing search engines to index the page. This might be the reason to which rogerbot has stopped crawling your site.
If that's an issue just put a canonical tag for that URL, but still remove that reference.
Hope this helps,
Vahe
-
Hi! If you don't get an answer from the community by Monday, send an email to help@seomoz.org and they'll look at it to see what might be the problem (they're not in on the weekends, otherwise I'd have you send them an email right away).
Thanks!
Keri
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt file issues on Shopify server
We have repeated issues with one of our ecommerce sites not being crawled. We receive the following message: Our crawler was not able to access the robots.txt file on your site. This often occurs because of a server error from the robots.txt. Although this may have been caused by a temporary outage, we recommend making sure your robots.txt file is accessible and that your network and server are working correctly. Typically errors like this should be investigated and fixed by the site webmaster. Read our troubleshooting guide. Are you aware of an issue with robots.txt on the Shopify servers? It is happening at least twice a month so it is quite an issue.
Moz Pro | | A_Q0 -
Block Moz (or any other robot) from crawling pages with specific URLs
Hello! Moz reports that my site has around 380 duplicate page content. Most of them come from dynamic generated URLs that have some specific parameters. I have sorted this out for Google in webmaster tools (the new Google Search Console) by blocking the pages with these parameters. However, Moz is still reporting the same amount of duplicate content pages and, to stop it, I know I must use robots.txt. The trick is that, I don't want to block every page, but just the pages with specific parameters. I want to do this because among these 380 pages there are some other pages with no parameters (or different parameters) that I need to take care of. Basically, I need to clean this list to be able to use the feature properly in the future. I have read through Moz forums and found a few topics related to this, but there is no clear answer on how to block only pages with specific URLs. Therefore, I have done my research and come up with these lines for robots.txt: User-agent: dotbot
Moz Pro | | Blacktie
Disallow: /*numberOfStars=0 User-agent: rogerbot
Disallow: /*numberOfStars=0 My questions: 1. Are the above lines correct and would block Moz (dotbot and rogerbot) from crawling only pages that have numberOfStars=0 parameter in their URLs, leaving other pages intact? 2. Do I need to have an empty line between the two groups? (I mean between "Disallow: /*numberOfStars=0" and "User-agent: rogerbot")? (or does it even matter?) I think this would help many people as there is no clear answer on how to block crawling only pages with specific URLs. Moreover, this should be valid for any robot out there. Thank you for your help!0 -
Will moz crawl pages blocked by robots.txt and nofollow links?
i have over 2,000 temporary redirects in my campaign report redirects are mostly events like being redirected to a login page before showing the actual data im thinking of adding nofollow on the link so moz wont crawl the redirection to reduce the notification will this solve my problem?
Moz Pro | | WizardOfMoz0 -
Robots.txt
I have a page used for a reference that lists 150 links to blog articles. I use in in a training area of my website. I now get warnings from moz that it has too many links. I decided to disallow this page in robots.text. Below is the what appears in the file. Robots.txt file for http://www.boxtheorygold.com User-agent: * Disallow: /blog-links/ My understanding is that this simply has google bypass the page and not crawl it. However, in Webmaster Tools, I used the Fetch tool to check out a couple of my blog articles. One returned an expected result. The other returned a result of "access denied" due to robots.text. Both blog article links are listed on the /blog/links/ reference page. Question: Why does google refuse to crawl the one article (using the Fetch tool) when it is not referenced at all in the robots.text file. Why is access denied? Should I have used a noindex on this page instead of robots.txt? I am fearful that robots.text may be blocking many of my blog articles. Please advise. Thanks,
Moz Pro | | Rong
Ron0 -
Do the SEOmoz Campaign Reports follow Robots.txt?
Hello, Do the SEOmoz Campaign Reports (that track errors and warnings for a website) follow rules I write in the robots.txt file? I've done all that I can to fix the legitimate errors with my website, as reported by the fabulous SEOmoz tools. I want to clean up my pages indexed with the search engines so I've written a few rules to exclude content from Wordpress tag URLs for instance. Will my campaign report errors and warnings also drop as a result of this?
Moz Pro | | Flexcin0 -
Meta-Robots noFollow and Blocked by Meta-Robots
On my most recent campaign report, I have 2 Notices that we can't find any cause for: Meta-Robots nofollow-
Moz Pro | | gfiedel
http://www.fateyes.com/the-effect-of-social-media-on-the-serps-social-signals-seo/?replytocom=92
"noindex nofollow" for the page: http://www.fateyes.com/the-effect-of-social-media-on-the-serps-social-signals-seo/ Blocked by Meta-Robots -Meta-Robots nofollow-
http://www.fateyes.com/the-effect-of-social-media-on-the-serps-social-signals-seo/?replytocom=92
"noindex nofollow" for the page: http://www.fateyes.com/the-effect-of-social-media-on-the-serps-social-signals-seo/ We are unable to locate any code whatsoever that may explain this. Any ideas anyone?0 -
Seomoz bar: No Follow and Robots.txt
Should the Mozbar pickup 'nofollow" links that are handled in robots.txt ? the robots.tx blocks categories, but is still show as a followed (green) link when using the mozbar. Thanks! Holly ETA: I'm assuming that- disallow: myblog.com/category/ - is comparable to the nofollow tag on catagory?
Moz Pro | | squareplug0 -
Why is Roger crawling pages that are disallowed in my robots.txt file?
I have specified the following in my robots.txt file: Disallow: /catalog/product_compare/ Yet Roger is crawling these pages = 1,357 errors. Is this a bug or am I missing something in my robots.txt file? Here's one of the URLs that Roger pulled: <colgroup><col width="312"></colgroup>
Moz Pro | | MeltButterySpread
| example.com/catalog/product_compare/add/product/19241/uenc/aHR0cDovL2ZyZXNocHJvZHVjZWNsb3RoZXMuY29tL3RvcHMvYWxsLXRvcHM_cD02/ Please let me know if my problem is in robots.txt or if Roger spaced this one. Thanks! |0