Robots.txt and canonical tag
-
In the SEOmoz post - http://www.seomoz.org/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts, it's being said -
If you have a robots.txt disallow in place for a page, the canonical tag will never be seen.
Does it so happen that if a page is disallowed by robots.txt, spiders DO NOT read the html code ?
-
Thanks Ryan for explaining things very clearly.
-
What we know is there have been many cases where a page that is blocked in robots.txt has appeared in search results. The explanation provided is that robots.txt blocks crawlers during normal site visits, but not necessarily on visits where they are following links from other sites.
-
If spiders follow links to an article on my site, will they read the contents then ? If the canonical tag is on article page itself, will canonical tag will be seen ?
-
Daylan offered a great answer but I would like to add one exception. When crawlers from the major SEs visit your site they will honor your robots.txt file but sometimes they will follow links from other sites to an article on your site, and during that particular visit they will not see the robots.txt file and index your page.
This is one of the reasons why your robots.txt file should be used as minimally as possible, and when it is used you should have a backup process in place such as the canonical or noindex tag on a page.
-
Thanks Daylan for your quick response. I just wanted a second opinion that canonical tag will never be seen if a page is disallowed.
-
Thats correct in most cases:
It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:
User-agent: *
Disallow: /The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.
Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
More information available here about:
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Using one robots.txt for two websites
I have two websites that are hosted in the same CMS. Rather than having two separate robots.txt files (one for each domain), my web agency has created one which lists the sitemaps for both websites, like this: User-agent: * Disallow: Sitemap: https://www.siteA.org/sitemap Sitemap: https://www.siteB.com/sitemap Is this ok? I thought you needed one robots.txt per website which provides the URL for the sitemap. Will having both sitemap URLs listed in one robots.txt confuse the search engines?
Technical SEO | | ciehmoz0 -
How long does it take for canonical tags to work
How long on average does it take for a canonical tag to work? Understand that canonicals are just a suggestion, but after adding a canonical tag and submitting the page via Google fetch, assuming Google follows the canonical, would you expect it to work after a day or two or does it take longer? We added canonicals to old PPC landing pages that are ranking organically, though our new landing pages (which we want to rank organically) are not identical and have a bit more content/features. They are similar though. Canonicals were added to the old pages (pointing to new pages) and requested indexing via search console. Old pages are still ranking and new pages not so much. FYI we are unable to 301 old PPC pages due to other non negotiable reasons unfortunately. Thanks.
Technical SEO | | SoulSurfer80 -
Canonical: Is this a problem?
Hi!!
Technical SEO | | petrospan
I am running a small wordpress website and i have a question because i am a litle confusic about Rel Canonical notices in the crawl diagnostics! I have the seo by yoast and i have fix all the canonical url for my page, but i take notices. I must worried about it or is something that inform me that everyting is ok? rel.jpg rel.jpg0 -
Title tag code
Hi, I have a couple of websites where I can't define the title tag (CMS does not support it) on a few default pages. On these pages "the system" just uses the primary/main title tag (from the frontpage) and my programming skills (as if I have any...!) have not been able to make a html code or something to override the main title tag on these specific pages. Does this make sense at all and can anyone give me a hint, a code to try out or something? Problem is that I now have 3 pages with the same title tag which in terms of SEO isn't too good, so to say... Thanks in advance. Jan
Technical SEO | | Wello12340 -
Two different canonical tags on one page
Due to an error, some of my pages now have two canonical tags on them. One is correct and the other goes to a nonsense URL (404 page). I know I should ideally remove the incorrect ones, but it's a big manual job. Are they doing any harm? Can I just leave them there and let Google figure it out? The correct ones are higher up in the code. Will this make a difference? Any help appreciated.
Technical SEO | | ShearingsGroup0 -
Googleoff/on tags
Hi all, I'd like to restrict Google indexing a part of content on the page. Does tag really work for it as it described on https://developers.google.com/search-appliance/documentation/46/admin_crawl/Preparing#pagepart? Thanks, Jane
Technical SEO | | Jane_Barry0 -
Problem with Rel Canonical
Background: We check to make sure that IF you use canonical URL tags, it points to the right page. If the canonical tag points to a different URL, engines will not count this page as the reference resource and thus, it won't have an opportunity to rank. If you've not made this page the rel=canonical target, change the reference to this URL. NOTE: For pages not employing canonical URL tags, this factor does not apply. Clearly I am doing something wrong here, how do I check my various pages to see where the problem lies and how do I go about fixing it?
Technical SEO | | SallySerfas0 -
Allow or Disallow First in Robots.txt
If I want to override a Disallow directive in robots.txt with an Allow command, do I have the Allow command before or after the Disallow command? example: Allow: /models/ford///page* Disallow: /models////page
Technical SEO | | irvingw0