Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Robots.txt and canonical tag
-
In the SEOmoz post - http://www.seomoz.org/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts, it's being said -
If you have a robots.txt disallow in place for a page, the canonical tag will never be seen.
Does it so happen that if a page is disallowed by robots.txt, spiders DO NOT read the html code ?
-
Thanks Ryan for explaining things very clearly.
-
What we know is there have been many cases where a page that is blocked in robots.txt has appeared in search results. The explanation provided is that robots.txt blocks crawlers during normal site visits, but not necessarily on visits where they are following links from other sites.
-
If spiders follow links to an article on my site, will they read the contents then ? If the canonical tag is on article page itself, will canonical tag will be seen ?
-
Daylan offered a great answer but I would like to add one exception. When crawlers from the major SEs visit your site they will honor your robots.txt file but sometimes they will follow links from other sites to an article on your site, and during that particular visit they will not see the robots.txt file and index your page.
This is one of the reasons why your robots.txt file should be used as minimally as possible, and when it is used you should have a backup process in place such as the canonical or noindex tag on a page.
-
Thanks Daylan for your quick response. I just wanted a second opinion that canonical tag will never be seen if a page is disallowed.
-
Thats correct in most cases:
It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:
User-agent: *
Disallow: /The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.
Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
More information available here about:
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Crawl solutions for landing pages that don't contain a robots.txt file?
My site (www.nomader.com) is currently built on Instapage, which does not offer the ability to add a robots.txt file. I plan to migrate to a Shopify site in the coming months, but for now the Instapage site is my primary website. In the interim, would you suggest that I manually request a Google crawl through the search console tool? If so, how often? Any other suggestions for countering this Meta Noindex issue?
Technical SEO | | Nomader1 -
Robots.txt Syntax for Dynamic URLs
I want to Disallow certain dynamic pages in robots.txt and am unsure of the proper syntax. The pages I want to disallow all include the string ?Page= Which is the proper syntax?
Technical SEO | | btreloar
Disallow: ?Page=
Disallow: ?Page=*
Disallow: ?Page=
Or something else?0 -
Do URLs with canonical tags get indexed by Google?
Hi, we re-branded and launched a new website in February 2016. In June we saw a steep drop in the number of URLs indexed, and there have continued to be smaller dips since. We started an account with Moz and found several thousand high priority crawl errors for duplicate pages and have since fixed those with canonical tags. However, we are still seeing the number of URLs indexed drop. Do URLs with canonical tags get indexed by Google? I can't seem to find a definitive answer on this. A good portion of our URLs have canonical tags because they are just events with different dates, but otherwise the content of the page is the same.
Technical SEO | | zasite0 -
Multiple robots.txt files on server
Hi! I have previously hired a developer to put up my site and noticed afterwards that he did not know much about SEO. This lead me to starting to learn myself and applying some changes step by step. One of the things I am currently doing is inserting sitemap reference in robots.txt file (which was not there before). But just now when I wanted to upload the file via FTP to my server I found multiple ones - in different sizes - and I dont know what to do with them? Can I remove them? I have downloaded and opened them and they seem to be 2 textfiles and 2 dupplicates. Names: robots.txt (original dupplicate)
Technical SEO | | mjukhud
robots.txt-Original (original)
robots.txt-NEW (other content)
robots.txt-Working (other content dupplicate) Would really appreciate help and expertise suggestions. Thanks!0 -
Canonical Tag when using Ajax and PhantomJS
Hello, We have a site that is built using an AJAX application. We include the meta fragment tag in order to get a rendered page from PhantomJS. The URL that is rendered to google from PhantomJS then is www.oursite.com/?escaped_fragment= In the SERP google of course doesnt include the hashtag in the URL. So my question, with this setup, do i still need a canonical tag and if i do, would the canonical tag be the escaped fragment URL or the regular URL? Much Appreciated!
Technical SEO | | RevanaDigitalSEO0 -
Two different canonical tags on one page
Due to an error, some of my pages now have two canonical tags on them. One is correct and the other goes to a nonsense URL (404 page). I know I should ideally remove the incorrect ones, but it's a big manual job. Are they doing any harm? Can I just leave them there and let Google figure it out? The correct ones are higher up in the code. Will this make a difference? Any help appreciated.
Technical SEO | | ShearingsGroup0 -
Internal search : rel=canonical vs noindex vs robots.txt
Hi everyone, I have a website with a lot of internal search results pages indexed. I'm not asking if they should be indexed or not, I know they should not according to Google's guidelines. And they make a bunch of duplicated pages so I want to solve this problem. The thing is, if I noindex them, the site is gonna lose a non-negligible chunk of traffic : nearly 13% according to google analytics !!! I thought of blocking them in robots.txt. This solution would not keep them out of the index. But the pages appearing in GG SERPS would then look empty (no title, no description), thus their CTR would plummet and I would lose a bit of traffic too... The last idea I had was to use a rel=canonical tag pointing to the original search page (that is empty, without results), but it would probably have the same effect as noindexing them, wouldn't it ? (never tried so I'm not sure of this) Of course I did some research on the subject, but each of my finding recommanded one of the 3 methods only ! One even recommanded noindex+robots.txt block which is stupid because the noindex would then be useless... Is there somebody who can tell me which option is the best to keep this traffic ? Thanks a million
Technical SEO | | JohannCR0 -
Can I Disallow Faceted Nav URLs - Robots.txt
I have been disallowing /*? So I know that works without affecting crawling. I am wondering if I can disallow the faceted nav urls. So disallow: /category.html/? /category2.html/? /category3.html/*? To prevent the price faceted url from being cached: /category.html?price=1%2C1000
Technical SEO | | tylerfraser
and
/category.html?price=1%2C1000&product_material=88 Thanks!0