Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Robots.txt and canonical tag
-
In the SEOmoz post - http://www.seomoz.org/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts, it's being said -
If you have a robots.txt disallow in place for a page, the canonical tag will never be seen.
Does it so happen that if a page is disallowed by robots.txt, spiders DO NOT read the html code ?
-
Thanks Ryan for explaining things very clearly.
-
What we know is there have been many cases where a page that is blocked in robots.txt has appeared in search results. The explanation provided is that robots.txt blocks crawlers during normal site visits, but not necessarily on visits where they are following links from other sites.
-
If spiders follow links to an article on my site, will they read the contents then ? If the canonical tag is on article page itself, will canonical tag will be seen ?
-
Daylan offered a great answer but I would like to add one exception. When crawlers from the major SEs visit your site they will honor your robots.txt file but sometimes they will follow links from other sites to an article on your site, and during that particular visit they will not see the robots.txt file and index your page.
This is one of the reasons why your robots.txt file should be used as minimally as possible, and when it is used you should have a backup process in place such as the canonical or noindex tag on a page.
-
Thanks Daylan for your quick response. I just wanted a second opinion that canonical tag will never be seen if a page is disallowed.
-
Thats correct in most cases:
It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:
User-agent: *
Disallow: /The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.
Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
More information available here about:
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Missing Canonical Tag for a PDF document
Error: Missing Canonical Tag
Technical SEO | | ahmadmdahshan
But URL is not a webpage it is a PDF document, is this fixable?0 -
Robots.txt Tester - syntax not understood
I've looked in the robots.txt Tester and I can see 3 warnings: There is a 'syntax not understood' warning for each of these. XML Sitemaps:
Technical SEO | | JamesHancocks1
https://www.pkeducation.co.uk/post-sitemap.xml
https://www.pkeducation.co.uk/sitemap_index.xml How do I fix or reformat these to remove the warnings? Many thanks in advance.
Jim0 -
How long does it take for canonical tags to work
How long on average does it take for a canonical tag to work? Understand that canonicals are just a suggestion, but after adding a canonical tag and submitting the page via Google fetch, assuming Google follows the canonical, would you expect it to work after a day or two or does it take longer? We added canonicals to old PPC landing pages that are ranking organically, though our new landing pages (which we want to rank organically) are not identical and have a bit more content/features. They are similar though. Canonicals were added to the old pages (pointing to new pages) and requested indexing via search console. Old pages are still ranking and new pages not so much. FYI we are unable to 301 old PPC pages due to other non negotiable reasons unfortunately. Thanks.
Technical SEO | | SoulSurfer80 -
Duplicate title-tags with pagination and canonical
Some time back we implemented the Google recommendation for pagination (the rel="next/prev"). GWMT now reports 17K pages with duplicate title-tags (we have about 1,1m products on our site and about 50m pages indexed in Google) As an example we have properties listed in various states and the category title would be "Properties for Sale in [state-name]". A paginated search page or browsing a category (see also http://searchengineland.com/implementing-pagination-attributes-correctly-for-google-114970) would then include the following: The title for each page is the same - so to avoid the duplicate title-tags issue, I would think one would have the following options: Ignore what Google says Change the canonical to http://www.site.com/property/state.html (which would then only show the first XX results) Append a page number to the title "Properties for Sale in [state-name] | Page XX" Have all paginated pages use noindex,follow - this would then result in no category page being indexed Would you have the canonical point to the individual paginated page or the base page?
Technical SEO | | MagicDude4Eva2 -
Geotargeting duplicate content to different regions - href and canonical tag confusion
If you duplicate content onto a sub-folder for say a new US geotargeted site (to target kw spelling differences) and, in addition to GWT geotargeting settings, implement the 'Canonical' and 'Hreflang' tags on these new pages to show G different region and language version (en-us). Then does the original/main site similar pages also need to have canonical and href tags ? The main/original sites page I don't really want to target a specific country (although existing signals (hosting etc) will be UK (primary target of main site) but pages show up in other country searches too (which we want). Im presuming fine to leave the original/main site as it currently is although wording in google blog/webmaster central articles etc are a bit confusing hence why im asking for anyone elses opinion/input on this. Also is there are any benefit (or just best practice) to use 'www.example.com/en-us/...' in the subdirectory URL as opposed to just 'www.example.com/us/' many thanks in advance to any commentators 🙂
Technical SEO | | Dan-Lawrence0 -
Removing robots.txt on WordPress site problem
Hi..am a little confused since I ticked the box in WordPress to allow search engines to now crawl my site (previously asked for them not to) but Google webmaster tools is telling me I still have robots.txt blocking them so am unable to submit the sitemap. Checked source code and the robots instruction has gone so a little lost. Any ideas please?
Technical SEO | | Wallander0 -
Internal search : rel=canonical vs noindex vs robots.txt
Hi everyone, I have a website with a lot of internal search results pages indexed. I'm not asking if they should be indexed or not, I know they should not according to Google's guidelines. And they make a bunch of duplicated pages so I want to solve this problem. The thing is, if I noindex them, the site is gonna lose a non-negligible chunk of traffic : nearly 13% according to google analytics !!! I thought of blocking them in robots.txt. This solution would not keep them out of the index. But the pages appearing in GG SERPS would then look empty (no title, no description), thus their CTR would plummet and I would lose a bit of traffic too... The last idea I had was to use a rel=canonical tag pointing to the original search page (that is empty, without results), but it would probably have the same effect as noindexing them, wouldn't it ? (never tried so I'm not sure of this) Of course I did some research on the subject, but each of my finding recommanded one of the 3 methods only ! One even recommanded noindex+robots.txt block which is stupid because the noindex would then be useless... Is there somebody who can tell me which option is the best to keep this traffic ? Thanks a million
Technical SEO | | JohannCR0 -
Subdomain Removal in Robots.txt with Conditional Logic??
I would like to see if there is a way to add conditional logic to the robots.txt file so that when we push from DEV to PRODUCTION and the robots.txt file is pushed, we don't have to remember to NOT push the robots.txt file OR edit it when it goes live. My specific situation is this: I have www.website.com, dev.website.com and new.website.com and somehow google has indexed the DEV.website.com and NEW.website.com and I'd like these to be removed from google's index as they are causing duplicate content. Should I: a) add 2 new GWT entries for DEV.website.com and NEW.website.com and VERIFY ownership - if I do this, then when the files are pushed to LIVE won't the files contain the VERIFY META CODE for the DEV version even though it's now LIVE? (hope that makes sense) b) write a robots.txt file that specifies "DISALLOW: DEV.website.com/" is that possible? I have only seen examples of DISALLOW with a "/" in the beginning... Hope this makes sense, can really use the help! I'm on a Windows Server 2008 box running ColdFusion websites.
Technical SEO | | ErnieB0