Robots.txt and canonical tag
-
In the SEOmoz post - http://www.seomoz.org/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts, it's being said -
If you have a robots.txt disallow in place for a page, the canonical tag will never be seen.
Does it so happen that if a page is disallowed by robots.txt, spiders DO NOT read the html code ?
-
Thanks Ryan for explaining things very clearly.
-
What we know is there have been many cases where a page that is blocked in robots.txt has appeared in search results. The explanation provided is that robots.txt blocks crawlers during normal site visits, but not necessarily on visits where they are following links from other sites.
-
If spiders follow links to an article on my site, will they read the contents then ? If the canonical tag is on article page itself, will canonical tag will be seen ?
-
Daylan offered a great answer but I would like to add one exception. When crawlers from the major SEs visit your site they will honor your robots.txt file but sometimes they will follow links from other sites to an article on your site, and during that particular visit they will not see the robots.txt file and index your page.
This is one of the reasons why your robots.txt file should be used as minimally as possible, and when it is used you should have a backup process in place such as the canonical or noindex tag on a page.
-
Thanks Daylan for your quick response. I just wanted a second opinion that canonical tag will never be seen if a page is disallowed.
-
Thats correct in most cases:
It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:
User-agent: *
Disallow: /The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.
Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
More information available here about:
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
One robots.txt file for multiple sites?
I have 2 sites hosted with Blue Host and was told to put the robots.txt in the root folder and just use the one robots.txt for both sites. Is this right? It seems wrong. I want to block certain things on one site. Thanks for the help, Rena
Technical SEO | | renalynd270 -
What are the negative implications of listing URLs in a sitemap that are then blocked in the robots.txt?
In running a crawl of a client's site I can see several URLs listed in the sitemap that are then blocked in the robots.txt file. Other than perhaps using up crawl budget, are there any other negative implications?
Technical SEO | | richdan0 -
Hreflang tag implentation
Hi, We've had hreflang tags implemented on our site for a few weeks now, and while we are seeing some improvements for the regional subfolders I wanted to double check I had the tags implemented correctly (a couple of examples are below). However while the regional subfolder sites are now ranking instead of the US site for some keywords, some key search terms are still returning the US site. Could this be due to incorrect implementation for that specific page? Due to complications with using Magento we're implementing the tags in the site maps. Also magento appears to be inserting a rel canonical tag automatically for each page and self referencing e.g. On www.example.com/uk/security-cameras (one of the pages we're having issues with) the canonical tag is http://www.example.com/uk/security-cameras" />. Is this an issue? Any advice would be appreciated. Thanks. <url><loc>http://www.example.com/uk/dvrs-kits</loc>
Technical SEO | | ahyde
<lastmod>2014-07-23</lastmod>
<changefreq>daily</changefreq>
<priority>0.5</priority></url>
<url><loc>http://www.example.com/uk/dvrs-kits/1080p</loc>
<lastmod>2014-07-23</lastmod>
<changefreq>daily</changefreq>
<priority>0.5</priority></url>0 -
Canonicals
We have a client that has his products listed on 20+ different websites, including 4 of his own. Also, he only has 1 of everything, so once he sells it then the product is gone. To battle this duplication issue, plus having a short internet lifespan of less than 4 weeks, I was wondering if it would be a good idea to canonical the products back to the category page. Kind of like using canonical tags on your "used blue widget" and "used red widget" pages back to the "used widgets" page. Would this help with the duplicate content issues? Is this a proper use of a canonical?
Technical SEO | | WhoWuddaThunk0 -
GWT Duplicate Content and Canonical Tag - Annoying
Hello everyone! I run an e-commerce site and I had some problems with duplicate meta descriptions for product pages. I implemented the rel=canonical in order to address this problem, but after more than a week the number of errors showing in google webmaster tools hasn't changed and the site has been crawled already three times since I put the rel canonical. I didn't change any description as each error regards a set of pages that are identical, same products, same descriptions just different length/colour. I am pretty sure the rel=canonical has been implemented correctly so I can't understand why I still have these errors coming up. Any suggestions? Cheers
Technical SEO | | PremioOscar0 -
No indexing url including query string with Robots txt
Dear all, how can I block url/pages with query strings like page.html?dir=asc&order=name with robots txt? Thanks!
Technical SEO | | HMK-NL0 -
How long does it take for traffic to bounce back from and accidental robots.txt disallow of root?
We accidentally uploaded a robots.txt disallow root for all agents last Tuesday and did not catch the error until yesterday.. so 6 days total of exposure. Organic traffic is down 20%. Google has since indexed the correct version of the robots.txt file. However, we're still seeing awful titles/descriptions in the SERPs and traffic is not coming back. GWT shows that not many pages were actually removed from the index but we're still seeing drastic rankings decreases. Anyone been through this? Any sort of timeline for a recovery? Much appreciated!
Technical SEO | | bheard0 -
Robots.txt File Redirects to Home Page
I've been doing some site analysis for a new SEO client and it has been brought to my attention that their robots.txt file redirects to their homepage. I was wondering: Is there a benfit to setup your robots.txt file to do this? Will this effect how their site will get indexed? Thanks for your response! Kyle Site URL: http://www.radisphere.net/
Technical SEO | | kchandler0