Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Robots.txt and canonical tag
-
In the SEOmoz post - http://www.seomoz.org/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts, it's being said -
If you have a robots.txt disallow in place for a page, the canonical tag will never be seen.
Does it so happen that if a page is disallowed by robots.txt, spiders DO NOT read the html code ?
-
Thanks Ryan for explaining things very clearly.
-
What we know is there have been many cases where a page that is blocked in robots.txt has appeared in search results. The explanation provided is that robots.txt blocks crawlers during normal site visits, but not necessarily on visits where they are following links from other sites.
-
If spiders follow links to an article on my site, will they read the contents then ? If the canonical tag is on article page itself, will canonical tag will be seen ?
-
Daylan offered a great answer but I would like to add one exception. When crawlers from the major SEs visit your site they will honor your robots.txt file but sometimes they will follow links from other sites to an article on your site, and during that particular visit they will not see the robots.txt file and index your page.
This is one of the reasons why your robots.txt file should be used as minimally as possible, and when it is used you should have a backup process in place such as the canonical or noindex tag on a page.
-
Thanks Daylan for your quick response. I just wanted a second opinion that canonical tag will never be seen if a page is disallowed.
-
Thats correct in most cases:
It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:
User-agent: *
Disallow: /The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.
Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
More information available here about:
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Missing Canonical Tag for a PDF document
Error: Missing Canonical Tag
Technical SEO | | ahmadmdahshan
But URL is not a webpage it is a PDF document, is this fixable?0 -
Canonical tag use for ecommerce product page detail
Hi, I have a category page I want to rank. This page has 24 different products quite similar but not exactly the same.
Technical SEO | | amastone
I want to use canonical tag in any product to the parent category.
Is this a right use of the canonical?
Category page I'm talking about is : Finger bits If I understand how to use canonical tags I can improve all my category pages. thanks marco0 -
Canonical Tag when using Ajax and PhantomJS
Hello, We have a site that is built using an AJAX application. We include the meta fragment tag in order to get a rendered page from PhantomJS. The URL that is rendered to google from PhantomJS then is www.oursite.com/?escaped_fragment= In the SERP google of course doesnt include the hashtag in the URL. So my question, with this setup, do i still need a canonical tag and if i do, would the canonical tag be the escaped fragment URL or the regular URL? Much Appreciated!
Technical SEO | | RevanaDigitalSEO0 -
Should I block Map pages with robots.txt?
Hello, I have a website that was started in 1999. On the website I have map pages for each of the offices listed on my site, for which there are about 120. Each of the 120 maps is in a whole separate html page. There is no content in the page other than the map. I know all of the offices love having the map pages so I don't want to remove the pages. So, my question is would these pages with no real content be hurting the rankings of the other pages on our site? Therefore, should I block the pages with my robots.txt? Would I also have to remove these pages (in webmaster tools?) from Google for blocking by robots.txt to really work? I appreciate your feedback, thanks!
Technical SEO | | imaginex0 -
Isnt it better to have headlines in H1 and H2 tags instead of p tags?
I am working with a simple site http://http://lightsigns.com/Uniko_Manufacturing_Limited.html They seek more SEO traffic. However, the two big headlines that read "Wholesale Supply to the Sign and Display Industries" which is on line 241 and 242 of the source code, its in a p tag, i.e. <p <span class="webkit-html-tag">style</p <span>="padding-top: 0pt; " class="paragraph_style_1">Wholesale Supply to the and <p <span class="webkit-html-tag">style</p <span>="padding-bottom: 0pt; " class="paragraph_style_1">Sign and Display Industries Likewise, the product titles are in p tags, also. For example, on the Slide-in Light Box product page, http://lightsigns.com/Slide_In_light_box.html , I have done keyword research and no one is using the words slide in light box.Plus, it is also a p tag, ie. line 43 reads style="padding-bottom: 0pt; padding-top: 0pt; " class="paragraph_style">Slide-in Light Box If I suggest that they make an H2 tag with SEO-optimized keywords such as Display Light Box - Slide-In LIght Box, would this indeed help SEO? In summary, is it correct to say that H1 and H2 tags are stronger signals to the search bots of what the page is about?
Technical SEO | | BridgetGibbons1 -
Removing robots.txt on WordPress site problem
Hi..am a little confused since I ticked the box in WordPress to allow search engines to now crawl my site (previously asked for them not to) but Google webmaster tools is telling me I still have robots.txt blocking them so am unable to submit the sitemap. Checked source code and the robots instruction has gone so a little lost. Any ideas please?
Technical SEO | | Wallander0 -
Do I need to add canonical link tags to pages that I promote & track w/ UTM tags?
New to SEOmoz, loving it so far. I promote content on my site a lot and am diligent about using UTM tags to track conversions & attribute data properly. I was reading earlier about the use of link rel=canonical in the case of duplicate page content and can't find a conclusive answer whether or not I need to add the canonical tag to these pages. Do I need the canonical tag in this case? If so, can the canonical tag live in the HEAD section of the original / base page itself as well as any other URLs that call that content (that have UTM tags, etc)? Thank you.
Technical SEO | | askotzko1 -
Should I set up a disallow in the robots.txt for catalog search results?
When the crawl diagnostics came back for my site its showing around 3,000 pages of duplicate content. Almost all of them are of the catalog search results page. I also did a site search on Google and they have most of the results pages in their index too. I think I should just disallow the bots in the /catalogsearch/ sub folder, but I'm not sure if this will have any negative effect?
Technical SEO | | JordanJudson0