Meta NoIndex tag and Robots Disallow
-
Hi all,
I hope you can spend some time to answer my first of a few questions
We are running a Magento site - layered/faceted navigation nightmare has created thousands of duplicate URLS!
Anyway, during my process to tackle the issue, I disallowed in Robots.txt anything in the querystring that was not a p (allowed this for pagination).
After checking some pages in Google, I did a site:www.mydomain.com/specificpage.html and a few duplicates came up along with the original with
"There is no information about this page because it is blocked by robots.txt"So I had added in Meta Noindex, follow on all these duplicates also but I guess it wasnt being read because of Robots.txt.
So coming to my question.
-
Did robots.txt block access to these pages? If so, were these already in the index and after disallowing it with robots, Googlebot could not read Meta No index?
-
Does Meta Noindex Follow on pages actually help Googlebot decide to remove these pages from index?
I thought Robots would stop and prevent indexation? But I've read this:
"Noindex is a funny thing, it actually doesn’t mean “You can’t index this”, it means “You can’t show this in search results”. Robots.txt disallow means “You can’t index this” but it doesn’t mean “You can’t show it in the search results”.I'm a bit confused about how to use these in both preventing duplicate content in the first place and then helping to address dupe content once it's already in the index.
Thanks!
B
-
-
There's no real way to estimate how long the re-crawl will take, Ben. You can get a bit of an idea by looking at the crawl rate reported in Google Webmaster Tools.
Yes, asking for a page fetch then submitting with linked pages for each of the main website sections can help speed up the crawl discovery. In addition, make sure you've submitted a current sitemap and it's getting found correctly (also reported in GWT) You should also do the same in Bing Webmaster Tools. Too many sites forget about optimizing for Bing - even if it's only 20% of Google's traffic, there's no point throwing it away.
Lastly, earning some new links to different sections of the site is another great signal. This can often be effectively & quickly done using social media - especially Google+ as it gets crawled very quickly.
As far as your other question - yes, once you get the unwanted URLs out of the index, you can add the robots.txt disallow back in to optimise your crawl budget. I would strongly recommend you leave the meta-robots no-index tag in place though as a "belt & suspenders" approach to keep pages linking into those unwanted pages from triggering a re-indexing. It's OK to have both in place as long as the de-indexing has already been accomplished, as we've discussed.
Hope that answer your questions?
Paul
-
So once Google has started to see the meta-noindex and is slowly deindexing pages, once that is done, I would like to block it from crawling them with a robots.txt to conserve my crawl budget.
But, there are still internal links on the site that point to these URL´s - would they get back into the index in this case?
-
Hi Paul,
Thank you for your detailed answer - so I'm not going crazy
I did try with canonicals but then realized they are more of a suggestion as opposed to a directive and I am still correcting a lot of dupe content and 404's so I am imagining that Google view's the site as "these guys don't know what they are doing' so may have ignored the canonical suggestion.
So what I have done is remove the robots block on the pages I want de-indexed and add in meta noindex, follow on these pages - From what you are saying, they should naturally de-index, after which, I will put the robots.txt block back on to keep my crawl budget spent on better areas of the site.
How long in your opinion can it take for Googlebot to de-index the pages? Can I help it along at all to speed up? Fetch page and linking pages as Googlebot?
Thanks again,
Ben
-
You're right to be confused, B. The terminology is unfortunate and misleading.
To answer your questions
1. Yes
2. Yes.
A disallow in robots.txt does nothing to remove already-indexed pages. That's not its purpose. Its only purpose is to tell the search crawlers not to waste their time crawling those pages. Even if pages have been blocked in robots, they will remain in the index if already there. Even if never crawled, and blocked in robots.txt, they can still end up indexed if some other indexed page links to them and the crawlers find those pages by following links. Again, nothing in a robots.txt disallow tells the engines to remove a page from the index, just not to waste time crawling it.
Put another way, the robots.txt disallow directive only disallows crawling - it says nothing about what to do if the page gets into the index in other ways.
The meta-robots no-index tag however explicitly states to the crawler "if you arrive at this page, do not add it to the index. If it is already in the index, remove it".
And yea - as you suspected - if pages are blocked in robots.txt, the crawler obeys and doesn't visit those pages So it can't discover the no-index command to drop them from the index. Thus the only way a page could get dropped is if a crawler followed a link from an external site and discovered the page that way. A very inefficient way of trying to get all those pages out of the index.
Bottom line - robots.txt is never the correct tool to deal with duplicate content issues. It's sole purpose is to keep the crawlers from wasting time on unimportant pages so they can spend more time finding (and therefore indexing) more important pages.
The three tools for dealing with duplicate content are meta-robots no-index tags in a page header, 301 redirects, and canonical tags. Which one to use depends on the architecture of your site, your intended purpose, and the site's technical limitations.
Hope that makes sense?
Paul
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Multiple H2 tags
Is it advisable to use only one H2 tag? The template designs for some reason is ended up with multiple H2 tags, I realise if any think it's that each one is that are important and it is all relative. Just trying to assess if it's worth the time and effort to rehash the template. Has anyone done any testing or got any experience? Thanks
Intermediate & Advanced SEO | | seoman101 -
Using cononical tag instead of 301
I've got a bit of an odd situation... My business partner and I split up, and he's going to keep the company name. The website that I built for the company has some links to it, and I've managed to build up some DA and PA. I want to get the link juice over to my new website. My former partner doesn't care about the link juice, he just wants a website that he can show people. SO, I can't do a 301 or 302, because that would take down the existing site. Can I just use a canonical tag that refers link power to my new website? Would this be harmful in any way? What should I do to accomplish getting the link power without a redirect, and without contacting each person who has given us a backlink?
Intermediate & Advanced SEO | | Zing-Marketing0 -
SEO Title Versus Meta Description Tag
From an SEO perspective, is the title tag more important than the description tag? We use a set format for these tags on our real estate web site. The site contains 300 listings. Sample Title Tag:
Intermediate & Advanced SEO | | Kingalan1
Greenwich Village | Office Space Rental| 2300SF $9583/month Sample Description Tag:
Classic Greenwich Village office rental. Hardwood floors, 11' ceiling. 5 oversized windows. 24/7 attended lobby. Renovated common areas. Below market rent. Are we shooting ourselves in the foot by repeating the Square Footage and monthly rent amounts in the title tag? Should this tag be used for a short more descriptive terms so as to maximize the SEO benefit? Should these numbers be listed in the description tag? The listings are not heavily SEO optimized so I don't know whether this is really a non-issue.0 -
Robot.txt help
Hi, We have a blog that is killing our SEO. We need to Disallow Disallow: /Blog/?tag*
Intermediate & Advanced SEO | | Studio33
Disallow: /Blog/?page*
Disallow: /Blog/category/*
Disallow: /Blog/author/*
Disallow: /Blog/archive/*
Disallow: /Blog/Account/.
Disallow: /Blog/search*
Disallow: /Blog/search.aspx
Disallow: /Blog/error404.aspx
Disallow: /Blog/archive*
Disallow: /Blog/archive.aspx
Disallow: /Blog/sitemap.axd
Disallow: /Blog/post.aspx But Allow everything below /Blog/Post The disallow list seems to keep growing as we find issues. So rather than adding in to our Robot.txt all the areas to disallow. Is there a way to easily just say Allow /Blog/Post and ignore the rest. How do we do that in Robot.txt Thanks0 -
Issue with Robots.txt file blocking meta description
Hi, Can you please tell me why the following error is showing up in the serps for a website that was just re-launched 7 days ago with new pages (301 redirects are built in)? A description for this result is not available because of this site's robots.txt – learn more. Once we noticed it yesterday, we made some changed to the file and removed the amount of items in the disallow list. Here is the current Robots.txt file: # XML Sitemap & Google News Feeds version 4.2 - http://status301.net/wordpress-plugins/xml-sitemap-feed/ Sitemap: http://www.website.com/sitemap.xml Sitemap: http://www.website.com/sitemap-news.xml User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/ Other notes... the site was developed in WordPress and uses that followign plugins: WooCommerce All-in-One SEO Pack Google Analytics for WordPress XML Sitemap Google News Feeds Currently, in the SERPs, it keeps jumping back and forth between showing the meta description for the www domain and showing the error message (above). Originally, WP Super Cache was installed and has since been deactivated, removed from WP-config.php and deleted permanently. One other thing to note, we noticed yesterday that there was an old xml sitemap still on file, which we have since removed and resubmitted a new one via WMT. Also, the old pages are still showing up in the SERPs. Could it just be that this will take time, to review the new sitemap and re-index the new site? If so, what kind of timeframes are you seeing these days for the new pages to show up in SERPs? Days, weeks? Thanks, Erin ```
Intermediate & Advanced SEO | | HiddenPeak0 -
Noindex, rel=cannonical, or no worries?
Hello, SEO pros, We need your help with a case ↓ Introduction: Our website allows individual contractors to create a webpage where they can show what services they offer, write something about themselves and show their previous projects in pictures. All the professions and services assigned accordingly are already in our system, so users need to pick a profession and mark all services they provide or suggest those which we missed to add. We have created unique URLs for all the professions and services. We have internal search field and use a autocomplete to direct users to the right page. **Example: ** PROFESSION Carpenter (URL: /carpenters ) SERVICES Decking (URL: /carpenters/decking) Kitchens (URL: /carpenters/kitchens) Flooring and staircases (URL: /carpenters/flooring-and-staircases) Door trimming (URL: /carpenters/door-trimming) Lock fitting (URL: /carpenters/lock-fitting) Problem We want to be found by Google search on all the services and give a searchers a list of all carpenters in our database who can provide a service they want to find. We give 15 contractors per page and rank them by recommendations provided by their clients. Our concern is that our results pages may be marked as duplicate since some of them give the same list of carpenters. All the best 15 carpenters offer door-trimming and lock-fitting. So, all the same 15 are shown in /carpenters, /carpenters/lock-fitting, /carpenters/door-trimming. We don't want to be marked as spammers and loose points on domain trust, however we believe we give quality content since we gave what the searchers want to find - contractors, who offer what they need. **Solution? ** Noindex all service pages to avoid duplicate content indexed by Google OR rel=canonical tag on service pages to redirect to profession page. e.g. on /carpenters/lock-fitting page make a tag rel=canonical to /carpenters. OR no worries, allow Google index all the professions and services pages. Benefits of indexing it all (around 2500 additional pages with different keywords) is greater than ttagging service pages with no index or rel=canonical and loosing the opportunities to get more traffic by service titles. We need a solution which would be the best for our organic traffic 🙂 Many thanks for your precious time.
Intermediate & Advanced SEO | | osvaldas0 -
Is this structure valid for a canonical tag?
Working on a site, and noticed their canonical tags follow the structure: //www.domain.com/article They cited their reason for this as http://www.ietf.org/rfc/rfc3986.txt. Does anyone know if Google will recognize this as a valid canonical? Are there any issues with using this as a the canonical?
Intermediate & Advanced SEO | | nicole.healthline0 -
Should I index tag pages?
Should I exclude the tag pages? Or should I go ahead and keep them indexed? Is there a general opinion on this topic?
Intermediate & Advanced SEO | | NikkiGaul0