Is robots met tag a more reliable than robots.txt at preventing indexing by Google?
-
What's your experience of using robots meta tag v robots.txt when it comes to a stand alone solution to prevent Google indexing?
I am pretty sure robots meta tag is more reliable - going on own experiences, I have never experience any probs with robots meta tags but plenty with robots.txt as a stand alone solution.
Thanks in advance, Luke
-
Hi there,
Regarding the X-Robots tag. We have had a couple of sites that were disallowed in the robots.txt have their PDF, Doc etc files get indexed. I understand the reasoning for this. I would like to remove the disallow in the robots.txt and use the X-robots tag to noindex all pages as well as PDF, Doc files etc. This is for a ngnix configuation. Does anyone know what the written x-robots tag would look like in this case?
-
Test for what works for your site.
Use tools below
- https://www.deepcrawl.com/ (will give you one free full crawl)
- https://www.screamingfrog.co.uk/seo-spider/ (free up to 500 URLs)
- http://urlprofiler.com/ (14 days free try)
- https://www.deepcrawl.com/blog/best-practice/noindex-disallow-nofollow/
- https://www.screamingfrog.co.uk/seo-spider/user-guide/general/#robots-txt
- https://www.deepcrawl.com/blog/best-practice/noindex-and-google/
So much info
https://www.deepcrawl.com/blog/tag/robots-txt/
Thomas
-
Hi Luke,
In order to exclude individual pages from search engine indices, the noindex meta tag
is actually superior to robots.txt.
But X-Robots-Tag header tag is the best but much hader to use.
Block all web crawlers from all content
User-agent: * Disallow: /
Using the
robots.txt
file, you can tell a spider where it cannot go on your site. You can not tell a search engine which URLs it cannot show in the search results. This means that not allowing a search engine to crawl an URL – called “blocking” it – does not mean that URL will not show up in the search results. If the search engine finds enough links to that URL, it will include it; it will just not know what’s on that page.If you want to reliably block a page from showing up in the search results, you need to use a meta robots
noindex
tag. That means the search engine has to be able to index that page and find thenoindex
tag, so the page should not be blocked byrobots.txt
a
robots.txt
file does. In a nutshell, what it does is tell search engines to not crawl a particular page, file or directory of your website.Using this, helps both you and search engines such as Google. By not providing access to certain, unimportant areas of your website, you can save on your crawl budget and reduce load on your server.
Please note that using the
robots.txt
file to hide your entire website for search engines is definitely not recommended.see big photo: http://i.imgur.com/MM7hM4g.png
_(…)_ _(…)_
The robots meta tag in the above example instructs all search engines not to show the page in search results. The value of the
name
attribute (robots
) specifies that the directive applies to all crawlers. To address a specific crawler, replace therobots
value of thename
attribute with the name of the crawler that you are addressing. Specific crawlers are also known as user-agents (a crawler uses its user-agent to request a page.) Google's standard web crawler has the user-agent name.Googlebot
To prevent only Googlebot from crawling your page, update the tag as follows:This tag now instructs Google (but no other search engines) not to show this page in its web search results. Both the and
name
the attributescontent
are non-case sensitive.Search engines may have different crawlers for different properties or purposes. See the complete list of Google's crawlers. For example, to show a page in Google's web search results, but not in Google News, use the following meta tag:
If you need to specify multiple crawlers individually, it's okay to use multiple robots meta tags:
If competing directives are encountered by our crawlers we will use the most restrictive directive we find.
irective. This basically means that if you want to really hide something from the search engines, and thus from people using search,
robots.txt
won’t suffice.Indexer directives
Indexer directives are directives that are set on a per page and/or per element basis. Up until July 2007, there were two directives: the microformat rel=”nofollow”, which means that that link should not pass authority / PageRank, and the Meta Robots tag.
With the Meta Robots tag, you can really prevent search engines from showing pages you want to keep out of the search results. The same result can be achieved with the X-Robots-Tag HTTP header. As described earlier, the X-Robots-Tag gives you more flexibility by also allowing you to control how specific file(types) are indexed.
Example uses of the X-Robots-Tag
Using the
X-Robots-Tag
HTTP headerThe
X-Robots-Tag
can be used as an element of the HTTP header response for a given URL. Any directive that can be used in an robots meta tag can also be specified as anX-Robots-Tag
. Here's an example of an HTTP response with anX-Robots-Tag
instructing crawlers not to index a page:HTTP/1.1 200 OK Date: Tue, 25 May 2010 21:42:43 GMT _(…)_ **X-Robots-Tag: noindex** _(…)_
Multiple
X-Robots-Tag
headers can be combined within the HTTP response, or you can specify a comma-separated list of directives. Here's an example of an HTTP header response which has anoarchive
X-Robots-Tag
combined with anunavailable_after
X-Robots-Tag
.HTTP/1.1 200 OK Date: Tue, 25 May 2010 21:42:43 GMT _(…)_ **X-Robots-Tag: noarchive X-Robots-Tag: unavailable_after: 25 Jun 2010 15:00:00 PST** _(…)_
The
X-Robots-Tag
may optionally specify a user-agent before the directives. For instance, the following set ofX-Robots-Tag
HTTP headers can be used to conditionally allow showing of a page in search results for different search engines:HTTP/1.1 200 OK Date: Tue, 25 May 2010 21:42:43 GMT _(…)_ **X-Robots-Tag: googlebot: nofollow X-Robots-Tag: otherbot: noindex, nofollow** _(…)_
Directives specified without a user-agent are valid for all crawlers. The section below demonstrates how to handle combined directives. Both the name and the specified values are not case sensitive.
- https://moz.com/learn/seo/robotstxt
- https://yoast.com/ultimate-guide-robots-txt/
- https://moz.com/blog/the-wonderful-world-of-seo-metatags
- https://yoast.com/x-robots-tag-play/
- https://www.searchenginejournal.com/x-robots-tag-simple-alternate-robots-txt-meta-tag/67138/
- https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag
I hope this helps,
Tom
-
If you've recently added the "noindex" meta, get the page fetched in GWT. Google can't act if it doesn't see the tag.
-
Hi Luke,
It's a pretty common misconception that the robots.txt will prevent indexing. It's only purpose is actually to prevent crawling, anything disallowed in there is still up for indexing if it's linked to elsewhere. If you want something deindexed, your best bet is the robots meta tag, but make sure you allow crawling of the URLs to give search engine bots an opportunity to see the tag.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt blocked internal resources Wordpress
Hi all, We've recently migrated a Wordpress website from staging to live, but the robots.txt was deleted. I've created the following new one: User-agent: *
Intermediate & Advanced SEO | | Mat_C
Allow: /
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Allow: /wp-admin/admin-ajax.php However, in the site audit on SemRush, I now get the mention that a lot of pages have issues with blocked internal resources in robots.txt file. These blocked internal resources are all cached and minified css elements: links, images and scripts. Does this mean that Google won't crawl some parts of these pages with blocked resources correctly and thus won't be able to follow these links and index the images? In other words, is this any cause for concern regarding SEO? Of course I can change the robots.txt again, but will urls like https://example.com/wp-content/cache/minify/df983.js end up in the index? Thanks for your thoughts!2 -
Google Indexing
Hi We have roughly 8500 pages in our website. Google had indexed almost 6000 of them, but now suddenly I see that the pages indexed has gone to 45. Any possible explanations why this might be happening and what can be done for it. Thanks, Priyam
Intermediate & Advanced SEO | | kh-priyam0 -
Google Ignoring Canonical Tag for Hundreds of Sites
Bazaar Voice provides a pretty easy-to-use product review solution for websites (especially sites on Magento): https://www.magentocommerce.com/magento-connect/bazaarvoice-conversations-1.html If your product has over a certain number of reviews/questions, the plugin cuts off the number of reviews/questions that appear on the page. To see the reviews/questions that are cut off, you have to click the plugin's next or back function. The next/back buttons' URLs have a parameter of "bvstate....." I have noticed Google is indexing this "bvstate..." URL for hundreds of sites, even with the proper rel canonical tag in place. Here is an example with Microsoft: http://webcache.googleusercontent.com/search?q=cache:zcxT7MRHHREJ:www.microsoftstore.com/store/msusa/en_US/pdp/Surface-Book/productID.325716000%3Fbvstate%3Dpg:8/ct:r+&cd=2&hl=en&ct=clnk&gl=us My website is seeing hundreds of these "bvstate" urls being indexed even though we have a proper rel canonical tag in place. It seems that Google is ignoring the canonical tag. In Webmaster Console, the main source of my duplicate titles/metas in the HTML improvements section is the "bvstate" URLs. I don't necessarily want to block "bvstate" in the robots.txt as it will prohibit Google from seeing the reviews that were cutoff. Same response for prohibiting Google from crawling "bvstate" in Paramters section of Webmaster Console. Should I just keep my fingers crossed that Google honors the rel canonical tag? Home Depot is another site that has this same issue: http://webcache.googleusercontent.com/search?q=cache:k0MBLFcu2PoJ:www.homedepot.com/p/DUROCK-Next-Gen-1-2-in-x-3-ft-x-5-ft-Cement-Board-172965/202263276%23!bvstate%3Dct:r/pg:2/st:p/id:202263276+&cd=1&hl=en&ct=clnk&gl=us
Intermediate & Advanced SEO | | redgatst1 -
Google is indexing the wrong page
Hello, I have a site I am optimizing and I cant seem to get a particular listing onto the first page due to the fact google is indexing the wrong page. I have the following scenario. I have a client with multiple locations. To target the locations I set them up with URLs like this /<cityname>-wedding-planner.</cityname> The home page / is optimized for their port saint lucie location. the page /palm-city-wedding-planner is optimized for the palm city location. the page /stuart-wedding-planner is optimized for the stuart location. Google picks up the first two and indexes them properly, BUT the stuart location page doesnt get picked up at all, instead google lists / which is not optimized at all for stuart. How do I "let google know" to index the stuart landing page for the "stuart wedding planner" term? MOZ also shows the / page as being indexed for the stuart wedding planner term as well but I assume this is just a result of what its finding when it performs its searches.
Intermediate & Advanced SEO | | mediagiant0 -
Huge increase in server errors and robots.txt
Hi Moz community! Wondering if someone can help? One of my clients (online fashion retailer) has been receiving huge increase in server errors (500's and 503's) over the last 6 weeks and it has got to the point where people cannot access the site because of server errors. The client has recently changed hosting companies to deal with this, and they have just told us they removed the DNS records once the name servers were changed, and they have now fixed this and are waiting for the name servers to propagate again. These errors also correlate with a huge decrease in pages blocked by robots.txt file, which makes me think someone has perhaps changed this and not told anyone... Anyone have any ideas here? It would be greatly appreciated! 🙂 I've been chasing this up with the dev agency and the hosting company for weeks, to no avail. Massive thanks in advance 🙂
Intermediate & Advanced SEO | | labelPR0 -
Do I need to disallow the dynamic pages in robots.txt?
Do I need to disallow the dynamic pages that show when people use our site's search box? Some of these pages are ranking well in SERPs. Thanks! 🙂
Intermediate & Advanced SEO | | esiow20130 -
Category Pages - Canonical, Robots.txt, Changing Page Attributes
A site has category pages as such: www.domain.com/category.html, www.domain.com/category-page2.html, etc... This is producing duplicate meta descriptions (page titles have page numbers in them so they are not duplicate). Below are the options that we've been thinking about: a. Keep meta descriptions the same except for adding a page number (this would keep internal juice flowing to products that are listed on subsequent pages). All pages have unique product listings. b. Use canonical tags on subsequent pages and point them back to the main category page. c. Robots.txt on subsequent pages. d. ? Options b and c will orphan or french fry some of our product pages. Any help on this would be much appreciated. Thank you.
Intermediate & Advanced SEO | | Troyville0