Can I Block https URLs using Host directive in robots.txt?
-
Hello Moz Community,
Recently, I have found that Google bots has started crawling HTTPs urls of my website which is increasing the number of duplicate pages at our website.
Instead of creating a separate robots.txt file for https version of my website, can I use Host directive in the robots.txt to suggest Google bots which is the original version of the website.
Host: http://www.example.com
I was wondering if this method will work and suggest Google bots that HTTPs URLs are the mirror of this website.
Thanks for all of the great responses!
Regards,
Ramendra -
Hi Ramendra,
To my knowledge, you can only provide directives in the robots.txt file for the domain on which it lives. This goes for both http/https and www/non-www versions of domains. This is why it's important to handle all preferred domain formatting with redirects, that point to your canonicalized version. So if you want http://www to index, all other versions redirect to that.
There might be a work around of some sort, but honestly, what I described above with redirection towards preferred versions is the direction you should take. Then you can manage one robots.txt file and your indexing will align with what you want better.
-
Thanks Logan,
I have read somewhere that using Host directive in the robots.txt file we can suggest Google bots which is the original version of the website if there are number of mirror sites. So, I was wondering if we can prevent indexing/crawling of HTTPS URLs by using Host directive in robots.txt of HTTP site.
We are using an ecommerce SAAS platform for our website where we have only one robots.txt file that we can use for HTTP site.
Is there any other way to prevent indexing/crawling of HTTPS URLs?
Regards,
Ramendra -
Hi Ramendra,
Based on what you said, it sounds like both versions of your site exist and are indexed, and you want to mitigate your duplicate content risk. If that's accurate, here are my recommendations on this:
- Robots.txt cannot be used on a HTTP site to prevent indexing/crawling of HTTPS URLs
- Google crawls HTTPS by default, so if your site is fully secure, then you need to redirect (this can be done with a redirect rule in HTACCESS, you don't need to do one-to-one redirects) HTTP URLs over to their HTTPS twin
- In addition to your HTTP>HTTPS redirects, you should also use canonical tags to push your preferred version to search engines
- Your HTTPS site should have its own robots.txt file
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Resolving 301 Redirect Chains from Different URL Versions (http, https, www, non-www)
Hi all, Our website has undergone both a redesign (with new URLs) and a migration to HTTPS in recent years. I'm having difficulties ensuring all URLs redirect to the correct version all the while preventing redirect chains. Right now everything is redirecting to the correct version but it usually takes up to two redirects to make this happen. See below for an example. How do I go about addressing this, or is this not even something I should concern myself with? Redirects (2) <colgroup><col width="123"><col width="302"></colgroup>
Technical SEO | | theyoungfirm
| Redirect Type | URL |
| | http://www.theyoungfirm.com/blog/2009/index.html 301 | https://theyoungfirm.com/blog/2009/index.html 301 | https://theyoungfirm.com/blog/ | This code below was what we added to our htaccess file. Prior to adding this, the various subdomain versions (www, non-www, http, etc.) were not redirecting properly. But ever since we added it, it's now created these additional URLs (see bolded URL above) as a middle step before resolving to the correct URL. RewriteEngine on RewriteCond %{HTTP_HOST} ^www.(.*)$ [NC] RewriteRule ^(.*)$ https://%1/$1 [R=301,L] RewriteCond %{HTTPS} !on RewriteRule (.*) https://%{HTTP_HOST}%{REQUEST_URI} [R=301,L] Your feedback is much appreciated. Thanks in advance for your help. Sincerely, Bethany0 -
Blocking pages from Moz and Alexa robots
Hello, We want to block all pages in this directory from Moz and Alexa robots - /slabinventory/search/ Here is an example page - https://www.msisurfaces.com/slabinventory/search/granite/giallo-fiesta/los-angeles-slabs/msi/ Let me know if this is a valid disallow for what I'm trying to. User-agent: ia_archiver
Technical SEO | | Pushm
Disallow: /slabinventory/search/* User-agent: rogerbot
Disallow: /slabinventory/search/* Thanks.0 -
URL 301 Re-direct
Hello, If we publish a blog post with a url which accidentally contains a number at the end (blog.companyname.com/subject-title-0), is it best-practice to update the URL (e.g. to blog.companyname.com/subject-title) and put in a 301 re-direct from the old to the new one or should it simply be left as is? I've read that 301's lose link equity and relevance so is it really worth re-directing for the sake of a cleaner url? Thanks for your input! John
Technical SEO | | SEOCT1 -
Robots.txt vs. meta noindex, follow
Hi guys, I wander what your opinion is concerning exclution via the robots.txt file.
Technical SEO | | AdenaSEO
Do you advise to keep using this? For example: User-agent: *
Disallow: /sale/*
Disallow: /cart/*
Disallow: /search/
Disallow: /account/
Disallow: /wishlist/* Or do you prefer using the meta tag 'noindex, follow' instead?
I keep hearing different suggestions.
I'm just curious what your opinion / suggestion is. Regards,
Tom Vledder0 -
RegEx help needed for robots.txt potential conflict
I've created a robots.txt file for a new Magento install and used an existing site-map that was on the Magento help forums but the trouble is I can't decipher something. It seems that I am allowing and disallowing access to the same expression for pagination. My robots.txt file (and a lot of other Magento site-maps it seems) includes both: Allow: /*?p= and Disallow: /?p=& I've searched for help on RegEx and I can't see what "&" does but it seems to me that I'm allowing crawler access to all pagination URLs, but then possibly disallowing access to all pagination URLs that include anything other than just the page number? I've looked at several resources and there is practically no reference to what "&" does... Can anyone shed any light on this, to ensure I am allowing suitable access to a shop? Thanks in advance for any assistance
Technical SEO | | MSTJames0 -
Do I need robots.txt and meta robots?
If I can manage to tell crawlers what I do and don't want them to crawl for my whole site via my robots.txt file, do I still need meta robots instructions?
Technical SEO | | Nola5040 -
What is the sense of robots.txt?
Using robots.txt to prevent search engine from indexing the page is not a good idea. so what is the sense of robots.txt? just for attracting robots to crawl sitemap?
Technical SEO | | jallenyang0