Can I Block https URLs using Host directive in robots.txt?

TJC.co.uk

Hello Moz Community,

Recently, I have found that Google bots has started crawling HTTPs urls of my website which is increasing the number of duplicate pages at our website.

Instead of creating a separate robots.txt file for https version of my website, can I use Host directive in the robots.txt to suggest Google bots which is the original version of the website.

Host: http://www.example.com

I was wondering if this method will work and suggest Google bots that HTTPs URLs are the mirror of this website.

Thanks for all of the great responses!

Regards,
Ramendra

LoganRay

Hi Ramendra,

To my knowledge, you can only provide directives in the robots.txt file for the domain on which it lives. This goes for both http/https and www/non-www versions of domains. This is why it's important to handle all preferred domain formatting with redirects, that point to your canonicalized version. So if you want http://www to index, all other versions redirect to that.

There might be a work around of some sort, but honestly, what I described above with redirection towards preferred versions is the direction you should take. Then you can manage one robots.txt file and your indexing will align with what you want better.

TJC.co.uk

Thanks Logan,

I have read somewhere that using Host directive in the robots.txt file we can suggest Google bots which is the original version of the website if there are number of mirror sites. So, I was wondering if we can prevent indexing/crawling of HTTPS URLs by using Host directive in robots.txt of HTTP site.

We are using an ecommerce SAAS platform for our website where we have only one robots.txt file that we can use for HTTP site.

Is there any other way to prevent indexing/crawling of HTTPS URLs?

Regards,
Ramendra

LoganRay

Hi Ramendra,

Based on what you said, it sounds like both versions of your site exist and are indexed, and you want to mitigate your duplicate content risk. If that's accurate, here are my recommendations on this:

Robots.txt cannot be used on a HTTP site to prevent indexing/crawling of HTTPS URLs
Google crawls HTTPS by default, so if your site is fully secure, then you need to redirect (this can be done with a redirect rule in HTACCESS, you don't need to do one-to-one redirects) HTTP URLs over to their HTTPS twin
In addition to your HTTP>HTTPS redirects, you should also use canonical tags to push your preferred version to search engines
Your HTTPS site should have its own robots.txt file

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Can I Block https URLs using Host directive in robots.txt?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Resolving 301 Redirect Chains from Different URL Versions (http, https, www, non-www)

Blocking pages from Moz and Alexa robots

URL 301 Re-direct

Should you use robots.txt for pages within your site which do not have high quality content or are not contributing a great deal so when Google crawls your site the best performing content has a higher chance of being indexed?

Robots.txt vs. meta noindex, follow

RegEx help needed for robots.txt potential conflict

Do I need robots.txt and meta robots?

What is the sense of robots.txt?