Can I Block https URLs using Host directive in robots.txt?
-
Hello Moz Community,
Recently, I have found that Google bots has started crawling HTTPs urls of my website which is increasing the number of duplicate pages at our website.
Instead of creating a separate robots.txt file for https version of my website, can I use Host directive in the robots.txt to suggest Google bots which is the original version of the website.
Host: http://www.example.com
I was wondering if this method will work and suggest Google bots that HTTPs URLs are the mirror of this website.
Thanks for all of the great responses!
Regards,
Ramendra -
Hi Ramendra,
To my knowledge, you can only provide directives in the robots.txt file for the domain on which it lives. This goes for both http/https and www/non-www versions of domains. This is why it's important to handle all preferred domain formatting with redirects, that point to your canonicalized version. So if you want http://www to index, all other versions redirect to that.
There might be a work around of some sort, but honestly, what I described above with redirection towards preferred versions is the direction you should take. Then you can manage one robots.txt file and your indexing will align with what you want better.
-
Thanks Logan,
I have read somewhere that using Host directive in the robots.txt file we can suggest Google bots which is the original version of the website if there are number of mirror sites. So, I was wondering if we can prevent indexing/crawling of HTTPS URLs by using Host directive in robots.txt of HTTP site.
We are using an ecommerce SAAS platform for our website where we have only one robots.txt file that we can use for HTTP site.
Is there any other way to prevent indexing/crawling of HTTPS URLs?
Regards,
Ramendra -
Hi Ramendra,
Based on what you said, it sounds like both versions of your site exist and are indexed, and you want to mitigate your duplicate content risk. If that's accurate, here are my recommendations on this:
- Robots.txt cannot be used on a HTTP site to prevent indexing/crawling of HTTPS URLs
- Google crawls HTTPS by default, so if your site is fully secure, then you need to redirect (this can be done with a redirect rule in HTACCESS, you don't need to do one-to-one redirects) HTTP URLs over to their HTTPS twin
- In addition to your HTTP>HTTPS redirects, you should also use canonical tags to push your preferred version to search engines
- Your HTTPS site should have its own robots.txt file
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can you use a seperate url for a interior product page on a site?
I have a friend that has a health insurance agency site. He wants to add a new page, for child health care insurance to his existing site. But the issue is, he brought a new URL; insurancemykidnow.com and he want's to use it for the new page. Now, I'm not sure I'm right on this, but I don't think that can be done? I'm I wrong? = Thanks in advance.
Technical SEO | | Coppell0 -
Will a Robots.txt 'disallow' of a directory, keep Google from seeing 301 redirects for pages/files within the directory?
Hi- I have a client that had thousands of dynamic php pages indexed by Google that shouldn't have been. He has since blocked these php pages via robots.txt disallow. Unfortunately, many of those php pages were linked to by high quality sites mulitiple times (instead of the static urls) before he put up the php 'disallow'. If we create 301 redirects for some of these php URLs that area still showing high value backlinks and send them to the correct static URLs, will Google even see these 301 redirects and pass link value to the proper static URLs? Or will the robots.txt keep Google away and we lose all these high quality backlinks? I guess the same question applies if we use the canonical tag instead of the 301. Will the robots.txt keep Google from seeing the canonical tags on the php pages? Thanks very much, V
Technical SEO | | Voodak0 -
Blocked jquery in Robots.txt, Any SEO impact?
I've heard that Google is now indexing links and stuff available in javascript and jquery. My webmastertools is showing that some links are blocked in robots.txt of jquery. Sorry I'm not a developer or designer. I want to know is there any impact of this on my SEO? and also how can I unblock it for the robots? Check this screenshot: http://i.imgur.com/3VDWikC.png
Technical SEO | | hammadrafique0 -
Can hosting blog posts with keyword anchor text on outbound links cause a penalty?
My site received a Google penalty for having inbound links from blog posts with over-optimized ("spammy") anchor text. I spent months getting these links removed. Yesterday - I received a link deletion request from a site that my site had linked out to (three links via keyword anchor text relevant to their company) in a blog post. The "unnatural link" penalty still hasn't been removed from my site. My question is: Does the penalty work both ways? For having inbound "unnatural" links ... AND for having outbound "unnatural" links?
Technical SEO | | RedNovaLabs910 -
Which Web host do you use?
A friend of mine has a successful website which is hosted by the company he used to use for developing his site. As he no longer uses them feels he should use it. Who do you use for hosting a small to medium sized business?
Technical SEO | | Ant710 -
Does Bing ignore robots txt files?
Bonjour from "Its a miracle is not raining" Wetherby Uk 🙂 Ok here goes... Why despite a robots text file excluding indexing to site http://lewispr.netconstruct-preview.co.uk/ is the site url being indexed in Bing bit not Google? Does bing ignore robots text files or is there something missing from http://lewispr.netconstruct-preview.co.uk/robots.txt I need to add to stop bing indexing a preview site as illustrated below. http://i216.photobucket.com/albums/cc53/zymurgy_bucket/preview-bing-indexed.jpg Any insights welcome 🙂
Technical SEO | | Nightwing0 -
Robots.txt usage
Hey Guys, I am about make an important improvement to our site's robots.txt we have large number of properties on our site and we have different views for them. List, gallery and map view. By default list view shows up and user can navigate through gallery view. We donot want gallery pages to get indexed and want to save our crawl budget for more important pages. this is one example of our site: http://www.holiday-rentals.co.uk/France/r31.htm When you click on "gallery view" URL of this site will remain same in your address bar: but when you mouse over the "gallery view" tab it will show you URL with parameter "view=g". there are number of parameters: "view=g, view=l and view=m". http://www.holiday-rentals.co.uk/France/r31.htm?view=l http://www.holiday-rentals.co.uk/France/r31.htm?view=g http://www.holiday-rentals.co.uk/France/r31.htm?view=m Now my question is: I If restrict bots by adding "Disallow: ?view=" in our robots.txt will it effect the list view too? Will be very thankful if yo look into this for us. Many thanks Hassan I will test this on some other site within our network too before putting it to important one's. to measure the impact but will be waiting for your recommendations. Thanks
Technical SEO | | holidayseo0 -
Not sure which URL to use for 301 redirect
A client has new website design completed by another developer, was launched in April of this year. No 301 redirect was set up so duplicate content is an issue. Client has had a website with same domain name for about 10 years, but has not had any SEO work completed before or since his new site design. For non-www there are 6 referring links - 1 considered to have authority, for www there are also 6 but 3 considered to have authority. More links seem to coming from www than non-www. But for one of the clients keywords they are ranked #1 for their area and that links to their non-www address. And even though no redirects set up by developer, non-www has had far more visits according to Google Analytics. So many basics that still need to be done for site: no meta-descriptions on any page, H1 and page titles could use keywords, call to action moved above fold, etc. Considering this is a new site, and new SEO work and many more inbound links needed, does it matter which address I redirect to? _Cindy Barnard
Technical SEO | | CeCeBar0