Can I Block https URLs using Host directive in robots.txt?
-
Hello Moz Community,
Recently, I have found that Google bots has started crawling HTTPs urls of my website which is increasing the number of duplicate pages at our website.
Instead of creating a separate robots.txt file for https version of my website, can I use Host directive in the robots.txt to suggest Google bots which is the original version of the website.
Host: http://www.example.com
I was wondering if this method will work and suggest Google bots that HTTPs URLs are the mirror of this website.
Thanks for all of the great responses!
Regards,
Ramendra -
Hi Ramendra,
To my knowledge, you can only provide directives in the robots.txt file for the domain on which it lives. This goes for both http/https and www/non-www versions of domains. This is why it's important to handle all preferred domain formatting with redirects, that point to your canonicalized version. So if you want http://www to index, all other versions redirect to that.
There might be a work around of some sort, but honestly, what I described above with redirection towards preferred versions is the direction you should take. Then you can manage one robots.txt file and your indexing will align with what you want better.
-
Thanks Logan,
I have read somewhere that using Host directive in the robots.txt file we can suggest Google bots which is the original version of the website if there are number of mirror sites. So, I was wondering if we can prevent indexing/crawling of HTTPS URLs by using Host directive in robots.txt of HTTP site.
We are using an ecommerce SAAS platform for our website where we have only one robots.txt file that we can use for HTTP site.
Is there any other way to prevent indexing/crawling of HTTPS URLs?
Regards,
Ramendra -
Hi Ramendra,
Based on what you said, it sounds like both versions of your site exist and are indexed, and you want to mitigate your duplicate content risk. If that's accurate, here are my recommendations on this:
- Robots.txt cannot be used on a HTTP site to prevent indexing/crawling of HTTPS URLs
- Google crawls HTTPS by default, so if your site is fully secure, then you need to redirect (this can be done with a redirect rule in HTACCESS, you don't need to do one-to-one redirects) HTTP URLs over to their HTTPS twin
- In addition to your HTTP>HTTPS redirects, you should also use canonical tags to push your preferred version to search engines
- Your HTTPS site should have its own robots.txt file
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
After you remove a 301 redirect that Google has processed, will the new URL retain any of the link equity from the old URL?
Lets say you 301 redirect URL A to URL B, and URL A has some backlinks from other sites. Say you left the 301 redirect in place for a year, and Google had already replaced the old URL with the new URL in the SERPs, would the new URL (B) retain some of the link equity from URL A after the 301 redirect was removed, or does the redirect have to remain in place forever?
Technical SEO | | johnwalkersmith0 -
Can redirect URL website also shown on the google ranking? and higher than the original website?
can redirect URL website also shown on the google ranking? and higher than the original website? For example, I create URL B which redirect to website A, and do good SEO on URL B, can URL B rank higher than my original website A?
Technical SEO | | HealthmateForever0 -
Using # in parameters?
I am trying to understand why a website would use # instead of a ? for its parameters? I have put an example of the URL below: http://www.warehousestationery.co.nz/office-supplies/adhesives-tapes-and-fastenings#prefn1=brand&prefn2=colour&prefv1=Command&prefv2=Clear Any help would be much appreciated.
Technical SEO | | CaitlinDW1 -
Google Webmaster Tools is saying "Sitemap contains urls which are blocked by robots.txt" after Https move...
Hi Everyone, I really don't see anything wrong with our robots.txt file after our https move that just happened, but Google says all URLs are blocked. The only change I know we need to make is changing the sitemap url to https. Anything you all see wrong with this robots.txt file? robots.txt This file is to prevent the crawling and indexing of certain parts of your site by web crawlers and spiders run by sites like Yahoo! and Google. By telling these "robots" where not to go on your site, you save bandwidth and server resources. This file will be ignored unless it is at the root of your host: Used: http://example.com/robots.txt Ignored: http://example.com/site/robots.txt For more information about the robots.txt standard, see: http://www.robotstxt.org/wc/robots.html For syntax checking, see: http://www.sxw.org.uk/computing/robots/check.html Website Sitemap Sitemap: http://www.bestpricenutrition.com/sitemap.xml Crawlers Setup User-agent: * Allowable Index Allow: /*?p=
Technical SEO | | vetofunk
Allow: /index.php/blog/
Allow: /catalog/seo_sitemap/category/ Directories Disallow: /404/
Disallow: /app/
Disallow: /cgi-bin/
Disallow: /downloader/
Disallow: /includes/
Disallow: /lib/
Disallow: /magento/
Disallow: /pkginfo/
Disallow: /report/
Disallow: /stats/
Disallow: /var/ Paths (clean URLs) Disallow: /index.php/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/
Disallow: /catalogsearch/
Disallow: /checkout/
Disallow: /control/
Disallow: /contacts/
Disallow: /customer/
Disallow: /customize/
Disallow: /newsletter/
Disallow: /poll/
Disallow: /review/
Disallow: /sendfriend/
Disallow: /tag/
Disallow: /wishlist/
Disallow: /aitmanufacturers/index/view/
Disallow: /blog/tag/
Disallow: /advancedreviews/abuse/reportajax/
Disallow: /advancedreviews/ajaxproduct/
Disallow: /advancedreviews/proscons/checkbyproscons/
Disallow: /catalog/product/gallery/
Disallow: /productquestions/index/ajaxform/ Files Disallow: /cron.php
Disallow: /cron.sh
Disallow: /error_log
Disallow: /install.php
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /STATUS.txt Paths (no clean URLs) Disallow: /.php$
Disallow: /?SID=
disallow: /?cat=
disallow: /?price=
disallow: /?flavor=
disallow: /?dir=
disallow: /?mode=
disallow: /?list=
disallow: /?limit=5
disallow: /?limit=10
disallow: /?limit=15
disallow: /?limit=20
disallow: /*?limit=250 -
Can i use "nofollow" tag on product page (duplicated content)?
Hi, im working on my webstore SEO. I got descriptions from official seller like "Bosch". I got more than 15.000 items so i cant create unique content for each product. Can i use nofollow tag for each product and create great content on category pages? I dont wanna lose rankings because duplicated content. Thank you for help!
Technical SEO | | pejtupizdo0 -
Robots.txt issue - site resubmission needed?
We recently had an issue when a load of new files were transferred from our dev server to the live site, which unfortunately included the dev site's robots.txt file which had a disallow:/ instruction. Bad! Luckily I spotted it quickly and the file has been replaced. The extent of the damage seems to be that some descriptions aren't displaying and we're getting a message about robots.txt in the SERPs for a few keywords. I've done a site: search and generally it seems to be OK for 99% of our pages. Our positions don't seem to be affected right now but obviously it's not great for the CTRs on those keywords affected. My question is whether there is anything I can do to bring the updated robots.txt file to Google's attention? Or should we just wait and sit it out? Thanks in advance for your answers!
Technical SEO | | GBC0 -
Redirect non-www if using canonical url?
I have setup my website to use canonical urls on each page to point to the page i wish Google to refer to. At the moment, my non-www domain name is not redirected to www domain. Is this required if i have setup the canonical urls? This is the tag i have on my index.php page rel="canonical" href="http://www.mydomain.com.au" /> If i browse to http://mydomain.com.au should the link juice pass to http://www.armourbackups.com.au? Will this solve duplicate content problems? Thanks
Technical SEO | | blakadz0 -
Robots.txt question
What is this robots.txt telling the search engines? User-agent: * Disallow: /stats/
Technical SEO | | DenverKelly0