Blocking https from being crawled
-
I have an ecommerce site where https is being crawled for some pages. Wondering if the below solution will fix the issue
www.example.com will be my domain
In the nav there is a login page www.example.com/login which is redirecting to the https://www.example.com/login
If I just disallowed /login in the robots file wouldn't it not follow the redirect and index that stuff?
The redirect part is what I am questioning.
-
Correct once /login gets redirected to https://www.example.com/login all nav links etc are https
What I ended up doing was blocking /login in robots and now doing canonicals on https as well as nofollow the /login link that is in the nav that redirects
Willl see what happens now.
-
So, the "/login" page gets redirected to https: and then every link on that page goes secure and Google crawls them all? I think blocking the "/login" page is a perfectly good way to go here - cut the crawl path, and you'll cut most of the problem.
You could request removal of "/login" in Google Webmaster Tools, too. Sometimes, I find that Robots.txt isn't great at removing pages that are already indexed. I would definitely add the canonical as well, if it's feasible. Cutting the path may not cut the pages that have already been indexed with https:.
Sorry, I'd actually reverse that:
(1) Add the canonicals, and let Google sweep up the duplicates
(2) A few weeks later, block the "/login" page
Sounds counter-intuitive, but if you block the crawl path to the https: pages first, then Google won't crawl the canonical tags on those versions. Use canonical to clean up the index, and then block the page to prevent future problems.
-
Gotcha. Yea I commented above how I was going to add a canonical as well as a noindex in the meta but was curious how it handled the redirect that was happening.
thanks for your help
-
Yea I was going to nofollow the link in the nav and add a meta tag but was curious how the robots file would handle this since the url is a redirect.
Thanks for your input
-
The pages that are being crawled under https, are the same pages available under http as well ? If yes, can you just add a canonical tag on these pages to go to the http version. That should fix it. And if your login page is the entry point, your fix will help as well. But then as Rebekah said, what if somebody is linking to your https page. I would suggest you look into making a canonical tag on these pages to http if that makes sense and is doable.
-
You can disallow the https portion in robots.txt, but remember robots.txt isn't always a sure fire way of not getting an area of your site crawled. If you have other important content to crawl from the secured page, be careful you are not blocking robots from there.
If this is linked to other places on the web, and the link doesn't include no-follow, search engines may still crawl the page. Can you change the link in your navigation to no-follow as well? I would also add a meta noindex tag to the page itself, and a canonical tag to the https version.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Forced Redirects/HTTP<>HTTPS 301 Question
Hi All, Sorry for what's about to be a long-ish question, but tl;dr: Has anyone else had experience with a 301 redirect at the server level between HTTP and HTTPS versions of a site in order to maintain accurate social media share counts? This is new to me and I'm wondering how common it is. I'm having issues with this forced redirect between HTTP/HTTPS as outlined below and am struggling to find any information that will help me to troubleshoot this or better understand the situation. If anyone has any recommendations for things to try or sources to read up on, I'd appreciate it. I'm especially concerned about any issues that this may be causing at the SEO level and the known-unknowns. A magazine I work for recently relaunched after switching platforms from Atavist to Newspack (which is run via WordPress). Since then, we've been having some issues with 301s, but they relate to new stories that are native to our new platform/CMS and have had zero URL changes. We've always used HTTPS. Basically, the preview for any post we make linking to the new site, including these new (non-migrated pages) on Facebook previews as a 301 in the title and with no image. This also overrides the social media metadata we set through Yoast Premium. I ran some of the links through the Facebook debugger and it appears that Facebook is reading these links to our site (using https) as redirects to http that then redirect to https. I was told by our tech support person on Newspack's team that this is intentional, so that Facebook will maintain accurate share counts versus separate share counts for http/https, however this forced redirect seems to be failing if we can't post our links with any metadata. (The only way to reliably fix is by adding a query parameter to each URL which, obviously, still gives us inaccurate share counts.) This is the first time I've encountered this intentional redirect thing and I've asked a few times for more information about how it's set up just for my own edification, but all I can get is that it’s something managed at the server level and is designed to prevent separate share counts for HTTP and HTTPS. Has anyone encountered this method before, and can anyone either explain it to me or point me in the direction of a resource where I can learn more about how it's configured as well as the pros and cons? I'm especially concerned about our SEO with this and how this may impact the way search engines read our site. So far, nothing's come up on scans, but I'd like to stay one step ahead of this. Thanks in advance!
Technical SEO | | ogiovetti0 -
Google Webmaster Tools is saying "Sitemap contains urls which are blocked by robots.txt" after Https move...
Hi Everyone, I really don't see anything wrong with our robots.txt file after our https move that just happened, but Google says all URLs are blocked. The only change I know we need to make is changing the sitemap url to https. Anything you all see wrong with this robots.txt file? robots.txt This file is to prevent the crawling and indexing of certain parts of your site by web crawlers and spiders run by sites like Yahoo! and Google. By telling these "robots" where not to go on your site, you save bandwidth and server resources. This file will be ignored unless it is at the root of your host: Used: http://example.com/robots.txt Ignored: http://example.com/site/robots.txt For more information about the robots.txt standard, see: http://www.robotstxt.org/wc/robots.html For syntax checking, see: http://www.sxw.org.uk/computing/robots/check.html Website Sitemap Sitemap: http://www.bestpricenutrition.com/sitemap.xml Crawlers Setup User-agent: * Allowable Index Allow: /*?p=
Technical SEO | | vetofunk
Allow: /index.php/blog/
Allow: /catalog/seo_sitemap/category/ Directories Disallow: /404/
Disallow: /app/
Disallow: /cgi-bin/
Disallow: /downloader/
Disallow: /includes/
Disallow: /lib/
Disallow: /magento/
Disallow: /pkginfo/
Disallow: /report/
Disallow: /stats/
Disallow: /var/ Paths (clean URLs) Disallow: /index.php/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/
Disallow: /catalogsearch/
Disallow: /checkout/
Disallow: /control/
Disallow: /contacts/
Disallow: /customer/
Disallow: /customize/
Disallow: /newsletter/
Disallow: /poll/
Disallow: /review/
Disallow: /sendfriend/
Disallow: /tag/
Disallow: /wishlist/
Disallow: /aitmanufacturers/index/view/
Disallow: /blog/tag/
Disallow: /advancedreviews/abuse/reportajax/
Disallow: /advancedreviews/ajaxproduct/
Disallow: /advancedreviews/proscons/checkbyproscons/
Disallow: /catalog/product/gallery/
Disallow: /productquestions/index/ajaxform/ Files Disallow: /cron.php
Disallow: /cron.sh
Disallow: /error_log
Disallow: /install.php
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /STATUS.txt Paths (no clean URLs) Disallow: /.php$
Disallow: /?SID=
disallow: /?cat=
disallow: /?price=
disallow: /?flavor=
disallow: /?dir=
disallow: /?mode=
disallow: /?list=
disallow: /?limit=5
disallow: /?limit=10
disallow: /?limit=15
disallow: /?limit=20
disallow: /*?limit=250 -
Increase in Crawl Errors
I had a problem with a lot of crawl errors (on Google Search Console) a while back, due to the removal of a shopping cart. I thought I'd dealt with this & Google seemed to agree (see attached pic), but now they're all back with a vengeance! The crawl errors are all the old shop pages that I thought I'd made clear weren't there anymore. The sitemaps (using Yoast on Wordpress to generate these) all updated 16 Aug but the increase didn't happen till 18-20. How do I make it clear to Google that these pages are gone forever? Screen-Shot-2016-08-22-at-10.19.05.png
Technical SEO | | abisti20 -
HTTP to HTTPS Transition, Large Drop in Search Traffic
My URL is: https://www.seattlecoffeegear.comWe implemented https across the site on Friday. Saturday and Sunday search traffic was normal/slightly higher than normal (in analytics) and slightly down in GWT. Today, it has dropped significantly in both, to about half of normal search traffic. From everything we can see, we implemented this correctly. 301 redirected all http requests to https (and yes, they go to the correct page and not to the homepage 😉 ) Rewrote hardcoded internal links Registered/submitted sitemaps from https in Bing and GWT Used fetch and render to ensure Google could reach the site and also was redirected appropriately from http to https versions Ensured robots.txt does not block https or secure We also use a CDN (though I don't think that impacts anything) and have had no customer issues with accessing or using the website since the transition.Is there anything else I might be missing that could correlate to a drop in search impressions or is this just a waiting game of a few days to let Google sort through the change we've made and reindex everything (it dropped to 0 indexed for a day and is now up to 1744 of our 2180 pages indexed)?Thank you so much for any input!Kaylie
Technical SEO | | Marketing.SCG0 -
Duplicate Page Title Crawl Error Issue
In the last crawl for on of our client websites the duplicate page title and page content numbers were very high. They are reading every page twice. http://www.barefootparadisevacations.com and http://barefootparadisevacations.com are being read as two different pages with the same page title. After the last crawl I used our built in redirect tool to redirect the urls, but the most recent crawl showed the same issue. Is this issue really hurting our rankings and if so, any suggestions on a fix for the problem? Thank you!
Technical SEO | | LoveMyPugs0 -
Google insists robots.txt is blocking... but it isn't.
I recently launched a new website. During development, I'd enabled the option in WordPress to prevent search engines from indexing the site. When the site went public (over 24 hours ago), I cleared that option. At that point, I added a specific robots.txt file that only disallowed a couple directories of files. You can view the robots.txt at http://photogeardeals.com/robots.txt Google (via Webmaster tools) is insisting that my robots.txt file contains a "Disallow: /" on line 2 and that it's preventing Google from indexing the site and preventing me from submitting a sitemap. These errors are showing both in the sitemap section of Webmaster tools as well as the Blocked URLs section. Bing's webmaster tools are able to read the site and sitemap just fine. Any idea why Google insists I'm disallowing everything even after telling it to re-fetch?
Technical SEO | | ahockley0 -
Can I crawl a password protected domain with SEOmoz?
Hi everyone, Just wondered if anybody has been able to use the SEOmoz site crawler for password protected domains? On Screaming Frog you are prompted for the username and password when you set the crawler running, however SEOmoz doesn't. It seems you can only crawl sites that are live and publicly available - can anyone confirm if this is the case? Cheers, M
Technical SEO | | edlondon0 -
Google Webmaster tools vs SeoMOZ Crawl Diagnostics
Hi Guys I was just looking over my weekly report and crawl diagnostics. What I've noticed is that the data gathered on SeoMoz is different from Google Webmaster diagnostics. The number of errors, in particular duplicate page titles, content and pages not found is much higher that what google webmaster tools is represents. I'm a bit confused and don't know which data is more accurate. Please Help
Technical SEO | | Tolod0