Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Google indexing despite robots.txt block
-
Hi
This subdomain has about 4'000 URLs indexed in Google, although it's blocked via robots.txt: https://www.google.com/search?safe=off&q=site%3Awww1.swisscom.ch&oq=site%3Awww1.swisscom.ch
This has been the case for almost a year now, and it does not look like Google tends to respect the blocking in http://www1.swisscom.ch/robots.txt
Any clues why this is or what I could do to resolve it?
Thanks!
-
It sounds like Martijn solved your problem, but I still wanted to add that robots.txt exclusions keep search bots from reading pages that are disallowed, but it does not stop those pages from being returned in search results. When those pages do appear, a lot of times they'll have a page description along the lines of "A description of this page is not available due to this sites robots.txt".
If you want to ensure that pages are kept out of search engines results, you have to use the noindex meta tag on each page.
-
Yes, I think the crucial point is that addressing googlebot wouldn't resolve the specific problem I have here.
I would have tried adressing googlebot otherwise. But to be honest, I wouldn't have expected a much different result than specifying all user agents. Googlebot should be part of that exclusion in any case.
-
I thought that value was a bit outdated, turns out to be still accepted. Although it probably only address this issue for him in Google and I assume it will still remain one in other search engines.
Besides that the problem offered a way better solution in allowing Google not on the HTTPS site.
-
Specifically for Googlebot. I'm pretty surprised people would disagree - Stephan Spencer recommended this in a personal conversation with me.
-
Did you mean a noindex tags for robots or a specific one for googlebot? With the second one I probably get the downvotes.
-
People who are disagreeing with this, explain your reasoning.
-
A noindex tag specific to Googlebot would also be a good idea.
-
You're welcome, it was mostly due to noticing that the first snippet, the homepage, had no snippet and the rest of the pages did have one. That led me to looking at their URL structure. Good luck fixing it!
-
100 points for you Martijn, thanks! I'm pretty sure you've found the problem and I'll go about fixing it. Gotta get used to having https used more frequently now...
-
Hi Phillipp,
You almost got me with this one, but it's fairly simple. In your question you're pointing at the robots.txt of your HTTP page. But it's mostly your HTTP**S **pages that are indexed and if you look at that robots.txt file it's pretty clear why these pages are indexed: https://www1.swisscom.ch/robots.txt all the pages that are indexed match with one of your Allow statements are the complete Disallow. Hopefully that provides you with the insight on how to fix your issue.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Not all images indexed in Google
Hi all, Recently, got an unusual issue with images in Google index. We have more than 1,500 images in our sitemap, but according to Search Console only 273 of those are indexed. If I check Google image search directly, I find more images in index, but still not all of them. For example this post has 28 images and only 17 are indexed in Google image. This is happening to other posts as well. Checked all possible reasons (missing alt, image as background, file size, fetch and render in Search Console), but none of these are relevant in our case. So, everything looks fine, but not all images are in index. Any ideas on this issue? Your feedback is much appreciated, thanks
Technical SEO | | flo_seo1 -
Pages are Indexed but not Cached by Google. Why?
Hello, We have magento 2 extensions website mageants.com since 1 years google every 15 days cached my all pages but suddenly last 15 days my websites pages not cached by google showing me 404 error so go search console check error but din't find any error so I have cached manually fetch and render but still most of pages have same 404 error example page : - https://www.mageants.com/free-gift-for-magento-2.html error :- http://webcache.googleusercontent.com/search?q=cache%3Ahttps%3A%2F%2Fwww.mageants.com%2Ffree-gift-for-magento-2.html&rlz=1C1CHBD_enIN803IN804&oq=cache%3Ahttps%3A%2F%2Fwww.mageants.com%2Ffree-gift-for-magento-2.html&aqs=chrome..69i57j69i58.1569j0j4&sourceid=chrome&ie=UTF-8 so have any one solutions for this issues
Technical SEO | | vikrantrathore0 -
Robots.txt on subdomains
Hi guys! I keep reading conflicting information on this and it's left me a little unsure. Am I right in thinking that a website with a subdomain of shop.sitetitle.com will share the same robots.txt file as the root domain?
Technical SEO | | Whittie0 -
How to block text on a page to be indexed?
I would like to block the spider indexing a block of text inside a page , however I do not want to block the whole page with, for example , a noindex tag. I have tried already with a tag like this : chocolate pudding chocolate pudding However this is not working for my case, a travel related website. thanks in advance for your support. Best regards Gianluca
Technical SEO | | CharmingGuy0 -
Robots.txt and Multiple Sitemaps
Hello, I have a hopefully simple question but I wanted to ask to get a "second opinion" on what to do in this situation. I am working on a clients robots.txt and we have multiple sitemaps. Using yoast I have my sitemap_index.xml and I also have a sitemap-image.xml I do put them in google and bing by hand but wanted to have it added into the robots.txt for insurance. So my question is, when having multiple sitemaps called out on a robots.txt file does it matter if one is before the other? From my reading it looks like you can have multiple sitemaps called out, but I wasn't sure the best practice when writing it up in the file. Example: User-agent: * Disallow: Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /wp-content/plugins/ Sitemap: http://sitename.com/sitemap_index.xml Sitemap: http://sitename.com/sitemap-image.xml Thanks a ton for the feedback, I really appreciate it! :) J
Technical SEO | | allstatetransmission0 -
Blocking Affiliate Links via robots.txt
Hi, I work with a client who has a large affiliate network pointing to their domain which is a large part of their inbound marketing strategy. All of these links point to a subdomain of affiliates.example.com, which then redirects the links through a 301 redirect to the relevant target page for the link. These links have been showing up in Webmaster Tools as top linking domains and also in the latest downloaded links reports. To follow guidelines and ensure that these links aren't counted by Google for either positive or negative impact on the site, we have added a block on the robots.txt of the affiliates.example.com subdomain, blocking search engines from crawling the full subddomain. The robots.txt file is the following code: User-agent: * Disallow: / We have authenticated the subdomain with Google Webmaster Tools and made certain that Google can reach and read the robots.txt file. We know they are being blocked from reading the affiliates subdomain. However, we added this affiliates subdomain block a few weeks ago to the robots.txt, but links are still showing up in the latest downloads report as first being discovered after we added the block. It's been a few weeks already, and we want to make sure that the block was implemented properly and that these links aren't being used to negatively impact the site. Any suggestions or clarification would be helpful - if the subdomain is being blocked for the search engines, why are the search engines following the links and reporting them in the www.example.com subdomain GWMT account as latest links. And if the block is implemented properly, will the total number of links pointing to our site as reported in the links to your site section be reduced, or does this not have an impact on that figure?From a development standpoint, it's a much easier fix for us to adjust the robots.txt file than to change the affiliate linking connection from a 301 to a 302, which is why we decided to go with this option.Any help you can offer will be greatly appreciated.Thanks,Mark
Technical SEO | | Mark_Ginsberg0 -
Should I block robots from URLs containing query strings?
I'm about to block off all URLs that have a query string using robots.txt. They're mostly URLs with coremetrics tags and other referrer info. I figured that search engines don't need to see these as they're always better off with the original URL. Might there be any downside to this that I need to consider? Appreciate your help / experiences on this one. Thanks Jenni
Technical SEO | | ShearingsGroup0 -
Does Google index XML files?
Does Google or other search engines include XML files in their index? More specifically, I am wondering how Google knows the difference between an xml filetype and an RSS feed.
Technical SEO | | nicole.healthline0