Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Will disallowing URL's in the robots.txt file stop those URL's being indexed by Google
-
I found a lot of duplicate title tags showing in Google Webmaster Tools. When I visited the URL's that these duplicates belonged to, I found that they were just images from a gallery that we didn't particularly want Google to index. There is no benefit to the end user in these image pages being indexed in Google.
Our developer has told us that these urls are created by a module and are not "real" pages in the CMS.
They would like to add the following to our robots.txt file
Disallow: /catalog/product/gallery/
QUESTION: If the these pages are already indexed by Google, will this adjustment to the robots.txt file help to remove the pages from the index?
We don't want these pages to be found.
-
That's why I mentioned: "eventually". But thanks for the added information. Hopefully it's clear now for the original poster.
-
Looking at this video - https://www.youtube.com/watch?v=KBdEwpRQRD0&feature=youtu.be Matt Cutts advises to use the noindex tag on every individual page. However, this is very time consuming if you're dealing wit a large volume of pages.
The other option he recommends is to use the robots.txt file as well as the URL removal tool in GWMT, Although this is the second choice option, it does seem easier for us to implement than the noindex tag.
-
Hi,
Yes, if you put any url in the robots.txt it will not be shown in the search results after some time even if your pages were already indexed. Because when your disallow urls in the robots.txt , Google will stop crawling that page and eventually will stop indexing those pages.
-
Hi Nico
Great response thanks.
This is certainly something I'm taking into consideration and will question my developer about this.
-
Thanks Thomas.
I'm now finding out from my developer is we are able to noindex these pages with the meta robots.
If this is something that isn't possible, it's likely that we'll add to the robots.txt as you did.
Either way I think will be progress to different degrees.
-
I don' think Martijn's statement is quite correct as I have made different experiences in an accidental experiment. Crawling is not the same as indexing. Google will put pages it cannot crawl into the index ... and they will stay there unless removed somehow. They will probably only show up for specific searches, though
Completely agree, I have done the same for a website I am doing work with, ideally we would noindex with meta robots however that isn't possible. So instead we added to the robots.txt, the number of indexed pages have dropped, yet when you search exactly it just says the description can't be reached.
So I was happy with the results as they're now not ranking for the terms they were.
-
I don' think Martijn's statement is quite correct as I have made different experiences in an accidental experiment. Crawling is not the same as indexing. Google will put pages it cannot crawl into the index ... and they will stay there unless removed somehow. They will probably only show up for specific searches, though
In September 2015 I catapulted a website from ~3.000 to 130.000 indexed pages (roughly). 127.000 were essentially canonicalised duplicates (yes, it did make sense) but also blocked by robots.txt - but put into the index nonetheless. The problem was a dynamically generated parameter, always different, always blocked by robots.
The title was equal to the link text; the description became "A description for this result is not available because of this site's robots.txt – learn more." (If Google cannot crawl a URL Google will usually take titles from links pointing to that URL). No sign of disappearing. In fact, Google was happy to add more and more to its index ...
At the start of December 2015 I removed the robots.txt block - Google could now read the canonicals or noindex on the URLs ... the pages only began dropping out, slowly and in bunches of a few thousand in March 2016 - probably due to the very low relevancy and crawl budget assigned to them. Right now there are still about 24.000 pages in the index.
So my answer would be: No - disabling crawling in the robots.txt will NOT remove a page from the index. For that you need to noindex them (which sometimes also works if done in robots.txt, I've heard). Disallowing URLs in the robots.txt will very likely drop pages to the end of useful results, though, as Andy described. (I don't know if this has any influence on the general evaluation of the site as a whole; I'd guess not.)
Regards
Nico
-
Thanks Martijn. This is what I was assuming would happen. However, I got a confusing message from my developer which said the following,
"won't remove the URL's from the index but it will mean that they will only show up for very specific searches that customers are extremely unlikely to use. It will also increase Asgard's crawl budget as Google and Bing won't try to crawl these URLs. Would you be happy with this solution?"
I would tend to still agree with your statement though.
-
Yes they will be eventually. As you disallow Google to crawl the URLs it will probably start hiding the descriptions for some of these image pages soon as they can't crawl them anymore. Then at some point they'll stop looking at them at all.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Change Google's version of Canonical link
Hi My website has millions of URLs and some of the URLs have duplicate versions. We did not set canonical all these years. Now we wanted to implement it and fix all the technical SEO issues. I wanted to consolidate and redirect all the variations of a URL to the highest pageview version and use that as the canonical because all of these variations have the same content. While doing this, I found in Google search console that Google has already selected another variation of URL as canonical and not the highest pageview version. My questions: I have millions of URLs for which I have to do 301 and set canonical. How can I find all the canonical URLs that Google has autoselected? Search Console has a daily quota of 100 or something. Is it possible to override Google's version of Canonical? Meaning, if I set a variation as Canonical and it is different than what Google has already selected, will it change overtime in Search Console? Should I just do a 301 to highest pageview variation of the URL and not set canonicals at all? This way the canonical that Google auto selected might get redirected to the highest pageview variation of the URL. Any advice or help would be greatly appreciated.
Intermediate & Advanced SEO | | SDCMarketing0 -
How does educational organization schema interact with Google's knowledge graph?
Hi there! I was just wondering if the granular options of the Organization schema, like Educational Organization (http://schema.org/EducationalOrganization) and CollegeOrUniversity (http://schema.org/CollegeOrUniversity) schema work the same when it comes to pulling data into the knowledge graph. I've typically always used the Organization schema for customers but was wondering if there are any drawbacks for going deep into the hierarchy of schema. Cheers 😄
Intermediate & Advanced SEO | | Corbec8880 -
If I block a URL via the robots.txt - how long will it take for Google to stop indexing that URL?
If I block a URL via the robots.txt - how long will it take for Google to stop indexing that URL?
Intermediate & Advanced SEO | | Gabriele_Layoutweb0 -
Should I include URLs that are 301'd or only include 200 status URLs in my sitemap.xml?
I'm not sure if I should be including old URLs (content) that are being redirected (301) to new URLs (content) in my sitemap.xml. Does anyone know if it is best to include or leave out 301ed URLs in a xml sitemap?
Intermediate & Advanced SEO | | Jonathan.Smith0 -
Wildcarding Robots.txt for Particular Word in URL
Hey All, So I know that this isn't a standard robots.txt, I'm aware of how to block or wildcard certain folders but I'm wondering whether it's possible to block all URL's with a certain word in it? We have a client that was hacked a year ago and now they want us to help remove some of the pages that were being autogenerated with the word "viagra" in it. I saw this article and tried implementing it https://builtvisible.com/wildcards-in-robots-txt/ and it seems that I've been able to remove some of the URL's (although I can't confirm yet until I do a full pull of the SERPs on the domain). However, when I test certain URL's inside of WMT it still says that they are allowed which makes me think that it's not working fully or working at all. In this case these are the lines I've added to the robots.txt Disallow: /*&viagra Disallow: /*&Viagra I know I have the solution of individually requesting URL's to be removed from the index but I want to see if anybody has every had success with wildcarding URL's with a certain word in their robots.txt? The individual URL route could be very tedious. Thanks! Jon
Intermediate & Advanced SEO | | EvansHunt0 -
Google indexed wrong pages of my website.
When I google site:www.ayurjeewan.com, after 8 pages, google shows Slider and shop pages. Which I don't want to be indexed. How can I get rid of these pages?
Intermediate & Advanced SEO | | bondhoward0 -
Remove URLs that 301 Redirect from Google's Index
I'm working with a client who has 301 redirected thousands of URLs from their primary subdomain to a new subdomain (these are unimportant pages with regards to link equity). These URLs are still appearing in Google's results under the primary domain, rather than the new subdomain. This is problematic because it's creating an artificial index bloat issue. These URLs make up over 90% of the URLs indexed. My experience has been that URLs that have been 301 redirected are removed from the index over time and replaced by the new destination URL. But it has been several months, close to a year even, and they're still in the index. Any recommendations on how to speed up the process of removing the 301 redirected URLs from Google's index? Will Google, or any search engine for that matter, process a noindex meta tag if the URL's been redirected?
Intermediate & Advanced SEO | | trung.ngo0 -
Best way to permanently remove URLs from the Google index?
We have several subdomains we use for testing applications. Even if we block with robots.txt, these subdomains still appear to get indexed (though they show as blocked by robots.txt. I've claimed these subdomains and requested permanent removal, but it appears that after a certain time period (6 months)? Google will re-index (and mark them as blocked by robots.txt). What is the best way to permanently remove these from the index? We can't use login to block because our clients want to be able to view these applications without needing to login. What is the next best solution?
Intermediate & Advanced SEO | | nicole.healthline0