Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
What happens to crawled URLs subsequently blocked by robots.txt?
-
We have a very large store with 278,146 individual product pages. Since these are all various sizes and packaging quantities of less than 200 product categories my feeling is that Google would be better off making sure our category pages are indexed.
I would like to block all product pages via robots.txt until we are sure all category pages are indexed, then unblock them. Our product pages rarely change, no ratings or product reviews so there is little reason for a search engine to revisit a product page.
The sales team is afraid blocking a previously indexed product page will result in in it being removed from the Google index and would prefer to submit the categories by hand, 10 per day via requested crawling.
Which is the better practice?
-
@aspenfasteners To my understanding, disallowing a page or folder in robots.txt does not remove pages from Google's index. It merely gives a directive to not crawl those pages/folders. In fact, when pages are accidentally indexed and one wants to remove them from the index, it is important to actually NOT disallow them in robots.txt, so that Google can crawl those pages and discover the meta NOINDEX tags on the pages. The meta NOINDEX tags are the directive to remove a page from the index, or to not index it in the first place. This is different than a robots.txt directive, whcih is intended to allow or disallow crawling. Crawling does not equal indexing.
So, you could keep the pages indexable, and simply block them in your robots.txt file, if you want. If they've already been indexed, they should not disappear quickly (they might, over time though). BUT if they haven't been indexed yet, this would prevent them from being discovered.
All of that said, from reading your notes, I don't think any of this is warranted. The speed at which Google discovers pages on a website is very fast. And existing indexed pages shouldn't really get in the way of new discovery. In fact, they might help the category pages be discovered, if they contain links to the categories.
I would create a categories sitemap xml file, link to that in your robots.txt, and let that do the work of prioritizing the categories for crawling/discovery and indexation.
-
@aspenfasteners to answer your question: "do we KNOW that Google will immediately de-index URL's blocked by robots.txt?"
Google will not immediately de-index URLs that are blocked by robots.txt, based on my experience. I've dealt with very similar situation but with much greater scale - around 8M automatically generated pages that got into Google index. It may take a year or more to de-index these pages completely. Of course, every case is different, but based on my understanding, if you block these low-quality product pages, Google will slowly start re-evaluating these pages, and it will start with the ones that get some traffic.
Here is what happens when Google re-evaluates your individual product pages:
When deciding, whether to keep a page in its index or not, Google takes into account multiple factors, and one of the most important ones is how many backlinks (both internal and external) are leading to a page. Other factors - content quality, if the page is similar or duplicate to another page, Core Web Vitals score, amount of your crawl budget, and, of course, external backlinks (which is irrelevant for your case).
If you are afraid of loosing some traffic that comes to these product pages, or you have other concerns, just do a smaller experiment: take a sample of 1000-2000 pages, block them in robots.txt or by adding meta robots "noindex, follow" directive, and observe Google's reaction in 1-6 weeks, depending on your crawl budget.
Another thing to check:
If you use Screaming Frog, it has a nice feature to show internal pagerank and the number of internal incoming links that lead to every page. As a rule of thumb, if an individual product page has at least 10 internal incoming links from canonicalized pages, there is a high probability it will get indexed.
-
@terentyev - sorry, can't edit my questions once submitted and I wait for approval (why?) the statement should read my question SHOULD be very specific, whereas my original question was much more general - you answered that question very nicely. Sorry for any misunderstanding
-
@terentyev thanks for the reply. We have no reason to believe these URL's are backlinked. These aren't consumer products that individual are interested in, our site is a wholesale B2B selling very narrow categories in bulk quantities typically for manufacturing. Therefore, almost zero chance for backlinks anywhere for something as specific as a particular size/material/package quantity of a product.
We have already initiated a canonicalization project started but we are stuck between two concerns from sales, 1) we can't wait for canonicalization (which is complex) we need sales now and 2) don't touch robots.txt because MAYBE the individual products are indexed.
So that is why my question is very specific - do we KNOW that Google will immediately de-index URL's blocked by robots.txt?
-
@aspenfasteners thanks for interesting question.
to summarize my understanding:- you have ~300K individual product pages, many of them are duplicates; eg. a single product can have multiple characteristics (eg. size or quantity) but the pages are essentially the same.
- your goal is to index 200 product categories that contain a collection of these products, and remove the low-quality duplicate individual pages from Google index in the long run.
- my assumption is that these 300K product pages have been historically accumulating some backlinks, which is one of the reasons why they are indexed.
If I am right about the 1 and 2, then you should not block these individual product pages, but rather add canonical URLs to them, which should point to the respective category page that you want to get indexed.
Once you have these canonicals implemented, you should wait for a few months or more for Google to pass the link equity to your 200 product category pages, and once it is done, you are free to block them from indexing on robots.txt + meta tag on the page itself, and maybe even x-robots-tag. The way how to block them - it is a different discussion. Let me know if you want to learn more on the best approach.
So, here is my checklist for this URL migration:
- add canonicals pointing from product pages to category pages.
- make sure that all category pages are well interlinked between each other, and the individual product pages are linked to several category pages (eg. a product A should be linked to category A, and also to similar categories B & C). As a rule of thumb, make sure that each category page has at least 10 incoming links from other category pages.
- Make sure that all these category pages are linked from your homepage
- Make sure that sitemap contains only self-canonicalized pages.
- Make sure that these category pages have good core web vitals metrics, compared to your competitors on SERP.
- In 2-3 months, when you see that Google indexes the category pages, and crawling of product pages have been reduced significantly, and the ranks of the category pages have gone up, it is ok to block these 300K pages from crawling.
As to manually submitting the categories by hand, I doubt it will help, especially if the product pages have a lot of backlinks. I've seen many cases when Google disregards the robots.txt directives if a page has good backlinks and traffic.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
If my website do not have a robot.txt file, does it hurt my website ranking?
After a site audit, I find out that my website don't have a robot.txt. Does it hurt my website rankings? One more thing, when I type mywebsite.com/robot.txt, it automatically redirect to the homepage. Please help!
Intermediate & Advanced SEO | | binhlai0 -
Attack of the dummy urls -- what to do?
It occurs to me that a malicious program could set up thousands of links to dummy pages on a website: www.mysite.com/dynamicpage/dummy123 www.mysite.com/dynamicpage/dummy456 etc.. How is this normally handled? Does a developer have to look at all the parameters to see if they are valid and if not, automatically create a 301 redirect or 404 not found? This requires a table lookup of acceptable url parameters for all new visitors. I was thinking that bad url names would be rare so it would be ok to just stop the program with a message, until I realized someone could intentionally set up links to non existent pages on a site.
Intermediate & Advanced SEO | | friendoffood1 -
"noindex, follow" or "robots.txt" for thin content pages
Does anyone have any testing evidence what is better to use for pages with thin content, yet important pages to keep on a website? I am referring to content shared across multiple websites (such as e-commerce, real estate etc). Imagine a website with 300 high quality pages indexed and 5,000 thin product type pages, which are pages that would not generate relevant search traffic. Question goes: Does the interlinking value achieved by "noindex, follow" outweigh the negative of Google having to crawl all those "noindex" pages? With robots.txt one has Google's crawling focus on just the important pages that are indexed and that may give ranking a boost. Any experiments with insight to this would be great. I do get the story about "make the pages unique", "get customer reviews and comments" etc....but the above question is the important question here.
Intermediate & Advanced SEO | | khi50 -
Will a disclaimer affect Crawling?
Hello everyone! My German users will have to get a disclaimer according to German laws, now my question is the following: Will a disclaimer affect crawling? What's the best practice to have regarding this? Should I have special care in this? What's the best disclaimer technique? A Plain HTML page? Something overlapping the site? Thank you all!
Intermediate & Advanced SEO | | NelsonF0 -
Robots.txt, does it need preceding directory structure?
Do you need the entire preceding path in robots.txt for it to match? e.g: I know if i add Disallow: /fish to robots.txt it will block /fish
Intermediate & Advanced SEO | | Milian
/fish.html
/fish/salmon.html
/fishheads
/fishheads/yummy.html
/fish.php?id=anything But would it block?: en/fish
en/fish.html
en/fish/salmon.html
en/fishheads
en/fishheads/yummy.html
**en/fish.php?id=anything (taken from Robots.txt Specifications)** I'm hoping it actually wont match, that way writing this particular robots.txt will be much easier! As basically I'm wanting to block many URL that have BTS- in such as: http://www.example.com/BTS-something
http://www.example.com/BTS-somethingelse
http://www.example.com/BTS-thingybob But have other pages that I do not want blocked, in subfolders that also have BTS- in, such as: http://www.example.com/somesubfolder/BTS-thingy
http://www.example.com/anothersubfolder/BTS-otherthingy Thanks for listening0 -
Robots.txt: Can you put a /* wildcard in the middle of a URL?
We have noticed that Google is indexing the language/country directory versions of directories we have disallowed in our robots.txt. For example: Disallow: /images/ is blocked just fine However, once you add our /en/uk/ directory in front of it, there are dozens of pages indexed. The question is: Can I put a wildcard in the middle of the string, ex. /en/*/images/, or do I need to list out every single country for every language in the robots file. Anyone know of any workarounds?
Intermediate & Advanced SEO | | IHSwebsite0 -
Robots.txt is blocking Wordpress Pages from Googlebot?
I have a robots.txt file on my server, which I did not develop, it was done by the web designer at the company before me. Then there is a word press plugin that generates a robots.txt file. How Do I unblock all the wordpress pages from googlebot?
Intermediate & Advanced SEO | | ENSO0 -
Multiple URLs for the same page
I am working with a client and recently discovered that they have several URLs that go to the same page. http://www.maps.com/FunFacts.aspx
Intermediate & Advanced SEO | | WebMarketingandDesign
http://www.maps.com/funfacts.aspx
http://www.maps.com/FunFacts.aspx?nav=FF
http://www.maps.com/FunFacts.aspx?nav=FS
http://www.maps.com/funfacts.aspx?nav=FF
http://www.maps.com/funfacts.aspx?nav=ffhttp://www.maps.com/FunFacts.aspx?nav=MShttp://www.maps.com/funfacts.aspx?nav=
http://www.maps.com/FunFacts.aspx?nav=FF#
http://www.maps.com/FunFacts
http://www.maps.com/funfacts.aspx?.nav=FF I am afraid this is happening all over the site. So, my question is: Is this hurting the SEO and how? If so what is the best way to go about fixing this problem? Thanks for your help!0