Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
What happens to crawled URLs subsequently blocked by robots.txt?
-
We have a very large store with 278,146 individual product pages. Since these are all various sizes and packaging quantities of less than 200 product categories my feeling is that Google would be better off making sure our category pages are indexed.
I would like to block all product pages via robots.txt until we are sure all category pages are indexed, then unblock them. Our product pages rarely change, no ratings or product reviews so there is little reason for a search engine to revisit a product page.
The sales team is afraid blocking a previously indexed product page will result in in it being removed from the Google index and would prefer to submit the categories by hand, 10 per day via requested crawling.
Which is the better practice?
-
@aspenfasteners To my understanding, disallowing a page or folder in robots.txt does not remove pages from Google's index. It merely gives a directive to not crawl those pages/folders. In fact, when pages are accidentally indexed and one wants to remove them from the index, it is important to actually NOT disallow them in robots.txt, so that Google can crawl those pages and discover the meta NOINDEX tags on the pages. The meta NOINDEX tags are the directive to remove a page from the index, or to not index it in the first place. This is different than a robots.txt directive, whcih is intended to allow or disallow crawling. Crawling does not equal indexing.
So, you could keep the pages indexable, and simply block them in your robots.txt file, if you want. If they've already been indexed, they should not disappear quickly (they might, over time though). BUT if they haven't been indexed yet, this would prevent them from being discovered.
All of that said, from reading your notes, I don't think any of this is warranted. The speed at which Google discovers pages on a website is very fast. And existing indexed pages shouldn't really get in the way of new discovery. In fact, they might help the category pages be discovered, if they contain links to the categories.
I would create a categories sitemap xml file, link to that in your robots.txt, and let that do the work of prioritizing the categories for crawling/discovery and indexation.
-
@aspenfasteners to answer your question: "do we KNOW that Google will immediately de-index URL's blocked by robots.txt?"
Google will not immediately de-index URLs that are blocked by robots.txt, based on my experience. I've dealt with very similar situation but with much greater scale - around 8M automatically generated pages that got into Google index. It may take a year or more to de-index these pages completely. Of course, every case is different, but based on my understanding, if you block these low-quality product pages, Google will slowly start re-evaluating these pages, and it will start with the ones that get some traffic.
Here is what happens when Google re-evaluates your individual product pages:
When deciding, whether to keep a page in its index or not, Google takes into account multiple factors, and one of the most important ones is how many backlinks (both internal and external) are leading to a page. Other factors - content quality, if the page is similar or duplicate to another page, Core Web Vitals score, amount of your crawl budget, and, of course, external backlinks (which is irrelevant for your case).
If you are afraid of loosing some traffic that comes to these product pages, or you have other concerns, just do a smaller experiment: take a sample of 1000-2000 pages, block them in robots.txt or by adding meta robots "noindex, follow" directive, and observe Google's reaction in 1-6 weeks, depending on your crawl budget.
Another thing to check:
If you use Screaming Frog, it has a nice feature to show internal pagerank and the number of internal incoming links that lead to every page. As a rule of thumb, if an individual product page has at least 10 internal incoming links from canonicalized pages, there is a high probability it will get indexed.
-
@terentyev - sorry, can't edit my questions once submitted and I wait for approval (why?) the statement should read my question SHOULD be very specific, whereas my original question was much more general - you answered that question very nicely. Sorry for any misunderstanding
-
@terentyev thanks for the reply. We have no reason to believe these URL's are backlinked. These aren't consumer products that individual are interested in, our site is a wholesale B2B selling very narrow categories in bulk quantities typically for manufacturing. Therefore, almost zero chance for backlinks anywhere for something as specific as a particular size/material/package quantity of a product.
We have already initiated a canonicalization project started but we are stuck between two concerns from sales, 1) we can't wait for canonicalization (which is complex) we need sales now and 2) don't touch robots.txt because MAYBE the individual products are indexed.
So that is why my question is very specific - do we KNOW that Google will immediately de-index URL's blocked by robots.txt?
-
@aspenfasteners thanks for interesting question.
to summarize my understanding:- you have ~300K individual product pages, many of them are duplicates; eg. a single product can have multiple characteristics (eg. size or quantity) but the pages are essentially the same.
- your goal is to index 200 product categories that contain a collection of these products, and remove the low-quality duplicate individual pages from Google index in the long run.
- my assumption is that these 300K product pages have been historically accumulating some backlinks, which is one of the reasons why they are indexed.
If I am right about the 1 and 2, then you should not block these individual product pages, but rather add canonical URLs to them, which should point to the respective category page that you want to get indexed.
Once you have these canonicals implemented, you should wait for a few months or more for Google to pass the link equity to your 200 product category pages, and once it is done, you are free to block them from indexing on robots.txt + meta tag on the page itself, and maybe even x-robots-tag. The way how to block them - it is a different discussion. Let me know if you want to learn more on the best approach.
So, here is my checklist for this URL migration:
- add canonicals pointing from product pages to category pages.
- make sure that all category pages are well interlinked between each other, and the individual product pages are linked to several category pages (eg. a product A should be linked to category A, and also to similar categories B & C). As a rule of thumb, make sure that each category page has at least 10 incoming links from other category pages.
- Make sure that all these category pages are linked from your homepage
- Make sure that sitemap contains only self-canonicalized pages.
- Make sure that these category pages have good core web vitals metrics, compared to your competitors on SERP.
- In 2-3 months, when you see that Google indexes the category pages, and crawling of product pages have been reduced significantly, and the ranks of the category pages have gone up, it is ok to block these 300K pages from crawling.
As to manually submitting the categories by hand, I doubt it will help, especially if the product pages have a lot of backlinks. I've seen many cases when Google disregards the robots.txt directives if a page has good backlinks and traffic.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Link juice through URL parameters
Hi guys, hope you had a fantastic bank holiday weekend. Quick question re URL parameters, I understand that links which pass through an affiliate URL parameter aren't taken into consideration when passing link juice through one site to another. However, when a link contains a tracking URL parameter (let's say gclid=), does link juice get passed through? We have a number of external links pointing to our main site, however, they are linking directly to a unique tracking parameter. I'm just curious to know about this. Thanks, Brett
Intermediate & Advanced SEO | | Brett-S0 -
Inactive Products - Inactive URLs
Hi, In our website www.viatrading.com we have many products that might be in stock or not depending on availability. Until now, when a product was not available anymore, we took this page down (and redirected to its product category page). And, only if the product was available again, we re-activated the URL - this might be days, months or even years later. To make this more SEO-friendly, we decided now that while a product is not available, instead or deactivating/redirecting the page, we will leave it online and just add a message saying "This product is currently not available". If we do this, we will automatically re-activate about 500 products pages at once. 1. Just to make sure, is it harmful for SEO to keep activating/deactivating URLs this way? 2. Since most of these pages have been deindexed for a long time due to being redirected - have they lost all their SEO juice? 3. How can we better activate these old 500 pages - is it ok activating them all at once? Thank you,
Intermediate & Advanced SEO | | viatrading11 -
Replace dynamic paramenter URLs with static Landing Page URL - faceted navigation
Hi there, got a quick question regarding faceted navigation. If a specific filter (facet) seems to be quite popular for visitors. Does it make sense to replace a dynamic URL e.x http://www.domain.com/pants.html?a_type=239 by a static, more SEO friendly URL e.x http://www.domain.com/pants/levis-pants.html by creating a proper landing page for it. I know, that it is nearly impossible to replace all variations of this parameter URLs by static ones but does it generally make sense to do this for the most popular facets choose by visitors. Or does this cause any issues? Any help is much appreciated. Thanks a lot in advance
Intermediate & Advanced SEO | | ennovators0 -
Dilemma about "images" folder in robots.txt
Hi, Hope you're doing well. I am sure, you guys must be aware that Google has updated their webmaster technical guidelines saying that users should allow access to their css files and java-scripts file if it's possible. Used to be that Google would render the web pages only text based. Now it claims that it can read the css and java-scripts. According to their own terms, not allowing access to the css files can result in sub-optimal rankings. "Disallowing crawling of Javascript or CSS files in your site’s robots.txt directly harms how well our algorithms render and index your content and can result in suboptimal rankings."http://googlewebmastercentral.blogspot.com/2014/10/updating-our-technical-webmaster.htmlWe have allowed access to our CSS files. and Google bot, is seeing our webapges more like a normal user would do. (tested it in GWT)Anyhow, this is my dilemma. I am sure lot of other users might be facing the same situation. Like any other e commerce companies/websites.. we have lot of images. Used to be that our css files were inside our images folder, so I have allowed access to that. Here's the robots.txt --> http://www.modbargains.com/robots.txtRight now we are blocking images folder, as it is very huge, very heavy, and some of the images are very high res. The reason we are blocking that is because we feel that Google bot might spend almost all of its time trying to crawl that "images" folder only, that it might not have enough time to crawl other important pages. Not to mention, a very heavy server load on Google's and ours. we do have good high quality original pictures. We feel that we are losing potential rankings since we are blocking images. I was thinking to allow ONLY google-image bot, access to it. But I still feel that google might spend lot of time doing that. **I was wondering if Google makes a decision saying, hey let me spend 10 minutes for google image bot, and let me spend 20 minutes for google-mobile bot etc.. or something like that.. , or does it have separate "time spending" allocations for all of it's bot types. I want to unblock the images folder, for now only the google image bot, but at the same time, I fear that it might drastically hamper indexing of our important pages, as I mentioned before, because of having tons & tons of images, and Google spending enough time already just to crawl that folder.**Any advice? recommendations? suggestions? technical guidance? Plan of action? Pretty sure I answered my own question, but I need a confirmation from an Expert, if I am right, saying that allow only Google image access to my images folder. Sincerely,Shaleen Shah
Intermediate & Advanced SEO | | Modbargains1 -
If I own a .com url and also have the same url with .net, .info, .org, will I want to point them to the .com IP address?
I have a domain, for example, mydomain.com and I purchased mydomain.net, mydomain.info, and mydomain.org. Should I point the host @ to the IP where the .com is hosted in wpengine? I am not doing anything with the .org, .info, .net domains. I simply purchased them to prevent competitors from buying the domains.
Intermediate & Advanced SEO | | djlittman0 -
Should comments and feeds be disallowed in robots.txt?
Hi My robots file is currently set up as listed below. From an SEO point of view is it good to disallow feeds, rss and comments? I feel allowing comments would be a good thing because it's new content that may rank in the search engines as the comments left on my blog often refer to questions or companies folks are searching for more information on. And the comments are added regularly. What's your take? I'm also concerned about the /page being blocked. Not sure how that benefits my blog from an SEO point of view as well. Look forward to your feedback. Thanks. Eddy User-agent: Googlebot Crawl-delay: 10 Allow: /* User-agent: * Crawl-delay: 10 Disallow: /wp- Disallow: /feed/ Disallow: /trackback/ Disallow: /rss/ Disallow: /comments/feed/ Disallow: /page/ Disallow: /date/ Disallow: /comments/ # Allow Everything Allow: /*
Intermediate & Advanced SEO | | workathomecareers0 -
URL Error or Penguin Penalty?
I am currently having a major panic as our website www.uksoccershop.com has been largely dropped from Google. We have not made any changes recently and I am not sure why this is happening, but having heard all sorts of horror stories of penguin update, I am fearing the worst. If you google "uksoccershop" you will see that the homepage does not rank. We previously ranked in the top 3 for "football shirts" but now we don't, although on page 2, 3 and 4 you will see one of our category pages ranking (this didn't used to happen). Some rankings are intact, but many have disappeared completely and in some cases been replaced by other pages on our site. I should point out our existing rankings have been consistently there for 5-6 years until today. I logged into webmaster tools and thankfully there is no warning message from Google about spam, etc, but what we do have is 35,000 URL errors for pages which are accessible. An example of this is: | URL: | http://www.uksoccershop.com/categories/5_295_327.html | | Error details In Sitemaps Linked from Last crawled: 6/20/12First detected: 6/15/12Googlebot couldn't access the contents of this URL because the server had an internal error when trying to process the request. These errors tend to be with the server itself, not with the request. Is it possible this is the cause of the issue (we are not currently sure why the URL's are being blocked) and if so, how severe is it and how recoverable?If that is unlikely to cause the issue, what would you recommend our next move is?All help is REALLY REALLY appreciated 🙂
Intermediate & Advanced SEO | | ukss19840 -
Removing dashes in our URLs?
Hi Forum, Our site has an errant product review module that is resulting in about 9-10 404 errors per day on Google Webmaster Tools. We've found that by changing our product page URLs to only include 2 dashes, the module stops causing 404 errors for that page. Does changing our URL from "oursite.com/girls-pink-yoga-capri.html" to "oursite.com/girlspink-yoga-capri.html" hurt our SEO for a search for "girls pink yoga capri"? If so, by how much (assuming everthing else on the page is optimized properly) Thanks for your input.
Intermediate & Advanced SEO | | pano0