Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
De-indexing millions of pages - would this work?
-
Hi all,
We run an e-commerce site with a catalogue of around 5 million products.
Unfortunately, we have let Googlebot crawl and index tens of millions of search URLs, the majority of which are very thin of content or duplicates of other URLs. In short: we are in deep. Our bloated Google-index is hampering our real content to rank; Googlebot does not bother crawling our real content (product pages specifically) and hammers the life out of our servers.
Since having Googlebot crawl and de-index tens of millions of old URLs would probably take years (?), my plan is this:
- 301 redirect all old SERP URLs to a new SERP URL.
- If new URL should not be indexed, add meta robots noindex tag on new URL.
- When it is evident that Google has indexed most "high quality" new URLs, robots.txt disallow crawling of old SERP URLs. Then directory style remove all old SERP URLs in GWT URL Removal Tool
- This would be an example of an old URL:
www.site.com/cgi-bin/weirdapplicationname.cgi?word=bmw&what=1.2&how=2 - This would be an example of a new URL:
www.site.com/search?q=bmw&category=cars&color=blue
I have to specific questions:
- Would Google both de-index the old URL and not index the new URL after 301 redirecting the old URL to the new URL (which is noindexed) as described in point 2 above?
- What risks are associated with removing tens of millions of URLs directory style in GWT URL Removal Tool? I have done this before but then I removed "only" some useless 50 000 "add to cart"-URLs.Google says themselves that you should not remove duplicate/thin content this way and that using this tool tools this way "may cause problems for your site".
And yes, these tens of millions of SERP URLs is a result of a faceted navigation/search function let loose all to long.
And no, we cannot wait for Googlebot to crawl all these millions of URLs in order to discover the 301. By then we would be out of business.Best regards,
TalkInThePark -
Thanks a lot, Tom. Time will tell...
Just one last thing:
what damage are you (and Google) thinking of when advising against removing URLs on a large scale through GWMT?Personally, I think Google says so only because they want to keep as much information possible in their index.
-
Thanks for the PM, I can now appreciate the problem a little more.
I think it's something that you should not rush. What you've done seems the best thing you can do for now.
Longer term, I'd look at your CMS options!
-
Yes, I have put a conditional meta robots "noindex" on all pages whose URL contains more than 2 GET elements. It is also present on URLs containing parameters of little or no SEO value (e.g. the "price" parameter).
Regarding the nofollow directive, my plan is to not put it in the head but on the individual links pointing to URLs that should not be indexed. If we happen to get a backlink to one of these noindexed pages, I want the link value to get passed on to listed product pages.
My big worrie is what should I do if this de-indexation process takes forever...
-
If you could put a conditional meta tag in to the source code, that will show the nofollow tag if the URL contains more than 3 GET elements, then that might help?
You seem to have already thought hard about your options, and they sound ok. Let's just wait to see whether any Gurus are about to shout stop!
-
Thanks for answering that quickly, Tom!
We cannot robots.txt disallow all URLs. We get quite a lot of organic traffic to these URLs. In july, organic traffic landing on results pages gave us approximately $85 000 in revenue. Also, what is good to know is that pages resulting from searching and browsing share the same URL - the search phrase is treated as just another filtering parameter in the URL.
Keeping the same URL structure is part of my preferred, 2-step solution:
- Meta Robots "noindex" unwanted results pages (the overwhelming majority)
- When our Google index has shrunken enough, put rel=nofollow on internal links pointing to those results pages in order to prevent bots from crawling them.
I have actually implemented step 1 (as of yesterday). The solution I was describing in my original post is my last resort solution. I wanted to get a professional opinion on that one in order to know if I should rule it out or not.
Unfortunately, I cannot disclose our company name here (I have a feeling our competitors use Seomoz as well :)). But I'll send you some links in a private message.
-
If I were you I'd keep the same URL structure. You're correct in thinking this won't be a quick fix.
First, use the robots.txt to disallow robots access to the search pages.
Don't remove all results just yet from GWT, this will be a long task and might damage your sites performance.
Could you provide some links to your site? I'll have a closer look.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Are image pages considered 'thin' content pages?
I am currently doing a site audit. The total number of pages on the website are around 400... 187 of them are image pages and coming up as 'zero' word count in Screaming Frog report. I needed to know if they will be considered 'thin' content by search engines? Should I include them as an issue? An answer would be most appreciated.
Technical SEO | | MTalhaImtiaz0 -
How to stop google from indexing specific sections of a page?
I'm currently trying to find a way to stop googlebot from indexing specific areas of a page, long ago Yahoo search created this tag class=”robots-nocontent” and I'm trying to see if there is a similar manner for google or if they have adopted the same tag? Any help would be much appreciated.
Technical SEO | | Iamfaramon0 -
Should i index or noindex a contact page
Im wondering if i should noindex the contact page im doing SEO for a website just wondering if by noindexing the contact page would it help SEO or hurt SEO for that website
Technical SEO | | aronwp0 -
Is the Authority of Individual Pages Diluted When You Add New Pages?
I was wondering if the authority of individual pages is diluted when you add new pages (in Google's view). Suppose your site had 100 pages and you added 100 new pages (without getting any new links). Would the average authority of the original pages significantly decrease and result in a drop in search traffic to the original pages? Do you worry that adding more pages will hurt pages that were previously published?
Technical SEO | | Charlessipe0 -
Should I put meta descriptions on pages that are not indexed?
I have multiple pages that I do not want to be indexed (and they are currently not indexed, so that's great). They don't have meta descriptions on them and I'm wondering if it's worth my time to go in and insert them, since they should hypothetically never be shown. Does anyone have any experience with this? Thanks! The reason this is a question is because one member of our team was linking to this page through Facebook to send people to it and noticed random text on the page being pulled in as the description.
Technical SEO | | Viewpoints0 -
How to Stop Google from Indexing Old Pages
We moved from a .php site to a java site on April 10th. It's almost 2 months later and Google continues to crawl old pages that no longer exist (225,430 Not Found Errors to be exact). These pages no longer exist on the site and there are no internal or external links pointing to these pages. Google has crawled the site since the go live, but continues to try and crawl these pages. What are my next steps?
Technical SEO | | rhoadesjohn0 -
How Does Google's "index" find the location of pages in the "page directory" to return?
This is my understanding of how Google's search works, and I am unsure about one thing in specific: Google continuously crawls websites and stores each page it finds (let's call it "page directory") Google's "page directory" is a cache so it isn't the "live" version of the page Google has separate storage called "the index" which contains all the keywords searched. These keywords in "the index" point to the pages in the "page directory" that contain the same keywords. When someone searches a keyword, that keyword is accessed in the "index" and returns all relevant pages in the "page directory" These returned pages are given ranks based on the algorithm The one part I'm unsure of is how Google's "index" knows the location of relevant pages in the "page directory". The keyword entries in the "index" point to the "page directory" somehow. I'm thinking each page has a url in the "page directory", and the entries in the "index" contain these urls. Since Google's "page directory" is a cache, would the urls be the same as the live website (and would the keywords in the "index" point to these urls)? For example if webpage is found at wwww.website.com/page1, would the "page directory" store this page under that url in Google's cache? The reason I want to discuss this is to know the effects of changing a pages url by understanding how the search process works better.
Technical SEO | | reidsteven750 -
How to stop my webmail pages not to be indexed on Google ??
when i did a search in google for Site:mywebsite.com , for a list of pages indexed. Surprisingly the following come up " Webmail - Login " Although this is associated with the domain , this is a completely different server , this the rackspace email server browser interface I am sure that there is nothing on the website that links or points to this.
Technical SEO | | UIPL
So why is Google indexing it ? & how do I get it out of there. I tried in webmaster tool but I could not , as it seems like a sub-domain. Any ideas ? Thanks Naresh Sadasivan0