Huge google index with un-relevant pages
-
Hi,
i run a site about sport matches, every match has a page and the pages are generated automatically from the DB. pages are not duplicated, but over time some look a little bit similar. after a match finishes it has no internal links or sitemap entry, but it's reachable by direct URL and continues to be on google index. so over time we have more than 100,000 indexed pages.
since past matches have no significance and they're not linked and a match can repeat and it may look like duplicate content....what you suggest us to do:
when a match is finished - not linked, but appears on the index and SERP
-
301 redirect the match Page to the match Category which is a higher hierarchy and is always relevant?
-
use rel=canonical to the match Category
-
do nothing....
*301 redirect will shrink my index status, some say a high index status is good...
*is it safe to 301 redirect 100,000 pages at once - wouldn't it look strange to google?
*would canonical remove the past matches pages from the index?
what do you think?
Thanks,
Assaf.
-
-
In terms of what you've written, blocking a page via robots.txt doesn't remove it from the index. It simply prevents the crawlers from reaching the page. So if you block a page via robots.txt, the page remains in the index, Google just can't go back to the page and see if anything has changed. So if you were to block the page via robots.txt, and add a noindex tag to the page, Google won't be able to see the page with the noindex tag to remove it from the index because it's blocked via robots.txt.
If you moved all of your old content to a different folder, and block that folder via robots.txt, Google won't remove those pages from the index. In order to remove them from the index, you would have to go in to Webmaster Tools and use the URL removal tool to remove that new folder from the index - if they see it's blocked via robots.txt, then and only then they'll remove the content from the index - it has to be blocked via robots.txt first in order to remove the whole folder with the URL removal tool.
I'm not sure though if this would work for the future - if you removed a folder from the index, and then added more content that was indexed previously afterwards, I'm not sure what would happen to that new content moved to that folder. Either way, Google will have to come back and recrawl the page to see that it has moved to the new folder, and then remove it from the index. So either way, the content will only be removed once Google recrawls the old content.
So I still think a better way to remove the content from the index is to add the noindex tag to the old pages. To facilitate the search engines reaching these old pages, I'd make sure there is a way the engines can get to them - make sure there is a path they can take to reach them.
Another good idea I saw on a forum post here a while ago would be to create a sitemap containing all of these old pages you have indexed and want removed. Add the noindex tag to the sitemap - using the Webmaster tools sitemap interface, you'll then be able to monitor the progress of deindexation over time - by checking how many pages on the sitemap/s of the old content are originally indexed as reported by webmaster tools, and then you can see later on how many of those pages are still indexed, this will be a good indicator for you of the progress of the deindexation.
-
Dear Mark,
*i've sent you a private message.
i'm starting to understand i've a much bigger problem.
*my index status contain 120k pages while only 2000 are currently relevant.
your suggestion is - after a match finishes pragmatically add to the page and google will remove it from it's index. it could work for relatively new pages but since very old pages don't have links OR sitemap entry it could take a very long time to clear the index cause they're rarely crawled - if at all.
- more aggressive approach would be to change this site architecture and restrict by robot.txt the folder that holds all the past irrelevant pages.
so if today a match URL is like this: www.domain.com/sport/match/T1vT2
restrict www.domain.com/sport/match/ on robots.txt
and from now on create all new matches on different folder like: www.domain.com/sport/new-match-dir/T1vT2
-
is this a good solution?
-
wouldn't google penalize me for removing a directory with 100k pages?
-
if it's a good approach, how much time it will take for google to clear all those pages from it's index?
I know it's a long one and i'll really appreciate your response.
Thanks a lot,
Assaf.
-
there are a bunch of articles out there, but each case is different - here are a few:
http://www.searchenginejournal.com/the-holy-grail-of-panda-recovery-a-1-year-case-study/45683/
You can contact me via private message here on the forum and I can try to take a more in depth look at your site if you can give me some more detailed info.
-
yes. when the 1st Panda update was rolled out i've lost 50% of the traffic from google and haven't really recovered since.
-
Are you sure you got hit by Panda before we talk about a Panda hit?
-
Thanks Mark!
any good article about how to recover from Panda?
-
Exactly - I'd build a strategy more around promoting pages that will have long lasting value.
If you use the tag noindex, follow, it will continue to spread link juice throughout the site, it's just the individual page with the tag will not be included in the search results and will not be part of the index. In order for the tag to work, they first have to crawl the page and see the tag - so it doesn't happen instantaneously - if they crawl these deeper pages once every few weeks, once a month, or even longer, it may take a while for these pages to be removed from the index.
-
Hi Mark
-
these pages are very important when they are relevant (before the match finished) - they are the source of most of our traffic which come from long tail searches.
-
some of these pages have inbound link and it would be a shame to lose all this juice.
-
would noindex remove the pages from the google index? how much time it would take? wouldn't a huge noindex also look suspicious?
-
by "evergreen pages" - you mean pages that are always relevant like League page / Sport page etc...?
Thanks,
Assaf.
-
-
Hi Assaf,
(I'm not stalking you, I just think you've raised another interesting question)
In terms of index status/size, you don't want to create a massive index of empty/low value pages - this is food for Google's Panda algorithm, and will not be good for your site in the long run. It'll get a Panda smack if it hasn't already.
To remove these pages from the index, instead of doing hundreds of thousands of 301 redirects, which your server won't like either, I'd recommend adding the noindex meta tag to the pages.
I'd put a rule in your cms that after a certain point in time, you noindex those pages. Make sure you also have evergreen pages on your site that can serve as landing pages for the search engines and which won't need to be removed after a short period of time. These are the pages you'll want to focus your outreach and link building efforts on.
Mark
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google webcache of product page redirects back to product page
Hi all– I've legitimately never seen this before, in any circumstance. I just went to check the google webcache of a product page on our site (was just grabbing the last indexation date) and was immediately redirected away from google's cached version BACK to the site's standard product page. I ran a status check on the product page itself and it was 200, then ran a status check on the webcache version and sure enough, it registered as redirected. It looks like this is happening for ALL indexed product pages across the site (several thousand), and though organic traffic has not been affected it is starting to worry me a little bit. Has anyone ever encountered this situation before? Why would a google webcache possibly have any reason to redirect? Is there anything to be done on our side? Thanks as always for the help and opinions, y'all!
Intermediate & Advanced SEO | | TukTown1 -
Google Not Indexing App Content
Hello Mozzers I recently noticed that there has been an increase in crawl errors reported in Google Search console & Google has stopped indexing our app content. Could this be due to the fact that there is a mismatch between the host path name mentioned within the android deeplink (within the alternate tag) and the actual URL of the page. For instance on the following desktop page http://www.example.com.au/page-1 the android deeplink points to http://www.example.com.au/android-app://com.example/http/www.example.com.au/4652374 Please note that the content on both pages (desktop & android) is same.Is this is a correct setup or am I doing something wrong here? Any help would be much appreciated. Thank you so much in advance.
Intermediate & Advanced SEO | | InMarketingWeTrust0 -
New Web Page Not Indexed
Quick question with probably a straightforward answer... We created a new page on our site 4 days ago, it was in fact a mini-site page though I don't think that makes a difference... To date, the page is not indexed and when I use 'Fetch as Google' in WT I get a 'Not Found' fetch status... I have also used the'Submit URL' in WT which seemed to work ok... We have even resorted to 'pinging' using Pinglar and Ping-O-Matic though we have done this cautiously! I know social media is probably the answer but we have been trying to hold back on that tactic as the page relates to a product that hasn't quite launched yet and we do not want to cause any issues with the vendor! That said, I think we might have to look at sharing the page socially unless anyone has any other ideas? Many thanks Andy
Intermediate & Advanced SEO | | TomKing0 -
Previously ranking #1 in google, web page has 301 / url rewrite, indexed but now showing for keyword search?
Two web pages on my website, previously ranked well in google, consistent top 3 places for 6months+, but when the site was modified, these two pages previously ending .php had the page names changed to the keyword to further improve (or so I thought). Since then the page doesn't rank at all for that search term in google. I used google webmaster tools to remove the previous page from Cache and search results, re submitted a sitemap, and where possible fixed links to the new page from other sites. On previous advice to fix I purchased links, web directories, social and articles etc to the new page but so far nothing... Its been almost 5 months and its very frustrating as these two pages previously ranked well and as a landing page ended in conversions. This problem is only appearing in google. The pages still rank well in Bing and Yahoo. Google has got the page indexed if I do a search by the url, but the page never shows under any search term it should, despite being heavily optimised for certain terms. I've spoke to my developers and they are stumped also, they've now added this text to the effected page(s) to see if this helps. Header("HTTP/1.1 301 Moved Permanently");
Intermediate & Advanced SEO | | seanclc
$newurl=SITE_URL.$seo;
Header("Location:$newurl"); Can Google still index a web page but refuse to show it in search results? All other pages on my site rank well, just these two that were once called something different has caused issues? Any advice? Any ideas, Have I missed something? Im at a loss...0 -
Help! Why did Google remove my images from their index?
I've been scratching my head over this one for a while now and I can't seem to figure it out. I own a website that is user-generated content. Users submit images to my sites of graphic resources (for designers) that they have created to share with our community. I've been noticing over the past few months that I'm getting completely dominated in Google Images. I used to get a ton of traffic from Google Images, but now I can't find my images anywhere. After diving into Analytics I found this: http://cl.ly/140L2d14040Q1R0W161e and realized sometime about a year ago my image traffic took a dive. We've gone back through all the change logs and can't find where we made any changes to the site structure that could have caused this. We are stumped. Does anyone know of any historical Google updates that could have caused this last year around the end of April 2010? Any help or insight would be greatly appreciated!
Intermediate & Advanced SEO | | shawn810 -
How long till pages drop out of the index
In your experience how long does it normally take for 301-redirected pages to drop out of Google's index?
Intermediate & Advanced SEO | | bjalc20110 -
Export list of urls in google's index?
Is there a way to export an exact list of urls found in Google's index?
Intermediate & Advanced SEO | | nicole.healthline0 -
How Google Carwler Cached Orphan pages and directory?
I have website www.test.com I have made some changes in live website and upload it to "demo" directory (which is recently created) for client approval. Now, my demo link will be www.test.com/demo/ I am not doing any type of link building or any activity which pass referral link to www.test.com/demo/ Then how Google crawler find it and cached some pages or entire directory? Thanks
Intermediate & Advanced SEO | | darshit210