Huge google index with un-relevant pages
-
Hi,
i run a site about sport matches, every match has a page and the pages are generated automatically from the DB. pages are not duplicated, but over time some look a little bit similar. after a match finishes it has no internal links or sitemap entry, but it's reachable by direct URL and continues to be on google index. so over time we have more than 100,000 indexed pages.
since past matches have no significance and they're not linked and a match can repeat and it may look like duplicate content....what you suggest us to do:
when a match is finished - not linked, but appears on the index and SERP
-
301 redirect the match Page to the match Category which is a higher hierarchy and is always relevant?
-
use rel=canonical to the match Category
-
do nothing....
*301 redirect will shrink my index status, some say a high index status is good...
*is it safe to 301 redirect 100,000 pages at once - wouldn't it look strange to google?
*would canonical remove the past matches pages from the index?
what do you think?
Thanks,
Assaf.
-
-
In terms of what you've written, blocking a page via robots.txt doesn't remove it from the index. It simply prevents the crawlers from reaching the page. So if you block a page via robots.txt, the page remains in the index, Google just can't go back to the page and see if anything has changed. So if you were to block the page via robots.txt, and add a noindex tag to the page, Google won't be able to see the page with the noindex tag to remove it from the index because it's blocked via robots.txt.
If you moved all of your old content to a different folder, and block that folder via robots.txt, Google won't remove those pages from the index. In order to remove them from the index, you would have to go in to Webmaster Tools and use the URL removal tool to remove that new folder from the index - if they see it's blocked via robots.txt, then and only then they'll remove the content from the index - it has to be blocked via robots.txt first in order to remove the whole folder with the URL removal tool.
I'm not sure though if this would work for the future - if you removed a folder from the index, and then added more content that was indexed previously afterwards, I'm not sure what would happen to that new content moved to that folder. Either way, Google will have to come back and recrawl the page to see that it has moved to the new folder, and then remove it from the index. So either way, the content will only be removed once Google recrawls the old content.
So I still think a better way to remove the content from the index is to add the noindex tag to the old pages. To facilitate the search engines reaching these old pages, I'd make sure there is a way the engines can get to them - make sure there is a path they can take to reach them.
Another good idea I saw on a forum post here a while ago would be to create a sitemap containing all of these old pages you have indexed and want removed. Add the noindex tag to the sitemap - using the Webmaster tools sitemap interface, you'll then be able to monitor the progress of deindexation over time - by checking how many pages on the sitemap/s of the old content are originally indexed as reported by webmaster tools, and then you can see later on how many of those pages are still indexed, this will be a good indicator for you of the progress of the deindexation.
-
Dear Mark,
*i've sent you a private message.
i'm starting to understand i've a much bigger problem.
*my index status contain 120k pages while only 2000 are currently relevant.
your suggestion is - after a match finishes pragmatically add to the page and google will remove it from it's index. it could work for relatively new pages but since very old pages don't have links OR sitemap entry it could take a very long time to clear the index cause they're rarely crawled - if at all.
- more aggressive approach would be to change this site architecture and restrict by robot.txt the folder that holds all the past irrelevant pages.
so if today a match URL is like this: www.domain.com/sport/match/T1vT2
restrict www.domain.com/sport/match/ on robots.txt
and from now on create all new matches on different folder like: www.domain.com/sport/new-match-dir/T1vT2
-
is this a good solution?
-
wouldn't google penalize me for removing a directory with 100k pages?
-
if it's a good approach, how much time it will take for google to clear all those pages from it's index?
I know it's a long one and i'll really appreciate your response.
Thanks a lot,
Assaf.
-
there are a bunch of articles out there, but each case is different - here are a few:
http://www.searchenginejournal.com/the-holy-grail-of-panda-recovery-a-1-year-case-study/45683/
You can contact me via private message here on the forum and I can try to take a more in depth look at your site if you can give me some more detailed info.
-
yes. when the 1st Panda update was rolled out i've lost 50% of the traffic from google and haven't really recovered since.
-
Are you sure you got hit by Panda before we talk about a Panda hit?
-
Thanks Mark!
any good article about how to recover from Panda?
-
Exactly - I'd build a strategy more around promoting pages that will have long lasting value.
If you use the tag noindex, follow, it will continue to spread link juice throughout the site, it's just the individual page with the tag will not be included in the search results and will not be part of the index. In order for the tag to work, they first have to crawl the page and see the tag - so it doesn't happen instantaneously - if they crawl these deeper pages once every few weeks, once a month, or even longer, it may take a while for these pages to be removed from the index.
-
Hi Mark
-
these pages are very important when they are relevant (before the match finished) - they are the source of most of our traffic which come from long tail searches.
-
some of these pages have inbound link and it would be a shame to lose all this juice.
-
would noindex remove the pages from the google index? how much time it would take? wouldn't a huge noindex also look suspicious?
-
by "evergreen pages" - you mean pages that are always relevant like League page / Sport page etc...?
Thanks,
Assaf.
-
-
Hi Assaf,
(I'm not stalking you, I just think you've raised another interesting question)
In terms of index status/size, you don't want to create a massive index of empty/low value pages - this is food for Google's Panda algorithm, and will not be good for your site in the long run. It'll get a Panda smack if it hasn't already.
To remove these pages from the index, instead of doing hundreds of thousands of 301 redirects, which your server won't like either, I'd recommend adding the noindex meta tag to the pages.
I'd put a rule in your cms that after a certain point in time, you noindex those pages. Make sure you also have evergreen pages on your site that can serve as landing pages for the search engines and which won't need to be removed after a short period of time. These are the pages you'll want to focus your outreach and link building efforts on.
Mark
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Pages canonicaled to another appearing before the canonical on google searches
Hello, When I do this google search, this page(amandine roses category) appears before the one it is canonical-ed to(this multi-product version of amandine roses). This happens often with this multi-product template, where they don't rank as well as their category version(that are canonical to the multi-product version). Can someone maybe point us in the right direction on what the issue may be? What can be improved?
Intermediate & Advanced SEO | | globalrose.com0 -
Why would my total number of indexed pages stop increasing?
I have an ecommerce marketplace that has new items added daily. In search consoloe my pages have always gone up almost every week. It hasn't increased in 5 weeks. We haven't made any changes to the site and the sitemap looks good. Any ideas on what I should look for?
Intermediate & Advanced SEO | | EcommerceSite0 -
Why is Google ranking irrelevant / not preferred pages for keywords?
Over the past few months we have been chipping away at duplicate content issues. We know this is our biggest issue and is working against us. However, it is due to this client also owning the competitor site. Therefore, product merchandise and top level categories are highly similar, including a shared server. Our rank is suffering major for this, which we understand. However, as we make changes, and I track and perform test searches, the pages that Google ranks for keywords never seems to match or make sense, at all. For example, I search for "solid scrub tops" and it ranks the "print scrub tops" category. Or the "Men Clearance" page is ranking for keyword "Women Scrub Pants". Or, I will search for a specific brand, and it ranks a completely different brand. Has anyone else seen this behavior with duplicate content issues? Or is it an issue with some other penalty? At this point, our only option is to test something and see what impact it has, but it is difficult to do when keywords do not align with content.
Intermediate & Advanced SEO | | lunavista-comm0 -
"Null" appearing as top keyword in "Content Keywords" under Google index in Google Search Console
Hi, "Null" is appearing as top keyword in Google search console > Google Index > Content Keywords for our site http://goo.gl/cKaQ4K . We do not use "null" as keyword on site. We are not able to find why Google is treating "null" as a keyword for our site. Is anyone facing such issue. Thanks & Regards
Intermediate & Advanced SEO | | vivekrathore0 -
Does Google still don't index Hashtag Links ? No chance to get a Search Result that leads directly to a section of a page? or to one of numeras Hashtag Pages in a single HTML page?
Does Google still don't index Hashtag Links ? No chance to get a Search Result that leads directly to a section of a page? or to one of numeras Hashtag Pages in a single HTML page? If I have 4 or 5 different hashtag link section pages , consolidated into one HTML Page, no chance to get one of the Hashtag Pages to appear as a search result? like, if under one Single Page Travel Guide I have two essential sections: #Attractions #Visa no chance to direct search queries for Visa directly to the Hashtag Link Section of #Visa? Thanks for any help
Intermediate & Advanced SEO | | Muhammad_Jabali0 -
Will I lose traffic from Google for re-directing a page?
I’m currently planning to a retire a discontinued product and put a 301 redirect to a related product (although not identical). The thing is, I’m still getting significant traffic from people searching for the old product by name. Would Google send this traffic to the new pages via the re-direct? Is Google likely to display the new page in place of the old page for similar queries or will it serve other content? I’d like to answer this question so that I can decide between the two following approaches: 1) Retiring the old page immediately and putting a 301 redirect to the new related pages. This will have the advantage of transferring the value of any link signals / referring traffic. Traffic will also land on the new pages directly without having to click through from another page. We would have a dynamic message telling users that the old product had been retired depending on whether they had visited out site before. 2) Keep the old product pages temporarily so that we don’t lose the traffic from the search engines. We would then change the old pages to advise users that the old product was now retired, but that we have other products that might solve their problems. When this organic traffic decreases over time, then we will proceed with the re-direct as above. I am worried though that the old product pages might outrank the new product pages. I’d really appreciate some advice with this. I’ve been reading lots of articles, but it seems like there are different opinions on this. I understand that I will lose between 10% - 15% of page rank as per the Matt Cutts video.
Intermediate & Advanced SEO | | RG_SEO0 -
Google Not Indexing XML Sitemap Images
Hi Mozzers, We are having an issue with our XML sitemap images not being indexed. The site has over 39,000 pages and 17,500 images submitted in GWT. If you take a look at the attached screenshot, 'GWT Images - Not Indexed', you can see that the majority of the pages are being indexed - but none of the images are. The first thing you should know about the images is that they are hosted on a content delivery network (CDN), rather than on the site itself. However, Google advice suggests hosting on a CDN is fine - see second screenshot, 'Google CDN Advice'. That advice says to either (i) ensure the hosting site is verified in GWT or (ii) submit in robots.txt. As we can't verify the hosting site in GWT, we had opted to submit via robots.txt. There are 3 sitemap indexes: 1) http://www.greenplantswap.co.uk/sitemap_index.xml, 2) http://www.greenplantswap.co.uk/sitemap/plant_genera/listings.xml and 3) http://www.greenplantswap.co.uk/sitemap/plant_genera/plants.xml. Each sitemap index is split up into often hundreds or thousands of smaller XML sitemaps. This is necessary due to the size of the site and how we have decided to pull URLs in. Essentially, if we did it another way, it may have involved some of the sitemaps being massive and thus taking upwards of a minute to load. To give you an idea of what is being submitted to Google in one of the sitemaps, please see view-source:http://www.greenplantswap.co.uk/sitemap/plant_genera/4/listings.xml?page=1. Originally, the images were SSL, so we decided to reverted to non-SSL URLs as that was an easy change. But over a week later, that seems to have had no impact. The image URLs are ugly... but should this prevent them from being indexed? The strange thing is that a very small number of images have been indexed - see http://goo.gl/P8GMn. I don't know if this is an anomaly or whether it suggests no issue with how the images have been set up - thus, there may be another issue. Sorry for the long message but I would be extremely grateful for any insight into this. I have tried to offer as much information as I can, however please do let me know if this is not enough. Thank you for taking the time to read and help. Regards, Mark Oz6HzKO rYD3ICZ
Intermediate & Advanced SEO | | edlondon0 -
Google cached pages and search terms
Here's something I noticed. We have a rank A page and it's ranking 10 on Google search results. When I hover my mouse over our search result, Google gives us a preview, but Google also highlights in red where the search keyword is present on the page. Reviewing our page, even though we have it as the h1 header and intro paragraph, Google is highlighting it half way down the page. Any ideas why? I review rank 1 - 5 and Google highlights the keyword on the intro paragraph and h1 header Have you guys experienced anything like this? It makes me think..Google could be crawling my site and thinking I haven't got it in the h1 or intro paragraph etc.. Thoughts?
Intermediate & Advanced SEO | | Bio-RadAbs0