Indexed pages and current pages - Big difference?

Nathan.Smith

Our website shows ~22k pages in the sitemap but ~56k are showing indexed on Google through the "site:" command. Firstly, how much attention should we paying to the discrepancy? If we should be worried what's the best way to find the cause of the difference?

The domain canonical is set so can't really figure out if we've got a problem or not?

grasshopper

Hi Nathan,

The delta between the number of pages returned by the site: operator and the number of pages in your sitemap could be down to a number of issues:

Your XML sitemap may represent only a percentage of the total number of valid content URLs that your site is capable of generating.

a) Often sites will only generate XML sitemaps for URLs that someone has decided are "important", when the total number of URLs is much larger.

Your XML sitemap contains ALL the valid content URLs that your site is capable of generating, but search engines are somehow finding more URLs.

a) Look in Google Webmaster Tools under Optimization >> HTML improvements >> Duplicate title tags

i) Do the pages with duplicate titles have duplicate page content? If so, your publishing platform is allowing multiple URLs to render the same content, which is a bug that needs to be fixed

b) Run a crawler like Xenu Link Sleuth or Screaming Frog against your site, and see how many URLs they discover. Export the results to Excel and look for weird URLs

i) Usually culprits for duplicate content include incorrect canonicalization (www vs non-www, URLs ending in /index.html vs just /, etc)

ii) Look for URLs ending with strange query strings (affiliate tracking, session IDs, etc)

c) Use the site: operator in other engines (Bing, blekko, etc) and compare the numbers they return. Especially if this number is larger than the number Google is returning, starting looking for weird URL patterns

Also, I'm not sure what you mean by "the domain canonical has been set correctly". If you're referring to use of the canonical link element for every URL, there are plenty of ways that can go wrong. E.g., if your CMS requires that each published URL have rel="canonical", but allows URLs to be published with and without the trailing /index.html, you can end up with a canonical link element on the non-canonical version of the URL, further confusing engines. Something to look into.

bronxpad

You might have a duplicate content issue. You will want to check if you have the proper 301 redirect and a canonical command in the head of your code. If you don't have this set properly then the search engines will see the www and non-www versions of your site as duplicate. Also remember that the search engines also by default place this at the end of the url /

Here are two links that can help if this is the issue.

http://www.webconfs.com/how-to-redirect-a-webpage.php/

http://www.mattcutts.com/blog/rel-canonical-html-head/

Hope this helps. Good Luck

deltasystems

Yes this is a potentially significant problem. The easiest way to troubleshoot is to do the 'site:' command again, and go to the last page of results. You should be seeing pages that aren't in your sitemap. Very likely duplicated content.

If you are having a rough time troubleshooting, post a link and I'll be glad to take a peek.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Indexed pages and current pages - Big difference?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Pages not indexable?

How to de-index a page with a search string with the structure domain.com/?"spam"

Issues Indexing Translated Pages

Google not index main keyword on homepage in 2 countries same language, rest of pages no problem

How do I get my pages to go from "Submitted" to "Indexed" in Google Webmaster Tools?

132 pages reported as having Duplicate Page Content but I'm not sure where to go to fix the problems?

Two different canonical tags on one page

SEOMoz is indicating I have 40 pages with duplicate content, yet it doesn't list the URL's of the pages???