Indexed pages and current pages - Big difference?
-
Our website shows ~22k pages in the sitemap but ~56k are showing indexed on Google through the "site:" command. Firstly, how much attention should we paying to the discrepancy? If we should be worried what's the best way to find the cause of the difference?
The domain canonical is set so can't really figure out if we've got a problem or not?
-
Hi Nathan,
The delta between the number of pages returned by the site: operator and the number of pages in your sitemap could be down to a number of issues:
- Your XML sitemap may represent only a percentage of the total number of valid content URLs that your site is capable of generating.
a) Often sites will only generate XML sitemaps for URLs that someone has decided are "important", when the total number of URLs is much larger.
- Your XML sitemap contains ALL the valid content URLs that your site is capable of generating, but search engines are somehow finding more URLs.
a) Look in Google Webmaster Tools under Optimization >> HTML improvements >> Duplicate title tags
i) Do the pages with duplicate titles have duplicate page content? If so, your publishing platform is allowing multiple URLs to render the same content, which is a bug that needs to be fixed
b) Run a crawler like Xenu Link Sleuth or Screaming Frog against your site, and see how many URLs they discover. Export the results to Excel and look for weird URLs
i) Usually culprits for duplicate content include incorrect canonicalization (www vs non-www, URLs ending in /index.html vs just /, etc)
ii) Look for URLs ending with strange query strings (affiliate tracking, session IDs, etc)
c) Use the site: operator in other engines (Bing, blekko, etc) and compare the numbers they return. Especially if this number is larger than the number Google is returning, starting looking for weird URL patterns
Also, I'm not sure what you mean by "the domain canonical has been set correctly". If you're referring to use of the canonical link element for every URL, there are plenty of ways that can go wrong. E.g., if your CMS requires that each published URL have rel="canonical", but allows URLs to be published with and without the trailing /index.html, you can end up with a canonical link element on the non-canonical version of the URL, further confusing engines. Something to look into.
-
You might have a duplicate content issue. You will want to check if you have the proper 301 redirect and a canonical command in the head of your code. If you don't have this set properly then the search engines will see the www and non-www versions of your site as duplicate. Also remember that the search engines also by default place this at the end of the url /
Here are two links that can help if this is the issue.
http://www.webconfs.com/how-to-redirect-a-webpage.php/
http://www.mattcutts.com/blog/rel-canonical-html-head/
Hope this helps. Good Luck
-
Yes this is a potentially significant problem. The easiest way to troubleshoot is to do the 'site:' command again, and go to the last page of results. You should be seeing pages that aren't in your sitemap. Very likely duplicated content.
If you are having a rough time troubleshooting, post a link and I'll be glad to take a peek.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Pages not indexable?
Hello, I've been trying to find out why Google Search Console finds these pages non-indexable: https://www.visitflorida.com/en-us/eat-drink.html https://www.visitflorida.com/en-us/florida-beaches/beach-finder.html Moz and SEMrush both crawl the pages and show no errors but GSC comes back with, "blocked by robots.txt" but I've confirmed it is not. Anyone have any thoughts? 6AYn1TL
Technical SEO | | KenSchaefer0 -
How to de-index a page with a search string with the structure domain.com/?"spam"
The site in question was hacked years ago. All the security scans come up clean but the seo crawlers like semrush and ahrefs still show it as an indexed page. I can even click through on it and it takes me to the homepage with no 301. Where is the page and how to deindex it? domain/com/?spam There are multiple instances of this. http://www.clipular.com/c/5579083284217856.png?k=Q173VG9pkRrxBl0b5prNqIozPZI
Technical SEO | | Miamirealestatetrendsguy1 -
Issues Indexing Translated Pages
I'm having trouble getting http://www.procloud.ch/ to index for their german pages. The english pages are being indexed but not the german. Any ideas? Chris
Technical SEO | | ninel_P0 -
Google not index main keyword on homepage in 2 countries same language, rest of pages no problem
Hello, Two the same websites, two countries, same language http://www.lavistarelatiegeschenken.nl / http://www.lavistarelatiegeschenken.be The main keyword "relatiegeschenken" in top 10 of netherlands (steady position for 2 years) and in ** belgium** not in top 15****0 the main keyword "relatiegeschenken| but other keywords good positions, thats so strange I didn't understand and now every thing turned around suddenly: Now the main keyword "relatiegeschenken suddenly " not anymore in top 10 in the netherslandsits gone and other kewyords still good positions , now **main keyword suddenly in top 10 of belgium 2 years was not **other pages still ok. It are exactly the same websites and the same language. So double content But my programmer told me in google webmaster tools settings are right, so no problem with double content ? I really dont understand first main keyword in netherland in top 10 and in belgium not, now changed, now in belgium top 10 and not findable in the netherland on the main keyword. Maybe problem in code ? Maybe problems in code because websites are identical and active in two different countries wit same language ? No message about a penalty message in WMT, no spam links week i delete two strong but according to Linkdetox a bad links. I can not find a solution but its really important keyword that my customer want back in top 10 in netherland, like it was. All other positions and visitors are the same. Befor i have had this with belgium site, also main keyword google not index homepage. But suddenly no google show in belgium in top 10 Its turned around Kind regards, Marcel
Technical SEO | | Bossie720 -
How do I get my pages to go from "Submitted" to "Indexed" in Google Webmaster Tools?
Background: I recently launched a new site and it's performing much better than the old site in terms of bounce rate, page view, pages per session, session duration, and conversions. As suspected, sessions, users, and % new sessions are all down. Which I'm okay with because the the old site had a lot of low quality traffic going to it. The traffic we have now is much more engaged and targeted. Lastly, the site was built using Squarespace and was launched the middle of August. **Question: **When reviewing Google Webmaster Tools' Sitemaps section, I noticed it says 57 web pages Submitted, but only 5 Indexed! The sitemap that's submitted seems to be all there. I'm not sure if this is a Squarespace thing or what. Anyone have any ideas? Thanks!!
Technical SEO | | Nate_D0 -
132 pages reported as having Duplicate Page Content but I'm not sure where to go to fix the problems?
I am seeing “Duplicate Page Content” coming up in our
Technical SEO | | danatanseo
reports on SEOMOZ.org Here’s an example: http://www.ccisolutions.com/StoreFront/product/williams-sound-ppa-r35-e http://www.ccisolutions.com/StoreFront/product/aphex-230-master-voice-channel-processor http://www.ccisolutions.com/StoreFront/product/AT-AE4100.prod These three pages are for completely unrelated products.
They are returning “200” status codes, but are being identified as having
duplicate page content. It appears these are all going to the home page, but it’s
an odd version of the home page because there’s no title. I would understand if these pages 301-redirected to the home page if they were obsolete products, but it's not a 301-redirect. The referring page is
listed as: http://www.ccisolutions.com/StoreFront/category/cd-duplicators None of the 3 links in question appear anywhere on that page. It's puzzling. We have 132 of these. Can anyone help me figure out
why this is happening and how best to fix it? Thanks!0 -
Two different canonical tags on one page
Due to an error, some of my pages now have two canonical tags on them. One is correct and the other goes to a nonsense URL (404 page). I know I should ideally remove the incorrect ones, but it's a big manual job. Are they doing any harm? Can I just leave them there and let Google figure it out? The correct ones are higher up in the code. Will this make a difference? Any help appreciated.
Technical SEO | | ShearingsGroup0 -
SEOMoz is indicating I have 40 pages with duplicate content, yet it doesn't list the URL's of the pages???
When I look at the Errors and Warnings on my Campaign Overview, I have a lot of "duplicate content" errors. When I view the errors/warnings SEOMoz indicates the number of pages with duplicate content, yet when I go to view them the subsequent page says no pages were found... Any ideas are greatly welcomed! Thanks Marty K.
Technical SEO | | MartinKlausmeier0