Total Indexed 1.5M vs 83k submitted by sitemap. What?
-
We recently took a good look at one of our content site's sitemap and tried to cut out a lot of crap that had gotten in there such as .php, .xml, .htm versions of each page. We also cut out images to put in a separate image sitemap.
The sitemap generated 83,000+ URLs for google to crawl (this partially used the Yoast Wordpress plugin to generate)
In webmaster tools in the index status section is showing that this site has a total index of 1.5 million.
With our sitemap coming back with 83k and google indexing 1.5 million pages, is this a sign of a CMS gone rogue? Is it an indication that we could be pumping out error pages or empty templates, or junk pages that we're cramming into Google's bot?
I would love to hear what you guys think. Is this normal? Is this something to be concerned about? Should our total index more closely match our sitemap page count?
-
As well as parameters mentioned you may possibly have heaps of duplicating categories, tags etc. What I would also do is start searching Google with something like site:www.example.com/directory/ or possibly site:www.example.com/category/directory/directory/ so you are tightly narrowing down the results, switch to 100 results per page and manually look for clues.
-
If you have 1.5 million pages and you think your sitemap is comprehensive at 83,000 then yes, your CMS is needlessly generating pages. It's usually not a big deal from a ranking standpoint, but it can make other important issues hard to detect. I would clean it up, but that's a business call you'll have to make.
The first step is diagnosing where are the URLs are coming from. What you do next will depend, but I will give you the best advice I can without knowing what types of extraneous URLs you have and how Google is treating them:
First, I'd start with WMT > Crawl > URL Parameters. Quite often your CMS will generate URLs, and Google usually knows how to handle them. If there are a lot of URL parameters, Google them and see if they're exactly the same as other pages. If they are, make sure you have canonical tags in place to point them to the main version. There's more you can do with parameters, but it'll depend on what you find so I won't go into more detail. As a general rule, though, a CMS should not generate a page unless it is uniquely useful as differentiated landing page or a page for people to link to.
Also check for parameters in your analytics program. They could actually be messing up your pageview data depending on how you report.There's a post on fixing that in GA here:
http://blog.crazyegg.com/2013/03/29/remove-url-parameters-from-google-analytics-reports/
Next I'd look at the "Advanced" tab in WMT > Google Index > Index Status . Are there a lot of URLs removed? If so, check on these pages and see why they're removed and why they exist.
I would also run a crawl with Xenu and Screaming Frog to make sure crawlers are finding a reasonable number of pages and that they're not getting stuck in crawl loops. (crawling variations of a page endlessly). These kinds of issues can prevent new pages from being indexed on time because Google is wasting time (your crawl budget) running in circles.
-
Rob,
Your sitemap is but an indication to Google about urls on your domain. The sitemap does not limit google to crawling or indexing only the urls listed on it, nor is it a directive that tells google to remove urls from the index that it has already crawled. As stated in GWT, use **robots.txt **to specify how search engines should crawl your site, or request **removal **of URLs from Google's search results with the URL removal tool Google webmaster tools under the "google index" link.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google Indexing Stopped
Hello Team, A month ago, Google was indexing more than 2,35,000 pages, now has reduced to 11K. I have cross-checked almost everything including content, backlinks and schemas. Everything is looking fine, except the server response time, being a heavy website, or may be due to server issues, the website has an average loading time of 4 secs. Also, I would like to mention that I have been using same server since I have started working on the website, and as said above a month ago the indexing rate was more than 2.3 M, now reduced to 11K. nothing changed. As I have tried my level best on doing research for the same, so please if you had any such experiences, do share your valuable solutions to this problem.
Intermediate & Advanced SEO | | jeffreyjohnson0 -
Yoast XML Sitemap Taxonomies
Hey all, Quick question about taxonomies and sitemap settings. On our e-commerce site, we noindex product tags and post tags. Under the "Taxonomies" settings in Yoast I'm seeing taxonomies such as Product Color (pa_color). Would it be wise to remove such taxonomies from our sitemap if we already include product colors, attributes, etc. in our page titles and product descriptions? Thanks in advance, Andrew
Intermediate & Advanced SEO | | mostcg0 -
How Submit to different Countries
Hey There, One of My client Needs To rank his site in multiple Location , like He is in Australia But wants To Promote his Site in Canada, india, USA,So What are the things i can do For Rank in Others Country. Please any Expert Help. Thanx
Intermediate & Advanced SEO | | nupuriepl0 -
XML Sitemaps - how to create the perfect XML Sitemap
Hello, We have a site that is not updated very often - currently we have a script running to create/update the XML sitemap every time a page is added/edited or deleted. I have a few questions about best practices for creating XML sitemaps. 1. If the site is not updated for months on end - is it a bad idea to force the script to update i.e. changing the dates once a month? Will google noticed nothing has changed just the date i.e. all the content on the site is exactly the same. Will they start penalising you for updating an XML sitemap when there is nothing new about the website?
Intermediate & Advanced SEO | | JohnW-UK
2. Is it worth automating the XML file to link into Bing/Google to update via webmaster tools - as I say even if the site is never updated?
3. Is the use of "priorities" necessary?
4. The changefreq - does that mean Google/Bing expects to see a new file ever month?
5. The ordering of the pages - the script seems pretty random and put the pages in a random order - should we make it order the pages with the most important ones first? Should the home page always be first?
6. Below is a sample of how our XML sitemap appears - is there anything that we should change? i.e. all marked up properly? This XML file does not appear to have any style information associated with it. The document tree is shown below.
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>http://www.domain.com</loc>
<lastmod>2013-11-06</lastmod>
<changefreq>monthly</changefreq></url>
<url><loc>http://www.domain.com/contact/</loc>
<lastmod>2013-11-06</lastmod>
<changefreq>monthly</changefreq></url>
<url><loc>http://www.domain.com/sitemap/</loc>
<lastmod>2013-11-06</lastmod>
<changefreq>monthly</changefreq></url></urlset> Hope someone can help enlighten us to best practices0 -
Is this link being indexed?
link text Deadline: Monday, Sep 30, 2013 link text I appreciate the help guys!
Intermediate & Advanced SEO | | jameswalkerson0 -
To index search results or not?
In its webmaster guidelines, Google says not to index search results " that don't add much value for users coming from search engines." I've noticed several big brands index search results, and am wondering if it is generally OK to index search results with high engagement metrics (high PVPV, time on site, etc). We have an database of content, and it seems one of the best ways to get this content in search engines would be to allow indexing of search results (to capture the long tail) rather than build thousands of static URLs. Have any smaller brands had success with allowing indexing of search results? Any best practices or recommendations?
Intermediate & Advanced SEO | | nicole.healthline0 -
Sitemaps and subdomains
At the beginning of our life-cycle, we were just a wordpress blog. However, we just launched a product created in Ruby. Because we did not have time to put together an open source Ruby CMS platform, we left the blog in wordpress and app in rails. Thus our web app is at http://www.thesquarefoot.com and our blog is at http://blog.thesquarefoot.com. We did re-directs such that if the URL does not exist at www.thesquarefoot.com it automatically forwards to blog.thesquarefoot.com. What is the best way to handle sitemaps? Create one for blog.thesquarefoot.com and for http://www.thesquarefoot.com and submit them separately? We had landing pages like http://www.thesquarefoot.com/houston in wordpress, which ranked well for Find Houston commercial real estate, which have been replaced with a landing page in Ruby, so that URL works well. The url that was ranking well for this word is now at blog.thesquarefoot.com/houston/? Should i delete this page? I am worried if i do, we will lose ranking, since that was the actual page ranking, not the new one. Until we are able to create an open source Ruby CMS and move everything over to a sub-directory and have everything live in one place, I would love any advice on how to mitigate damage and not confuse Google. Thanks
Intermediate & Advanced SEO | | TheSquareFoot0 -
Where to link to HTML Sitemap?
After searching this morning and finding unclear answers I decided to ask my SEOmoz friends a few questions. Should you have an HTML sitemap? If so, where should you link to the HTML sitemap from? Should you use a noindex, follow tag? Thank you
Intermediate & Advanced SEO | | cprodigy290