Total Indexed 1.5M vs 83k submitted by sitemap. What?
-
We recently took a good look at one of our content site's sitemap and tried to cut out a lot of crap that had gotten in there such as .php, .xml, .htm versions of each page. We also cut out images to put in a separate image sitemap.
The sitemap generated 83,000+ URLs for google to crawl (this partially used the Yoast Wordpress plugin to generate)
In webmaster tools in the index status section is showing that this site has a total index of 1.5 million.
With our sitemap coming back with 83k and google indexing 1.5 million pages, is this a sign of a CMS gone rogue? Is it an indication that we could be pumping out error pages or empty templates, or junk pages that we're cramming into Google's bot?
I would love to hear what you guys think. Is this normal? Is this something to be concerned about? Should our total index more closely match our sitemap page count?
-
As well as parameters mentioned you may possibly have heaps of duplicating categories, tags etc. What I would also do is start searching Google with something like site:www.example.com/directory/ or possibly site:www.example.com/category/directory/directory/ so you are tightly narrowing down the results, switch to 100 results per page and manually look for clues.
-
If you have 1.5 million pages and you think your sitemap is comprehensive at 83,000 then yes, your CMS is needlessly generating pages. It's usually not a big deal from a ranking standpoint, but it can make other important issues hard to detect. I would clean it up, but that's a business call you'll have to make.
The first step is diagnosing where are the URLs are coming from. What you do next will depend, but I will give you the best advice I can without knowing what types of extraneous URLs you have and how Google is treating them:
First, I'd start with WMT > Crawl > URL Parameters. Quite often your CMS will generate URLs, and Google usually knows how to handle them. If there are a lot of URL parameters, Google them and see if they're exactly the same as other pages. If they are, make sure you have canonical tags in place to point them to the main version. There's more you can do with parameters, but it'll depend on what you find so I won't go into more detail. As a general rule, though, a CMS should not generate a page unless it is uniquely useful as differentiated landing page or a page for people to link to.
Also check for parameters in your analytics program. They could actually be messing up your pageview data depending on how you report.There's a post on fixing that in GA here:
http://blog.crazyegg.com/2013/03/29/remove-url-parameters-from-google-analytics-reports/
Next I'd look at the "Advanced" tab in WMT > Google Index > Index Status . Are there a lot of URLs removed? If so, check on these pages and see why they're removed and why they exist.
I would also run a crawl with Xenu and Screaming Frog to make sure crawlers are finding a reasonable number of pages and that they're not getting stuck in crawl loops. (crawling variations of a page endlessly). These kinds of issues can prevent new pages from being indexed on time because Google is wasting time (your crawl budget) running in circles.
-
Rob,
Your sitemap is but an indication to Google about urls on your domain. The sitemap does not limit google to crawling or indexing only the urls listed on it, nor is it a directive that tells google to remove urls from the index that it has already crawled. As stated in GWT, use **robots.txt **to specify how search engines should crawl your site, or request **removal **of URLs from Google's search results with the URL removal tool Google webmaster tools under the "google index" link.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
MOZ is showing that I have non- indexed blog tag posts are they supposed to be nonindexed. My articles are indexed just not the blog tags that take you to other similar articles do I need to fix this or is it ok?
MOZ is showing that my blog post tags are not indexed my question is should they be indexed? my articles are indexed just not the tags that take you to posts that are similar. Do I need to fix this or not? Thank you
Intermediate & Advanced SEO | | Tyler58910 -
Submitting URLs After New Search Console
Hi Everyone I wanted to see how people submit their urls to Google and ensure they are all being indexed. I currently have an ecommerce site with 18,000 products. I have sitemaps setup, but noticed that the various product pages haven't started ranking yet. If I submit the individual url through the new Google Search Console I see the page ranking in a matter of minutes. Before the new Google Search Console you could just ask Google to Fetch/Render an XML sitemap and ask it to crawl all the links. I don't see the same functionality working today on Google Search Console and was wondering if there are any new techniques people could share. Thanks,
Intermediate & Advanced SEO | | abiondo
Anthony1 -
Why is my domain authority still 1?
I changed the domain of my website from www.vanillacrush.co.uk to www.carissamay.co.uk at the end of December and yet my DA for carissamay is still 1. As advised, I set up a 301 redirect from VC to CM which seems to be working fine. However when I check on redirect detective it tells me I also have a 302 set up. Could this be confusing things? http://www.vanillacrush.co.uk http://www.vanillacrush.co.uk/ http://www.carissamay.co.uk Any help would be greatly appreciated! Many thanks
Intermediate & Advanced SEO | | Carissamay0 -
This url is not allowed for a Sitemap at this location error using pro-sitemaps.com
Hey, guys, We are using the pro-sitemaps.com tool to automate our sitemaps on our properties, but some of them give this error "This url is not allowed for a Sitemap at this location" for all the urls. Strange thing is that not all of them are with the error and most have all the urls indexed already. Do you have any experience with the tool and what is your opinion? Thanks
Intermediate & Advanced SEO | | lgrozeva0 -
Best Sitemap Generator XML
Hello Everyone, Can Anyone Suggest best Site map Generator Software??
Intermediate & Advanced SEO | | ieplnupur0 -
Google Indexed Old Backups Help!
I have the bad habit of renaming a html page sitting on my server, before uploading a new version. I usually do this after a major change. So after the upload, on my server would be "product.html" as well as "product050714".html. I just stumbled on the fact G has been indexing these backups. Can I just delete them and produce a 404?
Intermediate & Advanced SEO | | alrockn0 -
Why do some sites have several types of sitemap?
Hello Mozzers, I often seem to work on websites with several types of sitemaps - e.g. an html sitemap - an xml sitemap - almost always with identical structure and content. Does anybody know the thinking behind this? Currently looking at site with php and xml sitemap sitting alongside one another. I'm guessing one is for site users to read (and also to aid indexing) and the other for search engines, to further aid indexing. Does Google have any preferences? Is there anything you should be wary of re: Google, if there are multiple sitemaps?
Intermediate & Advanced SEO | | McTaggart0 -
De Index Section of Page?
Hey all! We're having a couple of issues with a certain section of our page that we don't want to index. Basically, our cross sells change really quickly, and big G is ranking them and linking to them even when they've long gone. Is it possible to put some kind of no index tag for a specific section of the page? See below 🙂 http://www.freestylextreme.com/uk/Home/Brands/DC-Shoe-Co-/Mens-DC-Shoe-Co-Hoodies-and-Sweaters/DC-Black-Rob-Dyrdek-Official-Sweater.aspx Thanks!
Intermediate & Advanced SEO | | elbeno0