Total Indexed 1.5M vs 83k submitted by sitemap. What?
-
We recently took a good look at one of our content site's sitemap and tried to cut out a lot of crap that had gotten in there such as .php, .xml, .htm versions of each page. We also cut out images to put in a separate image sitemap.
The sitemap generated 83,000+ URLs for google to crawl (this partially used the Yoast Wordpress plugin to generate)
In webmaster tools in the index status section is showing that this site has a total index of 1.5 million.
With our sitemap coming back with 83k and google indexing 1.5 million pages, is this a sign of a CMS gone rogue? Is it an indication that we could be pumping out error pages or empty templates, or junk pages that we're cramming into Google's bot?
I would love to hear what you guys think. Is this normal? Is this something to be concerned about? Should our total index more closely match our sitemap page count?
-
As well as parameters mentioned you may possibly have heaps of duplicating categories, tags etc. What I would also do is start searching Google with something like site:www.example.com/directory/ or possibly site:www.example.com/category/directory/directory/ so you are tightly narrowing down the results, switch to 100 results per page and manually look for clues.
-
If you have 1.5 million pages and you think your sitemap is comprehensive at 83,000 then yes, your CMS is needlessly generating pages. It's usually not a big deal from a ranking standpoint, but it can make other important issues hard to detect. I would clean it up, but that's a business call you'll have to make.
The first step is diagnosing where are the URLs are coming from. What you do next will depend, but I will give you the best advice I can without knowing what types of extraneous URLs you have and how Google is treating them:
First, I'd start with WMT > Crawl > URL Parameters. Quite often your CMS will generate URLs, and Google usually knows how to handle them. If there are a lot of URL parameters, Google them and see if they're exactly the same as other pages. If they are, make sure you have canonical tags in place to point them to the main version. There's more you can do with parameters, but it'll depend on what you find so I won't go into more detail. As a general rule, though, a CMS should not generate a page unless it is uniquely useful as differentiated landing page or a page for people to link to.
Also check for parameters in your analytics program. They could actually be messing up your pageview data depending on how you report.There's a post on fixing that in GA here:
http://blog.crazyegg.com/2013/03/29/remove-url-parameters-from-google-analytics-reports/
Next I'd look at the "Advanced" tab in WMT > Google Index > Index Status . Are there a lot of URLs removed? If so, check on these pages and see why they're removed and why they exist.
I would also run a crawl with Xenu and Screaming Frog to make sure crawlers are finding a reasonable number of pages and that they're not getting stuck in crawl loops. (crawling variations of a page endlessly). These kinds of issues can prevent new pages from being indexed on time because Google is wasting time (your crawl budget) running in circles.
-
Rob,
Your sitemap is but an indication to Google about urls on your domain. The sitemap does not limit google to crawling or indexing only the urls listed on it, nor is it a directive that tells google to remove urls from the index that it has already crawled. As stated in GWT, use **robots.txt **to specify how search engines should crawl your site, or request **removal **of URLs from Google's search results with the URL removal tool Google webmaster tools under the "google index" link.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Javascript content not being indexed by Google
I thought Google has gotten better at picking up unique content from javascript. I'm not seeing it with our site. We rate beauty and skincare products using our algorithms. Here is an example of a product -- https://www.skinsafeproducts.com/tide-free-gentle-he-liquid-laundry-detergent-100-fl-oz When you look at the cache page (text) from google none of the core ratings (badges like fragrance free, top free and so forth) are being picked up for ranking. Any idea what we could do to have the rating incorporated in the indexation.
Intermediate & Advanced SEO | | akih0 -
Sitemap Migration - Google Guidelines
Hi all. I saw in support.google.com the following text: Create and save the Sitemap and lists of links A Sitemap file containing the new URL mapping A Sitemap file containing the old URLs to map A list of sites with link to your current content I would like to better understand about a "A list of sites with bond link to current content" Question 1: have I need tree sitemaps simultaneously ?
Intermediate & Advanced SEO | | mobic
Question 2: If yes, should I put this sitemap on the Search Console of the new website?
Question 3: or just Google gave a about context how do we make the migration? And I'll need really have sitemaps about the new site only..? What about is Google talking? Thanks for any advice.0 -
Our website is not being indexed
We have an issue with a site that we can't get to the bottom of. This site: (URL removed) is not being properly indexed. When we do a search for (URL removed) in google.com.au. The site appears as the 4th listing with the following title and description: (Title removed) A description for this result is not available because of this site's robots.txt – learn more. We have checked the site's robots.txt and can see its been now implemented correctly: (URL removed) About a week ago, we also went into Webmaster Tools and submitted a request for Google to recrawl our site. We are unsure what the issue is that is causing the site to not be properly indexed and how to resolve it. Any assistance on this topic would be most appreciated!
Intermediate & Advanced SEO | | Gavo0 -
Sitemap Indexation
When we use HTML sitemap. Many a times i have seen that the sitemap itself gets mapped to keywords which it shouldn't have got to. So should we keep the HTML sitemap as No-Index, Follow or does anyone has a better solution that the sitemap doesn't show-up for other keyword terms that actually isn't representing this page.
Intermediate & Advanced SEO | | welcomecure0 -
Proper sitemap update frequency
I have 12 sitemaps submitted to Google. After about a week, Google is about 50% of the way through crawling each one. In the past week I've created many more pages. Should I wait until Google is 100% complete with my original sitemaps or can I just go ahead and refresh them? When I refresh the original files will have different URLs.
Intermediate & Advanced SEO | | jcgoodrich0 -
Wordpress Config Thoughts: Multisite vs. Parent/Child Themes vs. Infinite WP?
We publish four local food and drink magazines, each with its own website and related web content. Even though the content across all four titles shares a common mission, there is little overlap in actual stories. That is, each site has its own story content, events calendar and business listing guide. Still, since we share an editorial staff and a common look among all four, we are debating the pros and cons of a few different wordpress and SEO configurations, and would welcome the community's input on the pros and cons. Here is what we are considering for the Wordpress configuration: Wordpress Multisite - concerns about 10-15% performance hit, incompatibility with certain plug ins, need to more ‘expert’ development InfiniteWP - concerns that adding a 3rd party plugin to the mix might complicate things Parent / child themes A single wordpress site with different content subfolders for each locale - simplifies events / guide listings / seo, but too much in one place? Problems with current config (four different wordpress installs across four different base domains - ediblemanhattan.com, ediblebrooklyn.com, ediblelongisland.com, etc) SEO value is currently spread across four base domains Four different wordpress installs / upgrades / templates / plugins must be managed separately Four different namespaces for registered users make cross-domain registration more difficult, less usable The independent site approach is potentially problematic if we were to decide to combine certain site features - for example guide and event listings - into a single site experience filterable by zip / location Our questions: WP config: independent sites vs. multisite vs. parent/child themes vs. other? SEO config: should we move to shared parent domain? If we do, should we use locale-based subfolders or second level domains (brooklyn.ediblemag.com vs. ediblemag.com/brooklyn)? Operations: We think there are SEO advantages to move all four sites share the same base domain - ex, ediblemagazine.com, but are there operational disadvantages we are not considering? Ability for local site editors to work within their locale section only Ability for ad sales to target a single locale, example, run of site display ads on specific locales Ability to segment users by their locale - ex. enroll users in email lists for edible brooklyn only
Intermediate & Advanced SEO | | brianhalweil0 -
Duplicate Sub-domains Being Indexed
Hi all, I have this site that has a sub-domain that is meant to be a "support" for clients. Some sort of FAQ pages, if you will. A lot of them are dynamic URLs, hence, the title and most of the content are duplicated. Crawl Diagnostics found 52 duplicate content, 138 duplicate title and a lot other errors. My question is, what would be the best practice to fix this issue? Should I noindex and nofollow all of its subdomains? Thanks in advance.
Intermediate & Advanced SEO | | EdwardDennis0 -
Reducing Booking Engine Indexation
Hi Mozzers, I am working on a site with a very useful room booking engine. Helpful as it may be, all the variations (2 bedrooms, 3 bedrooms, room with a view, etc, etc,) are indexed by Google. Section 13 on Search Pagination in Dr. Pete's great post on Panda http://www.seomoz.org/blog/duplicate-content-in-a-post-panda-world speaks to our issue, but I was wondering since 2 (!) years have gone by, if there are any additional solutions y'all might recommend. We want to cut down on the duplicate titles and content and get the useful but not useful for SERPs online booking pages out of the index. Any thoughts? Thanks for your help.
Intermediate & Advanced SEO | | Leverage_Marketing0