Why do I have so many extra indexed pages?
-
Stats-
Webmaster Tools Indexed Pages- 96,995
Site: Search- 97,800 Pages
Sitemap Submitted- 18,832
Sitemap Indexed- 9,746
I went through the search results through page 28 and every item it showed was correct. How do I figure out where these extra 80,000 items are coming from? I tried crawling the site with screaming frog awhile back but it locked because of so many urls. The site is a Magento site so there are a million urls, but I checked and all of the canonicals are setup properly. Where should I start looking?
-
It ended up being my search results. I was able to use the site operator to break it down.
-
To ensure Screaming Frog can handle the crawl you could chunk up the site and crawl it in parts, e.g. by each subdirectory. This can be done within the 'configuration' menu under 'include'. There's loads of tutorials online.
You can also use exclude to ensure it doesn't crawl unnecessary pages, images or scripts for example on wordpress I often block wp-content
Definitely sounds like a problem with query parameters being indexed though and its often good to ensure these are addressed in the search console.
-
1. Your first one is interesting. I actually haven't been in there before. There are 96 rows and everyone of them is set to let Googlebot Decide. Do you think I should change that up?
2. Not sure on how many images we have but it is a lot. Not we do not have an image sitemap.
I tried Screaming Frog and it couldn't handle it. After about 1.5 million urls it kept locking up. I just setup a free trial for Deep Crawl. It can only do 10,000 but I will see if it has anything worthwhile.
-
- Have you checked out the parameters settings in Google Search Console to find out how many pages Google has found for your site with the same parameters? That might give some insights on that side.
- How many images do you have across the site? Do you have image sitemaps for these kind of pages.
What I would advise + what you've already been trying is to get a full crawl by either using ScreamingFrog or Deepcrawl. This will provide you with better insights into how many pages a search engine can really find.
-
I wouldn't say it is doing fine. Before I started they launched a new site and messed up the 301 redirects. Traffic hasn't recovered yet.
For Robots I am using the Inchoo robots.txt-http://inchoo.net/ecommerce/ultimate-magento-robots-txt-file-examples/ maybe it is a parameters issue, but I can't figure out how to see all my indexed pages.
I tried doing a search for both inurl:= site:www.site.com and inurl:? site:www.site.com and nothing showed up unless I am missing something.
I can't figure out how to check if some of the canonicalized urls are indexed. The pages are all identical though.
We have less then 100 out of stock items.
-
As long as your organic traffic is doing fine I shouldn't be too concerned. That being said:
- Is your robots.txt or search console disallowing crawler access to parameters like '?count=' or '?color='?
- Is your robots.txt disallowing crawler access to urls that have a 'noindex' but were indexed before they got noindex?
- You can also take a couple of parameters from your site and test if any url's have been indexed, by using the 'inurl:parameter site:www.site.com' query.
- Are some of the canonicalized urls indexed anyway? This may indicate that page content is different enough for Google to index both versions.
- If there's a ton of articles that go in and out of stock and use dynamic ID's, Google may keep these in their index. Do out of stock articles return a 404 or are they kept alive?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Crawling/indexing of near duplicate product pages
Hi, Hope someone can help me out here. This is the current situation: We sell stones/gravel/sand/pebbles etc. for gardens. I will take a type of pebbles and the corresponding pages/URL's to illustrate my question --> black beach pebbles. We have a 'top' product page for black beach pebbles on which you can find different types of quantities (differing from 20kg untill 1600 kg). There is not any search volume related to the different quantities The 'top' page does not link to the pages for the different quantities The content on the pages for the different quantities is not exactly the same (different price + slightly different content). But a lot of the content is the same. Current situation:
Intermediate & Advanced SEO | | AMAGARD
- Most pages for the different quantities do not have internal links (about 95%) But the sitemap does contain all of these pages. Because the sitemap contains all these URL's, google frequently crawls them (I checked the logfiles) and has indexed them. Problems: Google spends its time crawling irrelevant pages --> our entire website is not that big, so these quantity URL's kind of double the total number of URL's. Having url's in the sitemap that do not have an internal link is a problem on its own All these pages are indexed so all sorts of gravel/pebbles have near duplicates. My solution: remove these URL's from the sitemap --> that will probably stop Google from regularly crawling these pages Putting a canonical on the quantity pages pointing to the top-product page. --> that will hopefully remove the irrelevant (no search volume) near duplicates from the index My questions: To be able to see the canonical, google will need to crawl these pages. Will google still do that after removing them from the sitemap? Do you agree that these pages are near duplicates and that it is best to remove them from the index? A few of these quantity pages do have intenral links (a few procent of them) because of a sale campaign. So there will be some (not much) internal links pointing to non-canonical pages. Would that be a problem? Thanks a lot in advance for your help! Best!1 -
How to optimize count of interlinking by increasing Interlinking count of chosen landing pages and decreasing for less important pages within the site?
We have taken out our interlinking counts (Only Internal Links and not Outbound Links) through Google WebMaster tool and discovered that the count of interlinking of our most significant pages are less as compared to of less significant pages. Our objective is to reverse the existing behavior by increasing Interlinking count of important pages and reduce the count for less important pages so that maximum link juice could be transferred to right pages thereby increasing SEO traffic.
Intermediate & Advanced SEO | | vivekrathore0 -
How long to re-index a page after being blocked
Morning all! I am doing some research at the moment and am trying to find out, just roughly, how long you have ever had to wait to have a page re-indexed by Google. For this purpose, say you had blocked a page via meta noindex or disallowed access by robots.txt, and then opened it back up. No right or wrong answers, just after a few numbers 🙂 Cheers, -Andy
Intermediate & Advanced SEO | | Andy.Drinkwater0 -
What to do when your home page an index for a series of pages.
I have created an index stack. My home page is http://www.southernwhitewater.com The home page is the index itself and the 1st page http://www.southernwhitewater.com/nz-adventure-tours-whitewater-river-rafting-hunting-fishing My home page (if your look at it through moz bat for chrome bar} incorporates all the pages in the index. Is this Bad? I would prefer to index each page separately. As per my site index in the footer What is the best way to optimize all these pages individually and still have the customers arrive at the top to a picture. rel= canonical? Any help would be great!! http://www.southernwhitewater.com
Intermediate & Advanced SEO | | VelocityWebsites0 -
Do you add 404 page into robot file or just add no index tag?
Hi, got different opinion on this so i wanted to double check with your comment is. We've got /404.html page and I was wondering if you would add this page to robot text so it wouldn't be indexed or would you just add no index tag? What would be the best approach? Thanks!
Intermediate & Advanced SEO | | Rubix0 -
Does a page on a site with high domain authority build page authority easier? i.e. less inbound links?
Is this also why people build backlinks to their BBB profiles, Yellowpages Profiles, etc. i.e. why do people build backlinks to other pages that link to them? Wouldn't it be more beneficial to just build that backlink directly to your target?
Intermediate & Advanced SEO | | adriandg0 -
Landing page indexed and ranking in less then 24 hours
Hi, I got a landing page which went up last night about 11pm. Its been indexed and ranked since then. Its a EMD and has about 600 words of unqiue content. It currently sits on page 9 for what I would say is a non competitive term (the top result is not an EMD and has 10 backlinks from the same site, which has no PR). Now my question is this: Would you say that page 9 is the given position Google thinks this website should sit at? Or because its so new could I very much expect some more movement? Basically up the rankings? Cheers
Intermediate & Advanced SEO | | activitysuper0 -
Pop Up Pages Being Indexed, Seen As Duplicate Content
I offer users the opportunity to email and embed images from my website. (See this page http://www.andertoons.com/cartoon/6246/ and look under the large image for "Email to a Friend" and "Get Embed HTML" links.) But I'm seeing the ensuing pop-up pages (Ex: http://www.andertoons.com/embed/5231/?KeepThis=true&TB_iframe=true&height=370&width=700&modal=true and http://www.andertoons.com/email/6246/?KeepThis=true&TB_iframe=true&height=432&width=700&modal=true) showing up in Google. Even worse, I think they're seen as duplicate content. How should I deal with this?
Intermediate & Advanced SEO | | andertoons0