Locating Duplicate Pages
-
Hi,
Our website consists of approximately 15,000 pages however according to our Google Webmaster Tools account Google has around 26,000 pages for us in their index.
I have run through half a dozen sitemap generators and they all only discover the 15,000 pages that we know about. I have also thoroughly gone through the site to attempt to find any sections where we might be inadvertently generating duplicate pages without success.
It has been over six months since we did any structural changes (at which point we did 301's to the new locations) and so I'd like to think that the majority of these old pages have been removed from the Google Index. Additionally, the number of pages in the index doesn't appear to be going down by any discernable factor week on week.
I'm certain it's nothing to worry about however for my own peace of mind I'd like to just confirm that the additional 11,000 pages are just old results that will eventually disappear from the index and that we're not generating any duplicate content.
Unfortunately there doesn't appear to be a way to download a list of the 26,000 pages that Google has indexed so that I can compare it against our sitemap. Obviously I know about site:domain.com however this only returned the first 1,000 results which all checkout fine.
I was wondering if anybody knew of any methods or tools that we could use to attempt to identify these 11,000 extra pages in the Google index so we can confirm that they're just old pages which haven’t fallen out of the index yet and that they’re not going to be causing us a problem?
Thanks guys!
-
It's cool. Sorry, the point I was making is that irrespective of what you search for the page that is returned is http://www.refreshcartridges.co.uk/advanced_search_result.php (with nothing after the .php) and as such the search results page couldn't spurn multiple pages which could be indexed by Google.
-
Hmm, I'm not too knowledgeable about php pages. Sorry!
-
Sorry, I'm not sure what happened to that bit.ly address - The actual address of the website is www.refreshcartridges.co.uk.
Ah, I see what you mean about the search results now however this hopefully shouldn't be an issue as for security (our web guy said something about injections) the URL that is returned irrespective of what is searched for is http://www.refreshcartridges.co.uk/advanced_search_result.php
Thanks again!
-
I can't get that link to work.
What I said before still applies with physical input (this is what I assumed when I said it).
For example, user inputs the words "snakes and dogs" and clicks search. The new URL is "www.yoursite.com/search?q=snakes and dogs" All these weird URL pages need noindex meta tags or Google will flag them as duplicate content because, for example, this page and the result for "dogs and snakes" generate almost the same page.
Does that make sense?
It is in Google's Webmaster Guidelines that you should noindex these pages. -
Many thanks for your input on this. I have actually looked at this through the HTML improvements section of GWMT however I am showing only a few dozen duplicated titles / descriptions and this is simply due to the product categories being almost identical (for example HP Deskjet 500 and HP Deskjet 500+)
-
Many thanks for your response. Our site is an eCommerce site that doesn't employ tags as such and our categories are all accounted for in the 15,000 page figure.
-
We did have this at the beginning of the year when we used a ?dispmode=grid and ?dispmode=list to change the way our results were displayed. This has been rectified however by us completely removing the option and any instances of dispmode present in the URL force a 301 to the correct master page. There are still a few hundred instances of this dispmode being present in the Google index but 99% of them have fallen out now.
I have checked and double checked and we don't seem to have any issues like this at present.
-
I'm not certain if this is the case as our search engine requires physical input in order to yield a result. I don't know if it helps but the URL is http://bit.ly/4Cogchww if you fancy taking a look
-
Thanks for your reply. Indeed our website does force www. if someone were to attempt to navigate to us without prefixing www.
-
Hi Chris,
Google Webmaster has a tool that helps identify duplicate HTMLs and maybe you can use that to see if the 11,000 pages are duplicate. IF they are, I am assuming they should have the duplicate Title Tag and etc. which the tool may discover.
-
Have you checked for instances where a page parameter is being seen as another version of the same page? One of the sites I work for had an issue a few months back where every instance of a product page was being flagged as duplicate content because of an oversight. We had one of our coders write a clause into the page where every time a page loaded with a parameter such as ?color=72 it would canonicalize it to the page minus the parameter. This decreased our duplicate content warnings quickly and effectively.
-
it could be that your tags and categories are considered individual pages and therefore creating their own permalink: ex: http:www.example.com/keyword, and http://www.example.com/tag/keyword and http://www.example.com/category/keyword. Another way would be to check the sitemaps you have in webmaster tools and compare those to each other. Just a suggestion.
-
Does your website force 'www.'?
Both yourdomain.com and www.yourdomain.com are separate sites and can have different pages spidered.
-
Be sure to try different combinations of 'site:www.domain.com' and 'site:domain.com'. They will all yield different results.
Sounds to me like you probably have an internal search engine that is generating search results pages based off the search term, and each different results page is a piece of duplicate content.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Too many on-page links
Hi, I've apparently got too many on-page links on 79 of my webpages. The majority of these pages are category pages, like this: https://www.turnkeymortgages.co.uk/mortgage-advice/mortgages/... so, what's a person to do? Obviously the page would be useless without the links. Should I just ignore these 'errors'? Or is there something else I should do? I don't want to appear manipulative by labelling them nofollow... Thanks, Amelia
On-Page Optimization | | CommT0 -
Different pages for OS's vs 1 Page with Dynamic Content (user agent), what's the right approach?
We are creating a new homepage and the product are at different stages of development for different OS's. The value prop/messaging/some target keywords will be different for the various OS's for that reason. Question is, for SEO reasons, is it better to separate them into different pages or use 1 page and flip different content in based on the user agent?
On-Page Optimization | | JoeLin0 -
Landing Pages
Howdy Guys, We currently have around 19 landing pages that are near enough identical for each make of car. The content on each page isn't identical but you can tell its a template. Do you think we should change this and just target models instead of makes. Thanks, Scott
On-Page Optimization | | ScottBaxterWW0 -
On Page Optmisation for Newbie
Hi All, Literally just signed up - and thought I might be able to dive straight into my pages using the On Site Optimiser to check page, make some changes, then check again and see it have immediate effect on the analysis. Not so? Each time I click on "Grade My On Page Optuimisation" with the expectation that a box formerly with no tick now gets one, nothing changes :o( Chris.
On-Page Optimization | | Chris19700 -
Dupelicate content home page and custom page question
I am working on a website that got hit by the penguin update. Didn't get hit terribly bad, but dropped from number one to number 9. As I'm going through the pages, the theme and content is a mess. To give an example, say the site is about custom colored marbles. The main page content covers custom colored marbles, custom promotional marbles, custom glass marbles, etc. Custom colored marbles is mentioned and covered on all pages, which I am going back and trying to make each page theme specific. There is also a custom page, so I am at a cross roads on how best to employ the focus of the custom page and the home page. I am thinking the home page should emphasize colored marbles, and the custom page should emphasize custom colored marbles. My fear is that making such a drastic change will bounce the site completely off front page and that it will take time for the custom page to come up in rankings. AS it stands now I am confused as to how it even ranks on first page as there's two pages with custom colored marbles emphasis. Id like to clean this up as much as possible so there are no big hits with future google updates, but I don't want the site to drop off either as that would be hard to explain to the owner. Yeah, we are cleaning up your site and making it google compliant and in so doing you no longer rank on first page. That won't put food on the table. Thanks for any advise on this.
On-Page Optimization | | anthonytjm0 -
Faq page
We are redoing our faq page and we were trying to decide on the best format. 1. Create each question on a separate page 2. Create one page with all the question and have the questions expand 3. Create different faq category pages (like 4) and divide the questions between them From my perspective #1 seems the best ---. you can create hyper relevant content for the user and optimize each question really well Any experience with this?
On-Page Optimization | | Morris770 -
Title tags in duplicate pages
hi there, we have a new ecommerce platform which has just been deployed, and I've been asked to tidy up the onpage SEO. we have employed canonicals across the category and product pages and we now have a nice set of unique product pages my question is - do we need to create the title tags in all of the duplicate non-canonical pages eg www.mysite.com/niceproduct.html (canonical) www.mysite.com/acategory/niceproduct.html (duplicate) Can we leave the duplicate title tag empty and not worry about it, or should we put in a duplicate of the canonical title tag hope the question makes sense! thanks in advance for all help
On-Page Optimization | | k3nn3dy30 -
Page Authority
I have recently optimised a set of images for a client of ours: I'm looking through all the PA of these newly optimised images, and have varying PA {from SEOmoz toolbar} I understand that internal linking will pass link juice, and obviously external links will add to the overall PA. I have several pages with a PA of 36: { Fairly deep pages} Yet they have no external or internal links going to them. My question is "How can a page gain any authority when it has no visible links pointing at it?" Obviously there must be a link pointing at it {internally} as Google wouldn't have crawled the page right? Also lets say all the keywords are of equal competitiveness would the keywords with highest PA rank higher than those on O PA pages. Many Thanks
On-Page Optimization | | Yozzer0