Locating Duplicate Pages
-
Hi,
Our website consists of approximately 15,000 pages however according to our Google Webmaster Tools account Google has around 26,000 pages for us in their index.
I have run through half a dozen sitemap generators and they all only discover the 15,000 pages that we know about. I have also thoroughly gone through the site to attempt to find any sections where we might be inadvertently generating duplicate pages without success.
It has been over six months since we did any structural changes (at which point we did 301's to the new locations) and so I'd like to think that the majority of these old pages have been removed from the Google Index. Additionally, the number of pages in the index doesn't appear to be going down by any discernable factor week on week.
I'm certain it's nothing to worry about however for my own peace of mind I'd like to just confirm that the additional 11,000 pages are just old results that will eventually disappear from the index and that we're not generating any duplicate content.
Unfortunately there doesn't appear to be a way to download a list of the 26,000 pages that Google has indexed so that I can compare it against our sitemap. Obviously I know about site:domain.com however this only returned the first 1,000 results which all checkout fine.
I was wondering if anybody knew of any methods or tools that we could use to attempt to identify these 11,000 extra pages in the Google index so we can confirm that they're just old pages which haven’t fallen out of the index yet and that they’re not going to be causing us a problem?
Thanks guys!
-
It's cool. Sorry, the point I was making is that irrespective of what you search for the page that is returned is http://www.refreshcartridges.co.uk/advanced_search_result.php (with nothing after the .php) and as such the search results page couldn't spurn multiple pages which could be indexed by Google.
-
Hmm, I'm not too knowledgeable about php pages. Sorry!
-
Sorry, I'm not sure what happened to that bit.ly address - The actual address of the website is www.refreshcartridges.co.uk.
Ah, I see what you mean about the search results now however this hopefully shouldn't be an issue as for security (our web guy said something about injections) the URL that is returned irrespective of what is searched for is http://www.refreshcartridges.co.uk/advanced_search_result.php
Thanks again!
-
I can't get that link to work.
What I said before still applies with physical input (this is what I assumed when I said it).
For example, user inputs the words "snakes and dogs" and clicks search. The new URL is "www.yoursite.com/search?q=snakes and dogs" All these weird URL pages need noindex meta tags or Google will flag them as duplicate content because, for example, this page and the result for "dogs and snakes" generate almost the same page.
Does that make sense?
It is in Google's Webmaster Guidelines that you should noindex these pages. -
Many thanks for your input on this. I have actually looked at this through the HTML improvements section of GWMT however I am showing only a few dozen duplicated titles / descriptions and this is simply due to the product categories being almost identical (for example HP Deskjet 500 and HP Deskjet 500+)
-
Many thanks for your response. Our site is an eCommerce site that doesn't employ tags as such and our categories are all accounted for in the 15,000 page figure.
-
We did have this at the beginning of the year when we used a ?dispmode=grid and ?dispmode=list to change the way our results were displayed. This has been rectified however by us completely removing the option and any instances of dispmode present in the URL force a 301 to the correct master page. There are still a few hundred instances of this dispmode being present in the Google index but 99% of them have fallen out now.
I have checked and double checked and we don't seem to have any issues like this at present.
-
I'm not certain if this is the case as our search engine requires physical input in order to yield a result. I don't know if it helps but the URL is http://bit.ly/4Cogchww if you fancy taking a look
-
Thanks for your reply. Indeed our website does force www. if someone were to attempt to navigate to us without prefixing www.
-
Hi Chris,
Google Webmaster has a tool that helps identify duplicate HTMLs and maybe you can use that to see if the 11,000 pages are duplicate. IF they are, I am assuming they should have the duplicate Title Tag and etc. which the tool may discover.
-
Have you checked for instances where a page parameter is being seen as another version of the same page? One of the sites I work for had an issue a few months back where every instance of a product page was being flagged as duplicate content because of an oversight. We had one of our coders write a clause into the page where every time a page loaded with a parameter such as ?color=72 it would canonicalize it to the page minus the parameter. This decreased our duplicate content warnings quickly and effectively.
-
it could be that your tags and categories are considered individual pages and therefore creating their own permalink: ex: http:www.example.com/keyword, and http://www.example.com/tag/keyword and http://www.example.com/category/keyword. Another way would be to check the sitemaps you have in webmaster tools and compare those to each other. Just a suggestion.
-
Does your website force 'www.'?
Both yourdomain.com and www.yourdomain.com are separate sites and can have different pages spidered.
-
Be sure to try different combinations of 'site:www.domain.com' and 'site:domain.com'. They will all yield different results.
Sounds to me like you probably have an internal search engine that is generating search results pages based off the search term, and each different results page is a piece of duplicate content.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
What is the best meta description for Category Pages, Tag Pages and Main Article?
Hi, I want to index all my categories and tags. But I fear about duplicating the meta description. for example: I have a tag name "Learn Stock Market", a category name "Learning", and a main article "What is Stock Market". What is your suggestion for meta description of these three pages that looks great for seo google?
On-Page Optimization | | mbmozmb0 -
Page Not Indexed
Hi Guys I wrote and published an article last night on my site but it is yet to be indexed. This is strange as articles are usually indexed pretty quickly. Could you have a quick look and see what the problem is? http://www.rankmytri.com/tomtom-running-and-triathlon-watch/ Also all my Blog posts (in the blog section of the site) are not indexed as well (and I dont think they have been for a while) yet I dont have any messages from Google in my webmaster tools. Thoughts? Thanks in advance Ross
On-Page Optimization | | ross88guy0 -
Best practice to solve this Unique duplicate page content issue?
I just got Seomoz Pro (it's awesome!), and when I did a campaign for my website I discovered that I have a big issue with duplicate page content (as well as titles). The Crawl Diagnostics Summary told me I have 196 Crawl Errors Found (I had a total of 362 pages crawled on my site), and as much as 160 of these was duplicate page content. Which to me sounds like a big problem, correct me if I'm wrong (I'm very new to SEO). So our website is an ecommerce that sells greeting cards. The unique part about our platform is that we offer the customer to make a customization of the cards.
On-Page Optimization | | danielpett
Let me walk you through each step a customer takes so you fully understand: They find a card they like and visit the product page of that card (just like on any ecommerce store.) They then decide they want to buy it. There is no "Add to cart" button, they will instead click on a "customize the card" button. 3) This takes them to a step by step process of customizing the card. They change the name on the front of the greeting card so it says for example: "Happy Birthday Katy!". And then adds a personal text on the inside of the card. They then add an delivery address and when it should be delivered. After that they proceed to checkout and it's all done. This is my website (it's in Swedish): loveday.se - it will take you to a product page so that you can click the green button and see what I mean with the customization pages. Hopefully it helps even though it's in Swedish. My issue starts at the customization part of the site (the bolded step above), as I can see the permalinks in the diagnostics I got.
This step-by-step process looks exactly the same with every card in the store. Same call-to-action headline, same descriptive text etc. The only difference is a JPEG-file with the unique greeting card design. So, what is your take on this? Let me know if I was unclear about something. Any help or advice is greatly appreciated.0 -
Does Too Many On-Page Links on a Page Really Matters?
Does Too Many On-Page Links on a Page Really Matters? Especially if they are pointing to internal page?
On-Page Optimization | | AppleCapitalGroup1 -
Page without content
Hey Everyone, I've started an SEO On Page analysis for a web site and I've found a lot of duplicate content and useless pages. What do I have to do? Delete this useless page, redirect or do canonical tag? If I have to delete what is the best way to do? Should I use GWT to delete? or just delete from the server? This URL for example: http://www.sexshopone.com.br/?1.2.44.0,0,1,13,0,0,aneis-evolved-boss-cock's.html [admin note: NSFW page} There is no content and it is duplicate in reference of this: http://www.sexshopone.com.br/?1.2.44.0,0,1,12,0,0,aneis-evolved-boss-cock's.html [admin note: NSFW page} and the correct page of the product is: http://www.sexshopone.com.br/?1.2.44.0,423,anel-peniano-evolved-boss-cock's-pleasure-rings-collar-white-reutilizavel-e-a-prova-d'agua-colecao-evolved.html [admin note: NSFW page} What is happening is that we have 8.000 pages like this. Useless and without any content. How do I proceed? Thanks!
On-Page Optimization | | luf07090 -
Optimally, how many times should the key word or phrase you are targeting for a particular page be mentioned or appear on that page?
Our marketing team is debating how many times the key phrase on each of our web store's product pages should include the word/phrase we are trying to be competitive with. Can you advise?
On-Page Optimization | | Glynlyon0 -
Pages not cached
Sorry for all the questions. I have dozens of article pages that are not cached by google. How can I get them cached?
On-Page Optimization | | azguy0 -
Crawl Diagnostics - Duplicate Content and Duplicate Page Title Errors
I am getting a lot of duplicate content and duplicate page title errors from my crawl analysis. I using volusion and it looks like the photo gallery is causing the duplicate content errors. both are sitting at 231, this shows I have done something wrong... Example URL: Duplicate Page Content http://www.racquetsource.com/PhotoGallery.asp?ProductCode=001.KA601 Duplicate Page Title http://www.racquetsource.com/PhotoGallery.asp?ProductCode=001.KA601 Would anyone know how to properly disallow this? Would this be as simple as a robots.txt entry or something a little more involved within volusion? Any help is appreicated. Cheers Geoff B. (a.k.a) newbie.
On-Page Optimization | | GeoffBatterham0