Locating Duplicate Pages
-
Hi,
Our website consists of approximately 15,000 pages however according to our Google Webmaster Tools account Google has around 26,000 pages for us in their index.
I have run through half a dozen sitemap generators and they all only discover the 15,000 pages that we know about. I have also thoroughly gone through the site to attempt to find any sections where we might be inadvertently generating duplicate pages without success.
It has been over six months since we did any structural changes (at which point we did 301's to the new locations) and so I'd like to think that the majority of these old pages have been removed from the Google Index. Additionally, the number of pages in the index doesn't appear to be going down by any discernable factor week on week.
I'm certain it's nothing to worry about however for my own peace of mind I'd like to just confirm that the additional 11,000 pages are just old results that will eventually disappear from the index and that we're not generating any duplicate content.
Unfortunately there doesn't appear to be a way to download a list of the 26,000 pages that Google has indexed so that I can compare it against our sitemap. Obviously I know about site:domain.com however this only returned the first 1,000 results which all checkout fine.
I was wondering if anybody knew of any methods or tools that we could use to attempt to identify these 11,000 extra pages in the Google index so we can confirm that they're just old pages which haven’t fallen out of the index yet and that they’re not going to be causing us a problem?
Thanks guys!
-
It's cool. Sorry, the point I was making is that irrespective of what you search for the page that is returned is http://www.refreshcartridges.co.uk/advanced_search_result.php (with nothing after the .php) and as such the search results page couldn't spurn multiple pages which could be indexed by Google.
-
Hmm, I'm not too knowledgeable about php pages. Sorry!
-
Sorry, I'm not sure what happened to that bit.ly address - The actual address of the website is www.refreshcartridges.co.uk.
Ah, I see what you mean about the search results now however this hopefully shouldn't be an issue as for security (our web guy said something about injections) the URL that is returned irrespective of what is searched for is http://www.refreshcartridges.co.uk/advanced_search_result.php
Thanks again!
-
I can't get that link to work.
What I said before still applies with physical input (this is what I assumed when I said it).
For example, user inputs the words "snakes and dogs" and clicks search. The new URL is "www.yoursite.com/search?q=snakes and dogs" All these weird URL pages need noindex meta tags or Google will flag them as duplicate content because, for example, this page and the result for "dogs and snakes" generate almost the same page.
Does that make sense?
It is in Google's Webmaster Guidelines that you should noindex these pages. -
Many thanks for your input on this. I have actually looked at this through the HTML improvements section of GWMT however I am showing only a few dozen duplicated titles / descriptions and this is simply due to the product categories being almost identical (for example HP Deskjet 500 and HP Deskjet 500+)
-
Many thanks for your response. Our site is an eCommerce site that doesn't employ tags as such and our categories are all accounted for in the 15,000 page figure.
-
We did have this at the beginning of the year when we used a ?dispmode=grid and ?dispmode=list to change the way our results were displayed. This has been rectified however by us completely removing the option and any instances of dispmode present in the URL force a 301 to the correct master page. There are still a few hundred instances of this dispmode being present in the Google index but 99% of them have fallen out now.
I have checked and double checked and we don't seem to have any issues like this at present.
-
I'm not certain if this is the case as our search engine requires physical input in order to yield a result. I don't know if it helps but the URL is http://bit.ly/4Cogchww if you fancy taking a look
-
Thanks for your reply. Indeed our website does force www. if someone were to attempt to navigate to us without prefixing www.
-
Hi Chris,
Google Webmaster has a tool that helps identify duplicate HTMLs and maybe you can use that to see if the 11,000 pages are duplicate. IF they are, I am assuming they should have the duplicate Title Tag and etc. which the tool may discover.
-
Have you checked for instances where a page parameter is being seen as another version of the same page? One of the sites I work for had an issue a few months back where every instance of a product page was being flagged as duplicate content because of an oversight. We had one of our coders write a clause into the page where every time a page loaded with a parameter such as ?color=72 it would canonicalize it to the page minus the parameter. This decreased our duplicate content warnings quickly and effectively.
-
it could be that your tags and categories are considered individual pages and therefore creating their own permalink: ex: http:www.example.com/keyword, and http://www.example.com/tag/keyword and http://www.example.com/category/keyword. Another way would be to check the sitemaps you have in webmaster tools and compare those to each other. Just a suggestion.
-
Does your website force 'www.'?
Both yourdomain.com and www.yourdomain.com are separate sites and can have different pages spidered.
-
Be sure to try different combinations of 'site:www.domain.com' and 'site:domain.com'. They will all yield different results.
Sounds to me like you probably have an internal search engine that is generating search results pages based off the search term, and each different results page is a piece of duplicate content.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google Search Console issue: "This is how Googlebot saw the page" showing part of page being covered up
Hi everyone! Kind of a weird question here but I'll ask and see if anyone else has seen this: In Google Search Console when I do a fetch and render request for a specific site, the fetch and blocked resources all look A-OK. However, in the render, there's a large grey box (background of navigation) that covers up a significant amount of what is on the page. Attaching a screenshot. You can see the text start peeking out below (had to trim for confidentiality reasons). But behind that block of grey IS text. And text that apparently in the fetch part Googlebot does see and can crawl. My question: is this an issue? Should I be concerned about this visual look? Or no? Never have experienced an issue like that. I will say - trying to make a play at a featured snippet and can't seem to have Google display this page's information, despite it being the first result and the query showing a featured snippet of a result #4. I know that it isn't guaranteed for the #1 result but wonder if this has anything to do with why it isn't showing one. VmIqgFB.png
On-Page Optimization | | ChristianMKG0 -
Duplicate Page Content
Hi, I am new to the MOZ Pro community. I got the below message for many of my pages. We have a video site so all content in the page except the video link would be different. How can i handle such pages. Can we place adsense AD's on these pages? Duplicate Page Content Code and content on this page looks similar or identical to code and content on other pages on your site. Search engines may not know which pages are best to include in their index and rankings. Common fixes for this issue include 301 redirects, using the rel=canonical tag, and using the Parameter handling tool in Google Webmaster Central. For more information on duplicate content, visit http://moz.com/learn/seo/duplicate-content. Please help me to know how to handle this.. Regards
On-Page Optimization | | Nettv0 -
Your tips for this landing page
http://www.visalietuva.lt/imones/advokatai These are the yellow pages in Lithuanian 🙂 What would be your tips for this of page, to increase rankings? Every single tip is appreciated: The same in English: http://www.visalietuva.lt/en/companies/lawyers (does not have meta descriptions still) THANKS
On-Page Optimization | | FCRMediaLietuva0 -
Keyword Appearing on Home Page - Moz Page Grader
Hi Today I entered www.partydomain.co.uk through the Moz Page Grader and found that the Home Page is Ranked B. I noticed that an Area we could improve on is the amount of times we are using our main keyword "Fancy Dress" on the home page. Please can you take a look at www.partydomain.co.uk and scroll to the bottom of the page were the tabs are containing losts of content. I am thinking about removing all of thoose Tabs. Our Competitors dont have any content as such on the home page and are ranking higher than Party Domain for "fancy dress" What do you think ? remove all the tabs to be like the others that rank better? Or cut the text right down ? Thanks Adam
On-Page Optimization | | AMG1000 -
Is there any decent web browser that still displays the full page title at the top of the page?
I just updated to the latest version of Firefox on my Mac and saw that they now hide most of the page title in the browser tab, like Chrome and Safari. I like to be able to see the full page title at all times (for reasons I'm sure you all understand) and that's pretty much the only reason I stuck with Firefox all these years. Now I'm looking for an alternative – any suggestions?
On-Page Optimization | | matt-145671 -
Optimal URL structure for location-specific pages
I'm in the middle of revamping a website for a restaurant that has multiple locations and am trying to decide what the best URL/internal link structure would be. Right now, each restaurant has a single location page, but we are going to add additional pages for catering. Sitewide-linked pages exist for /catering and /locationname. The way I see it, we have two basic options: Option #1: Catering page - /locationname/catering/ Option #2: Catering page - /catering/locationname/ In both cases, there would be links from the /locationname an /catering pages to the location-specific catering pages. Is either option preferable to the other?
On-Page Optimization | | mblair0 -
Duplicate Page Title
Hi Guys, First off, it's an honour to be a part of this awesome community. I'm using WordPress and getting top 3 rankings for great keywords and I'm very excited, however my page titles are in this format "keyword optimised title here - site name here" eg: "This is my keyword - this is the name of my blog", "This is another keyword - this is the name of my blog", "This is a longtail keyword - this is the name of my blog" SEOMoz is reporting errors because of duplicate page title tags due to the "this is the name of my blog" being in every page title. Will this hurt my rankings? Thanks in advance and keep up the great work! Cheers, Troy.
On-Page Optimization | | TroyDean710 -
How do you see a list of URLs with duplicate page titles?
When looking at the Duplicate Page Title report, the Other URLs column has various numbers that presumably indicate the number of pages that share the same title. When I click on one of these numbers, say a URL that shows 4 in that column, the next page reports "No sample duplicate URLs to report". Why isn't it showing me the other 3 URLs with the same page title?
On-Page Optimization | | jkenyon0