Locating Duplicate Pages
-
Hi,
Our website consists of approximately 15,000 pages however according to our Google Webmaster Tools account Google has around 26,000 pages for us in their index.
I have run through half a dozen sitemap generators and they all only discover the 15,000 pages that we know about. I have also thoroughly gone through the site to attempt to find any sections where we might be inadvertently generating duplicate pages without success.
It has been over six months since we did any structural changes (at which point we did 301's to the new locations) and so I'd like to think that the majority of these old pages have been removed from the Google Index. Additionally, the number of pages in the index doesn't appear to be going down by any discernable factor week on week.
I'm certain it's nothing to worry about however for my own peace of mind I'd like to just confirm that the additional 11,000 pages are just old results that will eventually disappear from the index and that we're not generating any duplicate content.
Unfortunately there doesn't appear to be a way to download a list of the 26,000 pages that Google has indexed so that I can compare it against our sitemap. Obviously I know about site:domain.com however this only returned the first 1,000 results which all checkout fine.
I was wondering if anybody knew of any methods or tools that we could use to attempt to identify these 11,000 extra pages in the Google index so we can confirm that they're just old pages which haven’t fallen out of the index yet and that they’re not going to be causing us a problem?
Thanks guys!
-
It's cool. Sorry, the point I was making is that irrespective of what you search for the page that is returned is http://www.refreshcartridges.co.uk/advanced_search_result.php (with nothing after the .php) and as such the search results page couldn't spurn multiple pages which could be indexed by Google.
-
Hmm, I'm not too knowledgeable about php pages. Sorry!
-
Sorry, I'm not sure what happened to that bit.ly address - The actual address of the website is www.refreshcartridges.co.uk.
Ah, I see what you mean about the search results now however this hopefully shouldn't be an issue as for security (our web guy said something about injections) the URL that is returned irrespective of what is searched for is http://www.refreshcartridges.co.uk/advanced_search_result.php
Thanks again!
-
I can't get that link to work.
What I said before still applies with physical input (this is what I assumed when I said it).
For example, user inputs the words "snakes and dogs" and clicks search. The new URL is "www.yoursite.com/search?q=snakes and dogs" All these weird URL pages need noindex meta tags or Google will flag them as duplicate content because, for example, this page and the result for "dogs and snakes" generate almost the same page.
Does that make sense?
It is in Google's Webmaster Guidelines that you should noindex these pages. -
Many thanks for your input on this. I have actually looked at this through the HTML improvements section of GWMT however I am showing only a few dozen duplicated titles / descriptions and this is simply due to the product categories being almost identical (for example HP Deskjet 500 and HP Deskjet 500+)
-
Many thanks for your response. Our site is an eCommerce site that doesn't employ tags as such and our categories are all accounted for in the 15,000 page figure.
-
We did have this at the beginning of the year when we used a ?dispmode=grid and ?dispmode=list to change the way our results were displayed. This has been rectified however by us completely removing the option and any instances of dispmode present in the URL force a 301 to the correct master page. There are still a few hundred instances of this dispmode being present in the Google index but 99% of them have fallen out now.
I have checked and double checked and we don't seem to have any issues like this at present.
-
I'm not certain if this is the case as our search engine requires physical input in order to yield a result. I don't know if it helps but the URL is http://bit.ly/4Cogchww if you fancy taking a look
-
Thanks for your reply. Indeed our website does force www. if someone were to attempt to navigate to us without prefixing www.
-
Hi Chris,
Google Webmaster has a tool that helps identify duplicate HTMLs and maybe you can use that to see if the 11,000 pages are duplicate. IF they are, I am assuming they should have the duplicate Title Tag and etc. which the tool may discover.
-
Have you checked for instances where a page parameter is being seen as another version of the same page? One of the sites I work for had an issue a few months back where every instance of a product page was being flagged as duplicate content because of an oversight. We had one of our coders write a clause into the page where every time a page loaded with a parameter such as ?color=72 it would canonicalize it to the page minus the parameter. This decreased our duplicate content warnings quickly and effectively.
-
it could be that your tags and categories are considered individual pages and therefore creating their own permalink: ex: http:www.example.com/keyword, and http://www.example.com/tag/keyword and http://www.example.com/category/keyword. Another way would be to check the sitemaps you have in webmaster tools and compare those to each other. Just a suggestion.
-
Does your website force 'www.'?
Both yourdomain.com and www.yourdomain.com are separate sites and can have different pages spidered.
-
Be sure to try different combinations of 'site:www.domain.com' and 'site:domain.com'. They will all yield different results.
Sounds to me like you probably have an internal search engine that is generating search results pages based off the search term, and each different results page is a piece of duplicate content.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Duplicate Content with ?Page ID's in WordPress
Hi there, I'm trying to figure out the best way to solve a duplicate content problem that I have due to Page ID's that WordPress automatically assigns to pages. I know that in order for me to resolve this I have to use canonical urls but the problem for me is I can't figure out the URL structure. Moz is showing me thousands of duplicate content errors that are mostly related to Page IDs For example, this is how a page's url should look like on my site Moz is telling me there are 50 duplicate content errors for this page. The page ID for this page is 82 so the duplicate content errors appear as follows and so on. For 47 more pages. The problem repeats itself with other pages as well. My permalinks are set to "Post Name" so I know that's not an issue. What can I do to resolve this? How can I use canonical URLs to solve this problem. Any help will be greatly appreciated.
On-Page Optimization | | SpaMedica0 -
Description tag not showing in the SERPs because page is blocked by Robots, but the page isn't blocked. Any help?
While checking some SERP results for a few pages of a site this morning I noticed that some pages were returning this message instead of a description tag, A description for this result is not avaliable because of this site's robot.s.txt The odd thing is the page isn't blocked in the Robots.txt. The page is using Yoast SEO Plugin to populate meta data though. Anyone else had this happen and have a fix?
On-Page Optimization | | mac22330 -
Why isn't our site being shown on the first page of Google for a query using the exact domain, when its pages are indeed indexed by Google
When I type our domain.com as a query into Google, I only see one of our pages on the homepage, and it's in 4th position. It seems though, that all pages of the site are indexed by google when I type in the query "site:domain.com". There was an issue at the site launch, where the robots.txt file was left active for around two weeks. Would this have been responsible for the fact that another domain ranks #1 when we type in our own domain? It has been around a couple of months now since the site was launched. Thanks in advance.
On-Page Optimization | | featherseo0 -
I have a question about on page links or duplicate contant
Ok help me out here friends. I’m working with the warnings and errors for my site. I have two problems that relate to each other and I want to know if you had to choose what problem what would you choose. I’m running into some duplicate content and title errors because under categories for my products there are so many products that it creates more than one page and with each new page it has the same title or same content on the page. I tried to make this less in some cases by showing more products per page like 100 items and in most cases per category it will only show one page now. Now some times there’s still more than one page and also this creates too many links now on that category page. So I think I can get rid of all the to many on page links but it will make more pages and duplicate content and title tag. What would you guys do?
On-Page Optimization | | Dataken0 -
Duplicate Page Content Question
This article was published on fastcompany.com on March 19th. http://www.fastcompany.com/magazine/164/designing-facebook It did not receive much traffic, so it was re-posted on Co.Design today (March 27th) where it has received significantly more traffic. http://www.fastcodesign.com/1669366/facebook-agrees-the-secret-to-its-future-success-is-design My question is if google will dock us for reprinting/reusing content on another site (even if it is a sister site within the same company). If they do frown on that, is there a proper way to attribute the content to the source material/site (fastcompany.com)?
On-Page Optimization | | DanAsadorian0 -
301 redirects from several sub-pages to one sub-page
Hi! I have 14 sub-pages i deleted earlier today. But ofcourse Google can still find them, and gives everyone that gives them a go a 404 error. I have come to the understading that this wil hurt the rest of my site, at least as long as Google have them indexed. These sub-pages lies in 3 different folders, and i want to redirect them to a sub-page in a folder number 4. I have already an htaccess file, but i just simply cant get it to work! It is the same file as i use for redirecting trafic from mydomain.no to www.mydomain.no, and i have tried every kind of variation i can think of with the sub-pages. Has anyone perhaps had the same problem before, or for any other reason has the solution, and can help me with how to compose the htaccess file? 🙂 You have to excuse me if i'm using the wrong terms, missing something i should have seen under water while wearing a blindfold, or i am misspelling anything. I am neither very experienced with anything surrounding seo or anything else that has with internet to do, nor am i from an englishspeaking country. Hope someone here can light up my path 🙂 Thats at least something you can say in norwegian...
On-Page Optimization | | MarieA1 -
Location in keyword terms
I'm optimizing a website for a dentist and I'm looking for the best approach to incorporating the location into the keyword terms. For example if a dental practice in Boston has a page on Cosmetic Dentistry what would be the best approach for optimizing for "Boston Cosmetic Dentist", "Boston Teeth Whitening" and "Cosmetic Dentist in Boston"? How should I handle the repetition of the location name? Will I get the best results by using the full keyword terms several times on the page "example a" or will "example b" provide similar results? Title Tag: a) Boston Cosmetic Dentist | Boston Teeth Whitening | Cosmetic Dentist in Boston
On-Page Optimization | | OptioPublishing
b) Boston Cosmetic Dentist | Teeth Whitening H1
a) Boston Cosmetic Dentist | Boston Teeth Whitening | Cosmetic Dentist in Boston
b) Boston Cosmetic Dentist | Teeth Whitening keywords to sprinkle through content
a) Boston Cosmetic Dentist, Boston Teeth Whitening, Cosmetic Dentist in Boston
b) Boston Cosmetic Dentist, Teeth Whitening etc... It's important to rank for all 3 keywords but the pages would be flooded with the words Dentist and Boston if I use each phrase exactly. Any suggestions? Thanks in advance,
Jason0 -
Duplicate content Issue
I'm getting a report of duplicate title and content on: http://www.website.com/ http://www.website.com/index.php Of course, they're the same pages but does this need to be corrected somehow. Thanks!
On-Page Optimization | | dbaxa-2613380