Locating Duplicate Pages
-
Hi,
Our website consists of approximately 15,000 pages however according to our Google Webmaster Tools account Google has around 26,000 pages for us in their index.
I have run through half a dozen sitemap generators and they all only discover the 15,000 pages that we know about. I have also thoroughly gone through the site to attempt to find any sections where we might be inadvertently generating duplicate pages without success.
It has been over six months since we did any structural changes (at which point we did 301's to the new locations) and so I'd like to think that the majority of these old pages have been removed from the Google Index. Additionally, the number of pages in the index doesn't appear to be going down by any discernable factor week on week.
I'm certain it's nothing to worry about however for my own peace of mind I'd like to just confirm that the additional 11,000 pages are just old results that will eventually disappear from the index and that we're not generating any duplicate content.
Unfortunately there doesn't appear to be a way to download a list of the 26,000 pages that Google has indexed so that I can compare it against our sitemap. Obviously I know about site:domain.com however this only returned the first 1,000 results which all checkout fine.
I was wondering if anybody knew of any methods or tools that we could use to attempt to identify these 11,000 extra pages in the Google index so we can confirm that they're just old pages which haven’t fallen out of the index yet and that they’re not going to be causing us a problem?
Thanks guys!
-
It's cool. Sorry, the point I was making is that irrespective of what you search for the page that is returned is http://www.refreshcartridges.co.uk/advanced_search_result.php (with nothing after the .php) and as such the search results page couldn't spurn multiple pages which could be indexed by Google.
-
Hmm, I'm not too knowledgeable about php pages. Sorry!
-
Sorry, I'm not sure what happened to that bit.ly address - The actual address of the website is www.refreshcartridges.co.uk.
Ah, I see what you mean about the search results now however this hopefully shouldn't be an issue as for security (our web guy said something about injections) the URL that is returned irrespective of what is searched for is http://www.refreshcartridges.co.uk/advanced_search_result.php
Thanks again!
-
I can't get that link to work.
What I said before still applies with physical input (this is what I assumed when I said it).
For example, user inputs the words "snakes and dogs" and clicks search. The new URL is "www.yoursite.com/search?q=snakes and dogs" All these weird URL pages need noindex meta tags or Google will flag them as duplicate content because, for example, this page and the result for "dogs and snakes" generate almost the same page.
Does that make sense?
It is in Google's Webmaster Guidelines that you should noindex these pages. -
Many thanks for your input on this. I have actually looked at this through the HTML improvements section of GWMT however I am showing only a few dozen duplicated titles / descriptions and this is simply due to the product categories being almost identical (for example HP Deskjet 500 and HP Deskjet 500+)
-
Many thanks for your response. Our site is an eCommerce site that doesn't employ tags as such and our categories are all accounted for in the 15,000 page figure.
-
We did have this at the beginning of the year when we used a ?dispmode=grid and ?dispmode=list to change the way our results were displayed. This has been rectified however by us completely removing the option and any instances of dispmode present in the URL force a 301 to the correct master page. There are still a few hundred instances of this dispmode being present in the Google index but 99% of them have fallen out now.
I have checked and double checked and we don't seem to have any issues like this at present.
-
I'm not certain if this is the case as our search engine requires physical input in order to yield a result. I don't know if it helps but the URL is http://bit.ly/4Cogchww if you fancy taking a look
-
Thanks for your reply. Indeed our website does force www. if someone were to attempt to navigate to us without prefixing www.
-
Hi Chris,
Google Webmaster has a tool that helps identify duplicate HTMLs and maybe you can use that to see if the 11,000 pages are duplicate. IF they are, I am assuming they should have the duplicate Title Tag and etc. which the tool may discover.
-
Have you checked for instances where a page parameter is being seen as another version of the same page? One of the sites I work for had an issue a few months back where every instance of a product page was being flagged as duplicate content because of an oversight. We had one of our coders write a clause into the page where every time a page loaded with a parameter such as ?color=72 it would canonicalize it to the page minus the parameter. This decreased our duplicate content warnings quickly and effectively.
-
it could be that your tags and categories are considered individual pages and therefore creating their own permalink: ex: http:www.example.com/keyword, and http://www.example.com/tag/keyword and http://www.example.com/category/keyword. Another way would be to check the sitemaps you have in webmaster tools and compare those to each other. Just a suggestion.
-
Does your website force 'www.'?
Both yourdomain.com and www.yourdomain.com are separate sites and can have different pages spidered.
-
Be sure to try different combinations of 'site:www.domain.com' and 'site:domain.com'. They will all yield different results.
Sounds to me like you probably have an internal search engine that is generating search results pages based off the search term, and each different results page is a piece of duplicate content.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Understanding why our new page doesn't rank. Internal link structure to blame? + understand canonical pages more.
Hi guys. Sorry it's an essay...BUT, i think a lot of you will find this an interesting question. This question is in 2 (related) parts, and I imagine it would be an 'advanced' SEO question. Hoping you guys can help bring some real insight 🙂 Always amazed at the quality for this forum/ community. **Context... ** We had a duplicate content issue caused by this page and it's product permutations, so we placed canonical tags on all the product permutations to solve it. Worked a treat. However, we now have more **product ranges. **We now sell Diaries, Notebooks & Music books, which are clearly different from one another. So...we've placed canonical tags on all the product permutations leading back to the 'parent' theme. In other words, all the diary permutations 'lead back' to the diary page. All the notebooks permutations 'lead back' to the main notebook page. So on and so forth. Make sense so far? Context end..... Issue. Amazingly our Diary page outranks our notebook pagefor the search term 'Design your own Notebook'. The notebook page is well optimised for this search term, and the diary page avoids the word 'notebook' altogether (so no keyword cannibalisation going on). Possible reason? Our Diary page has a vast amount of internal links to it throughout our site. The notebook page has only a few. Could this be the issue? If so, what reading/ blogs/ content/ tools would you recommend to help understand and solve this problem? i.e) Better understanding internal link structure for SEO. 2nd part of the question (in the context of internal linking for SEO). When there are internal links to a page with a conical tag does that 'count' towards the 'parent page', or simply towards that specific page? I really hope that makes sense. If it's clear as mud just shout. Isaac. EDIT: All pages in question have been indexed since we added these changes to the site.
On-Page Optimization | | isaac6630 -
Duplicate Blog pages across different domains
Hey Moz Community, I have 3 Duplicate websites which more or less contain the same blog article ( they are copy & paste from the original website ). I am now in the process of changing my duplicate websites and I stumbled upon this problem: if I have to change the content for all the duplicate articles I have across my different domains it would be a very time consuming task and on the other hand I don't want to no index, follow the duplicate articles because I want to use them for SEO purposes. Should I only change the articles that brought significant traffic and no index, follow the rest ? What do you think ? Thanks, Anddrei
On-Page Optimization | | kiraftw0 -
How to treat pages that are removed?
I have a website that need be very up-to-date, I mean, pages can be published just for 30 days, after that it should be unpublished. Everyday more than 300 pages is "removed", For theses pages I am returning http code "410" (Gone), also I remove from the sitemap. Now, I am checking Google WebMasterTools and I am getting thousands of pages not found. So... My questions Does it have SEO impact? How is the best approach to treat it?
On-Page Optimization | | thobryan0 -
Duplicate Home Page
Hi, I have a question around best practise on duplicate home pages. The /index.aspx page is showing up as a top referrer in my analytics. I have the rel=canonical tag implemented for the www.mysite.com on both pages. Do I need to 301 the /index.aspx to the mysite.com? I have a lot of links pointing to the /index.aspx (half of those are coming from the mysite.com). www.mysite.com/index.aspx www.mysite.com Many thanks Jon
On-Page Optimization | | JonRaubenheimer0 -
Duplicate Page Content Should we 301 - Best Practices?
What would be the best way to avoid a Duplicate Page Content for these type of pages. Our website generates user friendly urls, for each page..
On-Page Optimization | | 365ToursSafaris
So it is the same exact page, just both versions of the url work.. Example: http://www.safari365.com/about-africa/wildebeest-migration http://www.safari365.com/wildebeest-migration I don't think adding code to the page will work because its the same page for the incorrect and correct versions of the page. I don't think i can use the URL parameter setting because the version with /about-africa/ is the correct (correct as it it follows the site navigation) I was thinking of using the htaccess to redirect to the correct version.. Will that work ? and does it follow best Practices ? any other suggestions that would work better ?0 -
Why is the seomoz showing it crawled 3 pages when i only have 2 pages?
I had seomoz crawl my site. I only have 2 pages. The site url is www.autoinsurancefremontca.com.
On-Page Optimization | | Greenpeak0 -
Page Rank Drop
Just trying to get more feedback - so we recently edited title and meta descriptions for existing website SEO and we've noticed in the past several weeks, our client's website has dropped out of the top 50 in a variety of terms we were targeting that they used to show up for (note: when updating SEO, we DID NOT remove any relevant terms we were targeting). When the website does come up in searches, it is the old meta description and title. So far, the feedback we've gotten is that first, it takes Google a few weeks to recrawl and index - however, we are now on week 3 after the changes and still no rebound in rankings. We were also told to check with the SEO Moz page grader to be sure the keywords were being optimized correctly - got As and Bs for the test terms I tried. We also submitted an XML site map to speed up the crawl process as another user suggested. We've tested the site with various tools to make sure there are redirect errors, etc. and everything looks fine. Again, it's now been 3+ weeks and no ranking rebound. Any other suggestions on what could be happening?
On-Page Optimization | | laidlawseo0 -
Duplicate page content & title for www.mydomain.com and www.mydomain.com/index.php?
Hi, First post so please be gentle! My Crawl Diagnostics Summary is showing an error relating to duplicate page content and duplicate page title for www.mydomain.com and www.mydomain.com/index.php which are, in my view, the same thing/page? Could anyone shed any light please? Thanks Carl
On-Page Optimization | | Carl2870