Internal Duplicate Content Question...

tdawson09

We are looking for an internal duplicate content checker that is capable of crawling a site that has over 300,000 pages. We have looked over Moz's duplicate content tool and it seems like it is somewhat limited in how deep it crawls. Are there any suggestions on the best "internal" duplicate content checker that crawls deep in a site?

BlueprintMarketing

If you want to a free test to crawl use this

https://www.deepcrawl.com/forms/free-crawl-report/

Please remember that URIs & URLs are different so your site with 300,000 URLs might have 600,000 URIs if you want to see how it works for free you can sign up for a free crawl for your first 10,000 pages.

I am not affiliated with the company aside from being a very happy customer.

BlueprintMarketing

Far no way the Best is going to be deep Crawl it automatically connects to Google Webmaster tools and analytics.

it can crawl constantly for ever. The real advantage is setting it to five URLs per second and depending on the speed of your server it will do it consistently I would not go over five pages per second. Make sure that you pick a dynamic IP structuring if you do not have a strong web application firewall if you do pick a single static IP then you can crawl the entire tire site without issue by white listing it. Now this is my personal opinion and I know what you're asking to be accomplished in the literally no time compared to other systems using deep crawl deepcrawl.com

It will show you what duplicate content is contained inside your website duplicate URLs what duplicate title tags you name it.

https://www.deepcrawl.com/knowledge/best-practice/seven-duplicate-content-issues/

https://www.deepcrawl.com/knowledge/news/google-webmaster-hangout-highlights-08102015/

You have a decent sized website and I would recommend adding a free edition of Robotto.org Robotto, can detect whether a preferredwww or non-www option has been configured correctly.

A lot of issues with web application firewall and CDNs you name it can be detected using the school and the combination of them is a real one-two punch. I honestly think that you will be happy with this tool. I have had issues with anything local like screaming frog when crawling surcharge websites you do not want to depend on your desktop ram. I hope you will let me know if this is a good solution for you I know that it works very very well and it will not stop crawling until it finds everything. Your site will be finished before 24 hours are done.

tdawson09

Correct, Thomas. We are not looking to restructure the site at this time but we are looking for a program that will crawl 300,000 plus pages and let us know which internal pages are duplicated.

BlueprintMarketing

If the tool has to crawl more than a crawl depth of 100 it is very common to find something that's able to do it. Like a said deep crawl, screaming frog & Moz is but you're talking about finding content that shouldn't be restructured.

BlueprintMarketing

If you looking for the most powerful tool for crawling websites deepcrawl.com is the king. Screaming frog it Is good but is dependent on RAM on your desktop. And does not have as many features as deep crawl

https://www.deepcrawl.com/knowledge/news/google-webmaster-hangout-highlights-08102015/

Ria_

Check out Siteliner. I've never tried it with a site that big, personally. But it's free, so worth a shot to see what you can get out of it.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Internal Duplicate Content Question...

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

International SEO and duplicate content: what should I do when hreflangs are not enough?

Duplicate content. Competing for rank.

Semi-duplicate content yet authoritative site

If a website trades internationally and simply translates its online content from English to French, German, etc how can we ensure no duplicate content penalisations and still maintain SEO performance in each territory?

Ecommerce Internal Linking Questions

Duplicate Content Question

How best to handle (legitimate) duplicate content?

Duplicate content via dynamic URLs where difference is only parameter order?