Should all pages on a site be included in either your sitemap or robots.txt?

RossFruin

I don't have any specific scenario here but just curious as I come across sites fairly often that have, for example, 20,000 pages but only 1,000 in their sitemap. If they only think 1,000 of their URL's are ones that they want included in their sitemap and indexed, should the others be excluded using robots.txt or a page level exclusion? Is there a point to having pages that are included in neither and leaving it up to Google to decide?

RossFruin

Thanks guys!

CleverPhD

You bet - Cheers!

Ron_McCabe

Clever PHD,

You are correct. I have found that these little housekeeping issues like eliminating duplicate content really do make a big difference.

Ron

CleverPhD

I thinks Ron's point was that if you have a bunch of duplicates, the dups are not "real" pages, if you are only counting "real" pages. Therefore, if Google indexes your "real" pages and the dup versions of them, you can have more pages indexed. That is the issue then that you have duplicate versions of the same page in Google's index and so which will rank for a given key term? You could be competing against yourself. That is why it is so important you deal with crawl issues.

RossFruin

Thank you. Just curious, how would the number of pages indexed be higher than the number of actual pages?

Ron_McCabe

I think you are looking at the pages indexed which is generally a higher number than those on your web site. There is a point to marking things up so that there is a no follow on any pages that you do not want indexed as well as properly marking up the web pages that you do specifically want indexed. It is really important that you eliminate duplicate pages. A common source of these duplicates is improper tags on the blog. Make sure that your tags are set up in a logical hierarchy like your site map. This will assist the search engines when they re index your page.

Hope this helps,

Ron

CleverPhD

You want to have as many pages in the index as possible, as long as they are high quality pages with original content - if you publish quality original articles on a regular basis, you want to have all those pages indexed. Yes, from a practical perspective you may only be able to focus on tweaking the SEO on a portion of them, but if you have good SEO processes in place as you produce those pages, they will rank long term for a broad range of terms and bring traffic..

If you have 20,000 pages as you have an online catalog and you have 345 different ways to sort the same set of page results, or if you have keyword search URLs, or printer friendly version pages or your shopping cart pages, you do not want those indexed. These pages are typically, low quality/thin content pages and/or are duplicates and those do you no favor. You would want to use the noindex meta tag or canonical where appropriate. The reality is that out of the 20,000 pages, there are probably only a subset that are the "originals" and so you dont want to waste Googles time in crawling those pages.

A good concept here to look up is Crawl Budget or Crawl Optimization

http://searchengineland.com/how-i-think-crawl-budget-works-sort-of-59768

http://www.blindfiveyearold.com/crawl-optimization

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Should all pages on a site be included in either your sitemap or robots.txt?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Crawl Stats Decline After Site Launch (Pages Crawled Per Day, KB Downloaded Per Day)

Category Page as Shopping Aggregator Page

Keyword stuffing on category pages - eCommerce site

Better UX or more Dedicated Pages (and page views)?

Panda Updates - robots.txt or noindex?

Magneto site with many pages

Do in page links pointing to the parent page make the page more relevant for that term?

Old pages still crawled by SE returning 404s. Better to put 301 or block with robots.txt ?