Should all pages on a site be included in either your sitemap or robots.txt?
-
I don't have any specific scenario here but just curious as I come across sites fairly often that have, for example, 20,000 pages but only 1,000 in their sitemap. If they only think 1,000 of their URL's are ones that they want included in their sitemap and indexed, should the others be excluded using robots.txt or a page level exclusion? Is there a point to having pages that are included in neither and leaving it up to Google to decide?
-
Thanks guys!
-
You bet - Cheers!
-
Clever PHD,
You are correct. I have found that these little housekeeping issues like eliminating duplicate content really do make a big difference.
Ron
-
I thinks Ron's point was that if you have a bunch of duplicates, the dups are not "real" pages, if you are only counting "real" pages. Therefore, if Google indexes your "real" pages and the dup versions of them, you can have more pages indexed. That is the issue then that you have duplicate versions of the same page in Google's index and so which will rank for a given key term? You could be competing against yourself. That is why it is so important you deal with crawl issues.
-
Thank you. Just curious, how would the number of pages indexed be higher than the number of actual pages?
-
I think you are looking at the pages indexed which is generally a higher number than those on your web site. There is a point to marking things up so that there is a no follow on any pages that you do not want indexed as well as properly marking up the web pages that you do specifically want indexed. It is really important that you eliminate duplicate pages. A common source of these duplicates is improper tags on the blog. Make sure that your tags are set up in a logical hierarchy like your site map. This will assist the search engines when they re index your page.
Hope this helps,
Ron
-
You want to have as many pages in the index as possible, as long as they are high quality pages with original content - if you publish quality original articles on a regular basis, you want to have all those pages indexed. Yes, from a practical perspective you may only be able to focus on tweaking the SEO on a portion of them, but if you have good SEO processes in place as you produce those pages, they will rank long term for a broad range of terms and bring traffic..
If you have 20,000 pages as you have an online catalog and you have 345 different ways to sort the same set of page results, or if you have keyword search URLs, or printer friendly version pages or your shopping cart pages, you do not want those indexed. These pages are typically, low quality/thin content pages and/or are duplicates and those do you no favor. You would want to use the noindex meta tag or canonical where appropriate. The reality is that out of the 20,000 pages, there are probably only a subset that are the "originals" and so you dont want to waste Googles time in crawling those pages.
A good concept here to look up is Crawl Budget or Crawl Optimization
http://searchengineland.com/how-i-think-crawl-budget-works-sort-of-59768
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Validated pages on GSC displays 5x more pages than when performing site:domain.com?
Hi mozzers, When checking the coverage report on GSC I am seeing over 649,000 valid pages https://cl.ly/ae46ec25f494 but when performing site:domain.com I am only seeing 130,000 pages. Which one is more the source of truth especially I have checked some of these "valid" pages and noticed they're not even indexed?
Intermediate & Advanced SEO | | Ty19860 -
Keyword stuffing on category pages - eCommerce site
Hi there fellow Mozzers. I work for a wine company, and I have a theory that some of our category pages are not ranking as well as they could, due to keyword stuffing. The best example is our Champagne category page, which we are trying to rank for the keyword Champagne, currently rank 6ish. However, when I load the page into Moz, it tells me that I might be stuffing, which I am not, BUT my products might be giving both Moz and Google this impression as well. Our product names for any given Champagne is "Champagne - {name}" and the producer is "Champagne {producer name}. Now, on the category pages we have a list of Champagnes, actually 44 Which means that with the way we display them, with both name of the wine, the name of the producer AND the district. That means we have 132 mentions of the word "Champagne" + the content text that I have written. I am wondering, how good is Google at identifying that this is in fact not stuffing, but rather functionality that makes for this high density of the keyword? Is there anything I can do? I mean, we can change it so it's not listed with Champagne on all the products, but I believe it would make the usability suffer a bit, not a lot - but it's a question of balance and I would like to hear if anyone has encountered a similar problem, if it is in fact a problem?
Intermediate & Advanced SEO | | Nikolaj-Landrock2 -
Should I include URLs that are 301'd or only include 200 status URLs in my sitemap.xml?
I'm not sure if I should be including old URLs (content) that are being redirected (301) to new URLs (content) in my sitemap.xml. Does anyone know if it is best to include or leave out 301ed URLs in a xml sitemap?
Intermediate & Advanced SEO | | Jonathan.Smith0 -
What to do about similar product pages on major retail site
Hi all, I have a dilemma and I'm hoping the community can guide me in the right direction. We're working with a major retailer on launching a local deals section of their website (what I'll call the "local site"). The company has 55 million products for one brand, and 37 million for another. The main site (I'll call it the ".com version") is fairly well SEO'd with flat architecture, clean URLs, microdata, canonical tag, good product descriptions, etc. If you were looking for a refrigerator, you would use the faceted navigation and go from department > category > sub-category > product detail page. The local site's purpose is to "localize" all of the store inventory and have weekly offers and pricing specials. We will use a similar architecture as .com, except it will be under a /local/city-state/... sub-folder. Ideally, if you're looking for a refrigerator in San Antonio, Texas, then the local page should prove to be more relevant than the .com generic refrigerator pages. (the local pages have the addresses of all local stores in the footer and use the location microdata as well - the difference will be the prices.) MY QUESTION IS THIS: If we pull the exact same product pages/descriptions from the .com database for use in the local site, are we creating a duplicate content problem that will hurt the rest of the site? I don't think I can canonicalize to the .com generic product page - I actually want those local pages to show up at the top. Obviously, we don't want to copy product descriptions across root domains, but how is it handled across the SAME root domain? Ideally, it would be great if we had a listing from both the .com and the /local pages in the SERPs. What do you all think? Ryan
Intermediate & Advanced SEO | | RyanKelly0 -
Should I disallow via robots.txt for my sub folder country TLD's?
Hello, My website is in default English and Spanish as a sub folder TLD. Because of my Joomla platform, Google is listing hundreds of soft 404 links of French, Chinese, German etc. sub TLD's. Again, i never created these country sub folder url's, but Google is crawling them. Is it best to just "Disallow" these sub folder TLD's like the example below, then "mark as fixed" in my crawl errors section in Google Webmaster tools?: User-agent: * Disallow: /de/ Disallow: /fr/ Disallow: /cn/ Thank you, Shawn
Intermediate & Advanced SEO | | Shawn1240 -
How to Build High Quality eCommerce Web Site during Low Quality Web Pages?
Today, I was reading Official Google Webmaster Central Blog: More guidance on building high-quality sites. I found one interesting statement over there. Low-quality content on some parts of a website can impact the whole site’s rankings. Why should I like to discuss on this topic? Because, I have made big change on my website via narrow by search. I want to give specific result to know more about it. This is my category page: http://www.vistastores.com/patio-umbrellas Left narrow by search section is creating accurate page for specific attribute products. California Umbrella:
Intermediate & Advanced SEO | | CommercePundit
http://www.vistastores.com/patio-umbrellas/shopby/manufacturer-california-umbrella From above page following page is accessible. http://www.vistastores.com/patio-umbrellas/shopby/canopy-shape-search-octagonal/manufacturer-california-umbrella Sunbrella Patio Umbrellas:
http://www.vistastores.com/patio-umbrellas/shopby/canopy-fabric-search-sunbrella Similar story for this page. Following page can accessible from above page. http://www.vistastores.com/patio-umbrellas/shopby/canopy-fabric-search-sunbrella/finish-search-wood My website have 100+ categories, 11,000 products. I have checked indexed pages in Google for my website. https://www.google.com/search?q=info%3Awww.vistastores.com&pws=0&gl=US#hl=en&safe=off&pws=0&gl=US&q=site:www.vistastores.com&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=910893d99351c8f7&biw=1366&bih=547 It shows me 35,000+ crawled pages which are developed by left navigation section. So, Will it consider as low quality pages? I want to improve my website performance without delete these pages.0 -
Robots.txt 404 problem
I've just set up a wordpress site with a hosting company who only allow you to install your wordpress site in http://www.myurl.com/folder as opposed to the root folder. I now have the problem that the robots.txt file only works in http://www.myurl./com/folder/robots.txt Of course google is looking for it at http://www.myurl.com/robots.txt and returning a 404 error. How can I get around this? Is there a way to tell google in webmaster tools to use a different path to locate it? I'm stumped?
Intermediate & Advanced SEO | | SamCUK0 -
One page wordpress site - what are the steps for SEO
Hello, I am launching 5 sites with keyword exact domains. I am developing the sites on wordpress as one page sales funnel sites. What do I need to do to optimize my sites? Really appreciate any bullet points or directions. Tks
Intermediate & Advanced SEO | | brianmaher0