Should all pages on a site be included in either your sitemap or robots.txt?
-
I don't have any specific scenario here but just curious as I come across sites fairly often that have, for example, 20,000 pages but only 1,000 in their sitemap. If they only think 1,000 of their URL's are ones that they want included in their sitemap and indexed, should the others be excluded using robots.txt or a page level exclusion? Is there a point to having pages that are included in neither and leaving it up to Google to decide?
-
Thanks guys!
-
You bet - Cheers!
-
Clever PHD,
You are correct. I have found that these little housekeeping issues like eliminating duplicate content really do make a big difference.
Ron
-
I thinks Ron's point was that if you have a bunch of duplicates, the dups are not "real" pages, if you are only counting "real" pages. Therefore, if Google indexes your "real" pages and the dup versions of them, you can have more pages indexed. That is the issue then that you have duplicate versions of the same page in Google's index and so which will rank for a given key term? You could be competing against yourself. That is why it is so important you deal with crawl issues.
-
Thank you. Just curious, how would the number of pages indexed be higher than the number of actual pages?
-
I think you are looking at the pages indexed which is generally a higher number than those on your web site. There is a point to marking things up so that there is a no follow on any pages that you do not want indexed as well as properly marking up the web pages that you do specifically want indexed. It is really important that you eliminate duplicate pages. A common source of these duplicates is improper tags on the blog. Make sure that your tags are set up in a logical hierarchy like your site map. This will assist the search engines when they re index your page.
Hope this helps,
Ron
-
You want to have as many pages in the index as possible, as long as they are high quality pages with original content - if you publish quality original articles on a regular basis, you want to have all those pages indexed. Yes, from a practical perspective you may only be able to focus on tweaking the SEO on a portion of them, but if you have good SEO processes in place as you produce those pages, they will rank long term for a broad range of terms and bring traffic..
If you have 20,000 pages as you have an online catalog and you have 345 different ways to sort the same set of page results, or if you have keyword search URLs, or printer friendly version pages or your shopping cart pages, you do not want those indexed. These pages are typically, low quality/thin content pages and/or are duplicates and those do you no favor. You would want to use the noindex meta tag or canonical where appropriate. The reality is that out of the 20,000 pages, there are probably only a subset that are the "originals" and so you dont want to waste Googles time in crawling those pages.
A good concept here to look up is Crawl Budget or Crawl Optimization
http://searchengineland.com/how-i-think-crawl-budget-works-sort-of-59768
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt and redirected backlinks
Hey there, since a client's global website has a very complex structure which lead to big duplicate content problems, we decided to disallow crawler access and instead allow access to only a few relevant subdirectories. While indexing has improved since this I was wondering if we might have cut off link juice. Since several backlinks point to the disallowed root directory and are from there redirected (301) to the allowed directory I was wondering if this could cause any problems? Example: If there is a backlink pointing to example.com (disallowed in robots.txt) and is redirected from there to example.com/uk/en (allowed in robots.txt). Would this cut off the link juice? Thanks a lot for your thoughts on this. Regards, Jochen
Intermediate & Advanced SEO | | Online-Marketing-Guy0 -
After adding a ssl certificate to my site I encountered problems with duplicate pages and page titles
Hey everyone! After adding a ssl certificate to my site it seems that every page on my site has duplicated it's self. I think that is because it has combined the www.domainname.com and domainname.com. I would really hate to add a rel canonical to every page to solve this issue. I am sure there is another way but I am not sure how to do it. Has anyone else ran into this problem and if so how did you solve it? Thanks and any and all ideas are very appreciated.
Intermediate & Advanced SEO | | LovingatYourBest0 -
Robots.txt - blocking JavaScript and CSS, best practice for Magento
Hi Mozzers, I'm looking for some feedback regarding best practices for setting up Robots.txt file in Magento. I'm concerned we are blocking bots from crawling essential information for page rank. My main concern comes with blocking JavaScript and CSS, are you supposed to block JavaScript and CSS or not? You can view our robots.txt file here Thanks, Blake
Intermediate & Advanced SEO | | LeapOfBelief0 -
Robots.txt - Do I block Bots from crawling the non-www version if I use www.site.com ?
my site uses is set up at http://www.site.com I have my site redirected from non- www to the www in htacess file. My question is... what should my robots.txt file look like for the non-www site? Do you block robots from crawling the site like this? Or do you leave it blank? User-agent: * Disallow: / Sitemap: http://www.morganlindsayphotography.com/sitemap.xml Sitemap: http://www.morganlindsayphotography.com/video-sitemap.xml
Intermediate & Advanced SEO | | morg454540 -
Dilemma about "images" folder in robots.txt
Hi, Hope you're doing well. I am sure, you guys must be aware that Google has updated their webmaster technical guidelines saying that users should allow access to their css files and java-scripts file if it's possible. Used to be that Google would render the web pages only text based. Now it claims that it can read the css and java-scripts. According to their own terms, not allowing access to the css files can result in sub-optimal rankings. "Disallowing crawling of Javascript or CSS files in your site’s robots.txt directly harms how well our algorithms render and index your content and can result in suboptimal rankings."http://googlewebmastercentral.blogspot.com/2014/10/updating-our-technical-webmaster.htmlWe have allowed access to our CSS files. and Google bot, is seeing our webapges more like a normal user would do. (tested it in GWT)Anyhow, this is my dilemma. I am sure lot of other users might be facing the same situation. Like any other e commerce companies/websites.. we have lot of images. Used to be that our css files were inside our images folder, so I have allowed access to that. Here's the robots.txt --> http://www.modbargains.com/robots.txtRight now we are blocking images folder, as it is very huge, very heavy, and some of the images are very high res. The reason we are blocking that is because we feel that Google bot might spend almost all of its time trying to crawl that "images" folder only, that it might not have enough time to crawl other important pages. Not to mention, a very heavy server load on Google's and ours. we do have good high quality original pictures. We feel that we are losing potential rankings since we are blocking images. I was thinking to allow ONLY google-image bot, access to it. But I still feel that google might spend lot of time doing that. **I was wondering if Google makes a decision saying, hey let me spend 10 minutes for google image bot, and let me spend 20 minutes for google-mobile bot etc.. or something like that.. , or does it have separate "time spending" allocations for all of it's bot types. I want to unblock the images folder, for now only the google image bot, but at the same time, I fear that it might drastically hamper indexing of our important pages, as I mentioned before, because of having tons & tons of images, and Google spending enough time already just to crawl that folder.**Any advice? recommendations? suggestions? technical guidance? Plan of action? Pretty sure I answered my own question, but I need a confirmation from an Expert, if I am right, saying that allow only Google image access to my images folder. Sincerely,Shaleen Shah
Intermediate & Advanced SEO | | Modbargains1 -
Does site size (page count) effect search ranking?
If a company has a handful of large sites that function as collection of unique portals into client-specific content (password protected), will it have any positive effect on search ranking to migrate all of the sites to one URL structure.
Intermediate & Advanced SEO | | trideagroup0 -
How to associate content on one page to another page
Hi all, I would like associate content on "Page A" with "Page B". The content is not the same, but we want to tell Google it should be associated. Is there an easy way to do this?
Intermediate & Advanced SEO | | Viewpoints1 -
Key page of site not ranking at all
Our site has the largest selection of dog clothes on the Internet. We're been (every so slowly) creeping up in the rankings for the "dog clothes" term, but for some reason only rank for our home page. Even though the home page (and every page on the domain) has links pointing to our specific Dog Clothes page, that page doesn't even rank anywhere when searching Google with "dog clothes site:baxterboo.com". http://www.google.com/webhp?source=hp&q=dog+clothes+site:baxterboo.com&#sclient=psy&hl=en&site=webhp&source=hp&q=dog+clothes+site:baxterboo.com&btnG=Google+Search&aq=f&aqi=&aql=&oq=dog+clothes+site:baxterboo.com&pbx=1&bav=on.2,or.r_gc.r_pw.&fp=f4efcaa1b8c328f Pages 2+ of product results from that page rank, but not the base page. It's not excluded in robots.txt, All on site links to that page use the same URL. That page is loaded with more text that includes the keywords. I don't believe there's duplicated content. What am I missing? Has the page somehow been penalized?
Intermediate & Advanced SEO | | BBPets0