XML and Disallow
-
I was just curious about any potential side effects of a client Basically utilizing a catch-all solution through the use of a spider for generating their XML Sitemap and then disallowing some of the directories in the XML sitemap in the robots.txt.
i.e.
XML contains 500 URLs
50 URLs contain /dirw/
I don't want anything with /dirw/ indexed just because they are fairly useless. No content, one image.They utilize the robots.txt file to " disallow: /dirw/ "
Lets say they do this for maybe 3 separate directories making up roughly 30% of the URL's in the XML sitemap.
I am just advising they re-do the sitemaps because that shouldn't be too dificult but I am curious about the actual ramifications of this other than "it isn't a clear and concise indication to the SE and therefore should be made such" if there are any.
Thanks!
-
Hi Thomas,
I don't think that technically there is a problem with adding url's to a sitemap & then blocking part of them with robots.txt.
I wouldn't do it however - and I would give the same advice as you did: regenerate the sitemap without this content. Main reason would be that it goes against the main goals of a sitemap: helping bots to crawl your site and to provide valuable metadata (https://support.google.com/webmasters/answer/156184?hl=en). Another advantage is that Google indicates the % of url's of each sitemap which is index. From that perspective, url's which are blocked for indexing have no use in a sitemap. Normally webmaster tools will generate errors, to let you know that there are issues with the sitemap.
If you take it one step further, Google could consider you a bit of a lousy webmaster, if you keep these url's in the sitemap. Not sure if this is the case, but for something which can easily be corrected, not sure if I would take this risk (even if it's a very minor one).
There are crawlers (like screamingfrog) which can generate sitemaps, while respecting the directives of the robots.txt - this would in my opinion be a better option.
rgds,
Dirk
-
For syntax I think you'll want:
User-agent: *
Disallow: /dirw/If the content of /dirw/ isn't worthwhile to the engines then it should be fine to disallow. It's important to note though that Google asks for CSS and Javascript to not be disallowed. Run the site through their Page Speed tool to see how this setup currently impacts that interaction. Cheers!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt, Disallow & Indexed-Pages..
Hi guys, hope you're well. I have a problem with my new website. I have 3 pages with the same content: http://example.examples.com/brand/brand1 (good page) http://example.examples.com/brand/brand1?show=false http://example.examples.com/brand/brand1?show=true The good page has rel=canonical & it is the only page should be appear in Search results but Google has indexed 3 pages... I don't know how should do now, but, i am thinking 2 posibilites: Remove filters (true, false) and leave only the good page and show 404 page for others pages. Update robots.txt with disallow for these parameters & remove those URL's manually Thank you so much!
Intermediate & Advanced SEO | | thekiller990 -
Dynamic XML Sitemap Generator
Has anyone used a Dynamic XML Sitemap Generator tool? Looking for recommendations!
Intermediate & Advanced SEO | | Matchnode0 -
Do you suggest I use the Yoast or the Google XML sitemap for my blog?
I just shut off the All-In-One seo pack plugin for wordpress, and turned on the Yoast plugin. It's great! So much helpful, seo boosting info! So, in watching a video on how to configure the plugin, it mentions that I should update the sitemap, using the Yoast sitemap I'm afraid to do this, because I'm pretty technologically behind... I see I have a Google XML Sitemaps (by Arne Brachhold) plugin turned on (and have had it for many years). Should I leave this one on? Or would you recommend going through the steps to use the Yoast plugin sitemap? If so, what are the benefits of the Yoast plugin, over the Google XML? Thanks!
Intermediate & Advanced SEO | | DavidC.0 -
Does anyone know of any tools that can help split up xml sitemap to make it more efficient and better for seo?
Hello All, We want to split up our Sitemap , currently it's almost 10K pages in one xml sitemap but we want to make it in smaller chunks splitting it by category or location or both. Ideally into 100 per sitemap is what I read is the best number to help improve indexation and seo ranking. Any thoughts on this ? Does anyone know or any good tools out there which can assist us in doing this ? Also another question I have is that should we put all of our products (1250) in one site map or should this also be split up in to say products for category etc etc ? thanks Pete
Intermediate & Advanced SEO | | PeteC120 -
Noindex xml RSS feed
Hey, How can I tell search engines not to index my xml RSS feed? The RSS feed is created by Yoast on WordPress. Thanks, Luke.
Intermediate & Advanced SEO | | NoisyLittleMonkey0 -
When you add 10.000 pages that have no real intention to rank in the SERP, should you: "follow,noindex" or disallow the whole directory through robots? What is your opinion?
I just want a second opinion 🙂 The customer don't want to loose any internal linkvalue by vaporizing link value though a big amount of internal links. What would you do?
Intermediate & Advanced SEO | | Zanox0 -
Will disallowing in robots.txt noindex a page?
Google has indexed a page I wish to remove. I would like to meta noindex but the CMS isn't allowing me too right now. A suggestion o disallow in robots.txt would simply stop them crawling I expect or is it also an instruction to noindex? Thanks
Intermediate & Advanced SEO | | Brocberry0 -
How can I get an XML sitemap in the order that I want?
I use Screaming Frog and Xenu on a daily basis and I use them for sitemap creation, but the functionality is limited. With huge sites, it's really easy to create an ordered list of URLs for the sitemap in excel or word and upload that to Screaming Frog to crawl. The only problem is that it won't export the sitemap in the order that I uploaded it. Does anybody know of a tool that will do this or am I doomed to sit an manually arrange the URLs the way I want?
Intermediate & Advanced SEO | | MichaelWeisbaum0