If i disallow unfriendly URL via robots.txt, will its friendly counterpart still be indexed?
-
Our not-so-lovely CMS loves to render pages regardless of the URL structure, just as long as the page name itself is correct. For example, it will render the following as the same page:
example.com/really/dumb/duplicative/URL/123.html
To help combat this, we are creating mod rewrites with friendly urls, so all of the above would simply render as example.com/123
I understand robots.txt respects the wildcard (*), so I was considering adding this to our robots.txt:
Disallow: */123.html
If I move forward, will this block all of the potential permutations of the directories preceding 123.html yet not block our friendly example.com/123?
Oh, and yes, we do use the canonical tag religiously - we're just mucking with the robots.txt as an added safety net.
-
Yeah, if you could solve this via .htaccess that would be great, especially if you have link equity flowing into any of those URLs.
I'd go one step further than Irving and highly recommend canonical tags on those URLs. Since, as you said, it's all one page with infinite URL possibilities, the canonical should be easy to implement.
Best of luck!
-
Thanks, however, the meta tag won't work in this case because it's technically one page with an infinite amount of names via the URL (remember, the CMS only depends on the 123.html and ignores the directories preceding it). If I applied the NOINDEX within the meta, then the version I do want to get indexed would not be indexed.
The question was really around "will the internal rewrite of /123.html to just /123 be impacted if we disallow */123.html" - and since the rewrite happens before the bot sees it, I presume the answer is "no, it will not be impacted: 123.html will be blocked yet /123 will still be indexed.
Now, after I posted the question I realized this is the case where I should use a "greedy" 301 redirect via htaccess rather than try to block permutations of the URL via robots.txt. So I decided to not go the robots.txt route and instead do a 301 redirect via regex:
*/123.html to /123 (that's obviously not perfect regex, but you see my point)
-
that disallow command will block all files with the name 123.html in any folder deeper that the root.
This with the canonical (absolute not relative) will probably cover you, but it is really recommended to get a robots noindex meta tag on these duplicate pages as well. Bots coming in from an external link pointing to that page could result in the page getting indexed, also the canonical is a suggestion not a rule.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt blocked internal resources Wordpress
Hi all, We've recently migrated a Wordpress website from staging to live, but the robots.txt was deleted. I've created the following new one: User-agent: *
Intermediate & Advanced SEO | | Mat_C
Allow: /
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Allow: /wp-admin/admin-ajax.php However, in the site audit on SemRush, I now get the mention that a lot of pages have issues with blocked internal resources in robots.txt file. These blocked internal resources are all cached and minified css elements: links, images and scripts. Does this mean that Google won't crawl some parts of these pages with blocked resources correctly and thus won't be able to follow these links and index the images? In other words, is this any cause for concern regarding SEO? Of course I can change the robots.txt again, but will urls like https://example.com/wp-content/cache/minify/df983.js end up in the index? Thanks for your thoughts!2 -
Google robots.txt test - not picking up syntax errors?
I just ran a robots.txt file through "Google robots.txt Tester" as there was some unusual syntax in the file that didn't make any sense to me... e.g. /url/?*
Intermediate & Advanced SEO | | McTaggart
/url/?
/url/* and so on. I would use ? and not ? for example and what is ? for! - etc. Yet "Google robots.txt Tester" did not highlight the issues... I then fed the sitemap through http://www.searchenginepromotionhelp.com/m/robots-text-tester/robots-checker.php and that tool actually picked up my concerns. Can anybody explain why Google didn't - or perhaps it isn't supposed to pick up such errors? Thanks, Luke0 -
Robots.txt and redirected backlinks
Hey there, since a client's global website has a very complex structure which lead to big duplicate content problems, we decided to disallow crawler access and instead allow access to only a few relevant subdirectories. While indexing has improved since this I was wondering if we might have cut off link juice. Since several backlinks point to the disallowed root directory and are from there redirected (301) to the allowed directory I was wondering if this could cause any problems? Example: If there is a backlink pointing to example.com (disallowed in robots.txt) and is redirected from there to example.com/uk/en (allowed in robots.txt). Would this cut off the link juice? Thanks a lot for your thoughts on this. Regards, Jochen
Intermediate & Advanced SEO | | Online-Marketing-Guy0 -
Woo Commerce Woo Compare Urls Indexing?
Hi I am using Wordpress/Woo commerce for my site Thetotspot.co.uk http://www.thetotspot.co.uk/?action=yith-woocompare-add-product&id=1412&_wpnonce=a5560b1b07 But I am getting a lot of temporary redirects registering in Moz for things like the above - woo compare / add to cart links Anyone come across this - how did you get solve? I am using Yoast SEO currently, have no indexed archives and pages of archive etc.
Intermediate & Advanced SEO | | Kelly33300 -
Robots.txt help
Hi Moz Community, Google is indexing some developer pages from a previous website where I currently work: ddcblog.dev.examplewebsite.com/categories/sub-categories Was wondering how I include these in a robots.txt file so they no longer appear on Google. Can I do it under our homepage GWT account or do I have to have a separate account set up for these URL types? As always, your expertise is greatly appreciated, -Reed
Intermediate & Advanced SEO | | IceIcebaby0 -
Should we use URL parameters or plain URL's=
Hi, Me and the development team are having a heated discussion about one of the more important thing in life, i.e. URL structures on our site. Let's say we are creating a AirBNB clone, and we want to be found when people search for apartments new york. As we have both have houses and apartments in all cities in the U.S it would make sense for our url to at least include these, so clone.com/Appartments/New-York but the user are also able to filter on price and size. This isn't really relevant for google, and we all agree on clone.com/Apartments/New-York should be canonical for all apartment/New York searches. But how should the url look like for people having a price for max 300$ and 100 sqft? clone.com/Apartments/New-York?price=30&size=100 or (We are using Node.js so no problem) clone.com/Apartments/New-York/Price/30/Size/100 The developers hate url parameters with a vengeance, and think the last version is the preferable one and most user readable, and says that as long we use canonical on everything to clone.com/Apartments/New-York it won't matter for god old google. I think the url parameters are the way to go for two reasons. One is that google might by themselves figure out that the price parameter doesn't matter (https://support.google.com/webmasters/answer/1235687?hl=en) and also it is possible in webmaster tools to actually tell google that you shouldn't worry about a parameter. We have agreed to disagree on this point, and let the wisdom of Moz decide what we ought to do. What do you all think?
Intermediate & Advanced SEO | | Peekabo0 -
How long will Google take to read my robots.txt after updating?
I updated www.egrecia.es/robots.txt two weeks ago and I still haven't solved Duplicate Title and Content on the website. The Google SERP doesn't show those urls any more but SEOMOZ Crawl Errors nor Google Webmaster Tools recognize the change. How long will it take?
Intermediate & Advanced SEO | | Tintanus0 -
Block all but one URL in a directory using robots.txt?
Is it possible to block all but one URL with robots.txt? for example domain.com/subfolder/example.html, if we block the /subfolder/ directory we want all URLs except for the exact match url domain.com/subfolder to be blocked.
Intermediate & Advanced SEO | | nicole.healthline0