If i disallow unfriendly URL via robots.txt, will its friendly counterpart still be indexed?
-
Our not-so-lovely CMS loves to render pages regardless of the URL structure, just as long as the page name itself is correct. For example, it will render the following as the same page:
example.com/really/dumb/duplicative/URL/123.html
To help combat this, we are creating mod rewrites with friendly urls, so all of the above would simply render as example.com/123
I understand robots.txt respects the wildcard (*), so I was considering adding this to our robots.txt:
Disallow: */123.html
If I move forward, will this block all of the potential permutations of the directories preceding 123.html yet not block our friendly example.com/123?
Oh, and yes, we do use the canonical tag religiously - we're just mucking with the robots.txt as an added safety net.
-
Yeah, if you could solve this via .htaccess that would be great, especially if you have link equity flowing into any of those URLs.
I'd go one step further than Irving and highly recommend canonical tags on those URLs. Since, as you said, it's all one page with infinite URL possibilities, the canonical should be easy to implement.
Best of luck!
-
Thanks, however, the meta tag won't work in this case because it's technically one page with an infinite amount of names via the URL (remember, the CMS only depends on the 123.html and ignores the directories preceding it). If I applied the NOINDEX within the meta, then the version I do want to get indexed would not be indexed.
The question was really around "will the internal rewrite of /123.html to just /123 be impacted if we disallow */123.html" - and since the rewrite happens before the bot sees it, I presume the answer is "no, it will not be impacted: 123.html will be blocked yet /123 will still be indexed.
Now, after I posted the question I realized this is the case where I should use a "greedy" 301 redirect via htaccess rather than try to block permutations of the URL via robots.txt. So I decided to not go the robots.txt route and instead do a 301 redirect via regex:
*/123.html to /123 (that's obviously not perfect regex, but you see my point)
-
that disallow command will block all files with the name 123.html in any folder deeper that the root.
This with the canonical (absolute not relative) will probably cover you, but it is really recommended to get a robots noindex meta tag on these duplicate pages as well. Bots coming in from an external link pointing to that page could result in the page getting indexed, also the canonical is a suggestion not a rule.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Disallow: /jobs/? is this stopping the SERPs from indexing job posts
Hi,
Intermediate & Advanced SEO | | JamesHancocks1
I was wondering what this would be used for as it's in the Robots.exe of a recruitment agency website that posts jobs. Should it be removed? Disallow: /jobs/?
Disallow: /jobs/page/*/ Thanks in advance.
James0 -
Why is a canonicalized URL still in index?
Hi Mozers, We recently canonicalized a few thousand URLs but when I search for these pages using the site: operator I can see that they are all still in Google's index. Why is that? Is it reasonable to expect that they would be taken out of the index? Or should we only expect that they won't rank as high as the canonical URLs? Thanks!
Intermediate & Advanced SEO | | yaelslater0 -
How much does URLs with CAPS and URLs with non-CAPS existing on an IIS site matter nowadays?
I work on a couple ecommerce sites that are on IIS. Both sites have return a 200 header status for the CAPS and non CAPS version of the URLs. While I suppose it would be ok if the canonicals pointed to the same version of the page, in some cases it doesn't (ie; /Home-Office canonicalizes to itself and /home-office canonicalizes to itself). I came across this article (http://www.searchdiscovery.com/blog/case-sensitive-urls-and-seo-case-matters/) that is a few years old and I'm wondering how much of an issue it is and how I would determine if it is/isn't?
Intermediate & Advanced SEO | | OfficeFurn0 -
Robots.txt gone wild
Hi guys, a site we manage, http://hhhhappy.com received an alert through web master tools yesterday that it can't be crawled. No changes were made to the site. Don't know a huge amount about the robots.txt configuration expect that using Yoast by default it sets it not to crawl wp admin folder and nothing else. I checked this against all other sites and the settings are the same. And yet 12 hours later after the issue Happy is still not being crawled and meta data is not showing in search results. Any ideas what may have triggered this?
Intermediate & Advanced SEO | | wearehappymedia0 -
Domain.com/old-url to domain.com/new-url
HI, I have to change old url`s to new one, for the same domain and all landing pages will be the same: domain.com/old-url I have to change to: domain.com/new-url All together more than 70.000 url. What is best way to do that? should I use 301st redirect? is it possible to do in code or how? what could you please suggest? Thank you, Edgars
Intermediate & Advanced SEO | | Edzjus3330 -
My website is not indexing
Hello Experts As i search site :http://www.louisvuittonhandbagss.com or just entering http://www.louisvuittonhandbagss.com on Google i am not getting my website . I have done following steps 1. I have submitted sitemaps and indexed all the site maps 2.i have used GWT feature fetch as Google . 3. I have submitted my website to top social book marking websites and to some classified sites also . Pleae
Intermediate & Advanced SEO | | aschauhan5210 -
Website not being indexed after relocation
I have a scenario where a 'draft' website was built using Google Sites, and published using a Google Sites sub domain. Consequently, the 'same' website was rebuilt and published on its own domain. So effectively there were two sites, both more or less identical, with identical content. The first website was thoroughly indexed by Google. The second website has not been indexed at all - I am assuming for the obvious reasons ie. that Google is viewing it as an obvious rip-off of the first site / duplicate content etc. I was reluctant to take down the first website until I had found an effective way to resolve this issue long-term => ensuring that in future Google would index the second 'proper' site. A permanent 301 redirect was put forward as a solution - however, believe it or not, the Google Sites platform has no facility for implementing this. For lack of an alternative solution I have gone ahead and taken down the first site. I understand that this may take some time to drop out of Google's index, however, and I am merely hoping that eventually the second site will be picked up in the index. I would sincerely appreciate an advice or recommendations on the best course of action - if any! - I can take from here. Many thanks! Matt.
Intermediate & Advanced SEO | | collectedrunning0 -
What will the effect of normalising the case of my URLs be?
Hi all, I have a web site with a selection of pages with excellent rankings, mostly in the top 3 for the keywords we want to rank for. Currently, the URLs are mostly presented mixed case, like this: www.mydomain.com/Type/ITEM-IDENTIFIER/ However we have problems of different cases being used in different parts of our application, and also it's obviously not that attractive the way it is. What we are proposing to do is deploy a change to our web site that lowercases all URLs in internal links, as well as present the URLs in lowercase in our sitemap.xml, and provide any links to partners from this point on in lowercase format. We are also proposing to 301 redirect any non-lowercase URLs to the lowercase version. These pages already have a canonical link tag due to us hosting different versions of these pages on multiple domains, for skinning purposes. The link in the canonical link tag will also be changed to be lowercase. What I am concerned about is, URLs of the case above have been in the rankings for a few years now, and if all of a sudden our links are all lowercase, will they drop off the rankings? Or will the above measures mean that the pagerank is transferred to the lowercase version of the URL? Thanks in advance, James
Intermediate & Advanced SEO | | SeeTickets0