If i disallow unfriendly URL via robots.txt, will its friendly counterpart still be indexed?
-
Our not-so-lovely CMS loves to render pages regardless of the URL structure, just as long as the page name itself is correct. For example, it will render the following as the same page:
example.com/really/dumb/duplicative/URL/123.html
To help combat this, we are creating mod rewrites with friendly urls, so all of the above would simply render as example.com/123
I understand robots.txt respects the wildcard (*), so I was considering adding this to our robots.txt:
Disallow: */123.html
If I move forward, will this block all of the potential permutations of the directories preceding 123.html yet not block our friendly example.com/123?
Oh, and yes, we do use the canonical tag religiously - we're just mucking with the robots.txt as an added safety net.
-
Yeah, if you could solve this via .htaccess that would be great, especially if you have link equity flowing into any of those URLs.
I'd go one step further than Irving and highly recommend canonical tags on those URLs. Since, as you said, it's all one page with infinite URL possibilities, the canonical should be easy to implement.
Best of luck!
-
Thanks, however, the meta tag won't work in this case because it's technically one page with an infinite amount of names via the URL (remember, the CMS only depends on the 123.html and ignores the directories preceding it). If I applied the NOINDEX within the meta, then the version I do want to get indexed would not be indexed.
The question was really around "will the internal rewrite of /123.html to just /123 be impacted if we disallow */123.html" - and since the rewrite happens before the bot sees it, I presume the answer is "no, it will not be impacted: 123.html will be blocked yet /123 will still be indexed.
Now, after I posted the question I realized this is the case where I should use a "greedy" 301 redirect via htaccess rather than try to block permutations of the URL via robots.txt. So I decided to not go the robots.txt route and instead do a 301 redirect via regex:
*/123.html to /123 (that's obviously not perfect regex, but you see my point)
-
that disallow command will block all files with the name 123.html in any folder deeper that the root.
This with the canonical (absolute not relative) will probably cover you, but it is really recommended to get a robots noindex meta tag on these duplicate pages as well. Bots coming in from an external link pointing to that page could result in the page getting indexed, also the canonical is a suggestion not a rule.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Why a certain URL ( a category URL ) disappears?
the page hasn't been spammed. - links are natural - onpage grader is perfect - there are useful high ranking articles linking to the page...pretty much everything is okay.....also all of my websites pages are okay and none of them has disappeared only this one ( the most important category of my site. )
Intermediate & Advanced SEO | | mohamadalieskandariii0 -
Robots.txt Disallowed Pages and Still Indexed
Alright, I am pretty sure I know the answer is "Nothing more I can do here." but I just wanted to double check. It relates to the robots.txt file and that pesky "A description for this result is not available because of this site's robots.txt". Typically people want the URL indexed and the normal Meta Description to be displayed but I don't want the link there at all. I purposefully am trying to robots that stuff outta there.
Intermediate & Advanced SEO | | DRSearchEngOpt
My question is, has anybody tried to get a page taken out of the Index and had this happen; URL still there but pesky robots.txt message for meta description? Were you able to get the URL to no longer show up or did you just live with this? Thanks folks, you are always great!0 -
Moving career site to new URL from main site. Will it hurt SEO for main page?
For one of our clients we are building a career site and putting it under a different URL and hosting service (mainly due to security concerns of hosting it under the same host and domain). almost 100% of the incoming traffic to their current career section (which it is in a sub-folder) receives traffic for branded keywords (brand + job/career/employment), that is, there are no job position specific keywords. The client is now worried that after moving the site, the inbound traffic to the main site will be severely affected as well as the SERP results. My questions are, will the non-career related SERPs be affected? I don't see how will they be but I could be wrong If no, how could we reassure her that the SEO to the main site wont be affected? are there any case studies of a similar case (splitting part of the website under a new URL and hosting service?) Thank you for your help. PS: this is my first post so please forgive me if this has been asked before. I could not find a good response.
Intermediate & Advanced SEO | | rflores0 -
Received "Googlebot found an extremely high number of URLs on your site:" but most of the example URLs are noindexed.
An example URL can be found here: http://symptom.healthline.com/symptomsearch?addterm=Neck%20pain&addterm=Face&addterm=Fatigue&addterm=Shortness%20Of%20Breath A couple of questions: Why is Google reporting an issue with these URLs if they are marked as noindex? What is the best way to fix the issue? Thanks in advance.
Intermediate & Advanced SEO | | nicole.healthline0 -
Effect duration of robots.txt file.
in my web site there is demo site in that also, index in Google but no need it now.so i have created robots file and upload to server yesterday.in the demo folder there are some html files,and i wanna remove all these in demo file from Google.but still in web master tools it showing User-agent: *
Intermediate & Advanced SEO | | innofidelity
Disallow: /demo/ How long this will take to remove from Google ? And are there any alternative way doing that ?0 -
Why will google not index my pages?
About 6 weeks ago we moved a subcategory out to becomne a main category using all the same content. We also removed 100's of old products and replaced these with new variation listings to remove duplicate content issues. The problem is google will not index 12 critcal pages and our ranking have slumped for the keywords in the categories. What can i do to entice google to index these pages?
Intermediate & Advanced SEO | | Towelsrus0 -
Disallowed Pages Still Showing Up in Google Index. What do we do?
We recently disallowed a wide variety of pages for www.udemy.com which we do not want google indexing (e.g., /tags or /lectures). Basically we don't want to spread our link juice around to all these pages that are never going to rank. We want to keep it focused on our core pages which are for our courses. We've added them as disallows in robots.txt, but after 2-3 weeks google is still showing them in it's index. When we lookup "site: udemy.com", for example, Google currently shows ~650,000 pages indexed... when really it should only be showing ~5,000 pages indexed. As another example, if you search for "site:udemy.com/tag", google shows 129,000 results. We've definitely added "/tag" into our robots.txt properly, so this should not be happening... Google showed be showing 0 results. Any ideas re: how we get Google to pay attention and re-index our site properly?
Intermediate & Advanced SEO | | udemy0 -
URL Shorteners. Are they SEO Friendly?
Do URL shortener services like bit.ly act as 301 redirects? I was thinking about utilizing one for longer query based URLs and didn't want to risk losing link juice. Thanks for the insight! Regards - Kyle
Intermediate & Advanced SEO | | kchandler0