If i disallow unfriendly URL via robots.txt, will its friendly counterpart still be indexed?
-
Our not-so-lovely CMS loves to render pages regardless of the URL structure, just as long as the page name itself is correct. For example, it will render the following as the same page:
example.com/really/dumb/duplicative/URL/123.html
To help combat this, we are creating mod rewrites with friendly urls, so all of the above would simply render as example.com/123
I understand robots.txt respects the wildcard (*), so I was considering adding this to our robots.txt:
Disallow: */123.html
If I move forward, will this block all of the potential permutations of the directories preceding 123.html yet not block our friendly example.com/123?
Oh, and yes, we do use the canonical tag religiously - we're just mucking with the robots.txt as an added safety net.
-
Yeah, if you could solve this via .htaccess that would be great, especially if you have link equity flowing into any of those URLs.
I'd go one step further than Irving and highly recommend canonical tags on those URLs. Since, as you said, it's all one page with infinite URL possibilities, the canonical should be easy to implement.
Best of luck!
-
Thanks, however, the meta tag won't work in this case because it's technically one page with an infinite amount of names via the URL (remember, the CMS only depends on the 123.html and ignores the directories preceding it). If I applied the NOINDEX within the meta, then the version I do want to get indexed would not be indexed.
The question was really around "will the internal rewrite of /123.html to just /123 be impacted if we disallow */123.html" - and since the rewrite happens before the bot sees it, I presume the answer is "no, it will not be impacted: 123.html will be blocked yet /123 will still be indexed.
Now, after I posted the question I realized this is the case where I should use a "greedy" 301 redirect via htaccess rather than try to block permutations of the URL via robots.txt. So I decided to not go the robots.txt route and instead do a 301 redirect via regex:
*/123.html to /123 (that's obviously not perfect regex, but you see my point)
-
that disallow command will block all files with the name 123.html in any folder deeper that the root.
This with the canonical (absolute not relative) will probably cover you, but it is really recommended to get a robots noindex meta tag on these duplicate pages as well. Bots coming in from an external link pointing to that page could result in the page getting indexed, also the canonical is a suggestion not a rule.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt - Do I block Bots from crawling the non-www version if I use www.site.com ?
my site uses is set up at http://www.site.com I have my site redirected from non- www to the www in htacess file. My question is... what should my robots.txt file look like for the non-www site? Do you block robots from crawling the site like this? Or do you leave it blank? User-agent: * Disallow: / Sitemap: http://www.morganlindsayphotography.com/sitemap.xml Sitemap: http://www.morganlindsayphotography.com/video-sitemap.xml
Intermediate & Advanced SEO | | morg454540 -
Using folder blocked by robots.txt before uploaded to indexed folder - is that OK?
I have a folder "testing" within my domain which is a folder added to the robots.txt. My web developers use that folder "testing" when we are creating new content before uploading to an indexed folder. So the content is uploaded to the "testing" folder at first (which is blocked by robots.txt) and later uploaded to an indexed folder, yet permanently keeping the content in the "testing" folder. Actually, my entire website's content is located within the "testing" - so same URL structure for all pages as indexed pages, except it starts with the "testing/" folder. Question: even though the "testing" folder will not be indexed by search engines, is there a chance search engines notice that the content is at first uploaded to the "testing" folder and therefore the indexed folder is not guaranteed to get the content credit, since search engines see the content in the "testing" folder, despite the "testing" folder being blocked by robots.txt? Would it be better that I password protecting this "testing" folder? Thx
Intermediate & Advanced SEO | | khi50 -
Difference in Number of URLS in "Crawl, Sitemaps" & "Index Status" in Webmaster Tools, NORMAL?
Greetings MOZ Community: Webmaster Tools under "Index Status" shows 850 URLs indexed for our website (www.nyc-officespace-leader.com). The number of URLs indexed jumped by around 175 around June 10th, shortly after we launched a new version of our website. No new URLs were added to the site upgrade. Under Webmaster Tools under "Crawl, Site maps", it shows 637 pages submitted and 599 indexed. Prior to June 6th there was not a significant difference in the number of pages shown between the "Index Status" and "Crawl. Site Maps". Now there is a differential of 175. The 850 URLs in "Index Status" is equal to the number of URLs in the MOZ domain crawl report I ran yesterday. Since this differential developed, ranking has declined sharply. Perhaps I am hit by the new version of Panda, but Google indexing junk pages (if that is in fact happening) could have something to do with it. Is this differential between the number of URLs shown in "Index Status" and "Crawl, Sitemaps" normal? I am attaching Images of the two screens from Webmaster Tools as well as the MOZ crawl to illustrate what has occurred. My developer seems stumped by this. He has submitted a removal request for the 175 URLs to Google, but they remain in the index. Any suggestions? Thanks,
Intermediate & Advanced SEO | | Kingalan1
Alan0 -
Incorrect cached page indexing in Google while correct page indexes intermittently
Hi, we are a South African insurance company. We have a page http://www.miway.co.za/midrivestyle which has a 301 redirect to http://www.miway.co.za/car-insurance. Problem is that the former page is ranking in the index rather than the latter. The latter page does index occasionally in the same position, but rarely. This is primarily for search phrases like "car insurance" and "car insurance quotes". The ranking was knocked down the index with Penquin 2.0. It was not ranking at all but we have managed to recover to 12/13. This abnormally has only been occurring since the recovery. The correct page does index for other search terms like "insurance for car". Your help would be appreciated, thanks!
Intermediate & Advanced SEO | | miway0 -
Robots.txt: Syntax URL to disallow
Did someone ever experience some "collateral damages" when it's about "disallowing" some URLs? Some old URLs are still present on our website and while we are "cleaning" them off the site (which takes time), I would like to to avoid their indexation through the robots.txt file. The old URLs syntax is "/brand//13" while the new ones are "/brand/samsung/13." (note that there is 2 slash on the URL after the word "brand") Do I risk to erase from the SERPs the new good URLs if I add to the robots.txt file the line "Disallow: /brand//" ? I don't think so, but thank you to everyone who will be able to help me to clear this out 🙂
Intermediate & Advanced SEO | | Kuantokusta0 -
Robots.txt error message in Google Webmaster from a later date than the page was cached, how is that?
I have error messages in Google Webmaster that state that Googlebot encountered errors while attempting to access the robots.txt. The last date that this was reported was on December 25, 2012 (Merry Christmas), but the last cache date was November 16, 2012 (http://webcache.googleusercontent.com/search?q=cache%3Awww.etundra.com/robots.txt&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a). How could I get this error if the page hasn't been cached since November 16, 2012?
Intermediate & Advanced SEO | | eTundra0 -
Will our PA be retained after URL updates?
Our web hosting company recently applied a seo update to our site to deal with canonicalization issues and also rewrote all urls to lower case. As a result our PA is now 1 on all pages its effected. I took this up with them and they had this to say. "I must confess I’m still a bit lost however can assure you our consolidation tech uses a 301 permanent redirect for transfers. This should ensure any back link equity isn’t lost. For instance this address: http://www.towelsrus.co.uk/towels-bath-sheets/aztex/egyptian-cotton-Bath-sheet_ct474bd182pd2731.htm Redirects to this page: http://www.towelsrus.co.uk/towels-bath-sheets/aztex/egyptian-cotton-bath-sheet_ct474bd182pd2731.htm And the redirect returns 301 header response – as discussed in your attached forum thread extract" Firstly, is canonicalization working as the number of duplicate pages shot up last week and also will we get our PA back? Thanks Craig
Intermediate & Advanced SEO | | Towelsrus0 -
Disallowed Pages Still Showing Up in Google Index. What do we do?
We recently disallowed a wide variety of pages for www.udemy.com which we do not want google indexing (e.g., /tags or /lectures). Basically we don't want to spread our link juice around to all these pages that are never going to rank. We want to keep it focused on our core pages which are for our courses. We've added them as disallows in robots.txt, but after 2-3 weeks google is still showing them in it's index. When we lookup "site: udemy.com", for example, Google currently shows ~650,000 pages indexed... when really it should only be showing ~5,000 pages indexed. As another example, if you search for "site:udemy.com/tag", google shows 129,000 results. We've definitely added "/tag" into our robots.txt properly, so this should not be happening... Google showed be showing 0 results. Any ideas re: how we get Google to pay attention and re-index our site properly?
Intermediate & Advanced SEO | | udemy0