Robots Disallow Backslash - Is it right command
-
Bit skeptical, as due to dynamic url and some other linkage issue, google has crawled url with backslash and asterisk character
ex - www.xyz.com/\/index.php?option=com_product
www.xyz.com/\"/index.php?option=com_product
Now %5c is the encoded version of \ - backslash & %22 is encoded version of asterisk
Need to know for command :-
User-agent: * Disallow: \As am disallowing all backslash url through this - will it only remove the backslash url which are duplicates or the entire site,
-
Thanks, you seem lucky to me.. Almost after 2 month i have got the code for making all these encoded url's redirect correctly. Finally, now if one types
http://www.mycarhelpline.com/\"/index.php?option=com_latestnews&view=list&Itemid=10
then he's redirected through 301 to the correct url
http://www.mycarhelpline.com/index.php?option=com_latestnews&view=list&Itemid=10
-
Hello Gagan,
I think the best way to handle this would be using the rel canonical tag or rewriting the URLs to get rid of the parameters and replace them with something more user-friendly.
The rel canonical tag would be the easiest way out of those two. I notice the version without the encoding (e.g. http://www.mycarhelpline.com/index.php?option=com_latestnews&view=list&Itemid=10 ) have a rel canonical tag that correctly references itself as the canonical version. However, the encoded URLs (e.g. http://www.mycarhelpline.com/\"/index.php?option=com_latestnews&view=list&Itemid=10) which is actually http://www.mycarhelpline.com/\"/index.php?option=com_latestnews&view=list&Itemid=10 does NOT have a rel canonical tag.
If the version with the backslash had a rel canonical tag stating that the following URL is canonical it would solve your issue, I think.
Canonical URL:
http://www.mycarhelpline.com/index.php?option=com_latestnews&view=list&Itemid=10 -
Sure, If i show you some url they are crawled as :-
Sample Incorrect URLs crawled and reported as duplicate one in Google Webmaster & Moz too
|
http://www.mycarhelpline.com/\"/index.php?option=com_latestnews&view=list&Itemid=10
| http://www.mycarhelpline.com/\"/index.php?option=com_newcar&view=category&Itemid=2 |
|
Correct URL
http://www.mycarhelpline.com/index.php?option=com_latestnews&view=list&Itemid=10
http://www.mycarhelpline.com/index.php?option=com_newcar&view=search&Itemid=2
What we found online
Since URLs often contain characters outside the ASCII set, the URL has to be converted into a valid ASCII format. URL encoding replaces unsafe ASCII characters with a "%" followed by two hexadecimal digits. URLs cannot contain spaces.
%22 reflects - " and %5c as \ (forward slash)
We intend to remove these duplicate one created having %22 and %5c within them..
Many thanks
-
I am not entirely sure I understood your question as intended, but I will do my best to answer.
I would not put this in my robots.txt flie because it could possibly be misunderstood as a forward slash, in which case your entire domain would be blocked:
Disallow: \
We can possibly provide you with some alternative suggestions on how to keep Google from crawling those pages if you could share some real examples.
It may be best to rewrite/redirect those URls instead since they don't seem to be the canonical version you intend to be presented to the user.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt question
I notice something weird in Google robots. txt tester I have this line Disallow: display= in my robots.text but whatever URL I give to test it says blocked and shows this line in robots.text for example this line is to block pages like http://www.abc.com/lamps/floorlamps?display=table but if I test http://www.abc.com/lamps/floorlamps or any page it shows as blocked due to Disallow: display= am I doing something wrong or Google is just acting strange? I don't think pages with no display= are blocked in real.
Intermediate & Advanced SEO | | rbai0 -
Robots.txt Blocking - Best Practices
Hi All, We have a web provider who's not willing to remove the wildcard line of code blocking all agents from crawling our client's site (user-agent: *, Disallow: /). They have other lines allowing certain bots to crawl the site but we're wondering if they're missing out on organic traffic by having this main blocking line. It's also a pain because we're unable to set up Moz Pro, potentially because of this first line. We've researched and haven't found a ton of best practices regarding blocking all bots, then allowing certain ones. What do you think is a best practice for these files? Thanks! User-agent: * Disallow: / User-agent: Googlebot Disallow: Crawl-delay: 5 User-agent: Yahoo-slurp Disallow: User-agent: bingbot Disallow: User-agent: rogerbot Disallow: User-agent: * Crawl-delay: 5 Disallow: /new_vehicle_detail.asp Disallow: /new_vehicle_compare.asp Disallow: /news_article.asp Disallow: /new_model_detail_print.asp Disallow: /used_bikes/ Disallow: /default.asp?page=xCompareModels Disallow: /fiche_section_detail.asp
Intermediate & Advanced SEO | | ReunionMarketing0 -
How do I get my site to rank in the right country?
I work with an organization in the United States that purchased a 3 letter .com domain from another company based in the Netherlands. After transferring the domain and setting up the new site we've completed the following: Registered the site in Webmaster Tools and selected the United States as the international target target Used Schema Markup on the US business address Built links from US based websites Setup a Google My Business site and verified the address All site content is in english Requested removal of links from sites in the Netherlands that were linking to the old website Disavowed all links from the Netherlands that refused to remove links to the old website Despite these actions, our website rarely shows up in Google.com search results. However, we're in the top 10 for many of the same queries when performing searches at Google.nl - Interestingly enough, if we set our location for the US and do a search in Google.nl we do not show up in the search results.
Intermediate & Advanced SEO | | brianspatterson
It has been about 18 months since we've completed these actions and Google still assumes our website is best served and targeted towards people in the Netherlands. Is there anything else we can do to fix this?Thanks!0 -
Use Canonical or Robots.txt for Map View URL without Backlink Potential
I have a Page X with lots of unique content. This page has a "Map view" option, which displays some of the info from Page X, but a lot is ommitted. Questions: Should I add canonical even though Map View URL does not display a lot of info from Page X or adding to robots.txt or noindex, follow? I don't see any back links coming to Map View URL Should Map View page have unique H1, title tag, meta des?
Intermediate & Advanced SEO | | khi50 -
Block in robots.txt instead of using canonical?
When I use a canonical tag for pages that are variations of the same page, it basically means that I don't want Google to index this page. But at the same time, spiders will go ahead and crawl the page. Isn't this a waste of my crawl budget? Wouldn't it be better to just disallow the page in robots.txt and let Google focus on crawling the pages that I do want indexed? In other words, why should I ever use rel=canonical as opposed to simply disallowing in robots.txt?
Intermediate & Advanced SEO | | YairSpolter0 -
Robots.txt Syntax
I have been having a hard time finding any decent information regarding the robots.txt syntax that has been written in the last few years and I just want to verify some things as a review for myself. I have many occasions where I need to block particular directories in the URL, parameters and parameter values. I just wanted to make sure that I am doing this in the most efficient ways possible and thought you guys could help. So let's say I want to block a particular directory called "this" and this would be an example URL: www.domain.com/folder1/folder2/this/file.html
Intermediate & Advanced SEO | | DRSearchEngOpt
or
www.domain.com/folder1/this/folder2/file.html In order for me to block any URL that contains this folder anywhere in the URL I would use: User-agent: *
Disallow: /this/ Now lets say I have a parameter "that" I want to block and sometimes it is the first parameter and sometimes it isn't when it shows up in the URL. Would it look like this? User-agent: *
Disallow: ?that=
Disallow: &that= What about if there is only one value I want to block for "that" and the value is "NotThisGuy": User-agent: *
Disallow: ?that=NotThisGuy
Disallow: &that=NotThisGuy My big questions here are what are the most efficient ways to block a particular parameter and block a particular parameter value. Is there a more efficient way to deal with ? and & for when the parameter and value are either first or later? Secondly is there a list somewhere that will tell me all of the syntax and meaning that can be used for a robots.txt file? Thanks!0 -
I have two sitemaps which partly duplicate - one is blocked by robots.txt but can't figure out why!
Hi, I've just found two sitemaps - one of them is .php and represents part of the site structure on the website. The second is a .txt file which lists every page on the website. The .txt file is blocked via robots exclusion protocol (which doesn't appear to be very logical as it's the only full sitemap). Any ideas why a developer might have done that?
Intermediate & Advanced SEO | | McTaggart0 -
Using 2 wildcards in the robots.txt file
I have a URL string which I don't want to be indexed. it includes the characters _Q1 ni the middle of the string. So in the robots.txt can I use 2 wildcards in the string to take out all of the URLs with that in it? So something like /_Q1. Will that pickup and block every URL with those characters in the string? Also, this is not directly of the root, but in a secondary directory, so .com/.../_Q1. So do I have to format the robots.txt as //_Q1* as it will be in the second folder or just using /_Q1 will pickup everything no matter what folder it is on? Thanks.
Intermediate & Advanced SEO | | seo1234560