Robots.txt usage
-
Hey Guys,
I am about make an important improvement to our site's robots.txt
we have large number of properties on our site and we have different views for them. List, gallery and map view. By default list view shows up and user can navigate through gallery view.
We donot want gallery pages to get indexed and want to save our crawl budget for more important pages.
this is one example of our site:
http://www.holiday-rentals.co.uk/France/r31.htm
When you click on "gallery view" URL of this site will remain same in your address bar: but when you mouse over the "gallery view" tab it will show you URL with parameter "view=g". there are number of parameters: "view=g, view=l and view=m".
http://www.holiday-rentals.co.uk/France/r31.htm?view=l
http://www.holiday-rentals.co.uk/France/r31.htm?view=g
http://www.holiday-rentals.co.uk/France/r31.htm?view=m
Now my question is:
I If restrict bots by adding "Disallow: ?view=" in our robots.txt will it effect the list view too?
Will be very thankful if yo look into this for us.
Many thanks
Hassan
I will test this on some other site within our network too before putting it to important one's. to measure the impact but will be waiting for your recommendations. Thanks
-
Others are right by the way canonical may be better, but if you insist on robots restriction you should add two schemas to each parameter:
disallow:?view=m disallow:?view=m*
so that you block the urls that contain the parameter at the end and block the ones that have it in the middle as well.
-
I had a similar issue with my website: there were many ways of sorting a likst of items (date, title, etc) which ended up causing duplicate content, we solved the issue a couple of days ago by restricting the "sorted" pages using the robots.txt file. HOWEVER, this morning i found this text in the Google Webmaster Tools support section:
Google no longer recommends blocking crawler access to duplicate content on your website, whether with a robots.txt file or other methods. If search engines can't crawl pages with duplicate content, they can't automatically detect that these URLs point to the same content and will therefore effectively have to treat them as separate, unique pages. A better solution is to allow search engines to crawl these URLs, but mark them as duplicates by using the
rel="canonical"
link element, the URL parameter handling tool, or 301 redirects. In cases where duplicate content leads to us crawling too much of your website, you can also adjust the crawl rate setting in Webmaster Tools.source:
http://www.google.com/support/webmasters/bin/answer.py?answer=66359I havent seen any negative effect on my site (yet), but I would agree with SuperlativB in the sense that YOU might be better off using "canonical" tags on these links
http://www.holiday-rentals.co.uk/...?view=l
-
For these paratmeters are not at the very end os the url you should add * after the letter of the parameter as well in the restriction
you got my point, thanks for looking into this. Since our search page load with list view by default and it is not in URL but still v=l represents the list view.
I want to disallow both parameters "view=g, view=m" in any URL from bots.
If these parameters are sometimes in between and some time at the end of URL what will be the work around for for both cases, you suggest?
Thanks for looking into this...
-
You can do the restriction you want but if i get it right m stands for map view g stands for gallery view and l stands for list view. So if you want list view to be indexed and map and gallery view not to be indexed you should add two lines of distriction:
disallow:?view=m disallow:?view=g
if these paratmeters are not at the very end os the url you should add * after the letter of the parameter as well in the restriction
-
Sounds like this is something canonical could solve for you. If you disallow ?view=* you would disallow all "?view" on your homepage, if you are unsure you should go for exact match rather that all.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt in subfolders and hreflang issues
A client recently rolled out their UK business to the US. They decided to deploy with 2 WordPress installations: UK site - https://www.clientname.com/uk/ - robots.txt location: UK site - https://www.clientname.com/uk/robots.txt
Technical SEO | | lauralou82
US site - https://www.clientname.com/us/ - robots.txt location: UK site - https://www.clientname.com/us/robots.txt We've had various issues with /us/ pages being indexed in Google UK, and /uk/ pages being indexed in Google US. They have the following hreflang tags across all pages: We changed the x-default page to .com 2 weeks ago (we've tried both /uk/ and /us/ previously). Search Console says there are no hreflang tags at all. Additionally, we have a robots.txt file on each site which has a link to the corresponding sitemap files, but when viewing the robots.txt tester on Search Console, each property shows the robots.txt file for https://www.clientname.com only, even though when you actually navigate to this URL (https://www.clientname.com/robots.txt) you’ll get redirected to either https://www.clientname.com/uk/robots.txt or https://www.clientname.com/us/robots.txt depending on your location. Any suggestions how we can remove UK listings from Google US and vice versa?0 -
Robots.txt Disallow: / in Search Console
Two days ago I found out through search console that my website's Robots.txt has changed to User-agent: *
Technical SEO | | RAN_SEO
Disallow: / When I check the robots.txt in the website it looks fine - I see its blocked just in search console( in the robots.txt tester). when I try to do fetch as google to the homepage I see its blocked. Any ideas why would robots.txt block my website? it was fine until the weekend. before that, in the last 3 months I saw I had blocked resources in the website and I brought back pages with fetch as google. Any ideas?0 -
A few misc Webmaster tools questions & Robots.txt etc
Hi I have a few general misc questions re Robots.tx & GWT: 1) In the Robots.txt file what do the below lines block, internal search ? Disallow: /?
Technical SEO | | Dan-Lawrence
Disallow: /*? 2) Also the sites feeds are blocked in robots.txt, why would you want to block a sites feeds ? **3) **What's the best way to deal with the below: - old removed page thats returning a 500 response code ? - a soft 404 for an old removed page that has no current replacement old removed pages returning a 404 The old pages didn't have any authority or inbound links hence is it best/ok to simply create a url removal request in GWT ? Cheers Dan0 -
Disallow: /search/ in robots but soft 404s are still showing in GWT and Google search?
Hi guys, I've already added the following syntax in robots.txt to prevent search engines in crawling dynamic pages produce by my website's search feature: Disallow: /search/. But soft 404s are still showing in Google Webmaster Tools. Do I need to wait(it's been almost a week since I've added the following syntax in my robots.txt)? Thanks, JC
Technical SEO | | esiow20130 -
Allow or Disallow First in Robots.txt
If I want to override a Disallow directive in robots.txt with an Allow command, do I have the Allow command before or after the Disallow command? example: Allow: /models/ford///page* Disallow: /models////page
Technical SEO | | irvingw0 -
Impact of "restricted by robots" crawler error in WT
I have been wondering about this for a while now with regards to several of my sites. I am getting a list of pages that I have blocked in the robots.txt file. If I restrict Google from crawling them, then how can they consider their existence an error? In one case, I have even removed the urls from the index. And do you have any idea of the negative impact associated with these errors. And how do you suggest I remedy the situation. Thanks for the help
Technical SEO | | phogan0 -
How to block google robots from a subdomain
I have a subdomain that lets me preview the changes I put on my site. The live site URL is www.site.com, working preview version is www.site.edit.com The contents on both are almost identical I want to block the preview version (www.site.edit.com) from Google Robots, so that they don't penalize me for duplicated content. Is it the right way to do it: User-Agent: * Disallow: .edit.com/*
Technical SEO | | Alexey_mindvalley0 -
How do I use the Robots.txt "disallow" command properly for folders I don't want indexed?
Today's sitemap webinar made me think about the disallow feature, seems opposite of sitemaps, but it also seems both are kind of ignored in varying ways by the engines. I don't need help semantically, I got that part. I just can't seem to find a contemporary answer about what should be blocked using the robots.txt file. For example, I have folders containing site comps for clients that I really don't want showing up in the SERPS. Is it better to not have these folders on the domain at all? There are also security issues I've heard of that make sense, simply look at a site's robots file to see what they are hiding. It makes it easier to hunt for files when they know the directory the files are contained in. Do I concern myself with this? Another example is a folder I have for my xml sitemap generator. I imagine google isn't going to try to index this or count it as content, so do I need to add folders like this to the disallow list?
Technical SEO | | SpringMountain0