Robots.txt wildcards - the devs had a disagreement - which is correct?
-
Hi – the lead website developer was assuming that this wildcard: Disallow: /shirts/?* would block URLs including a ? within this directory, and all the subdirectories of this directory that included a “?”
The second developer suggested that this wildcard would only block URLs featuring a ? that come immediately after /shirts/ - for example: /shirts?minprice=10&maxprice=20 BUT argued that this robots.txt directive would not block URLS featuring a ? in sub directories - e.g. /shirts/blue?mprice=100&maxp=20
So which of the developers is correct?
Beyond that, I assumed that the ? should feature a * on each side of it – for example - /? - to work as intended above? Am I correct in assuming that?
-
Thanks Logan - much appreciated, as ever - that really helps - if I was to add another * to **Allow: /?resultspage= > so **Allow: /?*resultspage= - what would happen then? ****
-
Ok, gotcha. Add the following directives:
Disallow: /shirts/?
This prevents crawling of the following:
- /shirts**/golden/**?minprice=10&maxprice=20
- /shirts/?minprice=10&maxprice=20
Allow: /*?resultspage=
Allows crawling of the following:
- /shirts/navy/?resultspage=02
- /shirts/?resultspage=01
-
Thanks Logan - much appreciated - the aim would be to prevent bots crawling any parameter'd URL but only in the products section, and not all of them - see below.
I noticed the shirt URLs can be produce many pages of results - e.g. if you look for a type of shirt you can get up to 20 pages of results - the resulting URLs also feature a ?
So you end up with - for example - /shirts/?resultspage=01 and then /shirts/?resultspage=02 or shirts/navy/?resultspage=01 and /shirts/navy/?resultspage=02 - and so on - and it would be good to index them somehow. So I wonder how I can override disallow parameters robots.txt instruction only for specific paths and even individual pages?
-
Disallow: /shirts/?* will only block URLs that end with /shirts/ before beginning a parameter string. If you want to block /shirts**/golden/**?minprice=10&maxprice=20 you'll have to add the asterisk before and after the ?
What the end goal here? Preventing bots from crawling any parameter'd URL?
-
I suppose the nub of the disagreement is this: would Disallow: /shirts/?* block /shirts/?minprice=10&maxprice=20 and also block URLS further down the URL directory structure - e.g. /shirts/mens/navyblue/?minprice=10&maxprice=20 ?
-
Thanks Logan - the lead website developer was assuming that this wildcard: Disallow: /shirts/?* would block URLs including a ? within this directory, and all the subdirectories of this directory that included a “?”
If I amended the URL to
/shirts/?minprice=10&maxprice=20 would robots.txt work as intended right there?and would that robots.txt work as intended further down the directory structure of the URLs? E.g.
/shirts**/golden/**?minprice=10&maxprice=20 -
Hi Luke,
The second developer is correct....well, more correct than the first. Your example of /shirts?minprice=10&maxprice=20 would not be blocked by this direction, since there's no slack after shirts.
For future reference, you can test how directives function in Google Search Console. Under the 'Crawl' menu, there's a robots.txt tester in which you can manually edit the robots.txt directives (they don't apply to the live file) and enter test URLs to see which directive, if any, would prevent crawling.
You are correct in your assumption that a * on either side of the ? would prevent crawling of both /shirts/blue?mprice=100&maxp=20 and /shirts/?minprice=10&maxprice=20
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Allowing correct crawlers for GeoIP Redirect
Hi All, I am working on an international site and we have started running into issues with crawlers successfully crawling the site. GeoIPEnable On Redirect one country RewriteEngine on
Intermediate & Advanced SEO | | michaelpw
RewriteCond %{ENV:GEOIP_COUNTRY_CODE} ^US$
RewriteCond %{HTTP:X-Host} !.nexcesscdn.net$ [NC]
RewriteRule ^(.)$ https://us.website.com/ [R,L] The main reason for working on a hard GEOIP redirect would be that we are unable to show certain products in certain regions, the customer should not be given the option which is best practice. Can anyone advise? Thanking in advance.0 -
Did We Implement Structured Data Correctly?
Our designer/developer recently implemented structured data on our pages. I'm trying to become more educated on how it works since I'm the SEO marketing specialist on the team and the one that writes and publishes the majority of our content. I'm aware it's extremely important and needs to be done, I just don't know how to do it yet. The developer was on our team for over a year, we recently let him go. Now, I'm going through all the pages to make sure it's done correctly. I'm using the structured data testing tool to look at the pages and have been playing with the structured data markup helper. I would REALLY appreciate it if one of my fellow MOZ fans & family can help me determine if it's done correctly. We do not currently have any schema plugs installed that I know of. So I'm not sure how he implemented the Schema code. I would like to know what I need to do moving forward to the additional content we publish as well as what to do to correctly implement Schema if not already. When I manually look at one of our FAQ pages I see multiple schema data formats detected... I'm not sure if we're supposed to have multiple or just one----> https://www.screencast.com/t/TjHphL7jsI I also noticed in the Question schema data for that same page... the accepted answer is empty. I would image that should have the short version of the answer to the question in it?--->https://www.screencast.com/t/e6ppXkhXd7QS Here's a screenshot of our structured data info from Google search console---> https://www.screencast.com/t/KHj4BGgdrZ4m HELP please! Our website consists of 25-30 "product" pages https://www.medicarefaq.com/medigap/ https://www.medicarefaq.com/medicare-supplement/ https://www.medicarefaq.com/medigap/plan-f/ https://www.medicarefaq.com/medicare-supplement/plan-f/ We currently have about 75 FAQ pages and adding 4-6 per month. This is what brings in most our traffic. https://www.medicarefaq.com/faqs/2018-top-medicare-supplement-insurance-plans/ https://www.medicarefaq.com/faqs/2018-medicare-high-deductible-plan-f-changes https://www.medicarefaq.com/faqs/medicare-guaranteed-issue-rights We have 100 state specific pages (two for each state) https://www.medicarefaq.com/medicare-supplement/florida/ https://www.medicarefaq.com/medigap/florida/ https://www.medicarefaq.com/medicare-supplement/California/ https://www.medicarefaq.com/medigap/California/ We have 20ish carrier specific pages https://www.medicarefaq.com/medicare-supplement/humana/ https://www.medicarefaq.com/medicare-supplement/mutual-of-omaha/ Then we have about 30 blog pages so far and are publishing new blog posts weekly https://www.medicarefaq.com/blog/average-age-retirement-rising/ https://www.medicarefaq.com/blog/social-security-benefit-increase-announced-2018 https://www.medicarefaq.com/blog/new-california-bill-force-drugmakers-explain-price-hikes
Intermediate & Advanced SEO | | LindsayE0 -
SEO Best Practices regarding Robots.txt disallow
I cannot find hard and fast direction about the following issue: It looks like the Robots.txt file on my server has been set up to disallow "account" and "search" pages within my site, so I am receiving warnings from the Google Search console that URLs are being blocked by Robots.txt. (Disallow: /Account/ and Disallow: /?search=). Do you recommend unblocking these URLs? I'm getting a warning that over 18,000 Urls are blocked by robots.txt. ("Sitemap contains urls which are blocked by robots.txt"). Seems that I wouldn't want that many urls blocked. ? Thank you!!
Intermediate & Advanced SEO | | jamiegriz0 -
Robots.txt Disallowed Pages and Still Indexed
Alright, I am pretty sure I know the answer is "Nothing more I can do here." but I just wanted to double check. It relates to the robots.txt file and that pesky "A description for this result is not available because of this site's robots.txt". Typically people want the URL indexed and the normal Meta Description to be displayed but I don't want the link there at all. I purposefully am trying to robots that stuff outta there.
Intermediate & Advanced SEO | | DRSearchEngOpt
My question is, has anybody tried to get a page taken out of the Index and had this happen; URL still there but pesky robots.txt message for meta description? Were you able to get the URL to no longer show up or did you just live with this? Thanks folks, you are always great!0 -
Need help with Robots.txt
An eCommerce site built with Modx CMS. I found lots of auto generated duplicate page issue on that site. Now I need to disallow some pages from that category. Here is the actual product page url looks like
Intermediate & Advanced SEO | | Nahid
product_listing.php?cat=6857 And here is the auto generated url structure
product_listing.php?cat=6857&cPath=dropship&size=19 Can any one suggest how to disallow this specific category through robots.txt. I am not so familiar with Modx and this kind of link structure. Your help will be appreciated. Thanks1 -
Localising our business to the correct country
Hi I work for children's furniture business called Tidy Books. We are based in the UK. We have UK site www.tidy-books.co.uk. We also have a US site www.tidy-books.com which is registered in the US. We have fully dedicated and translated French, German and Italian site (www.tidy-books.fr, www.tidy-books.de, www.tidy-books.it) . These all fall under our UK registered address. What I would like, is to have a French, German and Italian business address for these website. We just need an address only. This would mainly be used to for Google business listing and other business listings sites to help rank are sites correctly in their country domains. T Do you know of or recommend any companies that can do this? Is there any implications I need to be aware of, such as tax? Thanks
Intermediate & Advanced SEO | | tidybooks0 -
Using Meta Header vs Robots.txt
Hey Mozzers, I am working on a site that has search-friendly parameters for their faceted navigation, however this makes it difficult to identify the parameters in a robots.txt file. I know that using the robots.txt file is highly recommended and powerful, but I am not sure how to do this when facets are using common words such as sizes. For example, a filtered url may look like www.website.com/category/brand/small.html Brand and size are both facets. Brand is a great filter, and size is very relevant for shoppers, but many products include "small" in the url, so it is tough to isolate that filter in the robots.txt. (I hope that makes sense). I am able to identify problematic pages and edit the Meta Head so I can add on any page that is causing these duplicate issues. My question is, is this a good idea? I want bots to crawl the facets, but indexing all of the facets causes duplicate issues. Thoughts?
Intermediate & Advanced SEO | | evan890 -
What should I block with a robots.txt file?
Hi Mozzers, We're having a hard time getting our site indexed, and I have a feeling my dev team may be blocking too much of our site via our robots.txt file. They say they have disallowed php and smarty files. Is there any harm in allowing these pages? Thanks!
Intermediate & Advanced SEO | | Travis-W1