Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
What does Disallow: /french-wines/?* actually do - robots.txt
-
Hello Mozzers - Just wondering what this robots.txt instruction means: Disallow: /french-wines/?*
Does it stop Googlebot crawling and indexing URLs in that "French Wines" folder - specifically the URLs that include a question mark?
Would it stop the crawling of deeper folders - e.g. /french-wines/rhone-region/ that include a question mark in their URL?
I think this has been done to block URLs containing query strings.
Thanks, Luke
-
Glad to help, Luke!
-
Thanks Logan for your help with this - much appreciated. Really helpful!
-
Disallow: /?* is the same thing as Disallow:/?, since the asterisk is a wildcard, both of those disallows prevent any URL that begins with /? from being crawled.
And yes, it is incredibly easy to disallow the wrong thing! The robots.txt tester in Search Console (under the Crawl menu) is very helpful for figuring out what a disallow will catch and what it will let by. I highly recommend testing any new disallows there before releasing them into the wild.
-
Thanks again Logan.
What would Disallow: /?* do because that is what the site I am looking at has implemented. Perhaps it works both ways around?
I imagine it's easy to disallow the wrong thing or possibly not disallow the right thing. Ugh.
-
Disallow: /*?
This disallow literally says to crawlers 'if a URL starts with a slash (all URLs) and has a parameter, don't crawl it'. The * is a wildcard that says anything between / and ? is applicable to the disallow.
It's very easy to disallow the wrong this especially in regards to parameters, for this reason I always do these 2 things rather than using robots.txt:
- Set the purpose of each parameter in Search Console - Go to Crawl > URL Parameters to configure for your site
- Self-referring canonicals - most people disallow URLs with parameters in robots.txt to prevent indexing, but this only prevents crawling. A self-referring canonical pointing to the root level of that URL will prevent indexing or URLs with parameters.
Hope that's helpful!
-
Thanks Logan - I was just reading: Disallow: /*? # block any URL that includes a ? (and thus a query string) - do you know why the ? comes before the * in this case?
-
Hi Luke,
You are correct that this was done to block URLs with parameters. However, since there's no wildcard (the asterisk) before the folder name, the URL would have to start with /french-wines/. This disallow is really only preventing crawling on the single URL www.yoursite.com/french-wines/ with any parameters appended.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Ogranization Schema/Microformat for a content/brand website | Travel
Hi, One of our clients have a website specific to a place, for eg. California Tourism in which they publish local information related to tourism, blogs & other useful content. I want to understand how useful is to publish Organization Schema on such website mentioning the actual Organization, which in this case is a Travel Agency? Or any other schema would fit in for such websites?
Intermediate & Advanced SEO | | ds9.tech0 -
SEO Best Practices regarding Robots.txt disallow
I cannot find hard and fast direction about the following issue: It looks like the Robots.txt file on my server has been set up to disallow "account" and "search" pages within my site, so I am receiving warnings from the Google Search console that URLs are being blocked by Robots.txt. (Disallow: /Account/ and Disallow: /?search=). Do you recommend unblocking these URLs? I'm getting a warning that over 18,000 Urls are blocked by robots.txt. ("Sitemap contains urls which are blocked by robots.txt"). Seems that I wouldn't want that many urls blocked. ? Thank you!!
Intermediate & Advanced SEO | | jamiegriz0 -
Help article / Knowledge base SEO consideration
Hi everyone, I am in the process of building the knowledge base for our SaaS product and I am afraid it could impact us negatively on the SEO side because of: Thin content on pages containing short answers to specific questions Keyword cannibalisation between some of our blog articles and the knowledge base articles I didn't find much on the impact of knowledge bases on SEO when I searched on Google. So I'm hoping we can use this thread to share a few thoughts and best practices on this topic. Below is a bit more details on the issues I face, any tips on how to address them would be most welcome. 1. Thin content: Some articles will have thin content by design: the H1 will be a specific question and there will be only 2 or 3 lines of text answering it in the article. I think creating a dedicated article per question is better than grouping 20 questions on one article from a UX point of view, because this will enable us to direct users more quickly to the answer when they use the live search function inside the software (help widget) or on the knowledge base (saves them the need to scrolling a long article to find the answer). Now the issue is that this will result in lots of pages with thin content. A workaround could be to have both a detailed FAQ style page with all the questions and answers, and individual articles for each question on top of that. The FAQ style page could be indexed in Google while the individual articles would have either a noIndex directive or a rel canonical to the FAQ style page. Have any of you faced similar issues when setting-up your knowledge base? Which approach would you recommend? 2.Keyword cannibalisation: There will be, to some extend, a level of keyword cannibalisation between our blog articles (which rank well) and some of the knowledge base articles. While we want both types of articles to appear in search, we don't want the "How to do XYZ" blog article containing practical tips to compete with the "How to do XYZ in the software" knowledge base article. Do you have any advice on how to achieve that? Having a specific Schema.org (or equivalent) type of markup to differentiate between the 2 types of articles would have been ideal but I couldn't find anything relating to help articles specifically when I searched.
Intermediate & Advanced SEO | | tbps0 -
Robots.txt - Do I block Bots from crawling the non-www version if I use www.site.com ?
my site uses is set up at http://www.site.com I have my site redirected from non- www to the www in htacess file. My question is... what should my robots.txt file look like for the non-www site? Do you block robots from crawling the site like this? Or do you leave it blank? User-agent: * Disallow: / Sitemap: http://www.morganlindsayphotography.com/sitemap.xml Sitemap: http://www.morganlindsayphotography.com/video-sitemap.xml
Intermediate & Advanced SEO | | morg454540 -
301 v/s 302 Redirection on Homepage (Multilingual)
Hello, Our website: http://www.luxresorts.com currently has a default 302 redirection to http://www.luxresorts.com/en. We would like to do a 301 redirection instead of a 302 to http://www.luxresorts.com. Our concern is that the site is multilingual and we wonder what effect would the 301 redirection have on search engine crawlers and how would this appear on SERP. When a search is done on Google.com, the English version of our website appears and when on Google.FR, the French version appears. Would the 301 redirection change the way our website appear on Google? Grateful if you could help us out in understanding the pros and cons/best practices for our concern. Thanks in advance. Tej Luchmun.
Intermediate & Advanced SEO | | luxresorts0 -
404 Errors with my RSS Feed/sitemap
In my google webmasters I just started getting 404 errors that I'm not sure how to redirect. I'm getting quite a few that are ending in /feed/ for instance /nyc-accident-injury/feed/
Intermediate & Advanced SEO | | jsmythd
contact-us-thank-you/feed/ and then also a problem with my sitemap I guess? With /site-map/?postsort=tags The domain is pulversthompson.com0 -
Recovering from robots.txt error
Hello, A client of mine is going through a bit of a crisis. A developer (at their end) added Disallow: / to the robots.txt file. Luckily the SEOMoz crawl ran a couple of days after this happened and alerted me to the error. The robots.txt file was quickly updated but the client has found the vast majority of their rankings have gone. It took a further 5 days for GWMT to file that the robots.txt file had been updated and since then we have "Fetched as Google" and "Submitted URL and linked pages" in GWMT. In GWMT it is still showing that that vast majority of pages are blocked in the "Blocked URLs" section, although the robots.txt file below it is now ok. I guess what I want to ask is: What else is there that we can do to recover these rankings quickly? What time scales can we expect for recovery? More importantly has anyone had any experience with this sort of situation and is full recovery normal? Thanks in advance!
Intermediate & Advanced SEO | | RikkiD220 -
Could you use a robots.txt file to disalow a duplicate content page from being crawled?
A website has duplicate content pages to make it easier for users to find the information from a couple spots in the site navigation. Site owner would like to keep it this way without hurting SEO. I've thought of using the robots.txt file to disallow search engines from crawling one of the pages. Would you think this is a workable/acceptable solution?
Intermediate & Advanced SEO | | gregelwell0