Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Robots.txt: how to exclude sub-directories correctly?
-
Hello here,
I am trying to figure out the correct way to tell SEs to crawls this:
http://www.mysite.com/directory/
But not this:
http://www.mysite.com/directory/sub-directory/
or this:
http://www.mysite.com/directory/sub-directory2/sub-directory/...
But with the fact I have thousands of sub-directories with almost infinite combinations, I can't put the following definitions in a manageable way:
disallow: /directory/sub-directory/
disallow: /directory/sub-directory2/
disallow: /directory/sub-directory/sub-directory/
disallow: /directory/sub-directory2/subdirectory/
etc...
I would end up having thousands of definitions to disallow all the possible sub-directory combinations.
So, is the following way a correct, better and shorter way to define what I want above:
allow: /directory/$
disallow: /directory/*
Would the above work?
Any thoughts are very welcome! Thank you in advance.
Best,
Fab.
-
I mentioned both. You add a meta robots to noindex and remove from the sitemap.
-
But google is still free to index a link/page even if it is not included in xml sitemap.
-
Install Yoast Wordpress SEO plugin and use that to restrict what is indexed and what is allowed in a sitemap.
-
I am using wordpress, Enfold theme (themeforest).
I want some files to be accessed by google, but those should not be indexed.
Here is an example: http://prntscr.com/h8918o
I have currently blocked some JS directories/files using robots.txt (check screenshot)
But due to this I am not able to pass Mobile Friendly Test on Google:Â http://prntscr.com/h8925z (check screenshot)
Is its possible to allow access, but use a tag like noindex in the robots.txt file. Or is there any other way out.
-
Yes, everything looks good, Webmaster Tools gave me the expected results with the following directives:
allow: /directory/$
disallow: /directory/*
Which allows this URL:
http://www.mysite.com/directory/
But doesn't allow the following one:
http://www.mysite.com/directory/sub-directory2/...
This page also gives an update similar to mine:
https://support.google.com/webmasters/answer/156449?hl=en
I think I am good! Thanks

-
Thank you Michael, it is my understanding then that my idea of doing this:
allow: /directory/$
disallow: /directory/*
Should work just fine. I will test it within Google Webmaster Tools, and let you know if any problems arise.
In the meantime if anyone else has more ideas about all this and can confirm me that would be great!
Thank you again.
-
I've always stuck to Disallow and followed -
"This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:"
http://www.robotstxt.org/robotstxt.html
From https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt this seems contradictory
|
/*| equivalent to / | equivalent to / | Equivalent to "/" -- the trailing wildcard is ignored. |I think this post will be very useful  for you - http://moz.com/community/q/allow-or-disallow-first-in-robots-txt
-
Thank you Michael,
Google and other SEs actually recognize the "allow:" command:
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
The fact is: if I don't specify that, how can I be sure that the following single command:
disallow: /directory/*
Doesn't prevent SEs to spider the /directory/ index as I'd like to?
-
As long as you dont have directories somewhere in /* that you want indexed then I think that will work. Â There is no allow so you don't need the first line just
disallow: /directory/*
You can test out here-Â https://support.google.com/webmasters/answer/156449?rd=1
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Using a Reverse Proxy and 301 redirect to appear Sub Domain as Sub Directory - what are the SEO Risks?
We’re in process to move WordPress blog URLs from subdomains to sub-directory. We aren’t moving blog physically, but using reverse proxy and 301 redirection to do this. Blog subdomain URL is https://blog.example.com/ and destination sub-directory URL is https://www.example.com/blog/ Our main website is e-commerce marketplace which is YMYL site. This is on Windows server. Due to technical reasons, we can’t physically move our WordPress blog to the main website. Following is our Technical Setup Setup a reverse proxy at https://www.example.com/blog/ pointing to https://blog.example.com/ Use a 301 redirection from https://blog.example.com/ to https://www.example.com/blog/ with an exception if a traffic is coming from main WWW domain then it won’t redirect. Thus, we can eliminate infinite loop. Change all absolute URLs to relative URLs on blog Change the sitemap URL from https://blog.example.com/sitemap.xml to https://www.example.com/blog/sitemap.xml and update all URLs mentioned within the sitemap. SEO Risk Evaluation We have individual GA Tracking ID and individual Google Search Console Properties for main website and blog. We will not merge them. Keep them separate as they are. Keeping this in mind, I am evaluating SEO Risks factors Right now when we receive traffic from main website to blog (or vice versa) then it is considered as referral traffic and new cookies are set for Google Analytics. What’s going to happen when its on the same domain? Which type of settings change should I do in Blog’s Google Search Console? (A). Do I need to request “Change of Address” in the Blog’s search console property? (B). Should I re-submit the sitemap? Do I need to re-submit the blog sitemap from the https://www.example.com/ Google Search Console Property? Main website is e-commerce marketplace which is YMYL website, and blog is all about content. So does that impact SEO? Will this dilute SEO link juice or impact on the main website ranking because following are the key SEO Metrices. (A). Main website’s Avg Session Duration is about 10 minutes and bounce rate is around 30% (B). Blog’s Avg Session Duration is 33 seconds and bounce rate is over 92%
Intermediate & Advanced SEO | | joshibhargav_200 -
H1 and Schema Codes Set Up Correctly?
Greetings: It was pointed out to me that the h1 tags on my website (www.nyc-officespace-leader.com) all had exactly the same text and that duplication may be contributing to the very low page authority for most URLs. The duplicate h1 appears in line 54-54 (see below) of the home page: www.nyc-officespace-leader.com: itemscope itemtype="http://schema.org/LocalBusiness" style="position:absolute;top:-9999em;"> <span<br>itemprop="name">Metro Manhattan Office Space</span<br> <img< p="">But the above refers to schema" so is this really duplicate H1 or is there an exception if the H1 is within a schema? Also, I was told that the company street address and city and state were set up incorrectly as part of an alt tag. However these items also appear as schema in lines 49-68 shown below: Dangerous for me to perform surgery on the code without being certain about these key items!! Could ask my developer, however they may be uncomfortable considering that they set this up in the 1st place. So the view of neutral professionals would be highly welcome! itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
Intermediate & Advanced SEO | | Kingalan1
<span<br>itemprop="streetAddress">347 5th Ave #1008
<span<br>itemprop="addressLocality">New York
<span<br>itemprop="addressRegion">NY
<span<br>itemprop="postalCode">10016<div<br>itemprop="brand" itemscope itemtype="http://schema.org/Organization">
---------------------------------------------------------------------------</div<br></span<br></span<br></span<br></span<br></img<>0 -
How do I know if I am correctly solving an uppercase url issue that may be affecting Googlebot?
We have a large e-commerce site (10k+ SKUs). https://www.flagandbanner.com. As I have begun analyzing how to improve it I have discovered that we have thousands of urls that have uppercase characters. For instance: https://www.flagandbanner.com/Products/patriotic-paper-lanterns-string-lights.asp. This is inconsistently applied throughout the site. I directed our website vendor to fix the issue and they placed 301 redirects via a rule to the web.config file. Any url that contains an uppercase character now displays as a lowercase. However, as I use screaming frog to monitor our site, I see all these 301 redirects--thousands of them. The XML sitemap still shows the the uppercase versions. We have had indexing issues as well. So I'm wondering what is the most effective way to make sure that I'm not placing an extra burden on Googlebot when they index our site? Should I have just not cared about the uppercase issue and let it alone?
Intermediate & Advanced SEO | | webrocket0 -
SEO Best Practices regarding Robots.txt disallow
I cannot find hard and fast direction about the following issue: It looks like the Robots.txt file on my server has been set up to disallow "account" and "search" pages within my site, so I am receiving warnings from the Google Search console that URLs are being blocked by Robots.txt. (Disallow: /Account/ and Disallow: /?search=). Do you recommend unblocking these URLs? I'm getting a warning that over 18,000 Urls are blocked by robots.txt. ("Sitemap contains urls which are blocked by robots.txt"). Seems that I wouldn't want that many urls blocked. ? Thank you!!
Intermediate & Advanced SEO | | jamiegriz0 -
Block in robots.txt instead of using canonical?
When I use a canonical tag for pages that are variations of the same page, it basically means that I don't want Google to index this page. But at the same time, spiders will go ahead and crawl the page. Isn't this a waste of my crawl budget? Wouldn't it be better to just disallow the page in robots.txt and let Google focus on crawling the pages that I do want indexed? In other words, why should I ever use rel=canonical as opposed to simply disallowing in robots.txt?
Intermediate & Advanced SEO | | YairSpolter0 -
Best way to block a sub-domain from being indexed
Hello, The search engines have indexed a sub-domain I did not want indexed its on old.domain.com and dev.domain.com - I was going to password them but is there a best practice way to block them. My main domain default robots.txt says :- Sitemap: http://www.domain.com/sitemap.xml global User-agent: *
Intermediate & Advanced SEO | | JohnW-UK
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Disallow: /trackback/
Disallow: /feed/
Disallow: /comments/
Disallow: /category//
Disallow: */trackback/
Disallow: */feed/
Disallow: /comments/
Disallow: /?0 -
Do you add 404 page into robot file or just add no index tag?
Hi, got different opinion on this so i wanted to double check with your comment is. We've got /404.html page and I was wondering if you would add this page to robot text so it wouldn't be indexed or would you just add no index tag? What would be the best approach? Thanks!
Intermediate & Advanced SEO | | Rubix0 -
Meta NoIndex tag and Robots Disallow
Hi all, I hope you can spend some time to answer my first of a few questions 🙂 We are running a Magento site - layered/faceted navigation nightmare has created thousands of duplicate URLS! Anyway, during my process to tackle the issue, I disallowed in Robots.txt anything in the querystring that was not a p (allowed this for pagination). After checking some pages in Google, I did a site:www.mydomain.com/specificpage.html and a few duplicates came up along with the original with
Intermediate & Advanced SEO | | bjs2010
"There is no information about this page because it is blocked by robots.txt" So I had added in Meta Noindex, follow on all these duplicates also but I guess it wasnt being read because of Robots.txt. So coming to my question. Did robots.txt block access to these pages? If so, were these already in the index and after disallowing it with robots, Googlebot could not read Meta No index? Does Meta Noindex Follow on pages actually help Googlebot decide to remove these pages from index? I thought Robots would stop and prevent indexation? But I've read this:
"Noindex is a funny thing, it actually doesn’t mean “You can’t index this”, it means “You can’t show this in search results”. Robots.txt disallow means “You can’t index this” but it doesn’t mean “You can’t show it in the search results”. I'm a bit confused about how to use these in both preventing duplicate content in the first place and then helping to address dupe content once it's already in the index. Thanks! B0