Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Robots.txt: how to exclude sub-directories correctly?
-
Hello here,
I am trying to figure out the correct way to tell SEs to crawls this:
http://www.mysite.com/directory/
But not this:
http://www.mysite.com/directory/sub-directory/
or this:
http://www.mysite.com/directory/sub-directory2/sub-directory/...
But with the fact I have thousands of sub-directories with almost infinite combinations, I can't put the following definitions in a manageable way:
disallow: /directory/sub-directory/
disallow: /directory/sub-directory2/
disallow: /directory/sub-directory/sub-directory/
disallow: /directory/sub-directory2/subdirectory/
etc...
I would end up having thousands of definitions to disallow all the possible sub-directory combinations.
So, is the following way a correct, better and shorter way to define what I want above:
allow: /directory/$
disallow: /directory/*
Would the above work?
Any thoughts are very welcome! Thank you in advance.
Best,
Fab.
-
I mentioned both. You add a meta robots to noindex and remove from the sitemap.
-
But google is still free to index a link/page even if it is not included in xml sitemap.
-
Install Yoast Wordpress SEO plugin and use that to restrict what is indexed and what is allowed in a sitemap.
-
I am using wordpress, Enfold theme (themeforest).
I want some files to be accessed by google, but those should not be indexed.
Here is an example: http://prntscr.com/h8918o
I have currently blocked some JS directories/files using robots.txt (check screenshot)
But due to this I am not able to pass Mobile Friendly Test on Google:Â http://prntscr.com/h8925z (check screenshot)
Is its possible to allow access, but use a tag like noindex in the robots.txt file. Or is there any other way out.
-
Yes, everything looks good, Webmaster Tools gave me the expected results with the following directives:
allow: /directory/$
disallow: /directory/*
Which allows this URL:
http://www.mysite.com/directory/
But doesn't allow the following one:
http://www.mysite.com/directory/sub-directory2/...
This page also gives an update similar to mine:
https://support.google.com/webmasters/answer/156449?hl=en
I think I am good! Thanks
-
Thank you Michael, it is my understanding then that my idea of doing this:
allow: /directory/$
disallow: /directory/*
Should work just fine. I will test it within Google Webmaster Tools, and let you know if any problems arise.
In the meantime if anyone else has more ideas about all this and can confirm me that would be great!
Thank you again.
-
I've always stuck to Disallow and followed -
"This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:"
http://www.robotstxt.org/robotstxt.html
From https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt this seems contradictory
|
/*
| equivalent to / | equivalent to / | Equivalent to "/" -- the trailing wildcard is ignored. |I think this post will be very useful  for you - http://moz.com/community/q/allow-or-disallow-first-in-robots-txt
-
Thank you Michael,
Google and other SEs actually recognize the "allow:" command:
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
The fact is: if I don't specify that, how can I be sure that the following single command:
disallow: /directory/*
Doesn't prevent SEs to spider the /directory/ index as I'd like to?
-
As long as you dont have directories somewhere in /* that you want indexed then I think that will work. Â There is no allow so you don't need the first line just
disallow: /directory/*
You can test out here-Â https://support.google.com/webmasters/answer/156449?rd=1
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Using a Reverse Proxy and 301 redirect to appear Sub Domain as Sub Directory - what are the SEO Risks?
We’re in process to move WordPress blog URLs from subdomains to sub-directory. We aren’t moving blog physically, but using reverse proxy and 301 redirection to do this. Blog subdomain URL is https://blog.example.com/ and destination sub-directory URL is https://www.example.com/blog/ Our main website is e-commerce marketplace which is YMYL site. This is on Windows server. Due to technical reasons, we can’t physically move our WordPress blog to the main website. Following is our Technical Setup Setup a reverse proxy at https://www.example.com/blog/ pointing to https://blog.example.com/ Use a 301 redirection from https://blog.example.com/ to https://www.example.com/blog/ with an exception if a traffic is coming from main WWW domain then it won’t redirect. Thus, we can eliminate infinite loop. Change all absolute URLs to relative URLs on blog Change the sitemap URL from https://blog.example.com/sitemap.xml to https://www.example.com/blog/sitemap.xml and update all URLs mentioned within the sitemap. SEO Risk Evaluation We have individual GA Tracking ID and individual Google Search Console Properties for main website and blog. We will not merge them. Keep them separate as they are. Keeping this in mind, I am evaluating SEO Risks factors Right now when we receive traffic from main website to blog (or vice versa) then it is considered as referral traffic and new cookies are set for Google Analytics. What’s going to happen when its on the same domain? Which type of settings change should I do in Blog’s Google Search Console? (A). Do I need to request “Change of Address” in the Blog’s search console property? (B). Should I re-submit the sitemap? Do I need to re-submit the blog sitemap from the https://www.example.com/ Google Search Console Property? Main website is e-commerce marketplace which is YMYL website, and blog is all about content. So does that impact SEO? Will this dilute SEO link juice or impact on the main website ranking because following are the key SEO Metrices. (A). Main website’s Avg Session Duration is about 10 minutes and bounce rate is around 30% (B). Blog’s Avg Session Duration is 33 seconds and bounce rate is over 92%
Intermediate & Advanced SEO | | joshibhargav_200 -
Block session id URLs with robots.txt
Hi, I would like to block all URLs with the parameter '?filter=' from being crawled by including them in the robots.txt. Which directive should I use: User-agent: *
Intermediate & Advanced SEO | | Mat_C
Disallow: ?filter= or User-agent: *
Disallow: /?filter= In other words, is the forward slash in the beginning of the disallow directive necessary? Thanks!1 -
URL Structure & Best Practice when Facing 4+ Sub-levels
Hi. I've spent the last day fiddling with the setup of a new URL structure for a site, and I can't "pull the trigger" on it. Example:Â -Â domain.com/games/type-of-game/provider-name/name-of-game/ Specific example:Â - arcade.com/games/pinball/deckerballs/starshooter2k/ The example is a good description of the content that I have to organize. The aim is to a) define url structure, b) facilitate good ux, **c)Â **create a good starting point for content marketing and SEO, avoiding multiple / stuffing keywords in urls'. The problem? Not all providers have the same type of game. Meaning, that once I get past the /type-of-game/, I must write a new category / page / content for /provider-name/. No matter how I switch the different "sub-levels" around in the url, at one point, the provider-name doesn't fit as its in need of new content, multiple times. The solution? I can skip "provider-name". The caveat though is that I lose out on ranking for provider keywords as I don't have a cornerstone content page for them. Question: Using the URL structure as outlined above in WordPress, would you A)Â go with "Pages", or B) use "Posts"
Intermediate & Advanced SEO | | Dan-Louis0 -
How can I get Bing to index my subdomain correctly?
Hi guys, My website exists on a subdomain (i.e. https://website.subdomain.com) and is being indexed correctly on all search engines except Bing and Duck Duck Go, which list 'https://www.website.subdomain.com'. Unfortunately my subdomain isn't configured for www (the domain is out of my control), so searchers are seeing a server error when clicking on my homepage in the SERPs. I have verified the site successfully in Bing Webmaster Tools, but it still shows up incorrectly. Does anyone have any advice on how I could fix this issue? Thank you!
Intermediate & Advanced SEO | | cos20300 -
Meta Robot Tag:Index, Follow, Noodp, Noydir
When should "Noodp" and "Noydir" meta robot tag be used? I have hundreds or URLs for real estate listings on my site that simply use "Index", Follow" without using Noodp and Noydir. Should the listing pages use these Noodp and Noydr also? All major landing pages use Index, Follow, Noodp, Noydir. Is this the best setting in terms of ranking and SEO. Thanks, Alan
Intermediate & Advanced SEO | | Kingalan10 -
Citation/Business Directory Question...
A company I work for has two numbers... one for the std call centre and one for tracking SEO. Now, if local citation/business directory listings have the same address but different numbers, will this affect local/other SEO results? Any help is greatly appreciated! 🙂
Intermediate & Advanced SEO | | geniusenergyltd0 -
De-indexed Link Directory
Howdy Guys, I'm currently working through our 4th reconsideration request and just have a couple of questions. Using Link Detox (www.linkresearchtools.com) new tool they have flagged up a 64 links that are Toxic and should be removed. After analysing them further alot / most of them are link directories that have now been de-indexed by Google. Do you think we should still ask for them to be removed or is this a pointless exercise as the links has already been removed because its been de-indexed. Would like your views on this guys.
Intermediate & Advanced SEO | | ScottBaxterWW0 -
Robots.txt is blocking Wordpress Pages from Googlebot?
I have a robots.txt file on my server, which I did not develop, it was done by the web designer at the company before me. Then there is a word press plugin that generates a robots.txt file. How Do I unblock all the wordpress pages from googlebot?
Intermediate & Advanced SEO | | ENSO0