Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Robots.txt: how to exclude sub-directories correctly?
-
Hello here,
I am trying to figure out the correct way to tell SEs to crawls this:
http://www.mysite.com/directory/
But not this:
http://www.mysite.com/directory/sub-directory/
or this:
http://www.mysite.com/directory/sub-directory2/sub-directory/...
But with the fact I have thousands of sub-directories with almost infinite combinations, I can't put the following definitions in a manageable way:
disallow: /directory/sub-directory/
disallow: /directory/sub-directory2/
disallow: /directory/sub-directory/sub-directory/
disallow: /directory/sub-directory2/subdirectory/
etc...
I would end up having thousands of definitions to disallow all the possible sub-directory combinations.
So, is the following way a correct, better and shorter way to define what I want above:
allow: /directory/$
disallow: /directory/*
Would the above work?
Any thoughts are very welcome! Thank you in advance.
Best,
Fab.
-
I mentioned both. You add a meta robots to noindex and remove from the sitemap.
-
But google is still free to index a link/page even if it is not included in xml sitemap.
-
Install Yoast Wordpress SEO plugin and use that to restrict what is indexed and what is allowed in a sitemap.
-
I am using wordpress, Enfold theme (themeforest).
I want some files to be accessed by google, but those should not be indexed.
Here is an example: http://prntscr.com/h8918o
I have currently blocked some JS directories/files using robots.txt (check screenshot)
But due to this I am not able to pass Mobile Friendly Test on Google:Â http://prntscr.com/h8925z (check screenshot)
Is its possible to allow access, but use a tag like noindex in the robots.txt file. Or is there any other way out.
-
Yes, everything looks good, Webmaster Tools gave me the expected results with the following directives:
allow: /directory/$
disallow: /directory/*
Which allows this URL:
http://www.mysite.com/directory/
But doesn't allow the following one:
http://www.mysite.com/directory/sub-directory2/...
This page also gives an update similar to mine:
https://support.google.com/webmasters/answer/156449?hl=en
I think I am good! Thanks
-
Thank you Michael, it is my understanding then that my idea of doing this:
allow: /directory/$
disallow: /directory/*
Should work just fine. I will test it within Google Webmaster Tools, and let you know if any problems arise.
In the meantime if anyone else has more ideas about all this and can confirm me that would be great!
Thank you again.
-
I've always stuck to Disallow and followed -
"This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:"
http://www.robotstxt.org/robotstxt.html
From https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt this seems contradictory
|
/*
| equivalent to / | equivalent to / | Equivalent to "/" -- the trailing wildcard is ignored. |I think this post will be very useful  for you - http://moz.com/community/q/allow-or-disallow-first-in-robots-txt
-
Thank you Michael,
Google and other SEs actually recognize the "allow:" command:
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
The fact is: if I don't specify that, how can I be sure that the following single command:
disallow: /directory/*
Doesn't prevent SEs to spider the /directory/ index as I'd like to?
-
As long as you dont have directories somewhere in /* that you want indexed then I think that will work. Â There is no allow so you don't need the first line just
disallow: /directory/*
You can test out here-Â https://support.google.com/webmasters/answer/156449?rd=1
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Using a Reverse Proxy and 301 redirect to appear Sub Domain as Sub Directory - what are the SEO Risks?
We’re in process to move WordPress blog URLs from subdomains to sub-directory. We aren’t moving blog physically, but using reverse proxy and 301 redirection to do this. Blog subdomain URL is https://blog.example.com/ and destination sub-directory URL is https://www.example.com/blog/ Our main website is e-commerce marketplace which is YMYL site. This is on Windows server. Due to technical reasons, we can’t physically move our WordPress blog to the main website. Following is our Technical Setup Setup a reverse proxy at https://www.example.com/blog/ pointing to https://blog.example.com/ Use a 301 redirection from https://blog.example.com/ to https://www.example.com/blog/ with an exception if a traffic is coming from main WWW domain then it won’t redirect. Thus, we can eliminate infinite loop. Change all absolute URLs to relative URLs on blog Change the sitemap URL from https://blog.example.com/sitemap.xml to https://www.example.com/blog/sitemap.xml and update all URLs mentioned within the sitemap. SEO Risk Evaluation We have individual GA Tracking ID and individual Google Search Console Properties for main website and blog. We will not merge them. Keep them separate as they are. Keeping this in mind, I am evaluating SEO Risks factors Right now when we receive traffic from main website to blog (or vice versa) then it is considered as referral traffic and new cookies are set for Google Analytics. What’s going to happen when its on the same domain? Which type of settings change should I do in Blog’s Google Search Console? (A). Do I need to request “Change of Address” in the Blog’s search console property? (B). Should I re-submit the sitemap? Do I need to re-submit the blog sitemap from the https://www.example.com/ Google Search Console Property? Main website is e-commerce marketplace which is YMYL website, and blog is all about content. So does that impact SEO? Will this dilute SEO link juice or impact on the main website ranking because following are the key SEO Metrices. (A). Main website’s Avg Session Duration is about 10 minutes and bounce rate is around 30% (B). Blog’s Avg Session Duration is 33 seconds and bounce rate is over 92%
Intermediate & Advanced SEO | | joshibhargav_200 -
No index detected in robots meta tag GSC issue_Help Please
Hi Everyone, We just did a site migration ( URL structure change, site redesign, CMS change). During migration, dev team messed up badly on a few things including SEO. The old site had pages canonicalized and self canonicalized <> New site doesn't have anything (CMS dev error) so we are working retroactively to add canonicalization mechanism The legacy site had URL’s ending with a trailing slash “/” <> new site got redirected to Set of url’s without “/” New site action : All robots are allowed: A new sitemap is submitted to google search console So here is my problem (it been a long 24hr night for me 🙂 ) 1. Now when I look at GSC homepage URL it says that old page is self canonicalized and currently in index (old page with a trailing slash at the end of URL). 2. When I try to perform a live URL test, I get the message "No: 'noindex' detected in 'robots' meta tag" , so indexation cant be done. I have no idea where noindex is coming from. 3. Robots.txt in search console still showing old file ( no noindex there ) I tried to submit new file but old one still coming up. When I click on "See live robots.txt" I get current robots. 4. I see that old page is still canonicalized and attempting to index redirected old page might be confusing google Hope someone can help to get the new page indexed! I really need it 🙂 Please ping me if you need more clarification. Thank you ! Thank you
Intermediate & Advanced SEO | | bgvsiteadmin1 -
Directory with Duplicate content? what to do?
Moz keeps finding loads of pages with duplicate content on my website. The problem is its a directory page to different locations. E.g if we were a clothes shop we would be listing our locations: www.sitename.com/locations/london www.sitename.com/locations/rome www.sitename.com/locations/germany The content on these pages is all the same, except for an embedded google map that shows the location of the place. The problem is that google thinks all these pages are duplicated content. Should i set a canonical link on every single page saying that www.sitename.com/locations/london is the main page? I don't know if i can use canonical links because the page content isn't identical because of the embedded map. Help would be appreciated. Thanks.
Intermediate & Advanced SEO | | nchlondon0 -
Disallow URLs ENDING with certain values in robots.txt?
Is there any way to disallow URLs ending in a certain value? For example, if I have the following product page URL: http://website.com/category/product1, and I want to disallow /category/product1/review, /category/product2/review, etc. without disallowing the product pages themselves, is there any shortcut to do this, or must I disallow each gallery page individually?
Intermediate & Advanced SEO | | jmorehouse0 -
Robots.txt, does it need preceding directory structure?
Do you need the entire preceding path in robots.txt for it to match? e.g: I know if i add Disallow:Â /fish to robots.txt it will block /fish
Intermediate & Advanced SEO | | Milian
/fish.html
/fish/salmon.html
/fishheads
/fishheads/yummy.html
/fish.php?id=anything But would it block?: en/fish
en/fish.html
en/fish/salmon.html
en/fishheads
en/fishheads/yummy.html
**en/fish.php?id=anything (taken from Robots.txt Specifications)** I'm hoping it actually wont match, that way writing this particular robots.txt will be much easier! As basically I'm wanting to block many URL that have BTS- in such as: http://www.example.com/BTS-something
http://www.example.com/BTS-somethingelse
http://www.example.com/BTS-thingybob But have other pages that I do not want blocked, in subfolders that also have BTS- in, such as: http://www.example.com/somesubfolder/BTS-thingy
http://www.example.com/anothersubfolder/BTS-otherthingy Thanks for listening0 -
How is Google crawling and indexing this directory listing?
We have three Directory Listing pages that are being indexed by Google: http://www.ccisolutions.com/StoreFront/jsp/ http://www.ccisolutions.com/StoreFront/jsp/html/ http://www.ccisolutions.com/StoreFront/jsp/pdf/ How and why is Googlebot crawling and indexing these pages? Nothing else links to them (although the /jsp.html/ and /jsp/pdf/ both link back to /jsp/). They aren't disallowed in our robots.txt file and I understand that this could be why. If we add them to our robots.txt file and disallow, will this prevent Googlebot from crawling and indexing those Directory Listing pages without prohibiting them from crawling and indexing the content that resides there which is used to populate pages on our site? Having these pages indexed in Google is causing a myriad of issues, not the least of which is duplicate content. For example, this file <tt>CCI-SALES-STAFF.HTML</tt> (which appears on this Directory Listing referenced above - http://www.ccisolutions.com/StoreFront/jsp/html/) clicks through to this Web page: http://www.ccisolutions.com/StoreFront/jsp/html/CCI-SALES-STAFF.HTML This page is indexed in Google and we don't want it to be. But so is the actual page where we intended the content contained in that file to display: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff As you can see, this results in duplicate content problems. Is there a way to disallow Googlebot from crawling that Directory Listing page, and, provided that we have this URL in our sitemap: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff, solve the duplicate content issue as a result? For example: Disallow: /StoreFront/jsp/ Disallow: /StoreFront/jsp/html/ Disallow: /StoreFront/jsp/pdf/ Can we do this without risking blocking Googlebot from content we do want crawled and indexed? Many thanks in advance for any and all help on this one!
Intermediate & Advanced SEO | | danatanseo0 -
Merging Domains... Sub-domains, Directories or Seperate Sites?
Hello! I am hoping you can help me decide the best path to take here... A little background: I'm moving to a new company that has three old domains (the oldest is 10 years old), which get a lot of traffic from their e-letters. Until recently they have not cared about SEO. So the websites have some structural, coding, URL and other issues. The sites are indexed, but have a problem getting crawled and/or indexed for new content - haven't delved into this yet but am certain I will be able to fix any of these issues. These three domains are PR4, PR4, PR5 and contain hundreds of unique articles. Here's the question... They want to move these three sites **to their main company site (PR4) and create sub domains for each one. ** I am wondering if this is a good idea or not. I have merged sites before (creating categories and/or directories) and the end result is that the ONE big site, is much for effective than TWO smaller, less authoritative sites. But the sub domain idea is something I am unsure about from an SEO perspective. Should we do this with sub domains? Or do you think we should keep the sites separate? How do Panda and Penguin play into this? Thanks in advance for the help! SD P.S. I'm not a huge advocate in using PR as a measurement tool, but since I can't reveal the actual domains, I figured I would list it as a reference point.
Intermediate & Advanced SEO | | essdee0 -
Blocking Dynamic URLs with Robots.txt
Background: My e-commerce site uses a lot of layered navigation and sorting links. Â While this is great for users, it ends up in a lot of URL variations of the same page being crawled by Google. Â For example, a standard category page: www.mysite.com/widgets.html ...which uses a "Price" layered navigation sidebar to filter products based on price also produces the following URLs which link to the same page: http://www.mysite.com/widgets.html?price=1%2C250 http://www.mysite.com/widgets.html?price=2%2C250 http://www.mysite.com/widgets.html?price=3%2C250 As there are literally thousands of these URL variations being indexed, so I'd like to use Robots.txt to disallow these variations. Question: Is this a wise thing to do? Â Or does Google take into account layered navigation links by default, and I don't need to worry. To implement, I was going to do the following in Robots.txt: User-agent: * Disallow: /*? Disallow: /*= ....which would prevent any dynamic URL with a '?" or '=' from being indexed. Â Is there a better way to do this, or is this a good solution? Thank you!
Intermediate & Advanced SEO | | AndrewY1