Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Robots.txt, does it need preceding directory structure?
-
Do you need the entire preceding path in robots.txt for it to match?
e.g:
I know if i add Disallow: /fish to robots.txt it will block
/fish
/fish.html
/fish/salmon.html
/fishheads
/fishheads/yummy.html
/fish.php?id=anythingBut would it block?:
en/fish
en/fish.html
en/fish/salmon.html
en/fishheads
en/fishheads/yummy.html
**en/fish.php?id=anything(taken from Robots.txt Specifications)** I'm hoping it actually wont match, that way writing this particular robots.txt will be much easier!
As basically I'm wanting to block many URL that have BTS- in such as:
http://www.example.com/BTS-something
http://www.example.com/BTS-somethingelse
http://www.example.com/BTS-thingybobBut have other pages that I do not want blocked, in subfolders that also have BTS- in, such as:
http://www.example.com/somesubfolder/BTS-thingy
http://www.example.com/anothersubfolder/BTS-otherthingyThanks for listening
-
Yes this is what I thought, but wanted some second opinions.
Although I wouldn't actually need a wild card after BTS, as just leaving it open is the same as using a wildcard:
/fish*.......... Equivalent to "/fish" -- the trailing wildcard is ignored. https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt Thanks for the link, I'll take a look
-
You're right in with the **Disallow: /fish **in the robots file blocking all those initial links, but if you wanted to block everything inside the /en/ folder, you would need to do disallow: /en/fish
You could use a wildcard in the robots.txt file to do something along the lines of Disallow: /BTS-*
This _'should' _work, but it's always worth checking using a tool to make sure it's all implemented correctly. Distilled did a post a while back about a JS tool which allows you to test if robots.txt files work correctly which can be found here - http://www.distilled.net/blog/seo/js-bookmarklet-for-checking-if-a-page-is-blocked-by-robots-txt/
In addition to this, you could also use the 'blocked URLs' tool in GWT to see if the pages are successfully blocked once you've implemented the code.
Hope this helps!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Taxonomy question - best approach for site structure
Hi all, I'm working on a dentist's website and want some advice on the best way to lay out the navigation. I would like to know which structure will help the site work naturally. I feel the second example would be better as it would focus the 'power' around the type of treatment and get that to rank better. .com/assessment/whitening
Intermediate & Advanced SEO | | Bee159
.com/assessment/straightening
.com/treatment/whitening
.com/treatment/straightening or .com/whitening/assessment
.com/straightening/assessment
.com/whitening/treatment
.com/straightening/treatment Please advise, thanks.0 -
Will disallowing URL's in the robots.txt file stop those URL's being indexed by Google
I found a lot of duplicate title tags showing in Google Webmaster Tools. When I visited the URL's that these duplicates belonged to, I found that they were just images from a gallery that we didn't particularly want Google to index. There is no benefit to the end user in these image pages being indexed in Google. Our developer has told us that these urls are created by a module and are not "real" pages in the CMS. They would like to add the following to our robots.txt file Disallow: /catalog/product/gallery/ QUESTION: If the these pages are already indexed by Google, will this adjustment to the robots.txt file help to remove the pages from the index? We don't want these pages to be found.
Intermediate & Advanced SEO | | andyheath0 -
Structured Data + Meta Descriptions
Hey All, Was just looking through some google pages on best practices for meta descriptions and came across this little tidbit. "Include clearly tagged facts in the description. The meta description doesn't just have to be in sentence format; it's also a great place to include structured data about the page. For example, news or blog postings can list the author, date of publication, or byline information. This can give potential visitors very relevant information that might not be displayed in the snippet otherwise. Similarly, product pages might have the key bits of information—price, age, manufacturer—scattered throughout a page. A good meta description can bring all this data together. For example, the following meta description provides detailed information about a book. " This is the first time I have seen suggested use of structured data in meta descriptions. Does this totally replace a regular meta description or will it work in conjunction with the regular meta description? If I provide both structured data and text, will the SERP display text and the structured data the way it was previously displayed? Or will the 150 -160 character limit take precedence and just cut off all info after that?
Intermediate & Advanced SEO | | Whebb0 -
Should I use meta noindex and robots.txt disallow?
Hi, we have an alternate "list view" version of every one of our search results pages The list view has its own URL, indicated by a URL parameter I'm concerned about wasting our crawl budget on all these list view pages, which effectively doubles the amount of pages that need crawling When they were first launched, I had the noindex meta tag be placed on all list view pages, but I'm concerned that they are still being crawled Should I therefore go ahead and also apply a robots.txt disallow on that parameter to ensure that no crawling occurs? Or, will Googlebot/Bingbot also stop crawling that page over time? I assume that noindex still means "crawl"... Thanks 🙂
Intermediate & Advanced SEO | | ntcma0 -
How to structure articles on a website.
Hi All, Key to a successful website is quality content - so the Gods of Google tell me. Embrace your audience with quality feature rich articles on your products or services, hints and tips, how to, etc. So you build your article page with all the correct criteria; Long Tail Keyword or phrases hitting the URL, heading, 1st sentance, etc. My question is this
Intermediate & Advanced SEO | | Mark_Ch
Let's say you have 30 articles, where would you place the 30 articles for SEO purposes and user experiences. My thought are:
1] on the home page create a column with a clear heading "Useful articles" and populate the column with links to all 30 articles.
or
2] throughout your website create link references to the articles as part of natural information flow.
or
3] Create a banner or impact logo on the all pages to entice your audience to click and land on dedicated "articles page" Thanks Mark0 -
Archiving a festival website - subdomain or directory?
Hi guys I look after a festival website whose program changes year in and year out. There are a handful of mainstay events in the festival which remain each year, but there are a bunch of other events which change each year around the mainstay programming.This often results in us redoing the website each year (a frustrating experience indeed!) We don't archive our past festivals online, but I'd like to start doing so for a number of reasons 1. These past festivals have historical value - they happened, and they contribute to telling the story of the festival over the years. They can also be used as useful windows into the upcoming festival. 2. The old events (while no longer running) often get many social shares, high quality links and in some instances still drive traffic. We try out best to 301 redirect these high value pages to the new festival website, but it's not always possible to find a similar alternative (so these redirects often go to the homepage) Anyway, I've noticed some festivals archive their content into a subdirectory - i.e. www.event.com/2012 However, I'm thinking it would actually be easier for my team to archive via a subdomain like 2012.event.com - and always use the www.event.com URL for the current year's event. I'm thinking universally redirecting the content would be easier, as would cloning the site / database etc. My question is - is one approach (i.e. directory vs. subdomain) better than the other? Do I need to be mindful of using a subdomain for archival purposes? Hope this all makes sense. Many thanks!
Intermediate & Advanced SEO | | cos20300 -
Recovering from robots.txt error
Hello, A client of mine is going through a bit of a crisis. A developer (at their end) added Disallow: / to the robots.txt file. Luckily the SEOMoz crawl ran a couple of days after this happened and alerted me to the error. The robots.txt file was quickly updated but the client has found the vast majority of their rankings have gone. It took a further 5 days for GWMT to file that the robots.txt file had been updated and since then we have "Fetched as Google" and "Submitted URL and linked pages" in GWMT. In GWMT it is still showing that that vast majority of pages are blocked in the "Blocked URLs" section, although the robots.txt file below it is now ok. I guess what I want to ask is: What else is there that we can do to recover these rankings quickly? What time scales can we expect for recovery? More importantly has anyone had any experience with this sort of situation and is full recovery normal? Thanks in advance!
Intermediate & Advanced SEO | | RikkiD220 -
Using 2 wildcards in the robots.txt file
I have a URL string which I don't want to be indexed. it includes the characters _Q1 ni the middle of the string. So in the robots.txt can I use 2 wildcards in the string to take out all of the URLs with that in it? So something like /_Q1. Will that pickup and block every URL with those characters in the string? Also, this is not directly of the root, but in a secondary directory, so .com/.../_Q1. So do I have to format the robots.txt as //_Q1* as it will be in the second folder or just using /_Q1 will pickup everything no matter what folder it is on? Thanks.
Intermediate & Advanced SEO | | seo1234560