Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Are robots.txt wildcards still valid? If so, what is the proper syntax for setting this up?
-
I've got several URL's that I need to disallow in my robots.txt file. For example, I've got several documents that I don't want indexed and filters that are getting flagged as duplicate content. Rather than typing in thousands of URL's I was hoping that wildcards were still valid.
-
Great job. I just wanted to add this from Google Webmasters
http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html
and this from Google Developers
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
-
Yup wildcard syntax is indeed still valid. However I can only confirm that the big 3 (Google, Yahoo and Bing) actively observe it. Other secondary search engines may not.
In your case you are probably looking for a syntax along the lines of:
User-agent: *
Disallow: /*.pdf$ This would set that any user agent should be blocked from any file name that ends in .pdf (a $ ties it to the end so pdf.txt would not be blocked in this case)Keep an eye on how you block them. Missing a trailing slash could block a directory rather than a file, or not appending a strict symbol ($) could mean that phrases throughout a directory could be blocked rather than just a filename.
Also keep in mind if you are using URL re-writing this may play into how you need to block things; and you may also want to remember that disallowing access in a robot.txt does NOT prevent search engines from indexing the data, it is up to them if they honor the request. So if it is very important to block the file access from search engines then robots.txt may not be the way to do it.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Crawl solutions for landing pages that don't contain a robots.txt file?
My site (www.nomader.com) is currently built on Instapage, which does not offer the ability to add a robots.txt file. I plan to migrate to a Shopify site in the coming months, but for now the Instapage site is my primary website. In the interim, would you suggest that I manually request a Google crawl through the search console tool? If so, how often? Any other suggestions for countering this Meta Noindex issue?
Technical SEO | | Nomader1 -
Removed Subdomain Sites Still in Google Index
Hey guys, I've got kind of a strange situation going on and I can't seem to find it addressed anywhere. I have a site that at one point had several development sites set up at subdomains. Those sites have since launched on their own domains, but the subdomain sites are still showing up in the Google index. However, if you look at the cached version of pages on these non-existent subdomains, it lists the NEW url, not the dev one in the little blurb that says "This is Google's cached version of www.correcturl.com." Clearly Google recognizes that the content resides at the new location, so how come the old pages are still in the index? Attempting to visit one of them gives a "Server Not Found" error, so they are definitely gone. This is happening to a couple of sites, one that was launched over a year ago so it doesn't appear to be a "wait and see" solution. Any suggestions would be a huge help. Thanks!!
Technical SEO | | SarahLK0 -
how to set rel canonical on wordpress.com sites
I know how to do this with a wordpress.org site but I have a client that does not want to switch and without a plugin I am lost. any help would be greatly appreciated. Jeremy Wood
Technical SEO | | SOtBOrlando0 -
Why are pages still showing in SERPs, despite being NOINDEXed for months?
We have thousands of pages we're trying to have de-indexed in Google for months now. They've all got . But they simply will not go away in the SERPs. Here is just one example.... http://bitly.com/VutCFiIf you search this URL in Google, you will see that it is indexed, yet it's had for many months. This is just one example for thousands of pages, that will not get de-indexed. Am I missing something here? Does it have to do with using content="none" instead of content="noindex, follow"? Any help is very much appreciated.
Technical SEO | | MadeLoud0 -
Robots.txt Sitemap with Relative Path
Hi Everyone, In robots.txt, can the sitemap be indicated with a relative path? I'm trying to roll out a robots file to ~200 websites, and they all have the same relative path for a sitemap but each is hosted on its own domain. Basically I'm trying to avoid needing to create 200 different robots.txt files just to change the domain. If I do need to do that, though, is there an easier way than just trudging through it?
Technical SEO | | MRCSearch0 -
Is it worth setting up 301 redirects from old products to new products?
This year we are using a new supplier and they have provided us a product database of approx. 5k products. About 80% of these products were in our existing database but once we have installed the new database all the URLs will have changed. There is no quick way to match the old products with the new products so we would have to manually match all 5k products if we were were to setup 301 rules for the old products pointing to the new products. Of course this would take a lot of time. So the options are: 1. Is it worth putting in this effort to make the 301 rules? 2. Or are we okay just to delete the old product pages, let the SE see the 404 and just wait for it to index the new pages? 3. Or, as a compromise, should we 301 the old product page to the new category page as this is a lot quicker for us do do than redirecting to the new product page?
Technical SEO | | indigoclothing0 -
Robots.txt file getting a 500 error - is this a problem?
Hello all! While doing some routine health checks on a few of our client sites, I spotted that a new client of ours - who's website was not designed built by us - is returning a 500 internal server error when I try to look at the robots.txt file. As we don't host / maintain their site, I would have to go through their head office to get this changed, which isn't a problem but I just wanted to check whether this error will actually be having a negative effect on their site / whether there's a benefit to getting this changed? Thanks in advance!
Technical SEO | | themegroup0 -
Robots.txt File Redirects to Home Page
I've been doing some site analysis for a new SEO client and it has been brought to my attention that their robots.txt file redirects to their homepage. I was wondering: Is there a benfit to setup your robots.txt file to do this? Will this effect how their site will get indexed? Thanks for your response! Kyle Site URL: http://www.radisphere.net/
Technical SEO | | kchandler0