Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Are robots.txt wildcards still valid? If so, what is the proper syntax for setting this up?
-
I've got several URL's that I need to disallow in my robots.txt file. For example, I've got several documents that I don't want indexed and filters that are getting flagged as duplicate content. Rather than typing in thousands of URL's I was hoping that wildcards were still valid.
-
Great job. I just wanted to add this from Google Webmasters
http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html
and this from Google Developers
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
-
Yup wildcard syntax is indeed still valid. However I can only confirm that the big 3 (Google, Yahoo and Bing) actively observe it. Other secondary search engines may not.
In your case you are probably looking for a syntax along the lines of:
User-agent: *
Disallow: /*.pdf$ This would set that any user agent should be blocked from any file name that ends in .pdf (a $ ties it to the end so pdf.txt would not be blocked in this case)Keep an eye on how you block them. Missing a trailing slash could block a directory rather than a file, or not appending a strict symbol ($) could mean that phrases throughout a directory could be blocked rather than just a filename.
Also keep in mind if you are using URL re-writing this may play into how you need to block things; and you may also want to remember that disallowing access in a robot.txt does NOT prevent search engines from indexing the data, it is up to them if they honor the request. So if it is very important to block the file access from search engines then robots.txt may not be the way to do it.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt allows wp-admin/admin-ajax.php
Hello, Mozzers!
Technical SEO | | AndyKubrin
I noticed something peculiar in the robots.txt used by one of my clients: Allow: /wp-admin/admin-ajax.php What would be the purpose of allowing a search engine to crawl this file?
Is it OK? Should I do something about it?
Everything else on /wp-admin/ is disallowed.
Thanks in advance for your help.
-AK:2 -
Can I still monitor noindex, nofollow pages with Google Analytics?
I have a private/login site where all pages are noindex, nofollow. Can I still monitor external site links with Google Analytics?
Technical SEO | | jasmine.silver0 -
Robot.txt : How to block a specific file type in several subdirectories ?
Hello everyone ! I need help setting up a robot.txt. I'm trying to block all pdf files in particular directories so I'm using this command. In the example below the line is blocking all .gif in the entire site. Block files of a specific file type (for example, .gif) | Disallow: /*.gif$ 2 questions : Can I use this command to specify one particular directory in which I want to block pdf files ? Will this line be recognized by googlebots ? Disallow: /fileadmin/xxxxxxx/xxx/xxxxxxx/*.pdf$ Then I realized that I would have to write as many lines as many directories there are in which I want to block pdf files. Let's say I want to block pdf files in all these 3 directories /fileadmin/directory1 /fileadmin/directory1/sub1 /fileadmin/directory1/sub1/pdf Is there a pattern-matching rule I could use to blocks access to pdf files in all subdirectories instead of writing 3x the above line for each subdirectory ? For exemple : Disallow: /fileadmin/directory1*/ Many thanks in advance for any insight you may have.
Technical SEO | | LabeliumUSA0 -
How can I stop a tracking link from being indexed while still passing link equity?
I have a marketing campaign landing page and it uses a tracking URL to track clicks. The tracking links look something like this: http://this-is-the-origin-url.com/clkn/http/destination-url.com/ The problem is that Google is indexing these links as pages in the SERPs. Of course when they get indexed and then clicked, they show a 400 error because the /clkn/ link doesn't represent an actual page with content on it. The tracking link is set up to instantly 301 redirect to http://destination-url.com. Right now my dev team has blocked these links from crawlers by adding Disallow: /clkn/ in the robots.txt file, however, this blocks the flow of link equity to the destination page. How can I stop these links from being indexed without blocking the flow of link equity to the destination URL?
Technical SEO | | UnbounceVan0 -
Removed Subdomain Sites Still in Google Index
Hey guys, I've got kind of a strange situation going on and I can't seem to find it addressed anywhere. I have a site that at one point had several development sites set up at subdomains. Those sites have since launched on their own domains, but the subdomain sites are still showing up in the Google index. However, if you look at the cached version of pages on these non-existent subdomains, it lists the NEW url, not the dev one in the little blurb that says "This is Google's cached version of www.correcturl.com." Clearly Google recognizes that the content resides at the new location, so how come the old pages are still in the index? Attempting to visit one of them gives a "Server Not Found" error, so they are definitely gone. This is happening to a couple of sites, one that was launched over a year ago so it doesn't appear to be a "wait and see" solution. Any suggestions would be a huge help. Thanks!!
Technical SEO | | SarahLK0 -
Robots.txt on http vs. https
We recently changed our domain from http to https. When a user enters any URL on http, there is an global 301 redirect to the same page on https. I cannot find instructions about what to do with robots.txt. Now that https is the canonical version, should I block the http-Version with robots.txt? Strangely, I cannot find a single ressource about this...
Technical SEO | | zeepartner0 -
Should I block Map pages with robots.txt?
Hello, I have a website that was started in 1999. On the website I have map pages for each of the offices listed on my site, for which there are about 120. Each of the 120 maps is in a whole separate html page. There is no content in the page other than the map. I know all of the offices love having the map pages so I don't want to remove the pages. So, my question is would these pages with no real content be hurting the rankings of the other pages on our site? Therefore, should I block the pages with my robots.txt? Would I also have to remove these pages (in webmaster tools?) from Google for blocking by robots.txt to really work? I appreciate your feedback, thanks!
Technical SEO | | imaginex0 -
How to create a delayed 301 redirect that still passes juice?
My company is merging one of our sites into another site. At first I was just going to create a 301 redirect from domainA.com to domainB.com but we decided that would be too confusing for customers expecting to see domainA.com so we want to create a page that says something like "We've moved. please visit domainB.com or be redirected after 10 seconds". My question is, how do I create a redirect that has a delay and will this still pass the same amount of juice that a regular 301 redirect would? I've heard that meta refreshes are considered spammy by Google.
Technical SEO | | bewoldt0