Robot.txt pattern matching
-
Hola fellow SEO peoples!
Site: http://www.sierratradingpost.com
robot: http://www.sierratradingpost.com/robots.txt
Please see the following line: Disallow: /keycodebypid~*
We are trying to block URLs like this:
http://www.sierratradingpost.com/keycodebypid~8855/for-the-home~d~3/kitchen~d~24/
but we still find them in the Google index.
1. we are not sure if we need to specify the robot to use pattern matching.
2. we are not sure if the format is correct. Should we use Disallow: /keycodebypid*/ or /*keycodebypid/ or even /*keycodebypid~/?
What is even more confusing is that the meta robot command line says "noindex" - yet they still show up. <meta name="robots" content="noindex, follow, noarchive" />
Thank you!
-
ok, so not sure sure this was shared. Matt Cutts talking on this same subject.
| | <cite class="kvm">www.youtube.com/watch?v=I2giR-WKUfY</cite> |
-
John, The article was a real eye-opener!Thanks again!
-
Somehow Google is finding these pages, but you're disallowing the Googlebot from reading the page, so it doesn't know anything about the meta noindex tag on the page. If you have meta noindex tags on all of these pages, you can remove that line in your robots.txt preventing bots from reading these pages, and as Google crawls these pages, they should remove them from their SERPs.
-
Great point! I will remember that. However I have both the disallow line in the robots.txt file and I also have the noindex meta command. Yet Google shows 3000 of them!?!?!?!
http://www.google.com/search?q=site%3Awww.sierratradingpost.com+keycodebypid
-
Well done John!!!
-
Hi,
then you have the robots.txt and the meta tag. I think its better the metatag (http://www.seomoz.org/learn-seo/robotstxt)
Have you WebMaster Tools in your web? you can test your robots.txt file (http://www.google.com/support/webmasters/bin/answer.py?answer=156449)
-
Here's a good SEOMoz post about this: http://www.seomoz.org/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts. What's most likely happening is that the disallow in robots.txt is preventing the bots from indexing the page, so they're not going to find the meta noindex tag. If people link to one of these pages externally, the disallow in robots.txt does not prevent the page from appearing in search results.
The robots.txt syntax you're using now looks correct to me for what you're trying to do.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Exclude root url in robots.txt ?
Hi, I have the following setup: www.example.com/nl
Technical SEO | | mikehenze
www.example.com/de
www.example.com/uk
etc
www.example.com is 301'ed to www.example.com/nl But now www.example.com is ranking instead of www.example.com/nl
Should is block www.example.com in robots.txt so only the subfolders are being ranked?
Or will i lose my ranking by doing this.0 -
Staging & Development areas should be not indexable (i.e. no followed/no index in meta robots etc)
Hi I take it if theres a staging or development area on a subdomain for a site, who's content is hence usually duplicate then this should not be indexable i.e. (no-indexed & nofollowed in metarobots) ? In order to prevent dupe content probs as well as non project related people seeing work in progress or finding accidentally in search engine listings ? Also if theres no such info in meta robots is there any other way it may have been made non-indexable, or at least dupe content prob removed by canonicalising the page to the equivalent page on the live site ? In the case in question i am finding it listed in serps when i search for the staging/dev area url, so i presume this needs urgent attention ? Cheers Dan
Technical SEO | | Dan-Lawrence0 -
Robots.txt anomaly
Hi, I'm monitoring a site thats had a new design relaunch and new robots.txt added. Over the period of a week (since launch) webmaster tools has shown a steadily increasing number of blocked urls (now at 14). In the robots.txt file though theres only 12 lines with the disallow command, could this be occurring because a line in the command could refer to more than one page/url ? They all look like single urls for example: Disallow: /wp-content/plugins
Technical SEO | | Dan-Lawrence
Disallow: /wp-content/cache
Disallow: /wp-content/themes etc, etc And is it normal for webmaster tools reporting of robots.txt blocked urls to steadily increase in number over time, as opposed to being identified straight away ? Thanks in advance for any help/advice/clarity why this may be happening ? Cheers Dan0 -
Should I block robots from URLs containing query strings?
I'm about to block off all URLs that have a query string using robots.txt. They're mostly URLs with coremetrics tags and other referrer info. I figured that search engines don't need to see these as they're always better off with the original URL. Might there be any downside to this that I need to consider? Appreciate your help / experiences on this one. Thanks Jenni
Technical SEO | | ShearingsGroup0 -
Use of Robots.txt file on a job site
We are performing SEO on a large niche Job Board. My question revolves around the thought of no following all the actual job postings from their clients as they only last for 30 to 60 days. Anybody have any idea on the best way to handle this?
Technical SEO | | WebTalent0 -
SERP change every few days in a certain pattern
Every 2-3 days, presumably each time we are crawled we go from 2nd for a chosen keyword - to 4th for a chosen keyword. Then 2-3 days later - back to 2nd. It's always the same two listings that come above us, and go back underneath - why would this be?
Technical SEO | | Benj251 -
SeoMoz robot is not able to crawl my website.
Hi, SeoMoz robot crawls only two web pages of my website. I contacts seomoz team and they told me that the problem is because of Javascript use. What is the solution to this? Should I contact my webdesign company and ask them to remove Javascript code?
Technical SEO | | ashish2110 -
Robots.txt File Redirects to Home Page
I've been doing some site analysis for a new SEO client and it has been brought to my attention that their robots.txt file redirects to their homepage. I was wondering: Is there a benfit to setup your robots.txt file to do this? Will this effect how their site will get indexed? Thanks for your response! Kyle Site URL: http://www.radisphere.net/
Technical SEO | | kchandler0