Robot.txt pattern matching
-
Hola fellow SEO peoples!
Site: http://www.sierratradingpost.com
robot: http://www.sierratradingpost.com/robots.txt
Please see the following line: Disallow: /keycodebypid~*
We are trying to block URLs like this:
http://www.sierratradingpost.com/keycodebypid~8855/for-the-home~d~3/kitchen~d~24/
but we still find them in the Google index.
1. we are not sure if we need to specify the robot to use pattern matching.
2. we are not sure if the format is correct. Should we use Disallow: /keycodebypid*/ or /*keycodebypid/ or even /*keycodebypid~/?
What is even more confusing is that the meta robot command line says "noindex" - yet they still show up. <meta name="robots" content="noindex, follow, noarchive" />
Thank you!
-
ok, so not sure sure this was shared. Matt Cutts talking on this same subject.
|
| <cite class="kvm">www.youtube.com/watch?v=I2giR-WKUfY</cite> |
-
John, The article was a real eye-opener!Thanks again!
-
Somehow Google is finding these pages, but you're disallowing the Googlebot from reading the page, so it doesn't know anything about the meta noindex tag on the page. If you have meta noindex tags on all of these pages, you can remove that line in your robots.txt preventing bots from reading these pages, and as Google crawls these pages, they should remove them from their SERPs.
-
Great point! I will remember that. However I have both the disallow line in the robots.txt file and I also have the noindex meta command. Yet Google shows 3000 of them!?!?!?!
http://www.google.com/search?q=site%3Awww.sierratradingpost.com+keycodebypid
-
Well done John!!!
-
Hi,
then you have the robots.txt and the meta tag. I think its better the metatag (http://www.seomoz.org/learn-seo/robotstxt)
Have you WebMaster Tools in your web? you can test your robots.txt file (http://www.google.com/support/webmasters/bin/answer.py?answer=156449)
-
Here's a good SEOMoz post about this: http://www.seomoz.org/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts. What's most likely happening is that the disallow in robots.txt is preventing the bots from indexing the page, so they're not going to find the meta noindex tag. If people link to one of these pages externally, the disallow in robots.txt does not prevent the page from appearing in search results.
The robots.txt syntax you're using now looks correct to me for what you're trying to do.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Staging & Development areas should be not indexable (i.e. no followed/no index in meta robots etc)
Hi I take it if theres a staging or development area on a subdomain for a site, who's content is hence usually duplicate then this should not be indexable i.e. (no-indexed & nofollowed in metarobots) ? In order to prevent dupe content probs as well as non project related people seeing work in progress or finding accidentally in search engine listings ? Also if theres no such info in meta robots is there any other way it may have been made non-indexable, or at least dupe content prob removed by canonicalising the page to the equivalent page on the live site ? In the case in question i am finding it listed in serps when i search for the staging/dev area url, so i presume this needs urgent attention ? Cheers Dan
Technical SEO | | Dan-Lawrence0 -
Robots.txt crawling URL's we dont want it to
Hello We run a number of websites and underneath them we have testing websites (sub-domains), on those sites we have robots.txt disallowing everything. When I logged into MOZ this morning I could see the MOZ spider had crawled our test sites even though we have said not to. Does anyone have an ideas how we can stop this happening?
Technical SEO | | ShearingsGroup0 -
Meta-robots Nofollow
I don't understand Meta-robots Nofollow. Wordpress has my homepage set to this according to SEOMoz tool. Is this really bad?
Technical SEO | | hopkinspat1 -
Robots.txt and joomla
Hello, I use joomla for my website and automatically all those files are blocked is that good or bad, so I remove anything and if so why ? User-agent: *
Technical SEO | | seoanalytics
Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /images/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/ I also added to my robots.txt files my email address ( is that useful, I am afraid google passes PR to the email address )
and a javascript: void (0) because I have tabs on my webpage ( is that useful )
as well as a .pdf ( is it also useful ) any comments ? does anything need to be changed or is it ok ? Thank you,0 -
Restricted by robots.txt does this cause problems?
I have restricted around 1,500 links which are links to retailers website and links that affiliate links accorsing to webmaster tools Is this the right approach as I thought it would affect the link juice? or should I take the no follow out of the restricted by robots.txt file
Technical SEO | | ocelot0 -
Robots.txt
should I add anything else besides User-Agent: * to my robots.txt file? http://melo4.melotec.com:4010/
Technical SEO | | Romancing0 -
Robots.txt
Hi there, My question relates to the robots.txt file. This statement: /*/trackback Would this block domain.com/trackback and domain.com/fred/trackback ? Peter
Technical SEO | | PeterM220 -
Site not being Indexed that fast anymore, Is something wrong with this Robots.txt
My wordpress site's robots.txt used to be this: User-agent: * Disallow: Sitemap: http://www.domainame.com/sitemap.xml.gz I also have all in one SEO installed and other than posts, tags are also index,follow on my site. My new posts used to appear on google in seconds after publishing. I changed the robots.txt to following and now post indexing takes hours. Is there something wrong with this robots.txt? User-agent: * Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/cache Disallow: /wp-content/themes Disallow: /wp-login.php Disallow: /wp-login.php Disallow: /trackback Disallow: /feed Disallow: /comments Disallow: /author Disallow: /category Disallow: */trackback Disallow: */feed Disallow: */comments Disallow: /login/ Disallow: /wget/ Disallow: /httpd/ Disallow: /*.php$ Disallow: /? Disallow: /*.js$ Disallow: /*.inc$ Disallow: /*.css$ Disallow: /*.gz$ Disallow: /*.wmv$ Disallow: /*.cgi$ Disallow: /*.xhtml$ Disallow: /? Disallow: /*?Allow: /wp-content/uploads User-agent: TechnoratiBot/8.1 Disallow: ia_archiverUser-agent: ia_archiver Disallow: / disable duggmirror User-agent: duggmirror Disallow: / allow google image bot to search all imagesUser-agent: Googlebot-Image Disallow: /wp-includes/ Allow: /* # allow adsense bot on entire siteUser-agent: Mediapartners-Google* Disallow: Allow: /* Sitemap: http://www.domainname.com/sitemap.xml.gz
Technical SEO | | ideas1230