Robot.txt pattern matching
-
Hola fellow SEO peoples!
Site: http://www.sierratradingpost.com
robot: http://www.sierratradingpost.com/robots.txt
Please see the following line: Disallow: /keycodebypid~*
We are trying to block URLs like this:
http://www.sierratradingpost.com/keycodebypid~8855/for-the-home~d~3/kitchen~d~24/
but we still find them in the Google index.
1. we are not sure if we need to specify the robot to use pattern matching.
2. we are not sure if the format is correct. Should we use Disallow: /keycodebypid*/ or /*keycodebypid/ or even /*keycodebypid~/?
What is even more confusing is that the meta robot command line says "noindex" - yet they still show up. <meta name="robots" content="noindex, follow, noarchive" />
Thank you!
-
ok, so not sure sure this was shared. Matt Cutts talking on this same subject.
| | <cite class="kvm">www.youtube.com/watch?v=I2giR-WKUfY</cite> |
-
John, The article was a real eye-opener!Thanks again!
-
Somehow Google is finding these pages, but you're disallowing the Googlebot from reading the page, so it doesn't know anything about the meta noindex tag on the page. If you have meta noindex tags on all of these pages, you can remove that line in your robots.txt preventing bots from reading these pages, and as Google crawls these pages, they should remove them from their SERPs.
-
Great point! I will remember that. However I have both the disallow line in the robots.txt file and I also have the noindex meta command. Yet Google shows 3000 of them!?!?!?!
http://www.google.com/search?q=site%3Awww.sierratradingpost.com+keycodebypid
-
Well done John!!!
-
Hi,
then you have the robots.txt and the meta tag. I think its better the metatag (http://www.seomoz.org/learn-seo/robotstxt)
Have you WebMaster Tools in your web? you can test your robots.txt file (http://www.google.com/support/webmasters/bin/answer.py?answer=156449)
-
Here's a good SEOMoz post about this: http://www.seomoz.org/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts. What's most likely happening is that the disallow in robots.txt is preventing the bots from indexing the page, so they're not going to find the meta noindex tag. If people link to one of these pages externally, the disallow in robots.txt does not prevent the page from appearing in search results.
The robots.txt syntax you're using now looks correct to me for what you're trying to do.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Will a robots.txt disallow apply to a 301ed URL?
Hi there, I have a robots.txt query which I haven't tried before and as we're nearing a big time for sales, I'm hesitant to just roll out to live! Say for example, in my robots.txt I disallow the URL 'example1.html'. In reality, 'example1.html' 301s/302s to 'example2.html'. Would the robots.txt directive also apply to 'example2.html' (disallow) or as it's a separate URL, would the directive be ignored as it's not valid? I have a feeling that as it's a separate URL, the robots disallow directive won't apply. However, just thought I'd sense-check with the community.
Technical SEO | | ecommercebc0 -
Robots.txt & Mobile Site
Background - Our mobile site is on the same domain as our main site. We use a folder approach for our mobile site abc.com/m/home.html We are re-directing traffic to our mobile site vie device detection and re-direction exists for a handful of pages of our site ie most of our pages do not redirect the user to a mobile equivalent page. Issue – Our mobile pages are being indexed in desktop Google searches Input Required – How should we modify our robots.txt so that the desktop google index does not index our mobile pages/urls User-agent: Googlebot-Mobile Disallow: /m User-agent: `YahooSeeker/M1A1-R2D2` Disallow: /m User-agent: `MSNBOT_Mobile` Disallow: /m Many thanks
Technical SEO | | CeeC-Blogger0 -
Is having no robots.txt file the same as having one and allowing all agents?
The site I am working on currently has no robots.txt file. However, I have just uploaded a sitemap and would like to point the robots.txt file to it. Once I upload the robots.txt file, if I allow access to all agents, is this the same as when the site had no robots.txt file at all; do I need to specify crawler access on can the robots.txt file just contain the link to the sitemap?
Technical SEO | | pugh0 -
What does it mean by 'blocked by Meta Robot'? How do I fix this?
When i get my crawl diagnostics, I am getting a blocked by Meta Robot, which means that my page is not being indexed in the search engines... obviously this is a major issue for organic traffic!!! What does it actually mean, and how can i fix it?
Technical SEO | | rolls1230 -
Robots.txt question
Hello, What does the following command mean - User-agent: * Allow: / Does it mean that we are blocking all spiders ? Is Allow supported in robots.txt ? Thanks
Technical SEO | | seoug_20050 -
Invisible robots.txt?
So here's a weird one... Client comes to me for some simple changes, turns out there are some major issues with the site, one of which is that none of the correct content pages are showing up in Google, just ancillary (outdated) ones. Looks like an issue because even the main homepage isn't showing up with a "site:domain.com" So, I add to Webmaster Tools and, after an hour or so, I get the red bar of doom, "robots.txt is blocking important pages." I check it out in Webmasters and, sure enough, it's a "User agent: * Disallow /" ACK! But wait... there's no robots.txt to be found on the server. I can go to domain.com/robots.txt and see it but nothing via FTP. I upload a new one and, thankfully, that is now showing but I've never seen that before. Question is: can a robots.txt file be stored in a way that can't be seen? Thanks!
Technical SEO | | joshcanhelp0 -
Robots.txt
Hi there, My question relates to the robots.txt file. This statement: /*/trackback Would this block domain.com/trackback and domain.com/fred/trackback ? Peter
Technical SEO | | PeterM220 -
SeoMoz robot is not able to crawl my website.
Hi, SeoMoz robot crawls only two web pages of my website. I contacts seomoz team and they told me that the problem is because of Javascript use. What is the solution to this? Should I contact my webdesign company and ask them to remove Javascript code?
Technical SEO | | ashish2110