Robots.txt assistance
-
I want to block all the inner archive news pages of my website in robots.txt - we don't have R&D capacity to set up rel=next/prev or create a central page that all inner pages would have a canonical back to, so this is the solution.
The first page I want indexed reads:
http://www.xxxx.news/?p=1all subsequent pages that I want blocked because they don't contain any new content read:
http://www.xxxx.news/?p=2
http://www.xxxx.news/?p=3
etc....There are currently 245 inner archived pages and I would like to set it up so that future pages will automatically be blocked since we are always writing new news pieces. Any advice about what code I should use for this?
Thanks!
-
Thanks for all the input and advice!
We are a gaming site that publishes industry news 2-3 times a week, but that is not our main source of income
-
"I mentioned at the end that being a content site and if that generates revenue that they should consider investing some money in that direction"
Absolutely.
-
Thanks Andy. I did see that and that is why I mentioned at the end that being a content site and if that generates revenue that they should consider investing some money in that direction.
If they are short on money/resources/capacity and the robots.txt solution could actually negatively impact indexation of content that is producing/justifying the current level of money/resources/capacity they could end up in worse position than where they started, i.e. having less money/resources/capacity.
-
If you read the original post again, Sara says "we don't have R&D capacity".
They wouldn't be able to do all this.
-Andy
-
I think you are missing something here if you want to get these pages out of the index. Plus, your use of Robots may harm how Google finds and ranks your actual news items.
First, you have to add the noindex meta tag to pages 2-N in your pagination. Let Google crawl them and take them out of the index.
If you just add them to robots.txt, Google will not crawl, but will also not remove them from the index.
Once you get them out of the index, keeping those tags in place will prevent reindexation and you don't have to add them to Robots.txt.
More importantly, you want pages 2-N being spidered but not indexed. You want Google to crawl your paginated pages to find all of your deep content. Otherwise, unless you have a XML or HTML sitemap, or some other crawlable navigational aid, you are actually preventing Google from crawling and then ranking your content.
Read this Moz post
http://moz.com/learn/seo/robotstxt
There is a section titled "Why Meta Robots is Better than Robots.txt" that will confirm my points.
Lastly. Step back a second. If you are a news/content site and this helps you to generate revenue, and you have a bunch of news pages, and this is important content, spend some money on Development to implement the rel=next/prev. It is worth it to get Google crawling your stuff properly.
Good luck!
-
Definitely something to test. I'm not sure of the rules that Google will apply with this and which way round works.
-Andy
-
I think it has to be the other way around: Disallow: /?p=* Allow: /?p=1 as you want to first disallow everything with the P parameter but then allow the first page. You should test it but I think in Andy's example you will still block the first page which you've just allowed.
-
I haven't actually done this myself, but I suspect that pattern matching is your solution here.
However, what you want to be able to do is disallow the whole pattern and then allow just the first page:
Allow: /?p=1 Disallow: /?p=*
The thing I don't have the answer to, is if this will work by first allowing the page 1, and then blocking all others. I don't have a method for this in blocking via robots as this is normally handed with other solutions you mention.
You can try it though through Webmaster tools:
https://support.google.com/webmasters/answer/156449?hl=en- On the Webmaster Tools Home page, click the site you want.
- Under Crawl, click Blocked URLs.
- If it's not already selected, click the** Test robots.txt** tab.
- Copy the content of your robots.txt file, and paste it into the first box.
- In the URLs box, list the site to test against.
- In the User-agents list, select the user-agents you want.
-Andy
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt - Googlebot - Allow... what's it for?
Hello - I just came across this in robots.txt for the first time, and was wondering why it is used? Why would you have to proactively tell Googlebot to crawl JS/CSS and why would you want it to? Any help would be much appreciated - thanks, Luke User-Agent: Googlebot Allow: /.js Allow: /.css
Intermediate & Advanced SEO | | McTaggart0 -
Search engine blocked by robots-crawl error by moz & GWT
Hello Everyone,. For My Site I am Getting Error Code 605: Page Banned by robots.txt, X-Robots-Tag HTTP Header, or Meta Robots Tag, Also google Webmaster Also not able to fetch my site, tajsigma.com is my site Any expert Can Help please, Thanx
Intermediate & Advanced SEO | | falguniinnovative0 -
meta robots no follow on page for paid links
Hi I have a page containing paid links. i would like to add no follow attribute to these links
Intermediate & Advanced SEO | | Kung_fu_Panda
but from technical reasons, i can only place meta robots no follow on page level (
is that enough for telling Google that the links in this page are paid and and to prevent Google penlizling the sites that the page link to? Thanks!0 -
Using Meta Header vs Robots.txt
Hey Mozzers, I am working on a site that has search-friendly parameters for their faceted navigation, however this makes it difficult to identify the parameters in a robots.txt file. I know that using the robots.txt file is highly recommended and powerful, but I am not sure how to do this when facets are using common words such as sizes. For example, a filtered url may look like www.website.com/category/brand/small.html Brand and size are both facets. Brand is a great filter, and size is very relevant for shoppers, but many products include "small" in the url, so it is tough to isolate that filter in the robots.txt. (I hope that makes sense). I am able to identify problematic pages and edit the Meta Head so I can add on any page that is causing these duplicate issues. My question is, is this a good idea? I want bots to crawl the facets, but indexing all of the facets causes duplicate issues. Thoughts?
Intermediate & Advanced SEO | | evan890 -
Robots.txt & Duplicate Content
In reviewing my crawl results I have 5666 pages of duplicate content. I believe this is because many of the indexed pages are just different ways to get to the same content. There is one primary culprit. It's a series of URL's related to CatalogSearch - for example; http://www.careerbags.com/catalogsearch/result/index/?q=Mobile I have 10074 of those links indexed according to my MOZ crawl. Of those 5349 are tagged as duplicate content. Another 4725 are not. Here are some additional sample links: http://www.careerbags.com/catalogsearch/result/index/?dir=desc&order=relevance&p=2&q=Amy
Intermediate & Advanced SEO | | Careerbags
http://www.careerbags.com/catalogsearch/result/index/?color=28&q=bellemonde
http://www.careerbags.com/catalogsearch/result/index/?cat=9&color=241&dir=asc&order=relevance&q=baggallini All of these links are just different ways of searching through our product catalog. My question is should we disallow - catalogsearch via the robots file? Are these links doing more harm than good?0 -
Search Engine Blocked by robots.txt for Dynamic URLs
Today, I was checking crawl diagnostics for my website. I found warning for search engine blocked by robots.txt I have added following syntax to robots.txt file for all dynamic URLs. Disallow: /*?osCsid Disallow: /*?q= Disallow: /*?dir= Disallow: /*?p= Disallow: /*?limit= Disallow: /*review-form Dynamic URLs are as follow. http://www.vistastores.com/bar-stools?dir=desc&order=position http://www.vistastores.com/bathroom-lighting?p=2 and many more... So, Why should it shows me warning for this? Does it really matter or any other solution for these kind of dynamic URLs.
Intermediate & Advanced SEO | | CommercePundit0 -
Block an entire subdomain with robots.txt?
Is it possible to block an entire subdomain with robots.txt? I write for a blog that has their root domain as well as a subdomain pointing to the exact same IP. Getting rid of the option is not an option so I'd like to explore other options to avoid duplicate content. Any ideas?
Intermediate & Advanced SEO | | kylesuss12 -
Should I robots block site directories with primarily duplicate content?
Our site, CareerBliss.com, primarily offers unique content in the form of company reviews and exclusive salary information. As a means of driving revenue, we also have a lot of job listings in ouir /jobs/ directory, as well as educational resources (/career-tools/education/) in our. The bulk of this information are feeds, which exist on other websites (duplicate). Does it make sense to go ahead and robots block these portions of our site? My thinking is in doing so, it will help reallocate our site authority helping the /salary/ and /company-reviews/ pages rank higher, and this is where most of the people are finding our site via search anyways. ie. http://www.careerbliss.com/jobs/cisco-systems-jobs-812156/ http://www.careerbliss.com/jobs/jobs-near-you/?l=irvine%2c+ca&landing=true http://www.careerbliss.com/career-tools/education/education-teaching-category-5/
Intermediate & Advanced SEO | | CareerBliss0