Robots.txt assistance
-
I want to block all the inner archive news pages of my website in robots.txt - we don't have R&D capacity to set up rel=next/prev or create a central page that all inner pages would have a canonical back to, so this is the solution.
The first page I want indexed reads:
http://www.xxxx.news/?p=1all subsequent pages that I want blocked because they don't contain any new content read:
http://www.xxxx.news/?p=2
http://www.xxxx.news/?p=3
etc....There are currently 245 inner archived pages and I would like to set it up so that future pages will automatically be blocked since we are always writing new news pieces. Any advice about what code I should use for this?
Thanks!
-
Thanks for all the input and advice!
We are a gaming site that publishes industry news 2-3 times a week, but that is not our main source of income
-
"I mentioned at the end that being a content site and if that generates revenue that they should consider investing some money in that direction"
Absolutely.
-
Thanks Andy. I did see that and that is why I mentioned at the end that being a content site and if that generates revenue that they should consider investing some money in that direction.
If they are short on money/resources/capacity and the robots.txt solution could actually negatively impact indexation of content that is producing/justifying the current level of money/resources/capacity they could end up in worse position than where they started, i.e. having less money/resources/capacity.
-
If you read the original post again, Sara says "we don't have R&D capacity".
They wouldn't be able to do all this.
-Andy
-
I think you are missing something here if you want to get these pages out of the index. Plus, your use of Robots may harm how Google finds and ranks your actual news items.
First, you have to add the noindex meta tag to pages 2-N in your pagination. Let Google crawl them and take them out of the index.
If you just add them to robots.txt, Google will not crawl, but will also not remove them from the index.
Once you get them out of the index, keeping those tags in place will prevent reindexation and you don't have to add them to Robots.txt.
More importantly, you want pages 2-N being spidered but not indexed. You want Google to crawl your paginated pages to find all of your deep content. Otherwise, unless you have a XML or HTML sitemap, or some other crawlable navigational aid, you are actually preventing Google from crawling and then ranking your content.
Read this Moz post
http://moz.com/learn/seo/robotstxt
There is a section titled "Why Meta Robots is Better than Robots.txt" that will confirm my points.
Lastly. Step back a second. If you are a news/content site and this helps you to generate revenue, and you have a bunch of news pages, and this is important content, spend some money on Development to implement the rel=next/prev. It is worth it to get Google crawling your stuff properly.
Good luck!
-
Definitely something to test. I'm not sure of the rules that Google will apply with this and which way round works.
-Andy
-
I think it has to be the other way around: Disallow: /?p=* Allow: /?p=1 as you want to first disallow everything with the P parameter but then allow the first page. You should test it but I think in Andy's example you will still block the first page which you've just allowed.
-
I haven't actually done this myself, but I suspect that pattern matching is your solution here.
However, what you want to be able to do is disallow the whole pattern and then allow just the first page:
Allow: /?p=1 Disallow: /?p=*
The thing I don't have the answer to, is if this will work by first allowing the page 1, and then blocking all others. I don't have a method for this in blocking via robots as this is normally handed with other solutions you mention.
You can try it though through Webmaster tools:
https://support.google.com/webmasters/answer/156449?hl=en- On the Webmaster Tools Home page, click the site you want.
- Under Crawl, click Blocked URLs.
- If it's not already selected, click the** Test robots.txt** tab.
- Copy the content of your robots.txt file, and paste it into the first box.
- In the URLs box, list the site to test against.
- In the User-agents list, select the user-agents you want.
-Andy
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google robots.txt test - not picking up syntax errors?
I just ran a robots.txt file through "Google robots.txt Tester" as there was some unusual syntax in the file that didn't make any sense to me... e.g. /url/?*
Intermediate & Advanced SEO | | McTaggart
/url/?
/url/* and so on. I would use ? and not ? for example and what is ? for! - etc. Yet "Google robots.txt Tester" did not highlight the issues... I then fed the sitemap through http://www.searchenginepromotionhelp.com/m/robots-text-tester/robots-checker.php and that tool actually picked up my concerns. Can anybody explain why Google didn't - or perhaps it isn't supposed to pick up such errors? Thanks, Luke0 -
Large robots.txt file
We're looking at potentially creating a robots.txt with 1450 lines in it. This will remove 100k+ pages from the crawl that are all old pages (I know, the ideal would be to delete/noindex but not viable unfortunately) Now the issue i'm thinking is that a large robots.txt will either stop the robots.txt from being followed or will slow our crawl rate down. Does anybody have any experience with a robots.txt of that size?
Intermediate & Advanced SEO | | ThomasHarvey0 -
Question about Syntax in Robots.txt
So if I want to block any URL from being indexed that contains a particular parameter what is the best way to put this in the robots.txt file? Currently I have-
Intermediate & Advanced SEO | | DRSearchEngOpt
Disallow: /attachment_id Where "attachment_id" is the parameter. Problem is I still see these URL's indexed and this has been in the robots now for over a month. I am wondering if I should just do Disallow: attachment_id or Disallow: attachment_id= but figured I would ask you guys first. Thanks!0 -
Meta canonical or simply robots.txt other domain names with same content?
Hi, I'm working with a new client who has a main product website. This client has representatives who also sells the same products but all those reps have a copy of the same website on another domain name. The best thing would probably be to shut down the other (same) websites and redirect 301 them to the main, but that's impossible in the minding of the client. First choice : Implement a conical meta for all the URL on all the other domain names. Second choice : Robots.txt with disallow for all the other websites. Third choice : I'm really open to other suggestions 😉 Thank you very much! 🙂
Intermediate & Advanced SEO | | Louis-Philippe_Dea0 -
Panda Updates - robots.txt or noindex?
Hi, I have a site that I believe has been impacted by the recent Panda updates. Assuming that Google has crawled and indexed several thousand pages that are essentially the same and the site has now passed the threshold to be picked out by the Panda update, what is the best way to proceed? Is it enough to block the pages from being crawled in the future using robots.txt, or would I need to remove the pages from the index using the meta noindex tag? Of course if I block the URLs with robots.txt then Googlebot won't be able to access the page in order to see the noindex tag. Anyone have and previous experiences of doing something similar? Thanks very much.
Intermediate & Advanced SEO | | ianmcintosh0 -
Will disallowing in robots.txt noindex a page?
Google has indexed a page I wish to remove. I would like to meta noindex but the CMS isn't allowing me too right now. A suggestion o disallow in robots.txt would simply stop them crawling I expect or is it also an instruction to noindex? Thanks
Intermediate & Advanced SEO | | Brocberry0 -
Reciprocal Links and nofollow/noindex/robots.txt
Hypothetical Situations: You get a guest post on another blog and it offers a great link back to your website. You want to tell your readers about it, but linking the post will turn that link into a reciprocal link instead of a one way link, which presumably has more value. Should you nofollow your link to the guest post? My intuition here, and the answer that I expect, is that if it's good for users, the link belongs there, and as such there is no trouble with linking to the post. Is this the right way to think about it? Would grey hats agree? You're working for a small local business and you want to explore some reciprocal link opportunities with other companies in your niche using a "links" page you created on your domain. You decide to get sneaky and either noindex your links page, block the links page with robots.txt, or nofollow the links on the page. What is the best practice? My intuition here, and the answer that I expect, is that this would be a sneaky practice, and could lead to bad blood with the people you're exchanging links with. Would these tactics even be effective in turning a reciprocal link into a one-way link if you could overlook the potential immorality of the practice? Would grey hats agree?
Intermediate & Advanced SEO | | AnthonyMangia0 -
Robots.txt: Link Juice vs. Crawl Budget vs. Content 'Depth'
I run a quality vertical search engine. About 6 months ago we had a problem with our sitemaps, which resulted in most of our pages getting tossed out of Google's index. As part of the response, we put a bunch of robots.txt restrictions in place in our search results to prevent Google from crawling through pagination links and other parameter based variants of our results (sort order, etc). The idea was to 'preserve crawl budget' in order to speed the rate at which Google could get our millions of pages back in the index by focusing attention/resources on the right pages. The pages are back in the index now (and have been for a while), and the restrictions have stayed in place since that time. But, in doing a little SEOMoz reading this morning, I came to wonder whether that approach may now be harming us... http://www.seomoz.org/blog/restricting-robot-access-for-improved-seo
Intermediate & Advanced SEO | | kurus
http://www.seomoz.org/blog/serious-robotstxt-misuse-high-impact-solutions Specifically, I'm concerned that a) we're blocking the flow of link juice and that b) by preventing Google from crawling the full depth of our search results (i.e. pages >1), we may be making our site wrongfully look 'thin'. With respect to b), we've been hit by Panda and have been implementing plenty of changes to improve engagement, eliminate inadvertently low quality pages, etc, but we have yet to find 'the fix'... Thoughts? Kurus0