Robots.txt assistance
-
I want to block all the inner archive news pages of my website in robots.txt - we don't have R&D capacity to set up rel=next/prev or create a central page that all inner pages would have a canonical back to, so this is the solution.
The first page I want indexed reads:
http://www.xxxx.news/?p=1all subsequent pages that I want blocked because they don't contain any new content read:
http://www.xxxx.news/?p=2
http://www.xxxx.news/?p=3
etc....There are currently 245 inner archived pages and I would like to set it up so that future pages will automatically be blocked since we are always writing new news pieces. Any advice about what code I should use for this?
Thanks!
-
Thanks for all the input and advice!
We are a gaming site that publishes industry news 2-3 times a week, but that is not our main source of income
-
"I mentioned at the end that being a content site and if that generates revenue that they should consider investing some money in that direction"
Absolutely.
-
Thanks Andy. I did see that and that is why I mentioned at the end that being a content site and if that generates revenue that they should consider investing some money in that direction.
If they are short on money/resources/capacity and the robots.txt solution could actually negatively impact indexation of content that is producing/justifying the current level of money/resources/capacity they could end up in worse position than where they started, i.e. having less money/resources/capacity.
-
If you read the original post again, Sara says "we don't have R&D capacity".
They wouldn't be able to do all this.
-Andy
-
I think you are missing something here if you want to get these pages out of the index. Plus, your use of Robots may harm how Google finds and ranks your actual news items.
First, you have to add the noindex meta tag to pages 2-N in your pagination. Let Google crawl them and take them out of the index.
If you just add them to robots.txt, Google will not crawl, but will also not remove them from the index.
Once you get them out of the index, keeping those tags in place will prevent reindexation and you don't have to add them to Robots.txt.
More importantly, you want pages 2-N being spidered but not indexed. You want Google to crawl your paginated pages to find all of your deep content. Otherwise, unless you have a XML or HTML sitemap, or some other crawlable navigational aid, you are actually preventing Google from crawling and then ranking your content.
Read this Moz post
http://moz.com/learn/seo/robotstxt
There is a section titled "Why Meta Robots is Better than Robots.txt" that will confirm my points.
Lastly. Step back a second. If you are a news/content site and this helps you to generate revenue, and you have a bunch of news pages, and this is important content, spend some money on Development to implement the rel=next/prev. It is worth it to get Google crawling your stuff properly.
Good luck!
-
Definitely something to test. I'm not sure of the rules that Google will apply with this and which way round works.
-Andy
-
I think it has to be the other way around: Disallow: /?p=* Allow: /?p=1 as you want to first disallow everything with the P parameter but then allow the first page. You should test it but I think in Andy's example you will still block the first page which you've just allowed.
-
I haven't actually done this myself, but I suspect that pattern matching is your solution here.
However, what you want to be able to do is disallow the whole pattern and then allow just the first page:
Allow: /?p=1 Disallow: /?p=*
The thing I don't have the answer to, is if this will work by first allowing the page 1, and then blocking all others. I don't have a method for this in blocking via robots as this is normally handed with other solutions you mention.
You can try it though through Webmaster tools:
https://support.google.com/webmasters/answer/156449?hl=en- On the Webmaster Tools Home page, click the site you want.
- Under Crawl, click Blocked URLs.
- If it's not already selected, click the** Test robots.txt** tab.
- Copy the content of your robots.txt file, and paste it into the first box.
- In the URLs box, list the site to test against.
- In the User-agents list, select the user-agents you want.
-Andy
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt blocked internal resources Wordpress
Hi all, We've recently migrated a Wordpress website from staging to live, but the robots.txt was deleted. I've created the following new one: User-agent: *
Intermediate & Advanced SEO | | Mat_C
Allow: /
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Allow: /wp-admin/admin-ajax.php However, in the site audit on SemRush, I now get the mention that a lot of pages have issues with blocked internal resources in robots.txt file. These blocked internal resources are all cached and minified css elements: links, images and scripts. Does this mean that Google won't crawl some parts of these pages with blocked resources correctly and thus won't be able to follow these links and index the images? In other words, is this any cause for concern regarding SEO? Of course I can change the robots.txt again, but will urls like https://example.com/wp-content/cache/minify/df983.js end up in the index? Thanks for your thoughts!2 -
Software assisted meta data
I was recently contacted by an SEO firm that did a search on my site and said it had a low index. Out of 5,000+ pages only 800 keywords was ranking. They said there is much improvement for adjusting my meta data for indexing. They said there is a software that does this for you. Does anyone have any experience with this? Does this sound true what they are explaining? What is this software and how much does it cost? Any help would be greatly appreciated.
Intermediate & Advanced SEO | | nchachula0 -
Block subdomain directory in robots.txt
Instead of block an entire sub-domain (fr.sitegeek.com) with robots.txt, we like to block one directory (fr.sitegeek.com/blog).
Intermediate & Advanced SEO | | gamesecure
'fr.sitegeek.com/blog' and 'wwww.sitegeek.com/blog' contain the same articles in one language only labels are changed for 'fr' version and we suppose that duplicate content cause problem for SEO. We would like to crawl and index 'www.sitegee.com/blog' articles not 'fr.sitegeek.com/blog'. so, suggest us how to block single sub-domain directory (fr.sitegeek.com/blog) with robot.txt? This is only for blog directory of 'fr' version even all other directories or pages would be crawled and indexed for 'fr' version. Thanks,
Rajiv0 -
Should comments and feeds be disallowed in robots.txt?
Hi My robots file is currently set up as listed below. From an SEO point of view is it good to disallow feeds, rss and comments? I feel allowing comments would be a good thing because it's new content that may rank in the search engines as the comments left on my blog often refer to questions or companies folks are searching for more information on. And the comments are added regularly. What's your take? I'm also concerned about the /page being blocked. Not sure how that benefits my blog from an SEO point of view as well. Look forward to your feedback. Thanks. Eddy User-agent: Googlebot Crawl-delay: 10 Allow: /* User-agent: * Crawl-delay: 10 Disallow: /wp- Disallow: /feed/ Disallow: /trackback/ Disallow: /rss/ Disallow: /comments/feed/ Disallow: /page/ Disallow: /date/ Disallow: /comments/ # Allow Everything Allow: /*
Intermediate & Advanced SEO | | workathomecareers0 -
Should all pages on a site be included in either your sitemap or robots.txt?
I don't have any specific scenario here but just curious as I come across sites fairly often that have, for example, 20,000 pages but only 1,000 in their sitemap. If they only think 1,000 of their URL's are ones that they want included in their sitemap and indexed, should the others be excluded using robots.txt or a page level exclusion? Is there a point to having pages that are included in neither and leaving it up to Google to decide?
Intermediate & Advanced SEO | | RossFruin1 -
Could you use a robots.txt file to disalow a duplicate content page from being crawled?
A website has duplicate content pages to make it easier for users to find the information from a couple spots in the site navigation. Site owner would like to keep it this way without hurting SEO. I've thought of using the robots.txt file to disallow search engines from crawling one of the pages. Would you think this is a workable/acceptable solution?
Intermediate & Advanced SEO | | gregelwell0 -
Not using a robot command meta tag
Hi SEOmoz peeps. Was doing some research on robot commands and found a couple major sites that are not using them. If you check out the code for these: http://www.amazon.com http://www.zappos.com http://www.zappos.com/product/7787787/color/92100 http://www.altrec.com/ You fill not find a meta robot command line. Of course you need the line for any noindex, nofollow, noarchive pages. However for pages you want crawled and indexed, is there any benefit for not having the line at all? Thanks!
Intermediate & Advanced SEO | | STPseo0 -
Video SERP assistance needed
I'm working on a site with a great deal of video content. We also have a youTube page. I've recently created a site map and submitted to Google. I really want for our videos to show up in our search results, but so far, nothings happening. Site: www.roydwyer.com Any advise to what's missing? Thanx in advance! annular-tear.php
Intermediate & Advanced SEO | | Aaronetics0