Robots.txt assistance
-
I want to block all the inner archive news pages of my website in robots.txt - we don't have R&D capacity to set up rel=next/prev or create a central page that all inner pages would have a canonical back to, so this is the solution.
The first page I want indexed reads:
http://www.xxxx.news/?p=1all subsequent pages that I want blocked because they don't contain any new content read:
http://www.xxxx.news/?p=2
http://www.xxxx.news/?p=3
etc....There are currently 245 inner archived pages and I would like to set it up so that future pages will automatically be blocked since we are always writing new news pieces. Any advice about what code I should use for this?
Thanks!
-
Thanks for all the input and advice!
We are a gaming site that publishes industry news 2-3 times a week, but that is not our main source of income
-
"I mentioned at the end that being a content site and if that generates revenue that they should consider investing some money in that direction"
Absolutely.
-
Thanks Andy. I did see that and that is why I mentioned at the end that being a content site and if that generates revenue that they should consider investing some money in that direction.
If they are short on money/resources/capacity and the robots.txt solution could actually negatively impact indexation of content that is producing/justifying the current level of money/resources/capacity they could end up in worse position than where they started, i.e. having less money/resources/capacity.
-
If you read the original post again, Sara says "we don't have R&D capacity".
They wouldn't be able to do all this.
-Andy
-
I think you are missing something here if you want to get these pages out of the index. Plus, your use of Robots may harm how Google finds and ranks your actual news items.
First, you have to add the noindex meta tag to pages 2-N in your pagination. Let Google crawl them and take them out of the index.
If you just add them to robots.txt, Google will not crawl, but will also not remove them from the index.
Once you get them out of the index, keeping those tags in place will prevent reindexation and you don't have to add them to Robots.txt.
More importantly, you want pages 2-N being spidered but not indexed. You want Google to crawl your paginated pages to find all of your deep content. Otherwise, unless you have a XML or HTML sitemap, or some other crawlable navigational aid, you are actually preventing Google from crawling and then ranking your content.
Read this Moz post
http://moz.com/learn/seo/robotstxt
There is a section titled "Why Meta Robots is Better than Robots.txt" that will confirm my points.
Lastly. Step back a second. If you are a news/content site and this helps you to generate revenue, and you have a bunch of news pages, and this is important content, spend some money on Development to implement the rel=next/prev. It is worth it to get Google crawling your stuff properly.
Good luck!
-
Definitely something to test. I'm not sure of the rules that Google will apply with this and which way round works.
-Andy
-
I think it has to be the other way around: Disallow: /?p=* Allow: /?p=1 as you want to first disallow everything with the P parameter but then allow the first page. You should test it but I think in Andy's example you will still block the first page which you've just allowed.
-
I haven't actually done this myself, but I suspect that pattern matching is your solution here.
However, what you want to be able to do is disallow the whole pattern and then allow just the first page:
Allow: /?p=1 Disallow: /?p=*
The thing I don't have the answer to, is if this will work by first allowing the page 1, and then blocking all others. I don't have a method for this in blocking via robots as this is normally handed with other solutions you mention.
You can try it though through Webmaster tools:
https://support.google.com/webmasters/answer/156449?hl=en- On the Webmaster Tools Home page, click the site you want.
- Under Crawl, click Blocked URLs.
- If it's not already selected, click the** Test robots.txt** tab.
- Copy the content of your robots.txt file, and paste it into the first box.
- In the URLs box, list the site to test against.
- In the User-agents list, select the user-agents you want.
-Andy
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt was set to disallow for 14 days
We updated our website and accidentally overwrote our robots file with a version that prevented crawling ( "Disallow: /") We realized the issue 14 days later and replaced after our organic visits began to drop significantly and we quickly replace the robots file with the correct version to begin crawling again. With the impact to our organic visits, we have a few and any help would be greatly appreciated - Will the site get back to its original status/ranking ? If so .. how long would that take? Is there anything we can do to speed up the process ? Thanks
Intermediate & Advanced SEO | | jc42540 -
Will disallowing URL's in the robots.txt file stop those URL's being indexed by Google
I found a lot of duplicate title tags showing in Google Webmaster Tools. When I visited the URL's that these duplicates belonged to, I found that they were just images from a gallery that we didn't particularly want Google to index. There is no benefit to the end user in these image pages being indexed in Google. Our developer has told us that these urls are created by a module and are not "real" pages in the CMS. They would like to add the following to our robots.txt file Disallow: /catalog/product/gallery/ QUESTION: If the these pages are already indexed by Google, will this adjustment to the robots.txt file help to remove the pages from the index? We don't want these pages to be found.
Intermediate & Advanced SEO | | andyheath0 -
Robots.txt - blocking JavaScript and CSS, best practice for Magento
Hi Mozzers, I'm looking for some feedback regarding best practices for setting up Robots.txt file in Magento. I'm concerned we are blocking bots from crawling essential information for page rank. My main concern comes with blocking JavaScript and CSS, are you supposed to block JavaScript and CSS or not? You can view our robots.txt file here Thanks, Blake
Intermediate & Advanced SEO | | LeapOfBelief0 -
Block subdomain directory in robots.txt
Instead of block an entire sub-domain (fr.sitegeek.com) with robots.txt, we like to block one directory (fr.sitegeek.com/blog).
Intermediate & Advanced SEO | | gamesecure
'fr.sitegeek.com/blog' and 'wwww.sitegeek.com/blog' contain the same articles in one language only labels are changed for 'fr' version and we suppose that duplicate content cause problem for SEO. We would like to crawl and index 'www.sitegee.com/blog' articles not 'fr.sitegeek.com/blog'. so, suggest us how to block single sub-domain directory (fr.sitegeek.com/blog) with robot.txt? This is only for blog directory of 'fr' version even all other directories or pages would be crawled and indexed for 'fr' version. Thanks,
Rajiv0 -
Do you add 404 page into robot file or just add no index tag?
Hi, got different opinion on this so i wanted to double check with your comment is. We've got /404.html page and I was wondering if you would add this page to robot text so it wouldn't be indexed or would you just add no index tag? What would be the best approach? Thanks!
Intermediate & Advanced SEO | | Rubix0 -
If i disallow unfriendly URL via robots.txt, will its friendly counterpart still be indexed?
Our not-so-lovely CMS loves to render pages regardless of the URL structure, just as long as the page name itself is correct. For example, it will render the following as the same page: example.com/123.html example.com/dumb/123.html example.com/really/dumb/duplicative/URL/123.html To help combat this, we are creating mod rewrites with friendly urls, so all of the above would simply render as example.com/123 I understand robots.txt respects the wildcard (*), so I was considering adding this to our robots.txt: Disallow: */123.html If I move forward, will this block all of the potential permutations of the directories preceding 123.html yet not block our friendly example.com/123? Oh, and yes, we do use the canonical tag religiously - we're just mucking with the robots.txt as an added safety net.
Intermediate & Advanced SEO | | mrwestern0 -
Robots.txt unblock
I'm currently having trouble with what appears to be a cached version of robots.txt. I'm being told via errors in my Google sitemap account that I'm denying Googlebot access to the entire site. I uploaded clean and "Allow" robots.txt yesterday, but receive the same error. I've tried "Fetch as Googlebot" on the index and other pages, but still the error. Here is the latest: | Denied by robots.txt |
Intermediate & Advanced SEO | | Elchanan
| 11/9/11 10:56 AM | As I said, there in not blocking on the robots.txt for 24 hours. HELP!0 -
Robots.txt: Link Juice vs. Crawl Budget vs. Content 'Depth'
I run a quality vertical search engine. About 6 months ago we had a problem with our sitemaps, which resulted in most of our pages getting tossed out of Google's index. As part of the response, we put a bunch of robots.txt restrictions in place in our search results to prevent Google from crawling through pagination links and other parameter based variants of our results (sort order, etc). The idea was to 'preserve crawl budget' in order to speed the rate at which Google could get our millions of pages back in the index by focusing attention/resources on the right pages. The pages are back in the index now (and have been for a while), and the restrictions have stayed in place since that time. But, in doing a little SEOMoz reading this morning, I came to wonder whether that approach may now be harming us... http://www.seomoz.org/blog/restricting-robot-access-for-improved-seo
Intermediate & Advanced SEO | | kurus
http://www.seomoz.org/blog/serious-robotstxt-misuse-high-impact-solutions Specifically, I'm concerned that a) we're blocking the flow of link juice and that b) by preventing Google from crawling the full depth of our search results (i.e. pages >1), we may be making our site wrongfully look 'thin'. With respect to b), we've been hit by Panda and have been implementing plenty of changes to improve engagement, eliminate inadvertently low quality pages, etc, but we have yet to find 'the fix'... Thoughts? Kurus0