Blocking poor quality content areas with robots.txt
-
I found an interesting discussion on seoroundtable where Barry Schwartz and others were discussing using robots.txt to block low quality content areas affected by Panda.
http://www.seroundtable.com/google-farmer-advice-13090.html
The article is a bit dated. I was wondering what current opinions are on this.
We have some dynamically generated content pages which we tried to improve after panda. Resources have been limited and alas, they are still there. Until we can officially remove them I thought it may be a good idea to just block the entire directory. I would also remove them from my sitemaps and resubmit. There are links coming in but I could redirect the important ones (was going to do that anyway). Thoughts?
-
If the page no longer exists and you remove the robots command for that directory it shouldn't make much difference. Google could start reporting it as a 404 since it knows that the files used to exist and there's no longer a robots command to ignore the directory. I don't see any harm in leaving it there, but I also don't see many issues arising from removing the robots command.
-
Hey Mark - Thank you, this is really helpful.
This is really great advice for deindexing the pages when they still actually do exist.
One more question though. Once we actually remove them, once the directory no longer actually exists, there's no point in using the robots.txt disallow, right? At that point if they're still in the index only the tool will be useful.
I read these: https://support.google.com/webmasters/answer/59819?hl=en
While the webmaster guidelines say you need to use robots.txt, I don't see how that's a requirement for pages which don't actually exist anymore. Google shouldn't be able to crawl the pages once they no longer exist. Also, if the directory is in robots.txt but there are a few redirects within it, they redirects would not work. I also don't think adding a line to robots.txt every time we remove something is a good practice. Thoughts?
-
When you block a page or folder in robots.txt, it doesn't remove the page from the search engine's index, it just prevents them from recrawling the page. For pages/folders/sites that were never crawled by the search engines, robots.txt can prevent them from being crawled and read. But blocking pages already crawled by robots.txt will not be enough on its own to remove them from the index.
To remove this low quality content, you can do one of two things:
- Add a meta robots noindex tag to the content you want to remove - this tells the engine to remove the page from the index and that the content to them shouldn't be there - in effect, it's dead to them
- After blocking the folder via robots.txt, going in to Webmaster Tools and using the URL removal tool on the folder or domain.
I usually recommend option number 1, because it works for multiple engines, doesn't require webmaster tools for each engine separately, and is easier to manage and a lot more customizable exactly which pages you want removed.
But you are on the right track with the sitemaps - don't include links to the no index pages in the sitemap.
Good luck,
Mark
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Content Below the Fold
Hi I wondered what the view is on content below the fold? We have the H1, product listings & then some written content under the products - will Google just ignore this? I can't hide it under a tab or put a lot of content above products - so I'm not sure what the other option is? Thank you
Intermediate & Advanced SEO | | BeckyKey0 -
Opinion on Duplicate Content Scenario
So there are 2 pest control companies owned by the same person - Sovereign and Southern. (The two companies serve different markets) They have two different website URLs, but the website code is actually all the same....the code is hosted in one place....it just uses an if/else structure with dynamic php which determines whether the user sees the Sovereign site or the Southern site....know what I am saying? Here are the two sites: www.sovereignpestcontrol.com and www.southernpestcontrol.com. This is a duplicate content SEO nightmare, right?
Intermediate & Advanced SEO | | MeridianGroup0 -
Duplicate content reported on WMT for 301 redirected content
We had to 301 redirect a large number of URL's. Not Google WMT is telling me that we are having tons of duplicate page titles. When I looked into the specific URL's I realized that Google is listing an old URL's and the 301 redirected new URL as the source of the duplicate content. I confirmed the 301 redirect by using a server header tool to check the correct implementation of the 301 redirect from the old to the new URL. Question: Why is Google Webmaster Tool reporting duplicated content for these pages?
Intermediate & Advanced SEO | | SEOAccount320 -
Duplicate content on subdomains.
Hi Mozer's, I have a site www.xyz.com and also geo targeted sub domains www.uk.xyz.com, www.india.xyz.com and so on. All the sub domains have the content which is same as the content on the main domain that is www.xyz.com. So, I want to know how can i avoid content duplication. Many Thanks!
Intermediate & Advanced SEO | | HiteshBharucha0 -
Duplicate content mess
One website I'm working with keeps a HTML archive of content from various magazines they publish. Some articles were repeated across different magazines, sometimes up to 5 times. These articles were also used as content elsewhere on the same website, resulting in up to 10 duplicates of the same article on one website. With regards to the 5 that are duplicates but not contained in the magazine, I can delete (resulting in 404) all but the highest value of each (most don't have any external links). There are hundreds of occurrences of this and it seems unfeasible to 301 or noindex them. After seeing how their system works I can canonical the remaining duplicate that isn't contained in the magazine to the corresponding original magazine version - but I can't canonical any of the other versions in the magazines to the original. I can't delete the other duplicates as they're part of the content of a particular issue of a magazine. The best thing I can think of doing is adding a link in the magazine duplicates to the original article, something along the lines of "This article originally appeared in...", though I get the impression the client wouldn't want to reveal that they used to share so much content across different magazines. The duplicate pages across the different magazines do differ slightly as a result of the different Contents menu for each magazine. Do you think it's a case of what I'm doing will be better than how it was, or is there something further I can do? Is adding the links enough? Thanks. 🙂
Intermediate & Advanced SEO | | Alex-Harford0 -
Why my blog ranks poorly on Google ?
Hi 🙂 I need help for my blog, my blog http://www.dota2club.com/ for many keywords it is not in first 50 results on google. What am i doing wrong ? Can you tell me what errors / mistakes i have made and what can i do to improve my blog ? Thank you !!!
Intermediate & Advanced SEO | | wolfinjo0 -
Wordpress Duplicate Content
We have recently moved our company's blog to Wordpress on a subdomain (we utilize the Yoast SEO plugin). We are now experiencing an ever-growing volume of crawl errors (nearly 300 4xx now) for pages that do not exist to begin with. I believe it may have something to do with having the blog on a subdomain and/or our yoast seo plugin's indexation archives (author, category, etc) --- we currently have Subpages of archives and taxonomies, and category archives in use. I'm not as familiar with Wordpress and the Yoast SEO plugin as I am with other CMS' so any help in this matter would be greatly appreciated. I can PM further info if necessary. Thank you for the help in advance.
Intermediate & Advanced SEO | | BethA0 -
Should I robots block this directory?
There's about 43k pages indexed in this directory, and while helpful to end users, I don't see it being a great source of unique content for search engines. Would you robots block or meta noindex nofollow these pages in the /blissindex/ directory? ie. http://www.careerbliss.com/blissindex/petsmart-index-980481/ http://www.careerbliss.com/blissindex/att-index-1043730/ http://www.careerbliss.com/blissindex/facebook-index-996632/
Intermediate & Advanced SEO | | CareerBliss0