Blocking poor quality content areas with robots.txt

Eric_edvisors

I found an interesting discussion on seoroundtable where Barry Schwartz and others were discussing using robots.txt to block low quality content areas affected by Panda.

http://www.seroundtable.com/google-farmer-advice-13090.html

The article is a bit dated. I was wondering what current opinions are on this.

We have some dynamically generated content pages which we tried to improve after panda. Resources have been limited and alas, they are still there. Until we can officially remove them I thought it may be a good idea to just block the entire directory. I would also remove them from my sitemaps and resubmit. There are links coming in but I could redirect the important ones (was going to do that anyway). Thoughts?

KaneJamison

If the page no longer exists and you remove the robots command for that directory it shouldn't make much difference. Google could start reporting it as a 404 since it knows that the files used to exist and there's no longer a robots command to ignore the directory. I don't see any harm in leaving it there, but I also don't see many issues arising from removing the robots command.

Eric_edvisors

Hey Mark - Thank you, this is really helpful.

This is really great advice for deindexing the pages when they still actually do exist.

One more question though. Once we actually remove them, once the directory no longer actually exists, there's no point in using the robots.txt disallow, right? At that point if they're still in the index only the tool will be useful.

I read these: https://support.google.com/webmasters/answer/59819?hl=en

While the webmaster guidelines say you need to use robots.txt, I don't see how that's a requirement for pages which don't actually exist anymore. Google shouldn't be able to crawl the pages once they no longer exist. Also, if the directory is in robots.txt but there are a few redirects within it, they redirects would not work. I also don't think adding a line to robots.txt every time we remove something is a good practice. Thoughts?

Mark_Ginsberg

When you block a page or folder in robots.txt, it doesn't remove the page from the search engine's index, it just prevents them from recrawling the page. For pages/folders/sites that were never crawled by the search engines, robots.txt can prevent them from being crawled and read. But blocking pages already crawled by robots.txt will not be enough on its own to remove them from the index.

To remove this low quality content, you can do one of two things:

Add a meta robots noindex tag to the content you want to remove - this tells the engine to remove the page from the index and that the content to them shouldn't be there - in effect, it's dead to them
After blocking the folder via robots.txt, going in to Webmaster Tools and using the URL removal tool on the folder or domain.

I usually recommend option number 1, because it works for multiple engines, doesn't require webmaster tools for each engine separately, and is easier to manage and a lot more customizable exactly which pages you want removed.

But you are on the right track with the sitemaps - don't include links to the no index pages in the sitemap.

Good luck,

Mark

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Blocking poor quality content areas with robots.txt

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Content very similar on different websites

Penalties for duplicate content

Duplicate Content Pages - A Few Queries..

Duplicate content within sections of a page but not full page duplicate content

Robots.txt: Syntax URL to disallow

Copying my Facebook content to website considered duplicate content?

202 error page set in robots.txt versus using crawl-able 404 error

Expired Content