Blocking poor quality content areas with robots.txt
-
I found an interesting discussion on seoroundtable where Barry Schwartz and others were discussing using robots.txt to block low quality content areas affected by Panda.
http://www.seroundtable.com/google-farmer-advice-13090.html
The article is a bit dated. I was wondering what current opinions are on this.
We have some dynamically generated content pages which we tried to improve after panda. Resources have been limited and alas, they are still there. Until we can officially remove them I thought it may be a good idea to just block the entire directory. I would also remove them from my sitemaps and resubmit. There are links coming in but I could redirect the important ones (was going to do that anyway). Thoughts?
-
If the page no longer exists and you remove the robots command for that directory it shouldn't make much difference. Google could start reporting it as a 404 since it knows that the files used to exist and there's no longer a robots command to ignore the directory. I don't see any harm in leaving it there, but I also don't see many issues arising from removing the robots command.
-
Hey Mark - Thank you, this is really helpful.
This is really great advice for deindexing the pages when they still actually do exist.
One more question though. Once we actually remove them, once the directory no longer actually exists, there's no point in using the robots.txt disallow, right? At that point if they're still in the index only the tool will be useful.
I read these: https://support.google.com/webmasters/answer/59819?hl=en
While the webmaster guidelines say you need to use robots.txt, I don't see how that's a requirement for pages which don't actually exist anymore. Google shouldn't be able to crawl the pages once they no longer exist. Also, if the directory is in robots.txt but there are a few redirects within it, they redirects would not work. I also don't think adding a line to robots.txt every time we remove something is a good practice. Thoughts?
-
When you block a page or folder in robots.txt, it doesn't remove the page from the search engine's index, it just prevents them from recrawling the page. For pages/folders/sites that were never crawled by the search engines, robots.txt can prevent them from being crawled and read. But blocking pages already crawled by robots.txt will not be enough on its own to remove them from the index.
To remove this low quality content, you can do one of two things:
- Add a meta robots noindex tag to the content you want to remove - this tells the engine to remove the page from the index and that the content to them shouldn't be there - in effect, it's dead to them
- After blocking the folder via robots.txt, going in to Webmaster Tools and using the URL removal tool on the folder or domain.
I usually recommend option number 1, because it works for multiple engines, doesn't require webmaster tools for each engine separately, and is easier to manage and a lot more customizable exactly which pages you want removed.
But you are on the right track with the sitemaps - don't include links to the no index pages in the sitemap.
Good luck,
Mark
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Default Robots.txt in WordPress - Should i change it??
I have a WordPress site as using theme Genesis i am using default robots.txt. that has a line Allow: /wp-admin/admin-ajax.php, is it okay or any problem. Should i change it?
Intermediate & Advanced SEO | | rootwaysinc0 -
Robots.txt Blocking - Best Practices
Hi All, We have a web provider who's not willing to remove the wildcard line of code blocking all agents from crawling our client's site (user-agent: *, Disallow: /). They have other lines allowing certain bots to crawl the site but we're wondering if they're missing out on organic traffic by having this main blocking line. It's also a pain because we're unable to set up Moz Pro, potentially because of this first line. We've researched and haven't found a ton of best practices regarding blocking all bots, then allowing certain ones. What do you think is a best practice for these files? Thanks! User-agent: * Disallow: / User-agent: Googlebot Disallow: Crawl-delay: 5 User-agent: Yahoo-slurp Disallow: User-agent: bingbot Disallow: User-agent: rogerbot Disallow: User-agent: * Crawl-delay: 5 Disallow: /new_vehicle_detail.asp Disallow: /new_vehicle_compare.asp Disallow: /news_article.asp Disallow: /new_model_detail_print.asp Disallow: /used_bikes/ Disallow: /default.asp?page=xCompareModels Disallow: /fiche_section_detail.asp
Intermediate & Advanced SEO | | ReunionMarketing0 -
ROI on Policing Scraped Content
Over the years, tons of original content from my website (written by me) has been scraped by 200-300 external sites. I've been using Copyscape to identify the offenders. It is EXTREMELY time consuming to identify the site owners, prepare an email with supporting evidence (screen shots), and following up 2, 3, 15 times until they remove the scraped content. Filing DMCA takedowns are a final option for sites hosted in the US, but quite a few of the offenders are in China, India, Nigeria, and other places not subject to DMCA. Sometimes, when a site owner takes down scraped content, it reappears a few months or years later. It's exasperating. My site already performs well in the SERPs - I'm not aware of a third party site's scraped content outperforming my site for any search phrase. Given my circumstances, how much effort do you think I should continue to put into policing scraped content?
Intermediate & Advanced SEO | | ahirai1 -
Penalized for Similar, But Not Duplicate, Content?
I have multiple product landing pages that feature very similar, but not duplicate, content and am wondering if this would affect my rankings in a negative way. The main reason for the similar content is three-fold: Continuity of site structure across different products Similar, or the same, product add-ons or support options (resulting in exactly the same additional tabs of content) The product itself is very similar with 3-4 key differences. Three examples of these similar pages are here - although I do have different meta-data and keyword optimization through the pages. http://www.1099pro.com/prod1099pro.asp http://www.1099pro.com/prod1099proEnt.asp http://www.1099pro.com/prodW2pro.asp
Intermediate & Advanced SEO | | Stew2220 -
How to promote good content?
Our team just finished a massive piece of content.. very similar to the SEOmoz Begginer's Guide to SEO, but for the salon/aesthetics industry. We have a beautifully designed 10 Chapter, 50-page PDF which will require an email form submission to download. Each chapter is optimized for specific phrases, and will be separate HTML pages that are publicly available... very much like how this is setup: http://www.seomoz.org/beginners-guide-to-seo My question is, what's the best way to promote this thing? Any specific examples would be ideal. I think blogger outreach would likely be the best approach, but is there any specific way that I should be doing this?.. Again a specific start-to-finish example is what I'm looking for here. (I've read almost every outreach post on moz, so no need to reference them) Anyone care to rattle off a list of ideas with accompanying examples? (even if they seem like no-brainers.. I'm all ears)
Intermediate & Advanced SEO | | ATMOSMarketing560 -
Duplicate content mess
One website I'm working with keeps a HTML archive of content from various magazines they publish. Some articles were repeated across different magazines, sometimes up to 5 times. These articles were also used as content elsewhere on the same website, resulting in up to 10 duplicates of the same article on one website. With regards to the 5 that are duplicates but not contained in the magazine, I can delete (resulting in 404) all but the highest value of each (most don't have any external links). There are hundreds of occurrences of this and it seems unfeasible to 301 or noindex them. After seeing how their system works I can canonical the remaining duplicate that isn't contained in the magazine to the corresponding original magazine version - but I can't canonical any of the other versions in the magazines to the original. I can't delete the other duplicates as they're part of the content of a particular issue of a magazine. The best thing I can think of doing is adding a link in the magazine duplicates to the original article, something along the lines of "This article originally appeared in...", though I get the impression the client wouldn't want to reveal that they used to share so much content across different magazines. The duplicate pages across the different magazines do differ slightly as a result of the different Contents menu for each magazine. Do you think it's a case of what I'm doing will be better than how it was, or is there something further I can do? Is adding the links enough? Thanks. 🙂
Intermediate & Advanced SEO | | Alex-Harford0 -
First link importance in the content
Hi, have you guys an opinion on this point, mentioned by Matt Cutts in 2010 : Matt made a point to mention that users are more likely to click on the first link in an article as opposed to a link at the bottom of the article. He said put your most important links at the top of the article. I believe it was Matt hinting to SEOs about this. http://searchengineland.com/key-takeaways-from-googles-matt-cutts-talk-at-pubcon-55457 I've asked this in private and Michael Cottam told me he read a study a year ago that indicated that the link juice passed to other pages diminished the further down the page you go. But he can't find it anymore ! Do you remember this study and have the link ? What is your opinion on Matt's point ?
Intermediate & Advanced SEO | | baptisteplace0 -
202 error page set in robots.txt versus using crawl-able 404 error
We currently have our error page set up as a 202 page that is unreachable by the search engines as it is currently in our robots.txt file. Should the current error page be a 404 error page and reachable by the search engines? Is there more value or is it a better practice to use 404 over a 202? We noticed in our Google Webmaster account we have a number of broken links pointing the site, but the 404 error page was not accessible. If you have any insight that would be great, if you have any questions please let me know. Thanks, VPSEO
Intermediate & Advanced SEO | | VPSEO0