Blocking poor quality content areas with robots.txt
-
I found an interesting discussion on seoroundtable where Barry Schwartz and others were discussing using robots.txt to block low quality content areas affected by Panda.
http://www.seroundtable.com/google-farmer-advice-13090.html
The article is a bit dated. I was wondering what current opinions are on this.
We have some dynamically generated content pages which we tried to improve after panda. Resources have been limited and alas, they are still there. Until we can officially remove them I thought it may be a good idea to just block the entire directory. I would also remove them from my sitemaps and resubmit. There are links coming in but I could redirect the important ones (was going to do that anyway). Thoughts?
-
If the page no longer exists and you remove the robots command for that directory it shouldn't make much difference. Google could start reporting it as a 404 since it knows that the files used to exist and there's no longer a robots command to ignore the directory. I don't see any harm in leaving it there, but I also don't see many issues arising from removing the robots command.
-
Hey Mark - Thank you, this is really helpful.
This is really great advice for deindexing the pages when they still actually do exist.
One more question though. Once we actually remove them, once the directory no longer actually exists, there's no point in using the robots.txt disallow, right? At that point if they're still in the index only the tool will be useful.
I read these: https://support.google.com/webmasters/answer/59819?hl=en
While the webmaster guidelines say you need to use robots.txt, I don't see how that's a requirement for pages which don't actually exist anymore. Google shouldn't be able to crawl the pages once they no longer exist. Also, if the directory is in robots.txt but there are a few redirects within it, they redirects would not work. I also don't think adding a line to robots.txt every time we remove something is a good practice. Thoughts?
-
When you block a page or folder in robots.txt, it doesn't remove the page from the search engine's index, it just prevents them from recrawling the page. For pages/folders/sites that were never crawled by the search engines, robots.txt can prevent them from being crawled and read. But blocking pages already crawled by robots.txt will not be enough on its own to remove them from the index.
To remove this low quality content, you can do one of two things:
- Add a meta robots noindex tag to the content you want to remove - this tells the engine to remove the page from the index and that the content to them shouldn't be there - in effect, it's dead to them
- After blocking the folder via robots.txt, going in to Webmaster Tools and using the URL removal tool on the folder or domain.
I usually recommend option number 1, because it works for multiple engines, doesn't require webmaster tools for each engine separately, and is easier to manage and a lot more customizable exactly which pages you want removed.
But you are on the right track with the sitemaps - don't include links to the no index pages in the sitemap.
Good luck,
Mark
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt - blocking JavaScript and CSS, best practice for Magento
Hi Mozzers, I'm looking for some feedback regarding best practices for setting up Robots.txt file in Magento. I'm concerned we are blocking bots from crawling essential information for page rank. My main concern comes with blocking JavaScript and CSS, are you supposed to block JavaScript and CSS or not? You can view our robots.txt file here Thanks, Blake
Intermediate & Advanced SEO | | LeapOfBelief0 -
Robots.txt: Syntax URL to disallow
Did someone ever experience some "collateral damages" when it's about "disallowing" some URLs? Some old URLs are still present on our website and while we are "cleaning" them off the site (which takes time), I would like to to avoid their indexation through the robots.txt file. The old URLs syntax is "/brand//13" while the new ones are "/brand/samsung/13." (note that there is 2 slash on the URL after the word "brand") Do I risk to erase from the SERPs the new good URLs if I add to the robots.txt file the line "Disallow: /brand//" ? I don't think so, but thank you to everyone who will be able to help me to clear this out 🙂
Intermediate & Advanced SEO | | Kuantokusta0 -
HTTPS Duplicate Content?
I just recieved a error notification because our website is both http and https. http://www.quicklearn.com & https://www.quicklearn.com. My tech tells me that this isn't actually a problem? Is that true? If not, how can I address the duplicate content issue?
Intermediate & Advanced SEO | | QuickLearnTraining0 -
Local Searches done from outside of local area better than searches from within local area
Here's a strange one: I am working on a site for a local business and targeting local searches. The names have been changed to protect the innocent. Various keyword position tools show the site ranking very well for searches like "Anytown Widget Store". Doing the same Google search from a browser in Anytown, the site shows up much lower. So I tried changing the location in Google to other cities, using a variety of browsers and it comes up much higher out of town than in town. I have seen plenty of geographic discrepancies before, but usually they went the other way - searches from the actual local area did slightly better than the same searches done elsewhere, which would make sense. Any thoughts on why this would happen?
Intermediate & Advanced SEO | | Nick_Ker0 -
Duplicate Content
Hi everyone, I have a TLD in the UK with a .co.uk and also the same site in Ireland (.ie). The only differences are the prices and different banners maybe. The .ie site pulls all of the content from the .co.uk domain. Is this classed as content duplication? I've had problems in the past in which Google struggles to index the website. At the moment the site appears completely fine in the UK SERPs but for Ireland I just have the Title and domain appearing in the SERPs, with no extended title or description because of the confusion I caused Google last time. Does anybody know a fix for this? Thanks
Intermediate & Advanced SEO | | royb0 -
Duplicate page content
Hi. I am getting error of having duplicate content on my website and pages its showing there are: www.mysitename.com www.mysitename.com/index.html As my best knowledge it only one page, I know this can be solved with some conical tag used in header, but do not know how. Can anyone please tell me about that code or any other way to get this solved. Thanks
Intermediate & Advanced SEO | | onlinetraffic0 -
Robots.txt disallow subdomain
Hi all, I have a development subdomain, which gets copied to the live domain. Because I don't want this dev domain to get crawled, I'd like to implement a robots.txt for this domain only. The problem is that I don't want this robots.txt to disallow the live domain. Is there a way to create a robots.txt for this development subdomain only? Thanks in advance!
Intermediate & Advanced SEO | | Partouter0 -
Duplicate Content from Article Directories
I have a small client with a website PR2, 268 links from 21 root domains with mozTrusts 5.5, MozRank 4.5 However whenever I check in google for the amount of link: Google always give the response none. My client has a blog and many articles on the blog. However they have submitted their blog article every time to article directories as well, plain and simle creating duplicate and content. Is this the reason why their link: is coming up as none? Is there something to correct the situation?
Intermediate & Advanced SEO | | danielkamen0