Can URLs blocked with robots.txt hurt your site?
-
We have about 20 testing environments blocked by robots.txt, and these environments contain duplicates of our indexed content. These environments are all blocked by robots.txt, and appearing in google's index as blocked by robots.txt--can they still count against us or hurt us?
I know the best practice to permanently remove these would be to use the noindex tag, but I'm wondering if we leave them they way they are if they can still hurt us.
-
90% not, first of all, check if google indexed them, if not, your robots.txt should do it, however I would reinforce that by making sure those URLs are our of your sitemap file and make sure your robots's disallows are set to ALL *, not just google for example.
Google's duplicity policies are tough, but they will always respect simple policies such as robots.txt.
I had a case in the past when a customer had a dedicated IP, and google somehow found it, so you could see both the domain's pages and IP's pages, both the same, we simply added a .htaccess rule to point the IP requests to the domain, and even when the situation was like that for long, it doesn't seem to have affected them. In theory google penalizes duplicity but not in this particular cases, it is a matter of behavior.
Regards!
-
I've seen people say that in "rare" cases, links blocked by Robots.txt will be shown as search results but there's no way I can imagine that would happen if it's duplicates of your content.
Robots.txt lets a search engine know not to crawl a directory - but if another resource links to it, they may know it exists, just not the content of it. They won't know if it's noindex or not because they don't crawl it - but if they know it exists, they could rarely return it. Duplicate content would have a better result, therefore that better result will be returned, and your test sites should not be...
As far as hurting your site, no way. Unless a page WAS allowed, is duplicate, is now NOT allowed, and hasn't been recrawled. In that case, I can't imagine it would hurt you that much either. I wouldn't worry about it.
(Also, noindex doesn't matter on these pages. At least to Google. Google will see the noindex first and will not crawl the page. Until they crawl the page it doesn't matter if it has one word or 300 directives, they'll never see it. So noindex really wouldn't help unless a page had already slipped through.)
-
I don't believe they are going to hurt you, it is more of a warning that if you are trying to have these indexed that at the moment they can't be accessed. When you don't want them to be indexed i.e. in this case, I don't believe you are suffering because of it.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Duplicate URLs on eCommerce site caused by parameters
Hi there, We have a client with a large eCommerce site with about 1500 duplicate URLs caused by the parameters in the URLs (such as the sort parameter where the list of products are then sorted by price, age etc.) Example: www.example.com/cars/toyota First duplicate URL: www.example.com/cars/toyota?sort=price-ascending Second duplicate URL: www.example.com/cars/toyota?sort=price-descending Third duplicate URL: www.example.com/cars/toyota?sort=age-descending Originally we had advised to add a robots.txt file to block search engines from crawling the URLs with parameters but this hasn't been done. My question: If we add the robots.txt now and exclude all URLs with filters - how long will it take for Google to disregard the duplicate URLs? We could ask the developers to add canonical tags to all the duplicates but these are about 1500... Thanks in advance for any advice!
Intermediate & Advanced SEO | | Gabriele_Layoutweb0 -
Robots.txt: how to exclude sub-directories correctly?
Hello here, I am trying to figure out the correct way to tell SEs to crawls this: http://www.mysite.com/directory/ But not this: http://www.mysite.com/directory/sub-directory/ or this: http://www.mysite.com/directory/sub-directory2/sub-directory/... But with the fact I have thousands of sub-directories with almost infinite combinations, I can't put the following definitions in a manageable way: disallow: /directory/sub-directory/ disallow: /directory/sub-directory2/ disallow: /directory/sub-directory/sub-directory/ disallow: /directory/sub-directory2/subdirectory/ etc... I would end up having thousands of definitions to disallow all the possible sub-directory combinations. So, is the following way a correct, better and shorter way to define what I want above: allow: /directory/$ disallow: /directory/* Would the above work? Any thoughts are very welcome! Thank you in advance. Best, Fab.
Intermediate & Advanced SEO | | fablau1 -
Why are these results being showed as blocked by robots.txt?
If you perform this search, you'll see all m. results are blocked by robots.txt: http://goo.gl/PRrlI, but when I reviewed the robots.txt file: http://goo.gl/Hly28, I didn't see anything specifying to block crawlers from these pages. Any ideas why these are showing as blocked?
Intermediate & Advanced SEO | | nicole.healthline0 -
Robot.txt error
I currently have this under my robot txt file: User-agent: *
Intermediate & Advanced SEO | | Rubix
Disallow: /authenticated/
Disallow: /css/
Disallow: /images/
Disallow: /js/
Disallow: /PayPal/
Disallow: /Reporting/
Disallow: /RegistrationComplete.aspx WebMatrix 2.0 On webmaster > Health Check > Blocked URL I copy and paste above code then click on Test, everything looks ok but then logout and log back in then I see below code under Blocked URL: User-agent: * Disallow: / WebMatrix 2.0 Currently, Google doesn't index my domain and i don't understand why this happening. Any ideas? Thanks Seda0 -
Why is this site have a PR 0 rank? Anyone can figure this out? LegionSafety.com
Our site dropped in PR and we haven't done anything and not sure why the drop. Anyone have any recommendations?
Intermediate & Advanced SEO | | legionsafety0 -
Can I, in Google's good graces, check for Googlebot to turn on/off tracking parameters in URLs?
Basically, we use a number of parameters in our URLs for event tracking. Google could be crawling an infinite number of these URLs. I'm already using the canonical tag to point at the non-tracking versions of those URLs....that doesn't stop the crawling tho. I want to know if I can do conditional 301s or just detect the user agent as a way to know when to NOT append those parameters. Just trying to follow their guidelines about allowing bots to crawl w/out things like sessionID...but they don't tell you HOW to do this. Thanks!
Intermediate & Advanced SEO | | KenShafer0 -
How can using guest bloggers HURT me?
Hi Mozzers, I want to start accepting guest bloggers on my site to create content.
Intermediate & Advanced SEO | | Travis-W
What potential pitfalls should I watch out for? Should they only link a certain amount of times to their site? Per article? Per everything?
Ratio of followed vs nofollowed links?
Should I not link to "hazardous" sites I don't want to associate myself with? How can I identify a site that will hurt my link profile?
What's the best way to make sure the guest bloggers aren't plagiarizing? Thanks!0 -
10,000 New Pages of New Content - Should I Block in Robots.txt?
I'm almost ready to launch a redesign of a client's website. The new site has over 10,000 new product pages, which contain unique product descriptions, but do feature some similar text to other products throughout the site. An example of the page similarities would be the following two products: Brown leather 2 seat sofa Brown leather 4 seat corner sofa Obviously, the products are different, but the pages feature very similar terms and phrases. I'm worried that the Panda update will mean that these pages are sand-boxed and/or penalised. Would you block the new pages? Add them gradually? What would you recommend in this situation?
Intermediate & Advanced SEO | | cmaddison0