Robots.txt gone wild
-
Hi guys, a site we manage, http://hhhhappy.com received an alert through web master tools yesterday that it can't be crawled. No changes were made to the site.
Don't know a huge amount about the robots.txt configuration expect that using Yoast by default it sets it not to crawl wp admin folder and nothing else. I checked this against all other sites and the settings are the same. And yet 12 hours later after the issue Happy is still not being crawled and meta data is not showing in search results. Any ideas what may have triggered this?
-
Hi Radi!
Have Matt and/or Martijn answered your question? If so, please mark one or both of their responses "Good Answer."
Otherwise, what's still tripping you up?
-
Have you checked the downtime of the site recently? Sometimes it could be that Google isn't able to reach your robots.txt file and because of that they'll stop crawling your site temporarily.
-
Are you getting the message in Search Console that there were errors crawling your page?
This typically means that your host was temporarily down when Google landed on your page. These types of things happen all the time and are no big deal.
Your homepage cache shows a crawl date of today so I'm assuming things are working properly ... if you really want to find out, try doing a "Fetch" of your site in Search Console.
Crawl > Fetch as Google > Fetch (big red button)
You should get a status of "Complete." If you get anything else there should be an error message with it. If so, paste that here.
I have checked the site headers, cache, crawlability with Screaming Frog, and everything is fine. This seems like one of those temporary messages but if the problem persists definitely let us know!
-
Our host has just offered this response which does not get me any closer:
Hi Radi,
It looks like your site has its own robots.txt file, which is not blocking any user agents. The only thing it's doing is blocking bots from indexing your admin area:
<code>User-agent: * Disallow: /wp-admin/</code>
This is a standard robots.txt file, and you shouldn't be having any issues with Google indexing your site from a hosting standpoint. To test this, I curled the site as Googlebot and received a 200OK response:
<code>curl -A "Googlebot/2.1" -IL [hhhhappy.com](http://hhhhappy.com) HTTP/1.1 200 OK Date: Sat, 05 Mar 2016 22:17:26 GMT Content-Type: text/html; charset=UTF-8 Connection: keep-alive Set-Cookie: __cfduid=d3177a1baa04623fb2573870f1d4b4bac1457216246; expires=Sun, 05-Mar-17 22:17:26 GMT; path=/; domain=.[hhhhappy.com](http://hhhhappy.com); HttpOnly X-Cacheable: bot Cache-Control: max-age=10800, must-revalidate X-Cache: HIT: 17 X-Cache-Group: bot X-Pingback: [http://hhhhappy.com/xmlrpc.php](http://hhhhappy.com/xmlrpc.php) Link: <[http://hhhhappy.com/](http://hhhhappy.com/)>; rel=shortlink Expires: Thu, 19 Nov 1981 08:52:00 GMT X-Type: default X-Pass-Why: Set-Cookie: X-Mapping-fjhppofk=2C42B261F74DA203D392B5EC5BF07833; path=/ Server: cloudflare-nginx CF-RAY: 27f0f02445920f09-IAD</code>
I didn't see any plugins on your site that looked like they would overwrite robots.txt, but I urge you to take another look at them, and then dive into your site's settings for the meta value that Googlebot would pick up. Everything on our end seems to be giving the green light.
Please let us know if you have any other questions or issues in the meantime.
Cheers,
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Scary bug in search console: All our pages reported as being blocked by robots.txt after https migration
We just migrated to https and created 2 days ago a new property in search console for the https domain. Webmaster Tools account for the https domain now shows for every page in our sitemap the warning: "Sitemap contains urls which are blocked by robots.txt."Also in the dashboard of the search console it shows a red triangle with warning that our root domain would be blocked by robots.txt. 1) When I test the URLs in search console robots.txt test tool all looks fine.2) When I fetch as google and render the page it renders and indexes without problem (would not if it was really blocked in robots.txt)3) We temporarily completely emptied the robots.txt, submitted it in search console and uploaded sitemap again and same warnings even though no robots.txt was online4) We run screaming frog crawl on whole website and it indicates that there is no page blocked by robots.txt5) We carefully revised the whole robots.txt and it does not contain any row that blocks relevant content on our site or our root domain. (same robots.txt was online for last decade in http version without problem)6) In big webmaster tools I could upload the sitemap and so far no error reported.7) we resubmitted sitemaps and same issue8) I see our root domain already with https in google SERPThe site is https://www.languagecourse.netSince the site has significant traffic, if google would really interpret for any reason that our site is blocked by robots we will be in serious trouble.
Intermediate & Advanced SEO | | lcourse
This is really scary, so even if it is just a bug in search console and does not affect crawling of the site, it would be great if someone from google could have a look into the reason for this since for a site owner this really can increase cortisol to unhealthy levels.Anybody ever experienced the same problem?Anybody has an idea where we could report/post this issue?0 -
Huge increase in server errors and robots.txt
Hi Moz community! Wondering if someone can help? One of my clients (online fashion retailer) has been receiving huge increase in server errors (500's and 503's) over the last 6 weeks and it has got to the point where people cannot access the site because of server errors. The client has recently changed hosting companies to deal with this, and they have just told us they removed the DNS records once the name servers were changed, and they have now fixed this and are waiting for the name servers to propagate again. These errors also correlate with a huge decrease in pages blocked by robots.txt file, which makes me think someone has perhaps changed this and not told anyone... Anyone have any ideas here? It would be greatly appreciated! 🙂 I've been chasing this up with the dev agency and the hosting company for weeks, to no avail. Massive thanks in advance 🙂
Intermediate & Advanced SEO | | labelPR0 -
Best practices for robotx.txt -- allow one page but not the others?
So, we have a page, like domain.com/searchhere, but results are being crawled (and shouldn't be), results look like domain.com/searchhere?query1. If I block /searchhere? will it block users from crawling the single page /searchere (because I still want that page to be indexed). What is the recommended best practice for this?
Intermediate & Advanced SEO | | nicole.healthline0 -
301 redirect or Robots.txt on an interstatial page
Hey guys, I have an affiliate tracking system that works like this : an affiliate puts up a certain code on his site, for example : www.domain.com/track/aff_id This url leads to a page where the hit is counted, analysed and then 302 redirects to my sales page with the affiliates ID in the url : www.mysalespage.com/?=aff_id. However, we've noticed recently that one affiliate seems to be ranking for our own name and the url google indexed was his tracking url (domain.com/track/aff_id). Which is strange because there is absolutely nothing on that page, its just an interstatial page so that our stats tracking software can properly filter hits. To remove the affiliate's url from showing up in the serps, I've come up with 2 solutions : 1 - Change the redirect to a 301 redirect on his track page. 2 - Change our robots.txt page to block all domain.com/track/ pages from being indexed. My question is : if I 301 redirect instead of 302, will I keep the affiliates from outranking me for my own name AND pass on link juice or should I simply block google from crawling the interstatial tracking pages?
Intermediate & Advanced SEO | | CrakJason0 -
Bing Disappearance Act - SERPs gone/extremely reduced for no apparent reason
Around November 23rd/25th - SERPs on Bing largely disappeared for my website. I did a site relaunch with new optimized content and proper redirects on November 12th. How can I tell if my site has been blocked by Bing? How long does a sitemap take to be indexed by Bing? Is this is a normal practice for sites that seem to have massive amounts of new content on Bing? Funny thing is things have only gotten better on Google which I know is unrelated but it's funny how Bing makes things so easy yet so difficult! Would appreciate any thoughts, help, etc. on my Bing disappearing act. domain: www.laptopmd.com
Intermediate & Advanced SEO | | LMDNYC0 -
New server update + wrong robots.txt = lost SERP rankings
Over the weekend, we updated our store to a new server. Before the switch, we had a robots.txt file on the new server that disallowed its contents from being indexed (we didn't want duplicate pages from both old and new servers). When we finally made the switch, we somehow forgot to remove that robots.txt file, so the new pages weren't indexed. We quickly put our good robots.txt in place, and we submitted a request for a re-crawl of the site. The problem is that many of our search rankings have changed. We were ranking #2 for some keywords, and now we're not showing up at all. Is there anything we can do? Google Webmaster Tools says that the next crawl could take up to weeks! Any suggestions will be much appreciated.
Intermediate & Advanced SEO | | 9Studios0 -
Should I robots block this directory?
There's about 43k pages indexed in this directory, and while helpful to end users, I don't see it being a great source of unique content for search engines. Would you robots block or meta noindex nofollow these pages in the /blissindex/ directory? ie. http://www.careerbliss.com/blissindex/petsmart-index-980481/ http://www.careerbliss.com/blissindex/att-index-1043730/ http://www.careerbliss.com/blissindex/facebook-index-996632/
Intermediate & Advanced SEO | | CareerBliss0 -
Block an entire subdomain with robots.txt?
Is it possible to block an entire subdomain with robots.txt? I write for a blog that has their root domain as well as a subdomain pointing to the exact same IP. Getting rid of the option is not an option so I'd like to explore other options to avoid duplicate content. Any ideas?
Intermediate & Advanced SEO | | kylesuss12