Is robots.txt a must-have for 150 page well-structured site?

scanlin

By looking in my logs I see dozens of 404 errors each day from different bots trying to load robots.txt. I have a small site (150 pages) with clean navigation that allows the bots to index the whole site (which they are doing). There are no secret areas I don't want the bots to find (the secret areas are behind a Login so the bots won't see them).

I have used rel=nofollow for internal links that point to my Login page.

Is there any reason to include a generic robots.txt file that contains "user-agent: *"?

I have a minor reason: to stop getting 404 errors and clean up my error logs so I can find other issues that may exist. But I'm wondering if not having a robots.txt file is the same as some default blank file (or 1-line file giving all bots all access)?

scanlin

Thanks, Keri. No, it's a hand-built blog. No CMS.

I think the googlebot is doing a good job of indexing my site. The site is small and when I search for my content I do find it in google. I was pretty sure that google worked the way you describe. So it sounds like sitemaps are an optional hint, and perhaps not needed for relatively small sites (couple hundred pages of well linked content). Thanks.

KeriMorgret

The phrase "blog entries" makes me ask are you on a CMS like Wordpress, or are the blog entries pages you are creating from scratch?

If you're on WP or a CMS, you'll want a robots.txt so that your admin, plugin, and other directories aren't indexed. On the plus side, WP (and other CMSs) have plugins that will generate a sitemap.xml file you as you add pages.

Google will find pages if you don't have a site map, or forget to add them. The sitemap is a way to let Google know what is out there, but it a) isn't required for Google to index a page and b) won't force Google to index a page.

scanlin

Thanks, Keith. Makes sense.

So how important is an xml sitemap for a 150 page site with clean navigation? As near as I can tell (from the site: command) my whole site is already being indexed by Google. Does a sitemap buy me anything? What happens if my sitemap is partial (ie if I forget to add new pages to it, but I do link to the new pages from my other indexed pages, then will the new pages get indexed)? I'm a little worried about sitemap maintenance as I add new blog entries and so on...

kpaulin

Hi Mike...

I am sure that you are always going to get a range of opinions to this kind of question.

I think that for your site the answer may be simply that having a robots.txt file is more of a “belt and braces” safe harbour-type thing – the same goes for say whether you should have a keywords meta tag – many say these pieces of code can be of marginal value but, when you are competing head to head for a #1 listing (ie 35%+ of the clicks) then you should use every option and weapon possible ...furthermore, if your site is likely to grow significantly or eventually have content/files that you may want excluded, it’s just a “tidy” thing to have had indexed over time.

Also, don’t forget that best practice robots.txt file taxonomy is to also include directions to your xml sitemap/s.

Here is an example from one of our sites...

User-agent: *
Disallow: /design_examples.xml
Disallow: /case_studies.xml

User-agent: Googlebot-Image
Disallow: /

Sitemap: http://www.sitetopleveldomain.com/sitemap.xml

In this example there are two root files specifically excluded from all bots and this site has also specifically excluded the Google Images bot as they were getting a lot of traffic from image searches and then subsequently seeing the same copyright images turn up on a hundred junk sites – this doesn’t stop image scraping but certainly reduces the ease of finding them.

In relation to the “or 1-line file giving all bots all access” part of your question...

Some bots (most notably Google) now support an additional field called "Allow:"

As the name suggests, "Allow:" lets you specifically indicate what files/folders CAN be crawled, excluding all others. However, this field is currently not part of the "robots.txt" protocol and so not universally supported, so my suggestion would be to test it for your site for a week, as it might confuse some less intelligent crawlers.

So, in summary, my recommendation is to keep a simple robots.txt file, test if the Allow: field works for you and also ensure you have that guide to your xml sitemap – although wearing a belt and braces might not be a good look, at least your pants are unlikely to fall down

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Is robots.txt a must-have for 150 page well-structured site?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Disallow wildcard match in Robots.txt

Test site got indexed in Google - What's the best way of getting the pages removed from the SERP's?

Robots.txt anomaly

Robots.txt best practices & tips

Best use of robots.txt for "garbage" links from Joomla!

What can I do if Google Webmaster Tools doesn't recognize the robots.txt file?

Blocking robots.txt

Does RogerBot read URL wildcards in robots.txt