Robots.txt and Magento
-
HI,
I am working on getting my robots.txt up and running and I'm having lots of problems with the robots.txt my developers generated. www.plasticplace.com/robots.txt
I ran the robots.txt through a syntax checking tool (http://www.sxw.org.uk/computing/robots/check.html) This is what the tool came back with: http://www.dcs.ed.ac.uk/cgi/sxw/parserobots.pl?site=plasticplace.com There seems to be many errors on the file.
Additionally, I looked at our robots.txt in the WMT and they said the crawl was postponed because the robots.txt is inaccessible. What does that mean?
A few questions:
1. Is there a need for all the lines of code that have the “#” before it? I don’t think it’s necessary but correct me if I'm wrong.
2. Furthermore, why are we blocking so many things on our website? The robots can’t get past anything that requires a password to access anyhow but again correct me if I'm wrong.
3. Is there a reason Why can't it just look like this:
User-agent: *
Disallow: /onepagecheckout/
Disallow: /checkout/cart/
I do understand that Magento has certain folders that you don't want crawled, but is this necessary and why are there so many errors?
-
Yes your short robots.txt idea would create a huge problem.
In your Magento admin if you click in the menu Catalog > URL Rewrite Management
You will see the magento feature that creates all the "pretty urls", in that page you will see a table. If get value from Target path column and copy and paste after your site domain, for example domain.com/value_in_target_path...
You'll see that the page loads fine, you don't want Google to rank those pages with the "messy" URL so that's why you need all those stuff in your robots.txt
-
I am bit confused. Are you saying that technically my Magento site has two different urls that can both be indexed; one with a (messy) url and another with a vanity url? This would create major duplicate content issues! The robots.txt would not solve such a complex issue.
Am I missing something?
-
My developer said they custom configured it to block the files they needed according to Magento.
You think I can simply make it look like this:
User-agent: *
Disallow: /onepagecheckout/
Disallow: /checkout/cart/
and then disable it in WMT?
-
3. Is there a reason Why can't it just look like this:
Yes, It would generate a lot of duplicates issues, for example your robots.txt you have the follow line:
Disallow: /catalog/category/view/ -> That's the "real" category URL, you can access any category on magento by /catalog/category/view/id or by the "pretty" URL. Because you disallow the "real: URL only the pretty URL will be viable for search engines. This same rule apply for many other parts of the robots.txt.
-
I assume this is a robots.txt that has been automatically created by Magento? - or has it been created by a developer?
I ran it through a tool and it showed 1 error and 10 warnings - so i would say you definitely need to do something about it.
The reason for all those disallows is to try and stop search engine indexing them (whether they would even find them to index them if they were not there is debatable).
What you could do is set up robots.txt as you have suggested and then stop the SE's indexing the directories or pages you don't want in appropriate webmaster tools.
I don't like displaying a lot of 'don't index' paths in the robots texts as it is pretty much telling any hacker or nasty spider where your weak points may be.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
No index tag robots.txt
Hi Mozzers, A client's website has a lot of internal directories defined as /node/*. I already added the rule 'Disallow: /node/*' to the robots.txt file to prevents bots from crawling these pages. However, the pages are already indexed and appear in the search results. In an article of Deepcrawl, they say you can simply add the rule 'Noindex: /node/*' to the robots.txt file, but other sources claim the only way is to add a noindex directive in the meta robots tag of every page. Can someone tell me which is the best way to prevent these pages from getting indexed? Small note: there are more than 100 pages. Thanks!
Technical SEO | | WeAreDigital_BE
Jens0 -
Shopify robots blocking stylesheets causing inconsistent mobile-friendly test results?
One of our shopify sites suffered an extreme rankings drop. Recent Google algorithm updates include mobile first so I tested the site and our team got different mobile-friendly test results. However, search console is also flagging pages as not mobile friendly. So, while us end-users see the site as OK on mobile, this may not be the case for Google? I researched more about inconsistent mobile test results and found answers that say it may be due to robots.txt blocking stylesheets. Do you recognise any directory blocked that might be affecting Google's rendering? We can't edit shopify robots.txt unfortunately. Our dev said the only thing that stands out to him is Disallow: /design_theme_id and the rest shouldn't be hindering Google bots. Here are some of the files blocked: Disallow: /admin
Technical SEO | | nhhernandez
Disallow: /cart
Disallow: /orders
Disallow: /checkout
Disallow: /9103034/checkouts
Disallow: /9103034/orders
Disallow: /carts
Disallow: /account
Disallow: /collections/+
Disallow: /collections/%2B
Disallow: /collections/%2b
Disallow: /blogs/+
Disallow: /blogs/%2B
Disallow: /blogs/%2b
Disallow: /design_theme_id
Disallow: /preview_theme_id
Disallow: /preview_script_id
Disallow: /discount/*
Disallow: /gift_cards/*
Disallow: /apple-app-site-association0 -
How can I make it so that robots.txt is not ignored due to a URL re-direct?
Recently a site moved from blog.site.com to site.com/blog with an instruction like this one: /etc/httpd/conf.d/site_com.conf:94: ProxyPass /blog http://blog.site.com
Technical SEO | | rodelmo4
/etc/httpd/conf.d/site_com.conf:95: ProxyPassReverse /blog http://blog.site.com It's a Wordpress.org blog that was set as a subdomain, and now is being redirected to look like a directory. That said, the robots.txt file seems to be ignored by Google bot. There is a Disallow: /tag/ on that file to avoid "duplicate content" on the site. I have tried this before with other Wordpress subdomains and works like a charm, except for this time, in which the blog is rendered as a subdirectory. Any ideas why? Thanks!0 -
Magento SEO question
Hello Moz Community, I am wondering if these magento settings are correct for seo. www.domain.com 301 > www.domain.com/main-language www.domain.com/main-language/main-keyword (index & follow) www.domain.com/main-language/main-keyword/shopby/size-m (index & follow & canonicalized to www.domain.com/main-language/main-keyword) All layered navigation links are no-follow
Technical SEO | | mhenze0 -
3,511 Pages Indexed and 3,331 Pages Blocked by Robots
Morning, So I checked our site's index status on WMT, and I'm being told that Google is indexing 3,511 pages and the robots are blocking 3,331. This seems slightly odd as we're only disallowing 24 pages on the robots.txt file. In light of this, I have the following queries: Do these figures mean that Google is indexing 3,511 pages and blocking 3,331 other pages? Or does it mean that it's blocking 3,331 pages of the 3,511 indexed? As there are only 24 URLs being disallowed on robots.text, why are 3,331 pages being blocked? Will these be variations of the URLs we've submitted? Currently, we don't have a sitemap. I know, I know, it's pretty unforgivable but the old one didn't really work and the developers are working on the new one. Once submitted, will this help? I think I know the answer to this, but is there any way to ascertain which pages are being blocked? Thanks in advance! Lewis
Technical SEO | | PeaSoupDigital0 -
How to use robots.txt to block areas on page?
Hi, Across the categories/product pages on out site there are archives/shipping info section and the texts are always the same. Would this be treated as duplicated content and harmful for seo? How can I alter robots.txt to tell google not to crawl those particular text Thanks for any advice!
Technical SEO | | LauraHT0 -
No descripton on Google/Yahoo/Bing, updated robots.txt - what is the turnaround time or next step for visible results?
Hello, New to the MOZ community and thrilled to be learning alongside all of you! One of our clients' sites is currently showing a 'blocked' meta description due to an old robots.txt file (eg: A description for this result is not available because of this site's robots.txt) We have updated the site's robots.txt to allow all bots. The meta tag has also been updated in WordPress (via the SEO Yoast plugin) See image here of Google listing and site URL: http://imgur.com/46wajJw I have also ensured that the most recent robots.txt has been submitted via Google Webmaster Tools. When can we expect these results to update? Is there a step I may have overlooked? Thank you,
Technical SEO | | adamhdrb
Adam 46wajJw0 -
Meta-robots Nofollow on logins and admins
In my SEO MOZ reports I am getting over 400 errors as Meta-robots Nofollow. These are all leading to my admin login page which I do not want robots in. Should I put some code on these pages so the robots know this and don't attempt to and I do not get these errors in my reports?
Technical SEO | | Endora0